MS Dataset Methodology

1) Determined target organizations: For the first version of MSID, we initially targeted 21 organizations for data collection based on their prominent role in market shaping activities over the last decade:

Access to Medicines Foundation (AtM); Innovative Vector Control Consortium (IVCC); African Medical Supply Platform; MedAccess; African Vaccine Acquisition Trust (AVAT); Medicines for Malaria Venture (MMV); Bill & Melinda Gates Foundation; Medicines Patent Pool (MPP); Clinton Health Access Initiative (CHAI); Pan American Health Organization (PAHO); Coalition for Epidemic Preparedness Innovations (CEPI); PATH; Developing Countries Vaccine Manufacturers Network (DCVMN); Results for Development (R4D); Foundation for Innovative New Diagnostics (FIND); UNICEF; Gavi; Unitaid; Glaxo Smith Kline Vaccines (IFPMA); United States Agency for International Development (USAID); Global Fund

2) Gathered available press releases from each organization’s websites for all years in consideration (2012-2023): We did this manually for each organization, recording the press release title, web address, and publish date. Certain press releases were excluded opportunistically if the title of the announcement made it clear that it was not a market shaping intervention.

3) Collected older press releases via a media archive tool: Via a subscription to NexisNews, we accessed press releases that predated those available on organizations’ websites by searching for each organization individually, and limiting results to the years in consideration. All search results were downloaded from the NexisNews platform for consideration in MSID. When these data were combined with the data from the previous step, there were over 9,500 press releases to review.

4) Manually reviewed and categorized a subset of press releases: Analysts with market shaping experience reviewed a subset of the 9,500 press releases to identify whether they were market shaping interventions or not. About 5,800 press releases were reviewed, leaving 3,600 unreviewed (to be addressed at a later step). Analysts determined that 298 of the reviewed press releases were market shaping interventions. For these press releases, we recorded additional details including the organizations involved, product category (e.g., vaccines, drugs, diagnostics, or devices), market shaping value chain category, intervention type, financial instrument type, amount of money committed, and health area. Data on the market shaping intervention types and definitions were based on those included in the CHAI Market Shaping Framework.

5) Used a machine learning algorithm to categorize the remaining press releases: We built an algorithm to determine whether the remaining 3,600 press releases were market shaping interventions or not. This algorithm can be used to help reduce manual processing effort for future data collection in a next version of MSID. Data from the 5,800 press releases reviewed in Step 4 were prepared for classification algorithms by randomly splitting it into a “training” (80% of the data) and a “testing” set (20% of the data) that were used to evaluate different algorithms and fine-tune the model settings.

Once split, we pre-processed the data into a classification algorithm-compatible format. Categorical data (the global health organization source of the press release) were converted into integers using one-hot encoding.[1] The press release title text was then converted into numbers using a vectorizer that calculated term frequency-inverse document frequency (TF-IDF)[2], which calculates how important a specific word is to each document in a database. For any given word, it considers the number of times that word appears in individual press releases and the overall frequency of that word in all other press releases in the dataset, with larger values indicating that the word is common and unique to a specific type of press release. We assumed that specific words (e.g., “vaccine,” “agreement,” “access”) would be more frequent and unique in press releases about market shaping interventions versus other press releases.

After data preparation, we fit classification algorithms on the “training” set and then used it to predict labels for press releases in the “testing” set (i.e., “yes,” indicating that the press release was about a market shaping intervention, or “no,” indicating that it was not related to market shaping). We calculated model accuracy and other evaluation metrics by comparing the machine learning model’s predicted label (“yes” or “no”) with the known label in the “testing” set. This analysis evaluated the following classification algorithms: Naïve-Bayes, support vector machines, logistic regression, k-nearest neighbors, random forests[3], and XGBoost.[4] Because the dataset was imbalanced (there were many more press releases that were not related to market shaping interventions), precision[5] and recall[6] were the primary metrics used to evaluate the different algorithms.

We selected XGBoost as the best algorithm option because it minimized both the number of press releases that were incorrectly labeled as not market shaping (i.e., false negatives) and the number of press releases that were incorrectly labeled as market shaping interventions (i.e., false positives). Finally, the trained XGBoost model was used to predict whether press releases in the unreviewed dataset subset (about 3,600 records) were market shaping interventions or not. The model identified 250 “yeses,” and to further reduce manual processing efforts and the number of potential false positives, we set a 90% predicted probability threshold—including only records that met this criterion in the next part of the process.

6) Categorized outputs from the machine learning algorithm: Analysts manually reviewed 98 press releases from the machine learning algorithm. Of those, we removed 48 false positives, leaving 50 market shaping intervention records that we then recorded additional details for, including the organizations involved, product category (e.g., vaccines, drugs, diagnostics, or devices), market shaping value chain category, intervention type, financial instrument type, amount of money committed, and health area.

7) Combined the categorized records: The final dataset included 348 market shaping intervention announcements—a combination of those identified from the machine learning algorithm (50 total; Step 5) and those from the manually reviewed data subset (298 total; Step 4).

[1] Scikit-learn: Machine Learning in Python. One Hot Encoder (https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OneHotEncoder.html)

[2] Scikit-learn: Machine Learning in Python: TF-IDF Vectorizer (https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html)

[3] Scikit-learn: Machine learning in Python. Supervised Learning (https://scikit-learn.org/stable/supervised_learning.html)

[4] Chen, T., & Guestrin, C. (2016). XGBoost: A Scalable Tree Boosting System. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (pp. 785–794). New York, NY, USA: ACM. https://doi.org/10.1145/2939672.2939785

[5] A measure of false positives (i.e., how many articles were incorrectly labeled as market shaping).

[6] A measure of false negatives (i.e., how many market shaping articles were incorrectly labeled as unrelated to market shaping).

Market Shaping Interventions Dataset Methodology