Machine Learning Algorithms are Confused by Records Without an Abstract, Clinical Trial Registry Records, and Non-English Records During Screening for Systematic Reviews and Evidence Synthesis. Solutions?

6 min readSep 25, 2024

The Machine ranks a relevant ClinicalTrials.gov record with no Abstract in the middle of the relevancy rating scale. Humans would rank this record as a five-green-star relevant based on the title only.

I have been screening using many tools for over a decade. In many cases, I have seen that a very relevant record—based on title assessment by humans—has been ranked as not relevant or less relevant by the Machine. So, I started thinking and finding patterns. I reported the first section of the findings in my previous post on Duplication and Multiplicity, and here is the second section.

Why Machine Learning Cannot Rank Some of the Relevant Records Correctly?

1. Missing Important Fields of Records

While automation companies may be secretive about their beloved algorithms, we can only guess that the textual information is the most important information for the Machine when calculating the relevancy rate for a record. Because of the amount of information that each field provides, my importance ranking for the textual fields is here:

Abstract
Title
Keywords/Subject Headings
Journal Name

A. It is my ranking based on my experience, B. Some programs may ignore Keywords/Subject Headings and Journal Name.

Based on the above list, which is mainly my experience, you probably know what the most important field for a Machine is! Abstract, not Title. This is very different for humans, where Title is the most important field.

Machine learning for screening is built based on Natural Language Processing (NLP) techniques (such as TF-IDF):

The total number of records, the number and density of the relevant words in each record, the proportion of relevant words to other words, and the closeness/distance of the relevant words from each other are contributing factors to the relevancy ranking of a record.

If not adjusted by the developers, the ML model will not care if the words appear in the Title, Abstract, or somewhere else in the record. This makes the Abstract the main text involved in the ranking because it contains most of the words in the record. When ML models learn most of their learning from records with an Abstract, generalizing its learning to the record with no Abstract creates an unpredicted ranking.

Missing Abstract is my top reason why ML cannot calculate the correct rating for relevant records that a human can easily assess based on the Title.

2. Non-Standard or Non-Journal Items (Clinical Trials Registry Records)

Even though errata, corrections, and retractions are published in journals, they don’t follow the standard journal format to have a proper Title or Abstract. Even though they may contain really relevant information, the Machine can miss them. However, the main drama is from Clinical Trial Registry Records, the majority of which come from ClinicalTrials.gov or WHO ICTRP. Why?

A. Atypical Structure: They don’t have a typical Abstract structure like a journal paper (i.e., Introduction, Methods, Results, and Conclusion in 250–400 words).

B. Information Loss in Export: The structure and amount of information we get when exporting records from ClinicalTrials.gov totally depend on the format you choose. If you choose RIS, you can get 20 data points (fields), among which only title and scientific title are helpful for the Machine (Missing Abstract!); if you choose CSV, you can get 30 fields, some of which are very helpful for the Machine.

The RIS = RefMan = Reference Manager Export from ClinicalTrials.gov contains 20 fields (Date: 20th Sep 2024), but only Title and Scientific Title are helpful for screening by Machine or Human, and many other relevant fields are missing from RIS export.

CSV export in ClinicalTrials.gov allows 30 fields to be exported, including PICOS. Compare this to the RIS format with 20 fields and only titles to be helpful for relevancy check.

C. Information Loss in Import: Even if you get complete records in CSV (from ClinicalTrials.gov) or XML from WHO ICTRP, they will be helpful if they are imported completely into the screening program in the relevant record field (i.e., Abstract). So, Import Filters in citation managers such as EndNote or screening managers become important.

D. Import into Unhelpful Field: As mentioned above, screening programs are better with information in the Abstract field. So, if the exported information is imported into other fields, it may not help the Machine.

3. Non-English Language (More than One Language in Set)

English bias does not disappear. Since English is the language of science, it is expected that for any systematic review, the majority of the record will be in English, and discrimination against non-English continues. Since there aren’t enough materials for the Machine to learn from, the most relevant Chinese record will be ranked as the least relevant by the Machine.

Machine Learning cannot allocate the correct relevancy rank to a relevant Chinese record with no abstract.

Who Should Deal with this Problem?

Developers? Well, of course, if only they read this post :D If I face such problems, what would I do? On the other hand, if you are only responsible for search and not screening, it is easy to ignore how poor reviewers are going to deal with such issues if they have no idea what the problem is. So here are information useful for both Information Scientists and Evidence Synthesists:

Solution 1: “Perfect Bibliographic Record” (PBR)?

In the context of Machine Learning for Systematic Reviews, A “Perfect Bibliographic Records” (PBRs) for Teaching the Machine are the records of ‘journal articles’ that contain all the ‘topical textual’ bibliographic information, such as the ‘Title, Abstract, Journal Name, and Keywords/Subject Headings’ in ‘one language’.

PBR is a reference or gold standard concept that allows us to compare the other records against it and estimate the deviation from perfection. It also shows us how imperfect our search results are and what performance we should expect from the Machine. Another use of the concept is to try and move our records towards this reference as long as such a move is cost-effective. When the search results are close to PBRs, then we can start screening. Such a move is called Data Cleaning, and I will write about it in another post. In short, we need to make sure the records that go into the screening program have the highest quality for ML purposes.

Solution 2: Separation of Record Subsets Based on Record Types, Abstract Status, or Language

Depending on the number and categories of non-standard records among your search results, one solution is to screen Clinical Trial Registry Records and journal article records in two separate sets. Two ML models will be built, each matching one of these record types. The same can be done for the records with or without Abstract. This is only if you have hundreds or thousands of non-standard records.

For mixed-language search results, a solution is to screen English Records separately using the Machine and non-English Records manually. If the number of non-English records is considerable, the same separation and two ML models can work.

Solution 3: Human-Only Screening for non-Standards Records

If there are only a few non-standard records and a human can handle them rapidly, why not screen these records out of the Machine quickly and take the standard records to the Machine?

Final Words

ML algorithms perform better when the records are Standardized (journal article format), English, and Complete (having title and abstract as the minimum).
It does not mean you should spend a huge amount of time creating Perfect Bibliographic Records, even though it is one of the solutions; however, it means that you need to be aware of the ML limitations in dealing with incomplete or non-standard records.
If the number of incomplete or non-standard records is low, screen them manually without the Machine’s help; however, if their number is high, consider screening them in a separate set and building a specific ML model for them.
Regardless, you can always ignore these two solutions and continue screening the records, bearing in mind that ML has its limitations.
ML features in automation programs are developing rapidly, and we hope to see them improve their performance in terms of non-standard records.

If supervised MLs have such limitations, what would be the limitations for LLMs such as GPT and Claude? I will write about this.

If you liked this blog post, please support me by pressing the green Follow button and signing up so I can write more. Email Subscription may not function well. Thank you :D