In Black Box We Trust: Machine Learning-Based Record Screening for Systematic Reviews

8 min readDec 1, 2022

I had an amazing two days at Search Solutions 2022 with a lot of lively discussions, and thrown, hit and missed punches :D I don’t miss Search Solutions, the only conference I attend every year.

I’m not going to keep you waiting, I’m going straight to the point. It is time for us to start trusting machine learning-based features for searching and screening stages of systematic reviews. Believe it or not, we will have no choice but to gradually adopt them into our routine workload or fall behind our fellow colleagues.

Using devices and technologies that use Machine Learning (ML) features in daily life and work is inevitable. Sometimes you have a choice to use or not to use them (turn off your smartphone), and sometimes you don’t (your bank’s chatbot until you get mad, then it connects you to a human).

If you are conducting a systematic review, you have a choice to use or not use ML-based technologies. Before writing this post, I have spoken to many of my colleagues who are big fans of automation, but it does not necessarily mean that they like or use ML-based features!

What is the Difference Between Automation and ML?

Automation of course differs from ML. In automation, the machine follows a pre-set number of steps and rules (algorithms) to achieve a known outcome or complete a task usually for repetitive tasks again and again with no performance improvement; ML uses machines to learn from data and improve the performance of a task.

In a systematic review context, the data could be the records, words in records, the density of words in the record, and the distance of words from each other. If no human decision is involved, this data set could be enough for an unsupervised ML system. If a human/humans screen some records and make include/exclude decisions for some records (labels the data), the decisions made by a human/humans can be part of the data, and this set of data could be a case for a semi-supervised or supervised ML system.

A good example of automation in systematic reviews is using a set filter in EndNote to find duplicates or using the EndNote tool to find full texts. Record screening (screening the records based on title and abstract) using Rayyan or EPPI-Reviewer is a good example of using ML.

10 Considerations for Adopting ML in Systematic Reviews

Probably many considerations for using the search filters apply to using the ML features as well. I try to provide a list of considerations based on which you can assess the ML feature before using it. This list is from personal experience and study:

Control: Switch on/off: Do you have the choice to switch ML on and off? It is important to have that control. Remote Control for TVs got acceptance because it was called ‘Control’. Humans are in control, and they want to control everything, including people and nature, and this is how we have done so far!
Control: Choosing Human over Machine: Do you have the choice to talk to humans or work with a human rather than a machine if you want to? For example, when Machine acts as the second reviewer for screening the search results, can you replace it with another human, if required?
Captivity of Negativity: Borrowed from T-Bag at Prison Break — Does the Machine help you capture the true negative records (irrelevant ones), the relevant ones, or both? In medical imaging, the focus is on capturing negatives. For example, mobile apps can detect if a mole is not melanoma (true negatives) with specificity and sensitivity higher than humans, way faster than humans, and at little or no cost, so it tells us not to go to the doctor and don’t waste their time. If negative cases don’t go to the doctor, the waiting time and the number of queuing people will reduce. Saves lives, time, and money.
Completeness and Englishness: Are all your records have titles and abstracts in English, are some of them missing abstracts, or are they in non-English? Based on my experience, the English bias continues with ML models. Even though Google’s release of its ML technique helped the development of some language-agnostic models, their implementation in systematic review automation may take time. Another experience is that ML is more likely to recognise the records with no abstracts as irrelevant, so completeness of records is important.
Methods for Developing ML: Just like developing search filters, the quality and performance of the ML model depend on the quality of training and test sets, and if labelled data from humans are used, human skill and expertise will have a saying in the accuracy of ML output. Garbage in, garbage out. So in systems such as Rayyan and EPPI-Reviewer, where human labels the data, their skills and expertise will affect the ML’s performance. In short, it is important to train your dragon before riding it.
Living ML Models: While systematic reviewers are desperately looking for solutions to ‘livingness’, such solutions are buried in constant improvement of ML models as they continue to learn from humans until ML can meet and exceed human accuracy. For example, Rayyan allows you to develop a model for your review and update it several times as you go ahead, screening (including/excluding) more and more records. The same applies to EPPI-Reviewer, which allows you to develop a model that tells you when you no longer need to continue screening because the rest of the records are irrelevant! It does that through learning from you and showing you only the records that are tricky for the Machine to decide and require a human decision. Lovely, isn’t it? Still, you can waste your time and look at those records — done that! — and PRISMA Diagram 2020 allows the Machine to exclude the records. What more do you want?
The Machine as Assistant: Believe it or not, from hammers and washing machines to ML-based screening tools, these solutions are part of life. Take it, or continue soaking and scrubbing the pot! ML can act as the second reviewer in screening, like a research assistant.
Independent Studies and Changing Algorithms: we need more studies for Machine disbelievers! The studies should focus on accuracy and time rather than one of them. I recently saw a study that showed Ovid, Covidence, and Rayyan are good at detecting duplicates, but the study did not mention how long it will take for me to remove the duplicates with Rayyan compared to the EndNote de-duplication tool. We also need to consider that some ML tools use ‘Active learning’, which means their performance changes and mainly improves as they continuously learn from humans. Assigning a static accuracy or performance measure to them may not be easy. Their performance may get worse if the human is angry, not an expert, drunk, or in love! Garbage in, garbage out. Love in, love out, very subjective decisions.
Laggards and Luddites: Living in Nottingham for about 10 years, it is impossible for me not to mention Ned Ludd and Death to Machines. Based on the Diffusion of Innovations theory, the categories of adopters are innovators, early adopters, early majority, late majority, and laggards. Depending on your research, you can be in any of these categories. I was a laggard with Windows Vista and an early adopter of Rayyan. Both go on my CV. Of course, we have innovations that die before they make it to us, either because of a lack of funding or failure to engage the user community. UI/UX and all that.
Explainability: Is there a way you can ask for a reason for the decision made by the Machine (Explainability), or is it a secret (Black Box)?

Explainability and Black Box

What is Black Box?

In computing and ML, Black Box is a system or Machine whose input and output can be viewed and are known, but the process, mechanism or internal working is unknown. The main problem with Black Box is that it causes distrust. Usually, some users do not trust them because the Machine’s decision is not understandable for humans or lacks Explainability.

In my talk on “The Futures of Systematic Searching”, I mentioned that Explainability is not that important as long as the system works. We eat a vegetarian English breakfast, and the cafe says it is a vegetarian sausage. We believe them and eat the sausage; we don’t know what is in it. We click icons that run hundreds of lines of command lines. Do we know what those codes or commands are? We take medication that doctors say would work, like vaccines. We have medications in the market with unknown mechanisms of action, and millions are taking them because they work on most people. If that’s not Black Box, what is? To be honest, we don’t care much about what is in stuff or how things work as long as they work. Now, if ML models are another intervention, how’s that we care about Explainability if they work?

McCoy et al. 2022 put it in more philosophical terms:

Ultimately, we conclude that the value of explainability in machine learning for healthcare is not intrinsic, but is instead instrumental to achieving greater imperatives such as performance and trust. We caution against the uncompromising pursuit of explainability, and advocate instead for the development of robust empirical methods to successfully evaluate increasingly inexplicable algorithmic systems.

Religion, EPPI-Reviewer, and Explainability

So if I screen 10% of results and EPPI-Reviewer tells me that the rest are irrelevant:

I can believe it (unexplainable)
I continue to screen until I turn a believer (unexplainable)
I calculate word density per record and measure the distance among relevant, irrelevant, stop words, and unspecific words to find out what score range each record should get to be relevant or irrelevant and then based on that reference score range, I could assign the records to relevant and irrelevant and ask the human to screen the ones in between the ranges (explainable).
I don’t care, I have a life to waste, and I will screen all the records myself with no help from the Machine. The systematic review is here to save some lives and waste others (explainable).

While I leave you to choose, let’s see how some religions got so many followers that any social media influencer dreams of. I always wondered how many unexplainable miracles Jesus and Muhammad showed people until people believed them, and now we have 2.2 billion Christians and 1.9 billion Muslims. These unexplainable miracles happened centuries ago and left their impact until today. I look at Black Box the same way. ML-based solutions may need time to find acceptability to find followers, but they are here to stay and find billions of followers. ML will not wait centuries, not even decades. Rayyan has over 200,000 users. Use it or lose it.

Conclusion

Sorry that I did not enter a more profound discussion on the differences among Automation, AI, and ML. I thought you wouldn’t find it useful at this stage, but I tried to clarify that we have no choice but to embrace the ML-based solutions with enough reasons from real life that these solutions don’t have to be explainable to be acceptable or used (religion, food, and medicine).

We should not blindly accept or reject the claims, so I added 10 considerations for users who intend to use ML-based solutions with caution. While I never thought I will, I am currently using any ML-based tool that I can to save time. I will update this post as I receive feedback.

I will soon write about other uses of ML in systematic reviews; do we need to design search strategies at all?

What’s your experience with using ML-based solutions for systematic reviewing?

If you liked this blog post, please support me by pressing the green Follow button and signing up so I can write more. Email Subscription Dysfunctions. Thank you :D

In Black Box We Trust: Machine Learning-Based Record Screening for Systematic Reviews

What is the Difference Between Automation and ML?

10 Considerations for Adopting ML in Systematic Reviews

Explainability and Black Box

What is Black Box?

Religion, EPPI-Reviewer, and Explainability

Conclusion

Sign up to discover human stories that deepen your understanding of the world.

Free

Membership

Written by Farhad Shokraneh

No responses yet