Why EndNote cannot find some of the duplicate titles? Can Automation or EndNote help?
Currently, there is no way to identify 100% of duplicate records in EndNote during a systematic review. Do we need to identify and remove 100% of the duplicates? If yes, does it worth it? If No, why bother reading this post :D
Anyway, if you have ever been questioned by fellow review team members about missing a few duplicates among the search results, this post is for you.
Missing a few duplicates is not a big deal and will not invalidate your systematic review.
But why is it that EndNote cannot find all the duplicates no matter how you tweak the duplication filter and combine the fields?
1. To Dash or not to Dash or to double-Dash!
In typography, there is more than one type of dash or hyphen. These hyphens can get as long as it gets. We have en dash (n or –) and em dash (m or — ). Changing encoding (ANSI vs Unicode or UTF-8) can make such characters unreadable, and the system may replace them with black rhombus with a white question mark.
The databases had no choice but to avoid surprising conversions popular with ANSI encoding. So, they started converting the en and em dashes into simple dashes. MEDLINE decided to replace them with double dashes, and Embase decided to go for “space single dash space”. You can imagine the rest! EndNote cannot ignore the dashes during the Find Duplicates function and thinks the two records are different.
2. Other Encoding During Import
It is important to save your txt files in UTF-8 or Unicode format if you want to import them into EndNote. This is specifically important if you have records with non-English characters, either author names or something in the title. If you use ANSI, don’t be surprised to see all accented characters have been converted into unreadable characters.
3. Symbol to Character
® or registered trademark symbol is another culprit. Some databases convert it to R; some remove it, and others keep it as it is.
4. HTML and XML Tags
While both these markup languages revolutionised the publishing industry, no solution is without its problems. The publishers usually supply the information to the indexing databases in txt, XML, or HTML format. XML is the best up until today — one day, I will write about the XML revolution.
The problem is the translation of some of these tags into simple text. For example, TM which is the Trademark symbol can appear as a symbol similar to ® but also can appear as text but in the superscripted format: ™. In HTML and XML, you probably would describe this superscription using tags: <sup>TM</sup> because HTML and XML are markup languages and not a txt document.
Some converters/parsers cannot translate such tags when converting XML/HTML into text, so these tags remain in the record as it is. If the tags are removed from one database and remain in another, EndNote will not detect the same titles as duplicates. See PubMed examples here: https://pubmed.ncbi.nlm.nih.gov/?term=%22%3Csup%3E%22
You may see other formatting tags such as subscription <sub></sub> or bold <b><b> or information tag <inf></inf>. See if you can find some of them in Embase and MEDLINE titles.
5. If that looks Greek, then it is
There are 24 Greek letters that some databases report as Greek letters (α, β, γ, …) in the title and some databases use their names (alpha, beta, gamma, …). You can imagine how EndNote can have difficulty finding duplication in such cases.
6. Formatting the journal papers and how it affects the title identification
Academic journals are extensions of newspapers and magazines where formatting and presenting the fancy typography and positioning of words are important.
For example, some journals add the type of the paper on the top right or left corner or after the title: Clinical Article, Review, or Case Report.
Indexer who extracts title information from the papers or types them into the record fields in the databases may have to follow their guts when there is no standard to follow. The indexer in one database may add the article type in the title after a colon; the indexer in another database can do the same after a full stop! Another indexer may ignore the article type as it is not part of the title! Enjoy the mess!
7. Language of the paper in title
You would imagine that in a database record, the language field is completely different from the title field; however, it happens that in some databases such as Embase, the language of the non-English papers may appear as part of the title between two square brackets: [Spanish]! Since MEDLINE adds the entire English title of non-English records between two square brackets without mentioning the language in the title, EndNote cannot detect these two records from MEDLINE and Embase as duplicates.
8. To err is human, so is machine
Typos in all databases are not rare. Since human or machine is involved in part of the indexing, you should expect typos by typists or because of the low machine-readability of PDF documents during the OCR process. Spelling errors are not just because of the indexing process. Authors and editors make mistakes as well.
While it is possible to correct some of these typos post-publication using the erratum mechanism in many journals, if one database does not correct the typos in the titles, EndNote cannot detect two titles as duplicates.
Another odd position is when the indexer detects an error in original sources and corrects it when indexing! While it seems good -will sensible, it is a mistake because it creates a unique record. During the indexing and cataloguing, we must stay loyal to the original source even if it has a typo in the title. See these 7 randomised controlled trials that report Trail rather than Trial in their titles.
9. Adding Study Group in the Title
Now that you had a good grant and finished a study, why not add the study group’s name in the title or as the subtitle to create more mess? Again because of the lack of standards in the indexing process across databases, Study Group’s name appears in the title field rather than the authors' field.
10. Subtitles
When it is unclear where the title ends, it’s easy to add other journalistic parts of the paper as subtitles to the title field. For example, Data Sharing in Medical Research in this paper can be mistaken as part of the title.
11. [References] or Numbers in Square Brackets
PsycINFO adds [References] at the end of the title of the records that have cited references. You may also find numbers in square brackets at the end of the title: [1]. I have no idea why they do this and how it helps. What I know is that they don’t help the de-duplication process. Depending on your export and import formatting, you may be able to get rid of them sometimes.
12. Multi-Letter Indexing
Some of the top journals love to have a lively discussion among the experts on a topic or published research, so they invite commentaries and share the comments with the authors to reply to these comments. This process can create several letters and commentaries that have no titles! You can imagine the rest. Is there a standard way to index them? Some databases index them as multiple letters in a single record to make life easier for the indexer. Another database indexes each letter separately, but since they have no default titles, there might be several scenarios on what is the title: [no title], the title of the original papers: reply, or just ‘reply’ or ‘comment’.
This is similar to when journals shove several corrigendum or corrections into one page without assigning a title to each erratum. The databases may index all as one record or separately.
Only human intelligence can detect such chaos, not EndNote.
13. Erratum Indexing
The way the databases index errata varies. Some index an erratum adding only the term erratum as the title. The others add the full original title and then erratum after full stop or colon. If that’s not enough, some databases change the title of the record for the original publication to add that an erratum appears in another part of the journal.
14. Non-Print Characters
Someone may enter an extra space in the title during indexing and create a unique record that EndNote cannot detect as a duplicate. That’s not all.
If you think there is only one type of space between two words, think again. Since I’m not a typographist, I leave it to you to do the research, but there are half-space, non-break space, and so on. If you did not know about them, what do you expect from EndNote?
15. Asterisk in Title
Yes, Asterisk is used in academic literature for specific concepts, such as this one, or to refer the reader to the footnote or a remark. The reference to the footnote could be information about a grant or a project. What matters is that some databases will remove that asterisk and will create a unique record!
16. Indexing Practice for Non-English Titles
Dealing with non-English records, we can have one or two titles: Original Title and English Title. The databases sometimes index the English title as title, Original title as title, or both titles together in the title field. What if there is no English title? Some databases create a translation; some use the authors' translation of the title (if they could find it somewhere in the paper), and some forget to index the English or original title entirely. There you go, some more duplicates.
Conclusions
Why did I write this post?
- To help the automation programme developers to factor such considerations in
- To help EndNote to make its duplicate detection better
- To help those who try to find all the duplicates consider certain tips when using the eyeball method
- To highlight the lack of indexing standardisation across the bibliographic databases
- To challenge the perfectionist idea that we have to find and delete ALL the duplicates during a single stage and just before screening the titles and abstraction for systematic reviews. I explained before why it is not possible.
So pragmatic approach to removing duplicates is to find and remove them by tweaking EndNote’s default filter (Edit/Preferences/Duplicates) in several stages (I will write about this). You don’t have to remove 100% of duplicates; only mention to your research team that if they find more duplicates, they can simply exclude them in title-abstract screening. If they are perfectionists, then ask them to record the number of duplicates they have found and add it to the number that you reported to them. Your time is valuable; it should not be wasted on something worthless.
If you liked this blog post, please support me by pressing the green Follow button and signing up so I can write more. Email Subscription Dysfunctions. Thank you :D