Eleonora Presani’s first day as executive director of arXiv was March 16, 2020 – the same day much of New York state went into lockdown.
Although she was new to arXiv, the free and open scientific research database maintained and operated by Cornell, she realized immediately that some things would have to change.
“Basically out of nowhere, we started to see more than 100 [coronavirus-related] papers per week,” Presani said during an Oct. 19 panel, “Rapid Response: How arXiv and Other Open-Access Resources Are Adapting to Research Community Needs During the Pandemic.”
“We had to do something about it,” she said, “to make sure we could process them at the same service level we provide for our other papers.”
The mission of arXiv is to make scholarly research that has not yet been published freely available to researchers around the world. When COVID-19 struck, this mission became even more urgent, as scientists raced to learn more about the coronavirus and develop ways to control the pandemic. But it also highlighted the challenges faced by preprint servers, including arXiv, bioRxiv and medRxiv, in providing easy access to a huge volume of research that has not yet been peer-reviewed while minimizing the spread of potential misinformation.
With nearly 1.8 million articles in its collection, arXiv relies on a team of 185 volunteer moderators, supported by a small team of staff, to screen submitted papers. But since the majority of its content is in the fields of physics, mathematics and computer science, the organization needed to quickly escalate its expertise in COVID-related research.
In addition to creating a new searchable category for coronavirus articles, arXiv hired a Cornell postdoctoral researcher in biomedicine to help screen these articles for potentially misleading content. arXiv also has partnered with other organizations compiling open-access research, in order to make these articles more easily accessible and to speed the pace of discovery worldwide.
At bioRxiv and medRxiv – which saw huge spikes in submissions as COVID-19 spread – moderators soon realized they needed enhanced measures to combat the risk of harmful misinformation. When a paper posted in January claiming that inserts in coronavirus’ spike protein resembled protein sequences from HIV, technical issues with the paper provoked a storm of criticism that led to its withdrawal, said John Inglis, co-founder of bioRxiv and medRxiv, during the panel.
“But we also noticed during that weekend that the paper had been picked up by conspiracy-oriented websites who were discussing the possibility that the virus had been human-engineered,” Inglis said. “We decided to put up an additional caveat about the use of preprints – a large yellow banner that reminded readers that these were preprints and, among other things, they should not be reported in the U.S. media as established information.”
The sites, hosted by the Cold Spring Harbor Laboratory, also put in place additional screening practices, including two-step scrutiny of every paper. The screeners are looking for issues such as plagiarism and assertions that could harm the public if they turn out to be wrong, Inglis said.
“Part of this was an awareness of the fact that in all the furor about hydroxychloroquine, for example, the run on the drug was such that patients who really needed hydroxychloroquine were denied access to it,” he said. “We were very conscious of not wanting to add to that problem, or create another similar problem.”
Papers from arXiv, bioRxiv and medRxiv are among those included in CORD-19 (COVID-19 Open Research Dataset), a new dataset made available to leverage artificial intelligence to make a vast trove of coronavirus-related research more easily searchable.
“We wanted to take all the research about COVID-19 and coronavirus, going back all the way to the 1960s and 1970s when initial research about coronavirus started getting published, and make that available in a machine readable format,” said panelist Sebastian Kohlmeier, CORD-19 project manager and senior manager at Semantic Scholar, Allen Institute for Artificial Intelligence, which created CORD-19 in partnership with other research groups.
When CORD-19 started in March, it contained 29,000 papers. It now has more than 280,000.
“It’s been absolutely incredible,” Kohlmeier said. “And one of the great driving forces behind making all the content available was a lot of the scientific publishers and preprint servers banded together to make COVID-19-related research open access, which was a very important component to making this research available to scientists who wanted to analyze the entire universe of coronavirus-related research.”