‘Even if all parts of a problem seem to fit together like the pieces of a jigsaw puzzle, one has to remember that the probable need not necessarily be the truth and the truth not always probable.’
SIGMUND FREUD
In January 2020, when the WIV scientists published their first manuscript about SARS-CoV-2, revealing that the most closely related virus, RaTG13, had been sequenced in their laboratory, many readers of the paper began to ponder the possibility that the pandemic might have begun with a lab leak. Such suspicions deepened with the revelation, in the addendum to the paper published the following November, that Dr Shi’s group had been studying at least eight other viruses very closely related to SARS-CoV-2 – and that these eight had also been collected from the same mineshaft in Mojiang where workers had sickened with a mysterious respiratory disease in 2012. It was almost a year since SARS-CoV-2 had been detected, and in that time the Wuhan scientists had not once mentioned these viruses. Even in the addendum they did not describe them, share their sequences or say what they had done with them. As we were writing this book, in the middle of 2021, they had still published very little about these viral cousins of the cause of the pandemic.
The addendum, as readers may recall (see Chapter 1), had cleared up quite a lot. It at last confirmed the internet sleuths’ findings about the illness of the Mojiang miners, the WIV scientists writing ‘we suspected that the patients had been infected by an unknown virus’. It confirmed that thirteen blood samples from four of the patients had been sent to the WIV, but whether they tested positive for antibodies to SARS was unclear. The addendum said no; the 2016 doctoral thesis from the Chinese CDC director’s laboratory said yes, and the 2013 medical thesis said the WIV had found virus antibodies in the patient samples, concluding a probable infection of the miners by a SARS-like virus from bats. The addendum also corroborated the internet sleuths’ finding that the virus labelled 4991 had been renamed RaTG13, to ‘reflect the bat species, the location, and the sampling year’. It confirmed that 4991 had been described in a 2016 publication and deposited in the GenBank online database in 2016, although neither were cited in the original Nature paper – making it challenging for outsiders to know if the new RaTG13 sequence and the old 4991 sequence had come from the same sample. And finally, once again validating the findings of the internet sleuths, the WIV scientists confirmed that RaTG13 had actually been sequenced in 2018 and not after SARS-CoV-2 had emerged in Wuhan as their paper had implied. Even the admission of the existence of eight other viruses was not entirely surprising: some of the internet sleuths had already made informed guesses as to their existence and identities. So the addendum had not added much that was truly new – it had confirmed the predictions of internet sleuths and concerned scientists.
Thanks to hints in genomic databases, some clues to the identity of the eight coronaviruses had emerged well before the addendum appeared. A digression is necessary here into how genetic sequences are assembled. Think of a bat faecal sample as a huge box in which the pieces of more than one jigsaw puzzle are mixed. These days, to extract the genome sequences of viruses from such samples, a scientist would employ a process called next-generation sequencing (NGS). This generates hundreds of thousands of mixed puzzle pieces, called ‘reads’. These reads have to be carefully pieced together to see which puzzles (genomes) can be more or less completed. Unless you have a great deal of a particular virus in your bat faecal sample, there will often be gaping holes and regions of uncertainty that need checking. The problem is even worse, you can imagine, if the box contains several puzzles that are largely similar to each other, equivalent to several different viruses in the same sample: which pieces belong to which puzzle?
One way to fill these holes is to generate more puzzle pieces from your sample using another approach that specifically seeks out sequences that are close to the spot in the genome where the gap is. This is called amplicon sequencing. (Amplicon sequencing can also be performed if scientists want to look at particular genes, and is not necessarily evidence of full genome sequencing and assembly.) To explain what this means, consider that the way genetic sequences are read is by first rapidly copying lengthy fragments of DNA in a chain-reaction fashion: each new copy serving as a template for more copies, and so on. This generates many copies of the same fragment, effectively amplifying its content so that it can be reliably read by the sequencing machine to generate a letter-by-letter sequence of nucleotides, the basic building blocks of DNA and RNA. The amplified sequence of each fragment is therefore known as an amplicon. The amplicons should overlap with other known pieces of the genome, allowing the reads to be assembled to give a complete end-to-end sequence of the whole genome.
Many scientists are fastidious and like to check whether they can assemble a published genome by starting with the mess of puzzle pieces – using both the NGS and amplicon data. In other words, can they reproduce the final genome from scratch without knowing what the puzzle should look like? Such careful researchers are not satisfied by being simply told that someone else has already correctly assembled the genome so there is no need for anyone else to try their hand at it. When the RaTG13 genome was published, some scientists who were trying to reproduce the assembly discovered that only the NGS reads from the sample had been published, but not the amplicon reads. Alina confirmed that at least one US scientist had reached out to the WIV, asking for the amplicon reads in April 2020, to verify the genome sequence of this closest relative to SARS-CoV-2.
On 19 May 2020, the amplicon sequences were deposited into the GenBank database without any fanfare. Internet sleuths soon discovered these new reads and got down to analysing them, sharing their insights on Twitter. The first thing that popped out was the dates attached to the reads. Some had been obtained in June 2017, others in September and October 2018. This was odd at the time (before the addendum) because the wording of the Nature paper had implied, and been taken by many to mean, that the sequencing of the RaTG13 genome had occurred after the outbreak of Covid-19. The key sentences read: ‘We then found that a short region of RNA-dependent RNA polymerase (RdRp) from a bat coronavirus (BatCoV RaTG13) – which was previously detected in Rhinolophus affinis from Yunnan province – showed high sequence identity to 2019-nCoV. We carried out full-length sequencing on this RNA sample.’ This impression was confirmed as late as July 2020, when Dr Peter Daszak – who, remember, had been a long-time collaborator and funder of the virus-hunting work at the WIV – told the Sunday Times that the WIV scientists ‘went back to that sample in 2020, in early January or maybe even at the end of last year, I don’t know. They tried to get full genome sequencing, which is important to find out the whole diversity of the viral genome . . . I think they tried to culture it but they were unable to, so that sample, I think, has gone.’ On 5 July 2020, Alina pointed out Dr Daszak’s mistake in the Sunday Times on Twitter: ‘I think Daszak was misinformed because the amplicon sequencing data on NCBI clearly shows that the WIV accessed the sample repeatedly in 2017 and 2018 . . . It wasn’t just sitting forgotten in a freezer for 6 years.’
Alerted by the tweets among the internet sleuths – namely Francisco de Ribera and Babarlelephant, who we will introduce shortly – Alina now also analysed the amplicon sequences for RaTG13. She noticed that the bulk of these reads, coincidentally or not, matched the gaps in the genome (the holes in the puzzle) that had been assembled based on the NGS reads. The short 4991 fragment had been sequenced and published in 2016. Was it possible that this sample had been chosen for next-generation sequencing in the following years, and that the amplicon sequencing was done to patch the holes in the genome?
Later that same month, Dr Shi revealed the sample history of RaTG13 in an interview with Science magazine: ‘As the sample was used many times for the purpose of viral nucleic acid extraction, there was no more sample after we finished genome sequencing, and we did not do virus isolation and other studies on it.’ And, contrary to the impression that readers had received from her Nature paper, ‘in 2018, as the NGS sequencing technology and capability in our lab was improved, we did further sequencing of the virus using our remaining samples, and obtained the full-length genome sequence of RaTG13 except the 15 nucleotides at the 5’ end.’ (A convention in genetics is that sequences are usually written in one direction, from the so-called 5-prime end to the 3-prime end, designated 5’ and 3’, the terms referring to different atoms in the pentagonal structure of sugar molecules.)
Why, if Dr Shi had a full genome for RaTG13 all along, did she not mention it in the paper, but instead first focus on the short fragment? As for whether the sample was ‘gone’, in Dr Daszak’s words, one very small section of the genome, fifteen letters long at the front (5’) end, was not uploaded until 13 October 2020. How did they obtain this new sequence data if the sample had been used up?
One thing about the May upload of RaTG13 amplicon data made the hairs stand up on the back of the neck of one careful observer. The amplicons sequenced in 2017 and 2018 had various labels. Many of the reads from 2017 and 2018 are labelled ‘RaTG13’. However, some of the earliest reads from June 2017 were labelled ‘7896’. What did this mean?
Here began a trail of clues leading to further revelations. It started with a tweet on 2 July 2020 from a Spaniard called Francisco de Ribera, tweeting under the name @franciscodeasis and who is a key member of the Drastic team. Who is Ribera? A forty-year old resident of the Chamberí district of the Spanish capital, Madrid, Ribera lost his job as a technology consultant on 24 March 2020 at the start of the pandemic. Like a lot of people he found he had time on his hands and decided to put it to good use. Drawing on his savings, he resumed work on his PhD in economics at Comillas Pontifical University, to add to his bachelor’s degree in industrial engineering and his master’s degree in business. He also began to dabble in modelling the data about the spread of the virus during the first wave. Handling data came naturally to Ribera who has worked for banks and investment managers and proved a fiend when it came to Microsoft Excel. On 13 April, he read a CNN news article that said China was restricting academic publications on the origins of SARS-CoV-2. Around the same time, he read Matt’s Wall Street Journal essay on ‘the bats behind the pandemic’ and was struck by the sentence: ‘Unless other evidence emerges, it thus looks like a horrible coincidence that China’s Institute of Virology, a high-security laboratory where human cells were being experimentally infected with bat viruses, happens to be in Wuhan, the origin of today’s pandemic.’ The possibility that the virus might have leaked from that laboratory intrigued Ribera and he began to follow the trail of breadcrumbs online. He knew that an old trick used by company auditors is to pay attention to serial numbers on invoices: a missing number may indicate a missing document. So he started to painstakingly assemble a huge spreadsheet, which he called his ‘big sudoku’, that lists all that is known about every virus sample referred to by the WIV scientists in papers, seminars and genetic databases. By July he had an unrivalled knowledge of the identification data on bat coronaviruses.
Ribera’s crucial 2 July tweet is not immediately understandable to those not following the details: ‘One of the inconsistencies of the GenBank upload of Latinne et al. (2020) is precisely an ID collision for sample 7896: – Rhinolophus bat coronavirus HKU2 isolate 7896 . . . Bat betacoronavirus isolate 7896 . . . Could it be somehow related?’ Let us explain what he meant. Ignoring the reference to HKU2, which relates to a different issue, Latinne et al. is a paper that appeared as a preprint on 31 May 2020, with Drs Linfa Wang, Peter Daszak and Shi Zhengli as senior authors and Alice Latinne of the EcoHealth Alliance as the lead author. Entitled ‘Origin and Cross-Species Transmission of Bat Coronaviruses in China’, the Latinne et al. paper was a comprehensive review of the 630 novel bat coronavirus sequences that the WIV-EcoHealth Alliance project had collected from 2010 to 2015 across numerous Chinese provinces, including ninety-seven new SARS-like viruses from thirty-one species of bat. The paper had been submitted on 6 October 2019, before the pandemic began. It was released as a preprint on 31 May 2020 and finally published on 25 August 2020. The data files behind it were processed by GenBank on 7 November 2019, with an embargo until 1 June 2020. Because one of the genes (RdRp) of these viruses had been partially sequenced in each case, the supplementary material included a vast trove of these sequences.
Just as Ribera was getting his head round the data, another Twitter user called Babarlelephant uploaded a tree diagram that he had created from the Latinne et al. data. ‘Babar’ has retained his anonymity. He is French and has a background in maths and computer science; at the start of the pandemic he knew little biology. He spent several months reading biological papers till he knew how to interpret and create the tree diagrams of gene sequences that are a key tool of genomic analysis. In the tree that Babar created, Ribera spotted a surprise. Nestled in the middle of the SARS-like viruses in the branches of the tree right next to the pangolin viruses was a small group of eight very similar viruses. One of them was numbered 7896, the very number that had appeared among the amplicons of our old friend RaTG13.
‘There is a mysterious “7896” label in the reads of RaTG13 that may match with that virus,’ said Ribera on Twitter, implying that the virus named 7896 might be somehow connected to RaTG13. Finding a run of eight similar numbers – 7896, 7905, 7907, 7909, 7921, 7924, 7931 and 7952 – attached to viruses so close to SARS-CoV-2 in the family tree surely merited further inquiry. On 7 July, he mused that 7896 was ‘suspiciously close to RaTG13 in the sarbecovirus tree. Could it be that the 7896 was used somehow as an aid for sequencing RaTG13?’ The next day, 8 July, Ribera asked Dr Daszak, one of the senior authors of Latinne et al., to explain: ‘Hi Peter, do you know why some reads of RaTG13 have a “7896” label in their filenames? There is a “7896” sample in a clade quite close to 4991 that was recently uploaded (only RdRp) by Latinne et al. (2020). Why there is no study on that new lineage within SL CoVs?’ (A clade is a group of creatures and their common biological ancestor.) Dr Daszak’s response was to block Ribera on Twitter.
Twitter user @babarlelephant’s tree from 7 July 2020.
@babarlelephant
This only made Ribera more intrigued. He began to follow the trail. In his big sudoku of all the samples collected by the WIV over the years, he had begun to work out details of who had collected each sample and where. He did this by combing published papers and other documents for clues to the dates of virus hunting trips to different sites. Each trip to a mine or a cave corresponded to a group of serial numbers. But the exact dates of the visits were not always clear.
Then, on 8 July, up popped the anonymous and indefatigable internet sleuth called the Seeker with a helpful clue. The official database of the Chinese National Genomics Data Center, known as BigD, included more details of dates and locations, the Seeker pointed out. Sure enough, this told Ribera that the sample with serial number 7896 had been collected on 30 April 2015 in Yunnan and the sequence deposited in the database two months before the Latinne et al. preprint had appeared.
Using the details in BigD, and the clues given in two published papers that provided the dates and locations of some visits, Ribera was able to identify which samples came from which location in Yunnan on which date. On 2 August, he found there was a ‘cluster of consecutive samples from 7895 to 7966 (72 samples, 47 positives, 3 co-infections) suggesting a same trip’. It was unlikely to be from the Shitou and Yanzi caves in Jinning, where the bats are mostly R. sinicus and where a different set of serial numbers had just been used for a visit. The profile of bat species looked more like those from the mineshaft in Mojiang. From this and other clues, Ribera gradually became all but certain that the eight new viruses came from the copper mine in Mojiang County, where they had been collected two years later than the RaTG13 sample, in the spring of 2015.
Three months after Ribera figured this out and tweeted publicly about it, Dr Shi’s group published the addendum to their Nature paper in which they confirmed almost all the previous discoveries of the Drastic sleuths and added that eight other viruses had been collected from the same location as RaTG13. Eight, note. Given what Ribera had already worked out in his big sudoku, that a mysterious group of eight viruses had been collected from bats in the Mojiang mine in 2015, which were very closely related to RaTG13 and to SARS-CoV-2, we could conclude with something approaching certainty that the 7896 group of viruses were the eight referred to in the addendum. We do not know why the addendum finally brought up the eight viruses. It is possible that as a result of Ribera’s tweets, the existence of the eight SARS-CoV-2-like viruses from the Mojiang mine could no longer be concealed. The addendum mentioned that in total 1,322 samples had been collected from the Mojiang site between 2012 and 2015, of which 284 were alphacoronaviruses and nine betacoronaviruses. Going through his own spreadsheets, Ribera realised he could pin down 1,320 of them, which gave him confidence that his method was working.
On 3 December 2020, Dr Shi spoke at a webinar organised by the French national academy of medicine and the Academie Veterinaire de France. She showed a slide with eight viruses clearly displayed in a family tree close to SARS-CoV-2, with their serial numbers exactly matching the eight that Ribera had pinpointed from Latinne et al.’s data. This was the first time that the existence of the 7896 group of viruses had been acknowledged publicly, as opposed to buried in a database or mentioned but not named in an addendum. The slide also appeared to suggest that all eight were taken from Rhinolophus affinis, the same species of bat as carried RaTG13: they all now had the prefix Ra in front of their serial numbers. This was yet another revelation smuggled out in the most low-key way imaginable. Despite appearing on the screen, the eight were not mentioned in the talk.
Intriguingly, Dr Shi’s slide showed that the eight are not identical. Her chart showed 7909 and 7952 to be at least near identical but different from the other six. This differed from the information that had been given in the partial sequences of the eight uploaded with the Latinne et al. paper, and used to make Babarlelephant’s tree, which showed 7952 and 7931 to be the ones that stood out from the other six. It was not clear which parts of the genomic sequences were being used to create Dr Shi’s tree, but it was probably not the partial RdRp sequences made available by Latinne et al. A bioinformatics consultant, Moreno Colaiacovo, who spotted the slide, pleaded without success on social media for Dr Shi to publish the genome sequences of these eight viruses.
Two weeks later, on 18 December, Dr Shi gave another online lecture at a conference of the European Scientific Working Group on Influenza. She showed the same slide as before but this time she showed another slide that gave away a crucial clue. In a table, two of the eight viruses, 7909 and 7924, were listed as ‘Mojiang bat CoVs’. This confirmed what Ribera had worked out: that the 7896 cluster of viruses had been collected in the Mojiang mine. On 24 February 2021, Dr Shi gave another online talk, at a conference about viruses called ‘Dangerous Liaisons’ organised by Hong Kong University and the Pasteur Research Pole. She showed this very slide again – but this time without the two viruses or the word ‘Mojiang’. On 23 March, Dr Shi gave yet another talk in which one of her slides showed a family tree that included the eight viruses. All the viruses on the tree had their location labelled – either the country or province – except for the eight. Now, however, seven of the eight had a different prefix: Rst, meaning Rhinolophus stheno. The exception, which was still named Ra for R. affinis, was 7909. This was curious because as late as 24 February Dr Shi’s slides had used the prefix ‘Ra’ for all eight. In answer to a question submitted during the seminar, we heard Dr Shi say that she had sequenced their full-length genomes and would publish these soon.
The partial sequences of the eight viruses in Latinne et al. had been embargoed on the GenBank database until June 2020, but they had been uploaded more than seven months earlier on 7 November 2019. Ribera had worked this out from the identification numbers attached to the samples, but it was eventually confirmed to him by GenBank in an email. At the time, the significance of these samples would have seemed small. They were just another set of bat viruses among hundreds and were not especially closely related to the 2003 SARS virus. That month people in Wuhan fell ill with a disease that would in due course make these eight viruses much more relevant. Indeed, in early 2020, these eight viruses would have been the most closely related to SARS-CoV-2, excepting RaTG13. Yet the WIV and EcoHealth Alliance scientists never drew attention to the eight viruses, even as they revised the manuscript of Latinne et al. through the early months of the pandemic. Indeed, looking at a key figure in the Latinne et al. paper that summarised their data, the 7896 cluster juts out right next to SARS-CoV-2, distinct from all other bat viruses the group had uncovered. Despite these eight novel SARS-like viruses being visibly pegged as very closely related to the pandemic virus, the scientists had neglected to label them in the figure or even discuss their provenance in the paper.
Had Babar not taken the trouble to build a family tree of the sequences, the relationship of the eight viruses to SARS-CoV-2 and the Mojiang mine might never have been noticed, as needles in a huge haystack of data. Indeed, it is hard to escape the conclusion that if the WIV-EcoHealth Alliance team had not uploaded these sequences in November 2019, before the pandemic, they might never have published the eight at all. ‘They were forced to disclose the 7896 clade and to concede going to the mine until 2015,’ said Ribera.
Finally, in May 2021, Dr Shi and colleagues did at last publish in a preprint the full genome of one of the eight, number 7909, simultaneously renaming it RaTG15, adding that the genomes of all eight viruses were near identical. This genome was now reported to be less similar to SARS-CoV-2 than both RaTG13 and RmYN02. The date of collection of the eight viruses was clarified in this preprint to have been 29 May 2015.
Pause here to reflect on the situation. A Spanish business consultant working in his spare time painstakingly worked out, no thanks to Dr Shi and Dr Daszak, that they found eight viruses five years ago that are very closely related to the virus causing the pandemic and brought them more than a thousand kilometres to Wuhan from the Mojiang mine. Yet not only did the scientists fail to announce this potentially crucial information at the start of the outbreak, but when they did slip out the existence of the eight in a tranche of data in June 2020, they failed to draw attention to it. Indeed, by leaving off the words ‘SARS-like’ in the database descriptions they obscured the viruses’ relevance, whether intentionally or not. And then when, in November 2020, they did belatedly and obscurely refer to the eight extra viruses in an addendum, they gave no details and disclosed nothing about how closely related they are to SARS-CoV-2, let alone published their genome sequences. This had to be left to ordinary people to figure out for themselves. Belatedly, eighteen months into the pandemic, they published a genome sequence for the eight viruses.
We repeat for emphasis, because even as we write this, the lack of urgency and transparency is so extraordinary that we can barely believe it: the laboratory had been studying eight viruses (in addition to RaTG13) that are very closely related to a then emerging killer virus and yet did not tell the world important details about the viruses and their connection to the Mojiang miners that could shed light on the origin and characteristics of SARS-CoV-2. Imagine how differently the world would have reacted if, in January 2020, the WIV had announced that the closest relatives of the novel coronavirus came from a place where half of the infected people had died from a mysterious respiratory disease – and that nine closest relatives to SARS-CoV-2 had been under study at the Wuhan Institute of Virology long before the Covid-19 outbreak.
What put the final pieces of the 7896 jigsaw in place was the discovery of another batch of doctoral and master’s theses. A source had shared these theses with the Seeker and Francisco de Ribera, who then worked tirelessly to translate and decode their contents. This time, each of the four theses had been supervised by Dr Shi from the WIV. Two were of particular interest and described the Mojiang mine and the 7896 group.
A Master of Engineering thesis submitted by Ning Wang in May 2014 recounted the immediate follow-up of bat viruses found in the Mojiang mine. It described the cases of the Mojiang miners, albeit the year was incorrect: ‘In 2011, Yunnan, Mojiang, three miners died from pneumonia . . . the 6 miners were probably infected with pathogens carried by bats.’ The thesis confirmed that the scientists investigated the bat-borne viruses in the mine, collecting seventy-seven and ninety-three bat swab samples in August and September of 2012 respectively. Besides obtaining partial RdRp sequences, they also managed to partially sequence the N (nucleocapsid) and S (spike) genes of some of the samples. They found that a whopping 64 per cent of their samples (109 out of 170) were positive for coronavirus. These sequences spoke of a variety of coronaviruses from the Mojiang mine. One was named 4991 and they successfully amplified its N gene and even translated the nucleocapsid protein from it. This had never been revealed by the WIV, not in the Nature paper, nor in the addendum.
The other relevant chapter of this 2014 thesis described the development of a method for detecting coronavirus from human samples based on the nucleocapsid protein of each virus – these included, as references, a bat SARS-like virus with 98 per cent nucleocapsid protein similarity to the 2003 SARS virus, the MERS virus and common cold coronaviruses. ‘The research method in this paper can also serve as a reference for the study of serological detection methods for novel coronaviruses that are potentially infectious to humans.’ Although the 2014 thesis did not produce a fully reliable test method, it revealed that dozens of blood samples had been taken from patients with fever in different provinces, including Yunnan.
Among the individuals thanked by Ning Wang in the thesis was Dr Linfa Wang, for the guidance he had provided during the course of Ning Wang’s master’s studies. Did Dr Wang know about the Mojiang miners? He told National Geographic in June 2020: ‘When we have prevented small outbreaks, people don’t care. It doesn’t get media attention. In Wuhan, if three people died and it was controlled, would we know it? No. This is happening all the time, it’s just in remote villages where people die. You bury them and end of the story, right?’ Was this an insider reference to the three Mojiang miners who had died in a remote county? In September 2021, it was revealed that Dr Wang had helped with the analysis of the miner cases.
Another thesis, from June 2019, by Yu Ping for a Master of Natural Science in Biochemistry and Molecular Biology was supervised by Dr Cui Jie and Dr Shi Zhengli. This thesis described SARS-like viruses sampled from bats across twenty regions in China between 2011 and 2016 – 2,815 samples from Yunnan (92 SARS-positive); 612 samples from Guangxi (11 SARS-positive); 1,979 samples from Guangdong (33 SARS-positive); and 1,383 samples from Hubei, where Wuhan city is located (3 SARS-positive). They found a total of 170 SARS-like viruses, predominantly from Yunnan province. A key takeaway from the thesis was that SARS-like viruses ‘tend to cluster together more by geographic location than by host species’. The only SARS-like viruses they had found to be capable of using the ACE2 receptor were located in Yunnan province, ‘where the ancestor of the S gene of human SARS coronaviruses is presumed to have originated’. In other words, one was unlikely to find a SARS-like virus that could use ACE2 as far north as Wuhan. Dr Shi said as much in her July 2020 interview with Science magazine, when she emphasised that many years of surveillance in Hubei province had turned up no coronaviruses that were closely related to SARS-CoV-2. A scientist from the Wuhan CDC had also sampled nearly ten thousand bats from Hubei before the pandemic and had not turned up any SARS-CoV-2-like relatives or indeed any SARS-like virus that could use ACE2 as an entry receptor.
Strikingly, the thesis revealed that four SARS-like viruses of key interest – all from southern Chinese provinces – had been selected for whole genome sequencing and the sequences carefully compared against other SARS-like viruses in the literature. Ra4991_Yunnan was among the four. In a table comparing different SARS-like viruses found in China, it was one of two outliers, considerably dissimilar from the 2003 SARS virus. The other was named Rs8561_Guangdong. What is this other novel SARS-like virus that the WIV had found in Guangdong and deemed a top priority for full genome sequencing? Will we ever see its genome sequence?
The 2019 thesis also shone much-needed light on the work that had been done on the 7896 group. It revealed that different parts of the genome had already been sequenced: parts of the spike of 7896, 7905, 7909, 7924 and 7931; as well as the gene known as ORF8 of 7896, 7909 and 7952. In particular, the full RdRp gene (2,757 RNA letters) had been sequenced for all these strains, not just the short 440-letter partial sequences deposited on GenBank for the Latinne et al. paper. It also revealed that primers (a short nucleotide fragment used for sequencing) named after 7896 had been used to sequence the spike genes of other SARS-like viruses, although the RaTG13 amplicon reads labelled with the mysterious ‘7896’ had been from outside the spike gene. This confirmed that the 7896 sequence had been used to help obtain the sequences of other closely related SARS-like viruses. Remember that Francisco de Ribera had asked, as long ago as July 2020, ‘Could it be that the 7896 was used somehow as an aid for sequencing RaTG13?’
The 2019 thesis also showed that over the years, the WIV scientists had identified four different lineages of SARS-like viruses. Although the thesis was written before the emergence of SARS-CoV-2, we now know that SARS-CoV-2 belongs to lineage 4; the 2003 SARS virus is in lineage 1. In a family tree showing the four different lineages of SARS-like viruses, only nine viruses were in lineage 4: Ra4991 and the eight from the 7896 group. A map of China showed the geographical location of this special lineage. Unlike the others, lineage 4 was found in only one province: Yunnan. Remember that these had all come from a single mine in which people had sickened with severe respiratory disease that was suspected to be caused by a virus from bats.
Another curiosity now emerged. The author of the thesis, Yu Ping, had also published her work in the journal Infection, Genetics and Evolution shortly before she defended her thesis. The paper had been submitted to the journal in November 2018 and published in February 2019. A near replica of the family tree and map of China from the thesis was published in this manuscript. Except there was no lineage 4. The 2019 journal paper showed only three lineages of SARS-like viruses, even though all nine of the viruses in the fourth lineage had been sampled between 2012 and 2015. There was no discussion of a mine from which top Chinese labs were collecting bat viruses in a quest to understand how human beings had contracted a mysterious pneumonia. There was no discussion of an entirely novel group of SARS-like viruses.
Was the mandate of these virus surveillance programmes not to find pathogens with pandemic potential and alert the world in a timely manner? Why did they obscure the discovery of eight novel sarbecoviruses in the February 2019 paper, in the Latinne et al. paper submitted in November 2019, and even post-Covid in the January 2020 WIV paper?
That the emergence of SARS-CoV-2 did not catalyse the release of information regarding the Mojiang miners and the 7896 group is inexplicable. The broken-down lorries, surveillance cameras and excuses about wild elephants used to prevent journalists from visiting the mine only add to the impression that a part of SARS-CoV-2’s origin story is being covered up. What is new and even more extraordinary is the hint – from the omission of lineage 4 from the early 2019 publication – that the withholding of information perhaps even began before the pandemic.