The sequence of the human genome is a living document that catalogs the history of migration, mutation, and environmental stressors that have shaped who we are and how we came to be. Sprinkled throughout the 3.1 billion DNA bases that comprise our genome are tens of thousands of protein-coding genes, regulatory regions that stipulate when, where, and how much of these genes are expressed, palindromic repeats that stretch for thousands of bases at a clip, and even sequences that can move.
These particular sequences, called transposons, enable sections of DNA to pop out of one chromosome and move to another, often resulting in mutation or disease. Together, we all enjoy a rich, mosaic blueprint that guides our cells through growth and development and enables us to pass our genetic information to our children.
Our understanding of this blueprint and text is still very incomplete and shaped by our cultural biases. STEM and the biomedical research and clinical enterprise have not been immune to the systemic prejudices and discriminations that have plagued all other facets of our society.
People of color are underrepresented in most academic research departments and this lack of representation has filtered down into how genetic studies were (and still are) designed and conducted.
Historically, genetic research often reflects the race or ancestry of the laboratory investigator who conducted the study—predominantly, the white male. It wasn’t even until 2014 that the National Institutes of Health (NIH) mandated that all preclinical research consider sex as a biological factor in human and animal research studies.
This disparity in representation within both the research community and within research studies is a major failing of the biomedical community and on multiple levels has dramatically slowed the discovery of disease-causing alleles and the development of new treatments. Of all published studies linking genetic variation to a particular characteristic, whether it be a physical trait, a disease-causing mutation, or some other physiological outcome, nearly 88 percent of the study participants have been European or white.
Major breakthroughs in our understanding of the genome’s complexity have occurred in the past two decades. But this systemic bias in genomic representation in our studies narrows our ability to interpret the role of genetics in a given physiological context.
Our viewpoint, unsurprisingly, is shaped by what we overwhelmingly know about those with European ancestry. This has influenced the well-documented biases observed with patient treatment in the clinic, particularly with women and people of color. It prejudices our understanding of how an individual may respond to a specific drug treatment and/or the steps we take when we assess their genetic risk for a specific disease.
These barriers will remain until our knowledge of the genome is complete by examining the true diversity and representation of genetic variation within all of the world’s communities. In 2015, the NIH launched the All of Us initiative to collect and sequence the DNA from over one million volunteers.
Tied together with behavioral surveys and electronic health records, the All of Us campaign is meant to better understand how lifestyle, the environment, and our biology interact to shape our health. The initiative is already a quarter of the way toward its target recruitment goal and over 80 percent of participants are from at least one, if not several, of the many underrepresented populations in research.
Capturing genomic variation with additional health data helps clinicians and researchers pin down genomic regions associated with disease or other biological processes. But what exactly is genetic variation? It is important to note that while all humans share the same set of genes and the same genome, the specific DNA base (those A’s, T’s, G’s, and C’s you may be familiar with) at a specific location in the genome sequence can be variable.
Among two unrelated individuals, there is commonly at least four to six million differences in genetic sequence spread throughout the genome. In some cases, these differences could be well over twenty million bases. Many of these variable locations have now been cataloged and are referred to as a single nucleotide polymorphism (SNP, pronounced “snip”).
SNPs are the dominant reason for the beautiful and incredible diversity of phenotypes that humans exhibit. Historically, though there wasn’t a name for them at the time, SNPs are the culprit for the horrific eugenics discussions in America, Germany, and Britain in the early half of the twentieth century.
SNPs are responsible for variation in phenotype and biological function for nearly all pathways in the human body, contributing to differences in skin, eye, and hair color, as well as how we digest foods, how we metabolize drugs, and how we respond to our environment. Of course, SNPs can also influence our susceptibility to disease.
As an example, some of the most well studied SNPs are located in a gene called haemoglobin beta (HBB). An individual may have the DNA base thymine (T) at position 20 within the gene’s protein-coding sequence, while another individual may have an adenine (A), or perhaps an SNP at a different position in HBB’s sequence. This polymorphism, the A instead of the T, causes an error in the translation and creation of the beta globin protein, which ultimately causes their red blood cells to curve into a sickled, abnormal shape. This is the genetic cause of sickle cell disease and leads to symptoms of anemia or the other related health complications.
Until the full human genomic sequence was published upon the completion of the Human Genome Project, the identification of SNPs and other disease-causing mutations was a slow, individualized process for diseases like sickle cell anemia, cystic fibrosis, several cancers like breast and colorectal, and others. SNPs were usually identified once a gene was linked to a disease and the extent of human variation, and the total number of SNPs in the genome, was still unknown.
In 2002, the International HapMap Project was launched to begin the process of truly investigating the extent of human genetic variation. Four populations were chosen for genomic sequencing during the first phases of the projects: Han Chinese living in Beijing, Japanese in Tokyo, members of the Yoruba tribe living in Nigeria, and residents in Utah in the United States with northern or western European ancestry.
The HapMap Project eventually expanded to several other ethnic groups, including families of the Maasai and Luhya tribes in Kenya and families with African ancestry living in the United States. While this international project featured diverse populations, it was still short of worldwide representation.
Cost limitations severely hampered the early efforts to diversify further. In the early 2000s, sequencing the complete genome of an individual cost well over ten million dollars. That price dropped considerably (to approximately one hundred thousand dollars) around 2008 when the 1000 Genomes Project was launched with the goal of sequencing one thousand people from all over the world. Today, technology has improved so much that a single genome can be read for less than one thousand dollars and ambitious projects sequencing over one million people are within our reach. For comparison, companies like 23andMe are even cheaper as they do not sequence the entire genome, but rather provide data on a subset of preselected SNPs.
Having a better grasp of the extent of variation in the human genome has helped scientists sift through the millions of variants to identify novel disease-causing SNPs. A widely used approach are genome-wide association studies (GWAS, pronounced “Gee-woz”).
GWAS are often designed as “case vs. control” studies that leverage large population samples that have been sequenced (either with the full genome known or with just a subset of SNPs). GWAS are performed to identify genetic variants that are overrepresented in individuals with the disease or condition being examined.
The first GWAS was published in 2002 on the heels of the completion of the draft of the human genome and the beginning of HapMap. Researchers in Japan identified several genetic variants found in individuals with Japanese ancestry within the gene lymphotoxin alpha (LTA).
These variants were statistically associated with an increased risk of heart attack. One SNP found within the LTA gene sequence can cause alteration of the expression of that gene and another SNP creates a functional change within the LTA protein. Having two copies of either SNP almost doubles one’s risk for heart attacks.
Because heart attacks are so common, identifying an SNP that is causative among the millions of SNPs in the genome would be impractical on an SNP-by-SNP basis. GWAS analysis combs through that data in thousands of people together quite effectively. The larger the sample size for a GWAS, the more rigorous the findings, and the higher the likelihood of identifying genetic variants that increase risk for a specific outcome, even if that risk is smaller than 10 percent.
But not all GWAS studies turn out as groundbreaking as the Japanese study on heart attacks. A downside to this approach is that it is far more likely that a study will uncover SNPs that are associated with a particular biological outcome or disease, rather than being the causative agent. These SNPs are often in close genomic proximity to the gene or genetic variant that has the true functional role. Follow-up studies are nearly always necessary to elucidate the true biological role the nearby regions have when a new SNP is associated with a biological outcome.
This brings us back to the diversity problem in science: most GWAS findings have identified disease-causing variants in individuals of European ancestry. These findings are important in that we now understand how the genes in which these SNPs reside function to cause disease, but our interpretation of how these diseases may develop is lensed only through one population. It is a problem that must be addressed.
Many SNPs can actually be protective against diseases, including cancer, Alzheimer’s, heart disease, and even infectious disease. For example, a recent study compared the ability of white blood cells to clear out and destroy the bacterial pathogens that cause listeria and salmonella. Cells were isolated from individuals with either European or African ancestry for this comparison and those cells with African genomic variation were able to clear out the bacteria faster and had stronger inflammatory responses. This response was genetically controlled as nearly 10 percent of all genes involved in bacterial clearance had ancestral-associated differences related to their level and duration of expression.
These findings highlight the importance of inclusivity in studies examining the genetic role of our immune system, as we can further understand how our genetic variation influences disease susceptibility and progression. This may also partially explain why some inflammatory diseases are more aggressive in individuals with African ancestry and help us eliminate health disparities rooted in environmental stress and other social determinants of health.
In this time of the COVID-19 pandemic, this information could also assist in understanding how the coronavirus infects our bodies, giving us insight into new vaccine options and helping clinicians better understand how there can be a myriad of symptoms that occur upon infection.
The GWAS Diversity Monitor calculates that over half of all GWAS discovery studies in 2019 were conducted in individuals with European ancestry, with the next highest at 17 percent for studies including individuals of Asian ancestry. African ancestry was represented in only 5 percent of studies in 2019 and accounted for less than 1 percent of all participants of any ancestry in 2019 studies. Yet, it is known that genomes of those with African ancestry have the most variation of any ancestral group and have the most to contribute to our understanding of human variation. It is a sobering statistic and highlights that much more work needs to be done.
Variation in our genome also records our prior historical exposures, as well as influencing our assumptions. The abovementioned SNPs responsible for sickle cell disease are classic examples of natural selection in the human population. Those harboring variants in the HBB gene have an innate protection against malaria, a very deadly infectious disease. While many people associate sickle cell disease and malaria resistance to those of sub-Saharan African ancestry, HBB variants are also prevalent in the Middle East, parts of India, and even Greece. Malaria resistance can also be found in populations in Asia and elsewhere that is linked to SNPs in other genes related to red blood cell function.
Immunity to infectious disease is not the only benefit of exploring and understanding our genomic diversity. It has been well understood that our extinct hominin relatives the Neanderthals and the Denisovans bred with our ancestors tens of thousands of years ago and that has resulted in gene flow between our species. Nearly all individuals with non-African ancestry have a small percentage of genetic variation introduced by Neanderthals and Denisovans, which have contributed genetic information to the genomes of those with Oceania ancestry.
A February 2020 study turned much-needed attention toward the African genome and discovered that additional hominin relatives contributed to West African genomic diversity well before Neanderthals even separated from our distant human ancestors. These findings establish a more comprehensive understanding of the origin of humanity. Published in March 2020, an analysis of over nine hundred individual genomes spread across fifty-four geographically—and linguistically—diverse communities identified tens of millions of SNPs and other types of genomic variation that are common in specific populations.
Exploring these differences and similarities will broaden our understanding of ourselves and finally bring the biomedical field into a more equitable place. Genetics has a tainted history of prejudice that still impacts lives today. It is time these issues are always at the front and center of every research discussion so study recruitment and focus ensure that history is not repeated and we can move forward, together.
About the Author
Douglas Dluzen, PhD, is an Assistant Professor of Biology at Morgan State University in Baltimore, MD. He is a geneticist and has studied the genetic contributors to aging, cancer, hypertension, and other age-related diseases. Currently, he studies the biology of health disparities and the microbiome in Baltimore City. He teaches evolution, genetics, and scientific thinking and you can find more about him on Twitter @ripplesintime24. He loves to write science and science fiction while sitting on the couch with his wife Julia, their dog, cat, and newborn son Parker.