The Nuts

Dna variants occur at different frequencies in dissimilar places across the globe, and every marking has its own pattern of geographical distribution. The 23andMe Ancestry Composition algorithm combines information nigh these patterns with the unique set of DNA alleles in your genome to guess your genetic ancestry.

Hither's an instance of a haplogroup, a special kind of Dna marking, that illustrates the idea. This map shows the frequency of the maternal haplogroup H effectually the globe. Haplogroup H is very common in Europe, is also found in Africa and Asia, and is rarely seen in people native to Australia or the Americas.

Worldwide distribution of maternal haplogroup H

The association between this marking and geographic location works in two ways. If you know you have European beginnings, we know that at that place's a decent chance yous have the H haplogroup. And if you have the H haplogroup, we know that your genetic history probable includes at least i European ancestor.

Although we can't locate your ancestry with much precision based on this i DNA marker, we measure out hundreds of thousands of DNA markers on the 23andMe platform. If we combine the show from many markers, each of which offers a little bit of data about where in the world you're from, we can develop a clear overall picture.

Wrinkle #1: People Usually Have Multiple Ancestries

If all of your Dna came from one place in the earth, figuring out where you're from would be piece of cake. Recent research has suggested that, for a European person whose unabridged family comes from the same place, genetic analysis can locate their bequeathed home within a range of effectually 100 miles!

But most people's ancestors come from many places. The technical give-and-take for this is admixture—the genetic mixing of previously separate populations. For example, information technology's common for people of European descent to have beginnings from all around Europe, and Latino people typically have ancestors from the Americas, Europe, and sometimes Africa.

Our Ancestry Limerick algorithm handles the challenge of admixture past breaking your chromosomes into short adjacent windows, similar boxcars in a train. These windows are small enough that it is mostly safe to assume that you inherited all the Deoxyribonucleic acid in any given window from a unmarried antecedent many generations back.

Wrinkle #two: We Don't Know Which DNA Comes From Which Parent

Remember that for each of your 23 chromosome pairs, 1 chromosome in each pair comes from your mom and the other from your dad. Genotyping chips don't capture data most which markers came from which parent.

Here's a quick example to illustrate this point. Say, for a short stretch of Chromosome 1, yous inherited the following genotypes at three sequent DNA markers:

from Dad: A-T-C
from Mom: G-T-A

When we look at your raw 23andMe data in this spot on Chromosome 1, we'll run into the post-obit:

The genotypes where you inherited different variants from mom and dad—in this case, the markers on the ends—are jumbled upwards. In that location are two possible "haplotypes" that are consistent with the raw information, and we don't know which one is your actual Dna sequence. It could be:

which happens to be wrong, or information technology could be:

which is right. The technical term for determining which alleles reside on the same chromosome together is phasing. Deoxyribonucleic acid data like our raw information is called unphased.

And then what? This matters considering nosotros can larn more from long runs of many DNA markers together than we tin can learn from individual Deoxyribonucleic acid markers alone. In the to a higher place instance, the combination A-T-C will by and large say more about your beginnings than the A, T, and C say when they are considered separately. Luckily, we can use statistical methods to approximate the phasing of your chromosomes. Later phasing your raw data, the Ancestry Composition algorithm calculates ancestry separately for each phased chromosome.

The Setup: Defining Ancestry Populations

Prep 1: The Datasets

The Beginnings Composition algorithm calculates your ancestry past comparing your genome to the genomes of people whose ancestries we already know. To make this work, nosotros need a lot of reference data! Our reference datasets include genotypes from xiv,437 people who were chosen generally to reflect populations that existed before transcontinental travel and migration were common (at to the lowest degree 500 years ago). Even so, considering different parts of the world take their own unique demographic histories, some Ancestry Limerick results may reflect ancestry from a much broader fourth dimension window than the by 500 years.

Customers incorporate the lion'southward share of the reference datasets used by Ancestry Composition. When a 23andMe research participant tells us they have iv grandparents all born in the same country—and the population of that country didn't experience massive migration in the last few hundred years, every bit happened throughout the Americas and in Australia, for example—that person becomes a candidate for inclusion in the reference data. Nosotros filter out all but one of any fix of closely related people, since including closely related relatives can misconstrue the results. And we remove outliers: people whose genetic ancestry doesn't seem to match up with their survey answers. To ensure a representative dataset, nosotros filter aggressively—nearly ten per centum of reference dataset candidates don't make the cutting.

We also draw from public reference datasets, including the Human being Genome Diversity Project, HapMap, and the 1000 Genomes Project. Finally, we contain data from 23andMe-sponsored projects, which are typically collaborations with academic researchers. We perform the same filtering on public and collaboration reference data that nosotros do on 23andMe customer information.

Prep 2: Population Option

The 45 Ancestry Composition populations are defined by genetically similar groups of people with known ancestry. Nosotros select Ancestry Composition populations by studying the reference datasets, choosing candidate populations that appear to cluster together, and and then evaluating whether we can distinguish those groups in practise. Using this method, we refined the candidate reference populations until we arrived at a gear up that works well.

Principal components plot of 23andMe reference European populations

Hither's an example of one of the diagnostic plots we utilize to select populations. The genomes in the European reference datasets are plotted using principal component analysis, which shows their overall genetic altitude from each other. Each point on the plot represents one person, and we labeled the points with dissimilar symbols and colors based on their known ancestry. You can run across that people from the aforementioned population (labeled with the aforementioned symbol) tend to cluster together. Some populations, like the Finns (the bluish triangles on the left), are relatively isolated from the other populations. Because Finns are so genetically distinct, they take their own reference population in Ancestry Composition. Most country-level populations, however, overlap to some caste. In these cases, we experimented with different groupings of country-level populations to discover combinations that we could distinguish with high confidence.

Some genetic ancestries are inherently difficult to distinguish because the people in those regions mixed throughout history or have shared history. As we obtain more than information, populations volition go easier to distinguish, and nosotros will be able to study on more populations in the Ancestry Composition study.

Against Bias

Historically, biomedical research has disproportionately focused on participants of European descent. Due to this bias, and to the fact that a large proportion of 23andMe customers have unmixed European ancestry, we have the most reference data from European populations, and we are able to distinguish as many sub-populations from Europe equally beyond all of Asia.

In light of this inequity, the 23andMe Research team is constantly working to larn new information from diverse populations. Our mission at 23andMe is to help people access, empathise, and benefit from the homo genome. The best manner we can practice that for underserved populations is to include their genetic information in our inquiry and in our Ancestry features—maximizing the granularity of Ancestry Composition for all of our customers and helping to gainsay disparities in genetic science. We have worked proactively to reduce bias in genetics research by initiating projects similar the Global Genetics Project, the African Genetics Project, the Population Collaborations Program, and our NIH-funded genetic health resources for African Americans. The genetic information we collect through these initiatives and others like them will help to improve features such as Ancestry Composition and volition benefit the scientific community at large.

The Ancestry Limerick Algorithm

Overview

The Ancestry Composition algorithm comprises four distinct steps.

First, we utilize a computational method to approximate the phasing of your chromosomes, that is, to make up one's mind the contribution to your genome by each of your parents. Next, nosotros break up the chromosomes into short windows, and we compare your DNA sequence in each window to the corresponding Dna in our reference datasets. We label your Dna with the beginnings whose reference Dna it's nigh similar, and then nosotros procedure those assignments computationally to "polish" them out. Each step in this process is described in more detail in the post-obit sections.

Step 1: Phasing

Recall contraction #2 above. For each customer, nosotros mensurate a set of genotypes (pairs of alleles). Merely what we really want is a pair of haplotypes for each chromosome. That is, we want to figure out the series of alleles nowadays on each of your two copies of, for example, chromosome 7: one you received from your female parent and 1 you received from your male parent. To do so, we first build a very large "phasing reference panel" using data from hundreds of thousands of customers. We then employ Eagle (Loh et al., 2016) to phase these individuals jointly. Hawkeye uses sophisticated statistics and a very clever algorithm to practise this. Once we have phased this large collection of customers, we tin use the information inferred to efficiently phase new customers.

Pace 2: Window Classification

Afterward phasing your chromosomes, we segment them into sequent windows containing ~300 genetic markers each. Nosotros measure out between 7,400 and 45,000 markers per chromosome, which translates to 24 to 149 windows, depending on the chromosome's length. We consider each window in plow and compare your DNA to the reference datasets to decide which ancestry well-nigh closely corresponds to your DNA.

There are many ways to assign ancestry to Deoxyribonucleic acid segments based on reference information, and nosotros tried several. The best-performing selection was a well-known classification tool chosen a support vector machine, or SVM. An SVM can "acquire" different ancestry classifications based on a fix of preparation examples and so assign new DNA segments to a learned category.

In the case of Ancestry Composition, we railroad train the SVM with reference Dna sequences and tell it which ancestry population those sequences are from. And so, when nosotros look at the DNA from a 23andMe client with unknown ancestry (similar you), nosotros tin can ask the SVM to allocate your DNA for us based on the reference datasets.

We chose an Ancestry Composition algorithm based on SVMs because it performed the best out of all the techniques that we tried. SVMs are also very fast, which is critical for a big and growing database.

Footstep 3: Smoothing

The SVM classifies each window of your genome independently, creating a "first draft" version of your beginnings event. We use another computational process, called the smoother, to smooth this raw SVM output. The smoother uses a version of a well-known mathematical tool called a Hidden Markov Model to correct, or "smooth," two kinds of errors. Hidden Markov Models are used to analyze sequential information, like biological sequences or recorded speech. Every bit an example, suppose we had three ancestry populations: X, Y, and Z. An example of output from the SVM might look like this:

chromosome 1, parent i: X - 10 - X - Z - Z - Z - Y - Z
chromosome 1, parent 2: Z - Z - Z - Ten - X - X - X - Ten

The starting time kind of fault the smoother corrects is an unusual assignment in the heart of a run of similar assignments. In the first line above, there's a run of Zs, interrupted by a single Y: Z - Z - Z - Y - Z. Information technology's possible that the lone Y was a close telephone call between Y and Z that went the wrong way. If that were the case, the smoother could correct it to Z - Z - Z - Z - Z.

The second kind of error the smoother corrects arises from the phasing stride. Phasing algorithms can make mistakes, known as switch errors, where they mix upwards the DNA of ane parent with that of another. The smoother can switch the ancestry assignments between your mother and your begetter if it detects one of these errors. In this example, there may be a switch fault afterward the fourth window. If the switch were reversed, and so the runs of Xs and the runs of Zs would stay together. In our simplified example, the smoother might output something similar this:

chromosome one, parent 1: Z - Z - Z - Z - Z - Z - Z - Z
chromosome ane, parent 2: X - X - Ten - X - X - X - X - X

This example illustrates the purpose of the smoother. But with real data the picture is much messier, and the answers are rarely and so clean. And then instead of assigning a single ancestry to each window similar we did in this example, the smoother estimates the probabilities of each Beginnings Composition population matching each window of DNA. The following picture shows a concrete example:

Case plot of Ancestry Composition assignment probabilities

This is the output of the smoother assay of one copy of chromosome 2. Starting on the left, at that place is a short run of pinkish, and so a wider run of dark-green, then some other run of pink. In this chart, pink is the color for Sub-Saharan African ancestry, and dark-green is the color for Indigenous American. The y-axis runs from 0 to 100 per centum, and it shows the probability that the Deoxyribonucleic acid in that region of the chromosome comes from each Beginnings Composition population. These pink and greenish regions make full the entire vertical space of the graph, which means that we are 100 percent confident that the Dna in those regions has Sub-Saharan African and Indigenous American genetic beginnings, respectively.

The adjacent region to the correct—between positions fifty and 100 on the x-axis—is a stretch of multi-colored bluish. The thickest strip at the bottom is dark teal, which is the color for British & Irish. This segment of DNA has somewhere between a 50 percent chance and a lx percent chance of reflecting from British & Irish gaelic ancestry. The other shades of bluish testify that the same DNA segment also has a gamble of reflecting Italian, Iberian, or French & German ancestry. If yous think dorsum to the haplogroup example above, this result makes sense: information technology is normal for a DNA marking to match reference Dna from lots of places, even if it matches some places amend than others. In this case, the result shows that this Dna segment matches reference Deoxyribonucleic acid from all over Europe. We can very confidently conclude that this stretch of DNA reflects European ancestry, but the evidence isn't strong enough to assign information technology to i specific region of Europe with high confidence.

Step four: Aggregation & Reporting

The last step is to summarize the results and display them in your Chromosome Painting. The mode we practise this is to apply a threshold to the probability plot equally in this figure:

Applying a threshold to Ancestry Composition consignment probabilities

The horizontal line in this paradigm indicates a lxx pct confidence threshold, which we will use for this case. Yous can view your ain Chromosome Painting at dissimilar confidence thresholds, ranging from l percent (speculative) to 90 pct (bourgeois).

We look across the entire chromosome and inquire whether whatever ancestry has an estimated probability exceeding the specified threshold (in this case 70 percent). In this instance, with the exception of the blue European stretch, the ancestry estimates exceed 70 percent over the majority of the chromosome. Each region contributes to your overall Beginnings Composition in proportion to its size: For instance, the greenish Ethnic American segment almost the stop of this plot makes up well-nigh 0.26 percent of the entire genome. Even though in that location is some probability that the segment comes from a dissimilar population, the Indigenous American proportion exceeds the seventy percent threshold, and then we add 0.26 percent Ethnic American to the overall Beginnings Composition at this threshold.

In the instance of the European segment, no single ancestry exceeds the lxx percent threshold, so we don't assign that Deoxyribonucleic acid to any fine-grained ancestries. Instead, we refer to our hierarchy of ancestries. In that location is a "Broadly Northern European" ancestry that includes iv fine-level ancestries: British & Irish, Scandinavian, Finnish, and French & German. If, when nosotros add up the contributions of each of these subgroups, the total contribution toward Broadly Northern European exceeds the lxx percent threshold, and then we volition report the region as Broadly Northern European.

In this example, the Broadly Northern European reference populations nonetheless don't exceed the 70 percent threshold, simply the combined probabilities of all the European populations do. And so this region is assigned "Broadly European" ancestry.

We use broad Ancestry Composition categories to avoid making assumptions about your ancestry when your DNA matches several dissimilar country-level populations. In regions where no ancestry—including the broad ancestries—exceeds the specified threshold, we report "Unassigned" ancestry. You lot tin see the entire ancestry hierarchy in your Ancestry Composition written report past clicking "See all tested populations."

Connecting With Close Family

Ancestry Composition is even more powerful if you have a biological parent who is besides in the 23andMe database. Click here to acquire more virtually connecting with family unit and friends.

Your connecting with a biological parent greatly simplifies the computational problem of figuring out what DNA you got from which parent (c.f., Step 1: Phasing). That may translate into better Ancestry Composition results, in the sense that yous might see more assignment to the fine-resolution ancestries: more Scandinavian, less Northern European.

Why is that? Remember, the smoother—which generates your final Ancestry Composition estimate—has to correct ii kinds of errors: those along the chromosome and those between the chromosomes. When your chromosomes are phased using genetic information from your parent, mistakes betwixt the chromosomes (switch errors) are extremely rare, so the smoother can exist more confident.

If y'all connect with ane or both of your biological parents, yous will get an extra result. You'll be able to see the Parental Inheritance view, which shows your mother's contribution to your ancestry on one side and your begetter'due south contribution to your ancestry on the other. Nosotros tin can't provide this view if y'all don't have a parent connected because we need at least one of your parents to orient the results. Hither's an example of what you can learn from Inheritance View: say your Ancestry Composition includes a small amount of Ashkenazi Jewish ancestry. When y'all look at your Inheritance View, y'all'll be able to see from which parent you inherited it.

Testing & Validation

Ancestry Limerick includes a lot of steps, and each step has to be tested. We've discussed a few of those tests already while explaining our algorithm. In this section, we desire to share some exam results to give a sense of how well Ancestry Composition works. This department focuses on the final examination we run, because that integrates the operation of each of the steps into an overall picture show.

This test looks at two classic measures of model performance, precision and recall. These are the standard measurements that researchers use to test how well a prediction arrangement works. Precision answers the question "When the system predicts that a piece of DNA comes from population A, how often is the Deoxyribonucleic acid actually from population A?" Call back answers the question "Of the pieces of DNA that actually are from population A, how often does the system correctly predict that they are from population A?"

At that place is a tradeoff between precision and recall, so nosotros have to strike a residual betwixt them. A high-precision, low-call up system will be extremely picky about assigning, say, Scandinavian ancestry. The system would just assign Deoxyribonucleic acid as Scandinavian when it is very confident. That will yield high precision—since the consignment of Scandinavian is almost always correct—just low recall, because a lot of truthful Scandinavian ancestry is left unassigned.

With a low-precision, high-recall system the opposite problem exists. In this instance, the organisation liberally assigns Scandinavian ancestry. Any time a piece of Deoxyribonucleic acid might exist Scandinavian, information technology is assigned that ancestry. This volition yield high call up, equally most genuine Scandinavian Dna will be labeled accordingly, just depression precision, considering not-Scandinavian Deoxyribonucleic acid volition often exist incorrectly labeled Scandinavian.

The platonic system has both high precision and high recall, merely that may be impossible in real life. Let'due south see how Ancestry Composition performs on these metrics. For this quality-command exam, nosotros set up apart 20 per centum of the reference database, approximately 2400 individuals of known ancestry. We trained and ran the unabridged Beginnings Composition pipeline on the other fourscore per centum of the reference individuals. So nosotros treated the "hold-out" 20 percent as though they were new 23andMe customers and used our Beginnings Composition pipeline to calculate their ancestries. Since we know these people'south truthful ancestries, we tin can bank check to see how accurate their Ancestry Composition results are. Nosotros ran this examination five times each at various minimum conviction thresholds, with a different 20 percent held out each fourth dimension, and then averaged across the 5 tests to give the following results (shown here for a minimum confidence threshold of l%, which is the default for results shown to customers):

Population Precision (%) Recall (%)
Sub-Saharan African 99 99
West African 99 98
Senegambian & Guinean 99 94
Ghanaian, Liberian & Sierra Leonean 97 88
Nigerian 92 98
Northern East African 99 93
Sudanese 96 85
Ethiopian & Eritrean 96 97
Somali 98 92
Congolese & Southern Eastward African 95 100
Angolan &: Congolese 96 100
Southern Due east African 92 93
African Hunter-Gatherer 100 83
East Asian & Indigenous American 99 100
N Asian 63 82
Siberian 98 91
Manchurian & Mongolian 41 69
Indigenous American 100 95
Chinese & Southeast Asian 99 98
Vietnamese 99 97
Filipino & Austronesian 95 95
Indonesian, Central khmer, Thai & Myanma 94 63
Chinese 96 99
Chinese Dai 94 99
Japanese & Korean 100 100
Japanese 100 100
Korean 99 100
European 98 100
Northern European 94 98
British & Irish ninety 95
Finnish 96 96
French & German 81 86
Scandinavian 97 84
Southern European 91 89
Greek & Balkan 92 80
Spanish & Portuguese 96 94
Italian 83 86
Sardinian 93 98
Eastern European 86 91
Ashkenazi Jewish 99 99
Western Asian & North African 98 93
Northern West Asian 85 90
Cypriot 97 91
Anatolian 88 71
Iranian, Caucasian & Mesopotamian 73 91
Arab, Egyptian & Levantine 98 81
Peninsular Arab 97 seventy
Levantine 97 67
Egyptian 77 89
Coptic Egyptian 99 87
N African 99 90
Central & Southward Asian 99 97
Key Asian, Due north Indian & Pakistani 95 93
Cardinal Asian 95 l
Northern Indian & Pakistani 85 88
Bengali & Northeast Indian 91 99
Gujarati Patidar 100 100
Southern Indian Subgroup 97 81
Southern South Asian 92 96
Southern Indian & Sri Lankan 76 95
Malayali Subgroup 98 70
Melanesian 100 97

This table shows that our precision numbers are high across the board, mostly to a higher place ninety pct, and rarely dipping below 75 percent. That means that when the system assigns an ancestry to a slice of DNA, that assignment is very probable to be authentic. You tin also see that as you movement upwardly from the sub-regional level (e.yard., British & Irish) to the regional level (east.g., Northern European) to the continental level (e.g., European), the precision approaches 100 percent.

It is important to realize that poor call back doesn't mean bad results. Some populations, like Sardinian, are only hard to tell apart from others. When Ancestry Composition fails to assign Sardinian DNA, this doesn't hateful that DNA is incorrectly assigned to something else, like Italian. If information technology were, then the Italian population would have poor precision. Instead, Ancestry Composition frequently assigns Sardinian DNA to the Broadly Southern European or Broadly European populations.

The Future of Ancestry Composition

Ancestry Composition has a modular design. This was intentional, because information technology allows us to better individual components of the system—like Eagle's phasing reference database or the SVM reference populations—without affecting whatsoever of the other steps in the assay pipeline.

We hope to update Ancestry Composition regularly. When we meliorate some component of the arrangement or upgrade the reference datasets, your results volition automatically be updated. Y'all will be able to see a list of those updates in the Change Log at the bottom of your Ancestry Limerick Scientific Details.

Updated October 2020