A recent blog post discussed the results of gut microbiome tests of someone who sent two different samples of the same stool to Thryve. Surprisingly the tests found different levels for several bacterial strains, even though both samples were collected from the same stool. For example, Alistipes came at 0.02% in one sample and 7.2% in the other. Iâm going to discuss how and why this happens with DNA testing fecal samples.Â
Â
Sequence Level Comparison of the Two Samples
Â
To gain insight into what is going on, I compared the two samples based on their counts of different distinct bacterial sequences. Before showing the plot, let me explain how this sequence level analysis differs from species, genus, or family level tables we often see elsewhere.
Â
When a person sends their stool sample to a microbiome analysis lab, the lab extracts DNA from the sample and sends it to a sequencing facility. This sample received by the sequencing facility is a mixture of DNA from various bacteria present in the stool sample.Â
Â
A sequencing instrument âreadsâ those DNA sequences. A bioinformatics program then compares them with the known 16S sequences from different bacteria to compute the proportions of Faecalibacterium, Alistipes, and so on.
Â
Why Would One Stool, Two Samples Have Different Bacteria Percentages?
Â
Here is an important point. The genus or species level counts we see for bacteria like Faecalibacterium, Alistipes, etc. are aggregate counts of many distinct sequences, each representing a different bacteria. The 16S mapping process may assign the same name (i.e. Faecalibacterium) to several different sequences. In fact, sequences with as much as a 3% difference may be assigned the same name.Â
Â
Given that each type of sequence represents a different bacteria, only a sequence level comparison between the samples shows the true dynamics of bacteria in each sample, as you see below.
Â

Â
In the above scatter plot, each dot represents the count of a distinct sequence in both samples. Sequences with low counts in both samples are removed for clarity.Â
Â
The dots in the plot form two clusters, as shown below. The blue cluster represents bacteria with a closely correlated presence in both samples, whereas the red cluster shows bacteria growing strongly in W3YMU but not in MLFTF.Â
Â
You also find that the blue cluster has many more dots than the red cluster. That means only a handful of bacteria have rapid growth in W3YMU, but not in MLFTF.
Â
Let us now check how those dots translate into genus or species level notation for the bacteria. The following table shows the scientific names of the bacterial sequences with similar levels of presence in both samples.Â
Â
The S numbers in parenthesis are their sequence level identifier in our internal database. You will find that multiple sequences (S32, S25) are associated with the name Ruminococcus.
Â

Â
The following table shows the names of bacteria growing rapidly in the W3YMU sample compared to MLFTF.
Â

Let me summarize the observations. Clearly, a large number of bacteria have similar levels of presence in both samples, and they are shown in the blue cluster. In addition, a small number of bacterial strains grew rapidly in W3YMU compared to MLFTF, and those strains contributed to the differences seen in the mentioned blog post.
Â
We can quantify the level of similarity between two samples by computing the linear correlation coefficient of the counts in a log scale.Â
Â
This number ranges between -1 and 1, with:
âĒ 1 representing perfect correlation
âĒ 0 representing most uncorrelation
âĒ -1 representing perfect anti-linear correlation (i.e. one number goes up, when the other goes down)
Â
For MLFTF and W3YMU, this correlation coefficient comes to 0.68. Moreover, if we remove only six sequences from the analysis, the number rises to 0.73. This increase in correlation suggests that only a handful of sequences are responsible for the difference seen at the genus or species level.
Â
Is 0.68 (or 0.73) high or low? What kind of correlation do we expect if we randomly pick two gut samples from unrelated persons? What is the number if both samples are from the same person? We answer these questions in the following section.
Â
Statistical Analysis of Other Gut Samples
Â
To understand the correlation coefficientâs general pattern, I analyzed about 500 pairs of samples in the Thryve database, where both pairs came from the same person. Please note that those pairs of samples may have been collected at different times, unlike MLFTF and W3YMU discussed above. Â
Â
As a control, I did a similar analysis for ~500 random pairs of samples picked from the Thryve database. For each pair, I computed the correlation coefficient in the same manner as the previous section.
Â
