InStrain¶
InStrain is a tool for analysis of co-occurring genome populations from metagenomes that allows highly accurate genome comparisons, analysis of coverage, microdiversity, and linkage, and sensitive SNP detection with gene localization and synonymous non-synonymous identification
Source code is available on GitHub.
Publication is available in Nature Biotechnology and on bioRxiv
See links to the left for Installation instructions
Bugs reports and feature requests can be submitted through GitHub.
InStrain was developed by Matt Olm and Alex Crits-Christoph in the Banfield Lab at the University of California, Berkeley.
Contents¶
Installation¶
Installation¶
InStrain is written in python. There are a number of ways that it can be installed.
pip¶
To install inStrain using the PyPi python repository, simply run
$ pip install instrain
That’s it!
Pip is a great package with many options to change the installation parameters in various ways. For details, see pip documentation
bioconda¶
To inStrain inStrain from bioconda, run
$ conda install -c conda-forge -c bioconda -c defaults instrain
From source¶
To install inStrain from the source code, run
$ git clone https://github.com/MrOlm/instrain.git
$ cd instrain
$ pip install .
Pre-built genome database¶
An established set of public genomes can be downloaded for your inStrain analysis at the following link - https://doi.org/10.5281/zenodo.4441269. See Tutorial #2 in Tutorial for usage instructions.
Docker image¶
A Docker image with inStrain and dependencies already installed is available on Docker Hub at mattolm/instrain. This image also has a wrapper script in it to make it easier to use inStrain with AWS. See the docker folder of the GitHub page for use instructions.
Glossary & FAQ¶
Glossary of terms used in inStrain¶
Note
This glossary is meant to give a conceptual overview of the terms used in inStrain. See Expected output for explanations of specific output data.
- ANI
- Average nucleotide identity. The average nucleotide distance between two genomes or .fasta files. If two genomes have a difference every 100 base-pairs, the ANI would be 99%
- conANI
- Consensus ANI - average nucleotide identity values calculated based on consensus sequences. This is commonly reported as “ANI” in other programs. Each position on the genome is represented by the most common allele (also referred to as the consensus allele), and minor alleles are ignored.
- popANI
Population ANI - a new term to describe a unique type of ANI calculation performed by inStrain that considers both major and minor alleles. If two populations share any alleles at a loci, including minor alleles, it does not count as a difference when calculating popANI. It’s easiest to describe with an example: consider a genomic position where the reference sequence is ‘A’ and 100 reads are mapped to the position. Of the 100 mapped reads, 60 have a ‘C’ and 40 have an ‘A’ at this position. In this example the reads share a minor allele with the reference genome at the position, but the consensus allele (most common allele) is different. Thus, this position would count as a difference in conANI calculations (because the consensus alleles are different) and would not count as a difference in popANI calculations (because the reference sequence is present as an allele in the reads). See Important concepts for examples.
Representative genomes (RGs) are genomes that are used to represent some taxa. For example you could have a series of representative genomes to represent each clade of E. coli (one genome for each clade), or you could have one representative genome for the entire species of E. coli (in that case it would be a Species Representative Genome (SRG)). The base unit of inStrain-based analysis is the representative genome, and they are usually generated using the program dRep
- Species representative genome
- A Species Representative Genome (SRG) is a representative genome that is used to represent an entire single microbial species.
- Genome database
- A collection of representative genomes that are mapped to simultaneously (competitive mapping).
- nucleotide diversity
- A measurement of genetic diversity in a population (microdiversity). We measure nucleotide diversity using the method from Nei and Li 1979 (often referred to as ‘pi’ π in the population genetics world). InStrain calculates nucleotide diversity at every position along the genome, based on all reads, and averages values across genes / genomes. This metric is influenced by sequencing error, but within study error rates should be consistent and this effect is often minor compared to the extent of biological variation observed within samples. This metric is nice because it is not affected by coverage. The formula for calculating nucleotide diversity is the sum of the frequency of each base squared: 1 - [(frequency of A)^2 + (frequency of C)^2 + (frequency of G)^2 + (frequency of T)^2 ].
- microdiversity
- We use the term microdiversity to refer to intraspecific genetic variation, i.e. the genetic variation between cells within a microbial species.
- clonality
- The opposite of nucleotide diversity (1 - nucleotide diversity). A deprecated term used in older versions of the program.
- SNV
- Single nucleotide variant. A single nucleotide change that is present in a faction of a population. Can also be described as a genomic loci with multiple alleles present. We identify and call SNVs using a simple model to distinguish them from errors, and more importantly in our experience, careful read mapping and filtering of paired reads to be assured that the variants (and the reads that contain them) are truly from the species being profiled, and not from another species in the metagenome (we call it ‘mismapping’ when this happens). Note that a SNV refers to genetic variation within a read set.
- SNS
- Single nucleotide substitution. A single nucleotide change that has a fixed difference between two populations. If the reference genome has a ‘A’ at some position, but all of the reads have a ‘C’ at that position, that would be a SNS (if half of the reads have an ‘A’ and half of the reads have a ‘C’, that would be an SNV).
- divergent site
- A position in the genome where either an SNV or SNS is present.
- SNP
- Single Nucleotide Polymorphism. In our experience this term means different things to different people, so we have tried to avoid using it entirely (instead referring to SNSs, SNVs, and divert sites).
- linkage
- A measure of how likely two divergent sites are to be inherited together. If two alleles are present on the same read, they are said to be “linked”, meaning that they are found together on the same genome. Loci are said to be in “linkage disequilibrium” when the frequency of association of their different alleles is higher or lower than what would be expected if the loci were independent and associated randomly. In the context of microbial population genetics, linkage decay is often used as a way to detect recombination among members of a microbial population. InStrain uses the metrics r2 (r squared) and D’ (D prime) to measure linkage.
- coverage
- A measure of sequencing depth. We calculate coverage as the average number of reads mapping to a region. If half the bases in a scaffold have 5 reads on them, and the other half have 10 reads, the coverage of the scaffold will be 7.5
- breadth
- A measure of how much of a region is covered by sequencing reads. Breadth is an important concept that is distinct from sequencing coverage, and gives you an approximation of how well the reference sequence you’re using is represented by the reads. Calculated as the percentage of bases in a region that are covered by at least a single read. A breadth of 1 means that all bases in a region have at least one read covering them
- expected breadth
- The breadth that would be expected if reads are evenly distributed along the genome, given a specific coverage value. Based on the function breadth = 1 - e ^{-0.883 * coverage}. This is useful to establish whether or not the scaffold is actually in the reads, or just a fraction of the scaffold. If your coverage is 10x, the expected breadth will be ~1. If your actual breadth is significantly lower then the expected breadth, this means that reads are mapping only to a specific region of your scaffold (transposon, prophage, etc.). See Important concepts for more info.
- relative abundance
- The percentage of total reads that map a particular entity. If a metagenome has 1,000,000 reads and 1,000 reads to a particular genome, that genome is at 0.1% relative abundance
- contig
- A contiguous sequence of DNA. Usually used as a reference sequence for mapping reads against. The terms contig and scaffold are used interchangeably by inStrain.
- scaffold
- A sequence of DNA that may have a string of “N”s in it representing a gap of unknown length. The terms contig and scaffold are used interchangeably by inStrain.
- iRep
- A measure of how fast a population was replicating at the time of DNA extraction. Based on comparing the sequencing coverage at the origin vs. terminus of replication, as described in Brown et. al., Nature Biotechnology 2016
- mutation type
- Describes the impact of a nucleotide mutation on the amino acid sequence of the resulting protein. N = non-synonymous mutation (the encoded amino-acid changes due to the mutation). S = synonymous mutation (the encoded amino-acid does not change due to the mutation; should happen ~1/6 of the time by random chance due to codon redundancy). I = intergenic mutation. M = multi-allelic SNV with more than one change (rare).
- dN/dS
- A measure of whether the set of mutations in a gene are biased towards synonymous (S) or non-synonymous (N) mutations. dN/dS is calculated based on mutations relative to the reference genome. dN/dS > 1 means the bias is towards N mutations, indicating the gene is under active selection to mutate. dN/dS < 1 means the bias is towards S mutations, indicated the gene is under stabilizing selection to not mutate. dN/dS = 1 means that N and S mutations are at the rate expected by mutating positions randomly, potentially indicating the gene is non-functional.
- pN/pS
- Very similar to dN/dS, but calculated at positions with at least two alleles present rather than in relation to the reference genome.
- fasta file
- A file containing a DNA sequence. Details on this file format can be found on wikipedia
- bam file
- A file containing metagenomic reads mapped to a DNA sequence. Very similar to a .sam file. Details can be found online
- scaffold-to-bin file
- A .text file with two columns separated by tabs, where the first column is the name of a scaffold and the second column is the name of the bin / genome the scaffold belongs to. Can be created using the script parse_stb.py that comes with the program
dRep
See Expected output for more info - genes file
- A file containing the nucleotide sequences of all genes to profile, as called by the program Prodigal. See Expected output for more info
- mismapped read
- A read that is erroneously mapped to a genome. InStrain profiles a population by looking at the reads mapped to a genome. These reads are short, and sometimes reads that originated from one microbial population map to the representative genome of another (for example if they share homology). There are several techniques that can be used to reduce mismapping to the lowest extent possible.
- multi-mapped read
- A read that maps equally well to multiple different locations in the .fasta file. Most mapping software will randomly select one position to place multi-mapped reads. There are several techniques that can be used to reduce multi-mapped reads to the lowest extent possible, including increasing the minimum MAPQ cutoff to >2 (which will eliminate them entirely).
- inStrain profile
- An inStrain profile (aka IS_profile, IS, ISP) is created by running the
inStrain profile
command. It contains all of the program’s internal workings, cached data, and is where the output is stored. Additional commands can then be run on an IS_profile, for example to analyze genes, compare profiles, etc., and there is lots of nice cached data stored in it that can be accessed using python. - null model
- The null model describes the probability that the number of true reads that support a variant base could be due to random mutation error, assuming Q30 score. The default false discovery rate with the null model is 1e-6 (one in a million).
- mm
- The maximum number of mismatches a read-pair can have to be considered in the metric being considered. Behind the scenes, inStrain actually calculates pretty much all metrics for every read pair mismatch level. That is, only including read pairs with 0 mismatches to the reference sequences, only including read pairs with >= 1 mis-match to the reference sequences, all the way up to the number of mismatches associated with the “PID” parameter. Most of the time when it then generates user-facing output, it uses the highest mm possible and deletes the column label. If you’d like access to information on the mm-level, see the section titled “Dealing with mm”
- mapQ score
- MapQ scores are a measure of how well a read aligns to a particular position. They are assigned to each read mapped by bowtie2, but the details of how they are generated are incredibly confusing (see the following link for more information). MapQ scores of 0 and 1 have a special meaning: if a read maps equally well to multiple different locations on a .fasta file, it always gets a MapQ score of 0 or 1.
FAQ (Frequently asked questions)¶
How does inStrain compare to other bioinformatics tools for strains analysis?¶
A major difference is inStrain’s use of the popANI and conANI, which allow consideration of minor alleles when performing genomic comparisons. See Important concepts for more information.
What can inStrain do?¶
inStrain includes calculation of nucleotide diversity, calling SNPs (including non-synonymous and synonymous variants), reporting accurate coverage / breadth, and calculating linkage disequilibrium in the contexts of genomes, contigs, and individual genes.
inStrain also includes comparing the frequencies of fixed and segregating variants between sequenced populations with extremely high accuracy, out-performing other popular strain-resolved metagenomics programs.
The typical use-case is to generate a .bam file by mapping metagenomic reads to a bacterial genome that is present in the metagenomic sample, and using inStrain to characterize the microdiversity present.
Another common use-case is detailed strain comparisons that involve comparing the genetic diversity of two populations and calculating the extent to which they overlap. This allows for the calculation of population ANI values for extremely similar genomic populations (>99.999% average nucleotide identity).
See also
- Installation
- To get started using the program
- Expected output
- To view example output
- User Manual
- For information on how to prepare data for inStrain and run inStrain
- Important concepts
- For detailed information on how to make sure inStrain is running correctly
How does inStrain work?¶
The reasoning behind inStrain is that every sequencing read is derived from a single DNA molecule (and thus a single cell) in the original population of a given microbial species. During assembly, the consensus of these reads are assembled into contigs and these contigs are binned into genomes - but by returning to assess the variation in the reads that assembled into the contigs, we can characterize the genetic diversity of the population that contributed to the contigs and genomes.
The basic steps:
- Map reads to a .fasta file to create a .bam file
- Stringently filter mapped reads and calculate coverage and breadth
- Calculate nucleotide diversity and SNVs
- Calculate SNV linkage
- Optional: calculate gene statistics and SNV function
- Optional: compare SNVs between samples.
What is unique about the way that inStrain compares strains?¶
Most strain-resolved pipelines compare the dominant allele at each position. If you have two closely related strains A and B in sample 1, with B being at higher abundance, and two closely related strains A and C in sample 2, with C being at higher abundance, most strain comparison pipelines will in actuality compare strain B and C. This is because they work on the principle of finding the dominant strain in each sample and then comparing the dominant strains. InStrain, on the other hand, is able to identify the fact that A is present in both samples. This is because it doesn’t just compare the dominant alleles, but compares all alleles in the two populations. See module_descriptions and choosing_parameters for more information.
What is a population?¶
To characterize intra-population genetic diversity, it stands to reason that you first require an adequate definition of “population”. InStrain relies mainly on population definitions that are largely technically limited, but also coincide conveniently with possibly biological real microbial population constraints (see Olm et. al. mSystems 2020 and Jain et. al. Nature Communications 2018). Often, we dereplicate genomes from an environment at average nucleotide identities (ANI) from 95% to 99%, depending on the heterogeneity expected within each sample - lower ANIs might be preferred with more complex samples. We then assign reads to each genome’s population by stringently requiring that combined read pairs for SNP calling be properly mapped pairs with an similarity to the consensus of at least 95% by default, so that the cell that the read pair came from was at least 95% similar to the average consensus genotype at that position. Within an environment, inStrain makes it possible to adjust these parameters as needed and builds plots which can be used to estimate the best cutoffs for each project.
What are inStrain’s computational requirements?¶
The two computational resources to consider when running inStrain are the number of processes given (-p
) and the amount of RAM on the computer (usually not adjustable unless using cloud-based computing). Using inStrain v1.3.3, running inStrain on a .bam file of moderate size (1 Gbp of less) will generally take less than an hour with 6 cores, and use about 8Gb of RAM. InStrain is designed to handle large .bam files as well. Running a huge .bam file (30 Gbp) with 32 cores, for example, will take ~2 hours and use about 128Gb of RAM. The more processes you give inStrain the longer it will run, but also the more RAM it will use. See Important concepts for information on reducing compute requirements.
How can I infer the relative abundance of each strain cluster within the metagenomes?¶
At the moment you can only compare the relative abundance of the populations between samples. Say strain A, based on genome X, is in samples 1 and 2. You now know that genome X is the same strain in both samples, so you could compare the relative abundance of genome X in samples 1 and 2. But if multiple strains are present within genome X, there’s no way to phase them out.
InStrain compare isn’t really phasing out multiple strains in a sample, it’s just seeing if there is micro-diversity overlap between samples. Conceptually inStrain operates on the idea of “strain clouds” more than distinct strains. InStrain isn’t able to tell the number of strains that are shared between two samples either, just that there is population-level overlap for some particular genome. Doing haplotype phasing is something we’ve considered and may add in the future, but the feature won’t be coming any time in the near future.
How can I determine the relative abundance of detected populations?¶
Relative abundance can be calculated a number of different ways, but the way I like to do it “percentage of reads”. So if your sample has 100 reads, and 15 reads map to genome X, the relative abundance of genome X is 15%. Because inStrain does not know the number of reads per sample, it cannot calculate this metric for you. You have to calculate it yourself by dividing the total reads in the sample by the value filtered_read_pair_count reported in the inStrain genome_wide output.
What mapping software can be used to generate .bam files for inStrain?¶
Bowtie2 is a common one the works well, but any software that generates .bam files should work. Some mapping software modifies .fasta file headers during mapping (including the tool BBMap and SNAP). Include the flag --use_full_fasta_header
when mapping with these programs to properly handle this.
Important concepts¶
There are a number of things to be aware of when performing metagenomic analysis with inStrain. This page will address the following key concepts:
1. An overview of inStrain and the data it generates. A brief introduction to microbial population genomics.
2. Representative genomes and their utility. InStrain runs on “representative genomes”; this section describes what they are and the benefit of using them.
3. Picking and evaluating representative genomes. Some things to think about when picking and mapping to representative genomes.
4. Establishing and evaluating genome databases. Some things to think about when dereplicating genomes to create a genome database.
5. Handling and reducing mis-mapping reads. Ways to ensure that sequencing reads align to the correct genomes.
6. Detecting organisms in metagenomic data. Determining whether an organism is “present” in a sample is more complicated than meets the eye.
7. Strain-level comparisons and popANI. A description of how inStrain performs detailed strain-level comparisons with the popANI metric.
8. Thresholds for determining “same” vs. “different” strains. How similar do strains need to be for them to be considered identical?
9. Importance of representative genomes when calculating popANI. Appropriate representative genomes are need for popANI to work correctly.
10. Using inStrain for gene-based functional analysis. Some ways to tie inStrain results to gene-based functional questions.
11. Reducing inStrain resource usage. Tips to reduce inStrain run-time and RAM usage.
1. An overview of inStrain and the data it generates¶
InStrain is a program for microbial metagenomic analysis. When you sequence any microbial genome(s), you sequence a population of cells. This population may be a nearly clonal population grown up from an isolate in a culture flask, or a highly heterogeneous population in the real world, but there is always real biological genetic heterogeneity within that population. Every cell does not have the same genotype at every single position. InStrain can determine organism presence / absence in a community, measure and interrogate the genetic heterogeneity in microbial population, and perform detailed comparisons between organisms in different samples.
A community is a collection of taxa in a metagenome. After mapping your metagenomic reads to a set of representative genomes, inStrain can generate a number of metrics that help understand community composition. These include the percentage of reads that map to your representative genome database, the abundance of each microbe in the community, and a detailed picture of the organisms that are present or absent (measured using breadth of coverage, expected breadth of coverage, and coverage s.e.m).
A population is the collection of cells that make up an individual taxa in a community. After mapping your metagenomic reads to a set of representative genomes, inStrain can generate a number of metrics characterizing the population-level diversity of each detected organism. These metrics include nucleotide diversity, SNSs and SNVs, linkage, pN/pS, iRep, and others. Most metrics are calculated on the gene level, scaffold level, and genome level.
Strain-level comparisons between populations in different communities are notoriously different to perform with high accuracy. After profiling the communities of metagenomic samples, inStrain can compare the populations in the different communities in a highly-accurate manner by taking into account the population-level diversity. This analysis reports comparison metrics including the percentage of the genome covered in each sample, popANI, conNI, and the locations of all differences between strains.
The above figure provides a conceptual overview of the steps involved when running inStrain.
2. Representative genomes and their utility¶
Representative genomes are genomes that are chosen to represent some group of taxa, and they are the base unit of inStrain-based metagenomic analyses. If one wanted to study the species-level composition of a community with inStrain they would use a set of Species representative genomes (SRGs), but Representative genomes can also be used at more specific taxonomic levels. They are similar to OTUs in 16S-based analysis. There are some things to be aware of when using Representative genomes, including ensuring that they truly represent the taxa they are meant to, but using them has several advantages over other common approaches.
The above figure shows a visual representation of k-mer based metagenomic analysis, gene-based metagenomic analysis, and Representative genome based metagenomic analysis. Advantages include the ability to align full read pairs to target sequences, use the entire genome to determine presence and absence (significantly improving detection accuracy; see Benchmarks for proof), and perform high-resolution comparisons, among other things.
A collection of representative genomes is referred to as a Genome database. Genome databases can be downloaded from public repositories, generated via de novo sequence assembly and binning, or a combination of the two. It is important to ensure that each genome in the Genome database is distinct enough from other genomes in the database to avoid mapping confusion, and by mapping to all genomes in a Genome database simultaneously (competitively) one can significantly reduce the number of mis-mapped reads overall.
The figure above provides a visual overview of options for generating Genome databases for use with inStrain. For technical details on how this is done, see User Manual. For a pre-generated Genome database for immediate download, see Tutorial.
3. Picking and evaluating representative genomes¶
Representative genomes are typically chosen by first clustering a set of genomes using some ANI threshold, and second picking a single genome to represent each cluster. Choosing ANI thresholds are discussed in the section below. A good Representative genome is high quality, contiguous, shares a high degree of gene content with the taxa it is meant to represent, and has a similar ANI to all genomes it’s meant to represent. The program dRep is commonly used to pick representative genomes, and it uses a scoring system to score each genome and pick the genome with the highest score.
Running inStrain profile
will generate a plethora of information about each Representative genome detected in your sample (see Expected output). This information can be used to determine how good of a fit each representative genome is to the true population that it is recruiting reads from. Helpful metrics are mean read ANI, reference conANI, reference popANI, and breadth vs. expected breath. If there are regions of the genome with much higher coverage than the rest, it is likely that that region is recruiting reads from another population (mismapped read). Looking at these wavy coverage patterns can be confusing, however. Here is a link for more information on this phenomenon.
One way of increasing the similarity between a Representative genome and the organisms in your sample is to assemble genomes from your sample directly. Something to keep in mind is that when multiple closely related genomes are present in a sample, the assembly algorithm can break and you can fail to recover genomes from either organism. A solution to this problem is to assemble and bin genomes from all metagenomic samples individually, and dereplicate the genome set at the end. For more information on this, see the publication “dRep: a tool for fast and accurate genomic comparisons that enables improved genome recovery from metagenomes through de-replication”
4. Establishing and evaluating genome databases¶
Genome databases are typically created by clustering a set of genomes using some ANI threshold using the program dRep. The dRep documentation describes some considerations to think about when choosing an ANI threshold. The most common thresholds are 95% ANI, which represents species-level clustering (Olm mSystems 2020), and 98% ANI, which is about the most stringent threshold recommended. Using either of these thresholds is generally a safe bet, and which threshold you choose depends on the level of resolution you’d like to perform your analysis at. These thresholds are ensure that genomes are distinct enough from each other, but not too distinct. Details on why this is important are below.
a) Ensure that genomes are distinct from one another.¶
Note
When genomes share stretches of identical sequence, read mapping software cannot reliably determine which genome a read should map to. The exact level of how distinct genomes need to be depends on the read length and the heterogeneity of differences across the genome, but having a maximum of 98% ANI between all genomes in the genome database is a good rule of thumb.
When mapping to a Genome database, if bowtie2 finds a read that maps equally well to multiple different positions in your fasta file it will randomly choose one of the two positions to place the read at. This is the best thing it could do, as you don’t want reads “duplicated” and mapped to multiple positions, but it also means that you really don’t want to have multiple positions in your .fasta file that are identical. The reason we go through the hassle of dereplication to generate a Genome database is to limit the number of positions in which the alignment algorithm cannot tell where the read should actually map to, and this is why we can’t just map to all possible genomes.
To determine how distinct genomes need to be to avoid having identical regions, we performed a simple experiment. We mapped to a randomly selected genome in isolation, and in the context of many other genomes in a Genome database dereplicated at 99.8% ANI. We then looked for reads that mapped to the genome of interest when mapping to that genome individually, but mapped elsewhere when mapping in the context of the entire Genome database. The results from this experiment are displayed below.
Each dot represents a genome in the full Genome database, the position on the x-axis indicates that genome’s ANI to the genome of interest (orange dot), and the position on the y-axis indicates the number of reads that were “stolen” from the genome of interest (stolen reads are those that mapped to the genome of interest when mapped in isolation, but mapped to a different genome when mapped in the context of the entire Genome database). As you can see, the more closely related an alternate genome is to a genome of interest, the more likely it is to steal reads. This makes sense, because assuming that the genomes represented by blue dots are not actually present in the sample (likely true in this case), the only way these genomes have reads mapped to them is by having regions that are identical to the genome that is actually present in the sample. In fact, you can even calculate the probability of having an identical region as long as a pair of reads (190bp in this case; 2 x 95bp) based on the genome ANI using the formula:
This simple formula was used to generate the black dotted line in the figure above. The line fits observed trend remarkably well, providing pretty compelling evidence that simple genome-ANI-based read stealing explains the phenomena. To be sure though, we can did final check based on mapQ score. Reads that map equally well to multiple different locations in a fasta file always get a MapQ score of 0-2. Thus, by filtering out reads with MapQ scores < 2, we can see reads that map uniquely to one genome only. Below we will re-generate the above figure while only including reads with mapQ scores above 2.
Just as we suspected, reads no longer map to alternate genomes at all. This provides near conclusive evidence that the organisms with these genomes are not truly in the sample, but are merely stealing reads from the genome of the organism that is there by having regions of identical DNA. For this reason it can be smart to set a minimum MapQ score of 2 to avoid mis-mapping, but at the same time, look at the difference in the number of reads mapping to the correct genome when the MapQ filter is used (compare the y-axis in the first and second figure)- 85% of the reads are filtered out. Using MapQ filters is a matter of debate depending on your specific use-case.
The data above can also be used to evaluate the most stringent threshold that can be used for dereplication. With 190bp reads (used in the figure above), we can see that read stealing approaches 0 at ~98% ANI. We can also plug this into the formula above to see that there is a ~2% change of genomes that are 98% ANI from each other sharing a 190bp identical stretch of DNA (0.98 ^ 190 = 0.02
). This is how we arrived at our recommended minimum of 98% ANI. However it is important to note that longer reads change the formula and differences between genomes are not uniformly spread across the genome. This is a complicated question and 98% ANI is just a good rule of thumb.
A symptom of having a Genome database in which genomes are too similar to one another is detecting lots of closely related organisms at similar abundance levels in samples.
b) Ensure that genomes aren’t too distinct from one another.¶
Note
When representative genomes are too distinct from the sample population they can have trouble with read mapping. The exact level of how similar genomes need to be depends on a number of factors, but a having a minimum of 95% ANI between all genomes in the genome database (representing species-level dereplication) is a good rule of thumb.
Genomes need to be similar enough to the population being mapped that they can properly recruit reads. If one were to generate a Genome database using an ANI threshold of 85% ANI, for example, implicit in that choice is the requirement that organisms which share ≥85% ANI to a representative genome will have their reads mapped to that genome. This begs the question- how similar do reads have to be to a genome for bowtie2 to map them? The answer is “it’s complicated”:
In the above example we generated synthetic reads that have a mean of 90% ANI to the reference genome. We then mapped these reads back to the reference genome and measured the ANI of mapped reads. Critically, the density of read ANI is not centered around 90% ANI, as it would be if all reads mapped equally well. The peak is instead centered at ~91% ANI, with a longer tail going left than right. This means that reads which have <92% ANI to the reference genome sometimes don’t map at all. Sometimes they do map, however, as we see some read pairs mapping that have ~88% ANI. The reason for this pattern is because bowtie2 doesn’t have a stringent ANI cutoff, it just maps whatever read-pairs it can. Where the SNPs are along the read, whether they’re in the seed sequence that bowtie2 uses, and other random things probably determine whether a low-ANI read pair maps or not. Thus, while bowtie2 can map reads that are up to 86% ANI with the reference genome, 92% seems to be a reasonable minimum based on this graph.
However, this does not mean that a representative genome that has 92% ANI to an organism of interest will properly recruit all it’s reads. ANI is calculated as a genome-wide average, and some regions will have more mutations than others. This is why the figure above has a wide distribution. Further, genomes that share 92% ANI have diverged from each other for a very long time, and likely have undergone changes in gene content as well. Recent studies have shown that organisms of the same species usually share >= 95% ANI, and that organisms of the same species share much more gene content than organisms from different species (Olm mSystems 2020). In sections below we also show that a buffer of ~3% ANI is needed to account for genomic difference heterogeneity, meaning that genomes dereplciated at 95% should be able to recruit reads at 92% ANI (the minimum for bowtie2). Thus for a number of reasons 95% ANI is a good minimum ANI threshold for establishing genome databases.
A symptom of having a Genome database in which genomes are too distinct from one another is genomes having low mean read ANI and breadth, and having an overall low percentage of reads mapping.
c) Ensure that all microbes in a sample have an appropriate representative genome.¶
Populations with appropriate representative genomes will be most accurately profiled, and populations that do not have a representative genome in the genome database will be invisible. Using a combination of de novo assembly and integration with public databases can result in genome databases that are both accurate and comprehensive. Instructions for how to do this are available in the Tutorial and User Manual. A great way to determine how complete your Genome database is is to calculate the percentage of reads that map to genomes in your database. The higher this percentage, the better (expect ~20-40% for soil, 60-80% for human microbiome, and 90%+ for simple, well defined communities).
5. Handling and reducing mis-mapping reads¶
As discussed above, a major aspect of using and establishing Genome databases with inStrain is reducing the number of reads that map to the wrong genome. When metagenomic sequencing is performed on a community, reads are generated from each population in that community. The goal of read mapping is to assign each read to the genome representing the population from which the read originated. When a read maps to a genome that does not represent the population from which the read originated, it is a mis-mapped read. Read mis-mapping can happen when a read maps equally well to multiple genomes (and is then randomly assigned to one or the other) or when a read from a distantly-related population maps to an inappropriate genome. Read mis-mapping can be reduced using a number of different techniques as discussed below.
Reducing read mis-mapping with competitive mapping
Competitive mapping is when reads are mapped to multiple genomes simultaneously. When we establish and map to a Genome database we are performing competitive mapping. When bowtie2 maps reads, by default, it only maps reads to a single location. That means that if a read maps at 98% ANI to one genome, and 99% ANI to another genome, it will place the read at the position with 99% ANI. If the read only maps to one scaffold at 98% ANI, however, bowtie2 will place the read there. Thus, by including more reference genome sequences when performing the mapping, reads will end up mapping more accurately overall. Ensuring that you have the most comprehensive genome set possible is a great way to reduce read mis-mapping via competitive mapping.
Reducing read mis-mapping by adjusting min_read_ani
InStrain calculates the ANI between all read-pairs and the genomes they map to. The inStrain profile parameter -l
/ --min_read_ani
dictates the minimum ANI a read pair can have; all pairs below this threshold are discarded. Adjusting this parameter can ensure that distantly related reads don’t map, but setting this parameter to be too stringent will reduce the ability of a genome to recruit reads with genuine variation.
For the figure above synthetic read pairs were generated to be 98% ANI to a random E. coli genome, reads were mapped back to that genome, and the distribution of ANI values of mapped reads was plotted. Most read pairs have 98%, as expected, but there is a wide distribution of read ANI values. This is because differences between reads and genomes are not evenly spread along the genome, a fact that is even more true when you consider that real genomes likely have even more heterogeneity in where SNPs occur than this synthetic example. You really don’t want reads to fail to map to heterogeneous areas of the genome, because those areas with more SNPs are potentially the most interesting. Based on the figure above and some other confusing tests that aren’t included in this documentation, it seems that the minimum read pair ANI should be 2-3% lower than the actual difference between the reads and the genome to account for genomic heterogeneity. Thus a --min_read_ani
of 92% should be used when reads are expected to map to genomes that are 95% ANI away, for example when using Species representative genomes.
Warning
The inStrain default is 95% minimum read pair ANI, which is ideal in the case that you’ve assembled your reference genome from the sample itself. If you plan on using inStrain to map reads to a Genome database of Species representative genome`s, you should lower the minimum read-pair ANI to ~92% (note that using the `–database_mode`` flag automatically adjusts --min_read_ani
to 0.92)
Reducing read mis-mapping by adjusting MapQ
mapQ score`s are numbers that describe how well a read maps to a genome. InStrain is able to set a minimum read-pair mapQ score using the parameter `–min_mapq``. MapQ scores in general are confusing, without consistent rules on how they’re calculated using different mapping programs, but the values 0-2 have special meaning. If a read maps equally well to multiple positions it is given a mapQ score of 1 or 2. Thus by setting --min_mapq
to 2, you can remove all reads that map equally well to multiple positions (multi-mapped read). Remember that with competitive mapping a read that maps equally well to multiple positions will be randomly assigned to one, giving that read a ≥50% chance of being mis-mapped.
Whether or not you should set --min_mapq
to 2 is a difficult decision. On one hand these reads have a high probability of being mis-mapped, which is not ideal, but on the other hand doing this mapQ filtering can result in filtering out lots of reads (see figures in the above section “Establishing and evaluating genome databases”). One way of thinking about this is by imagining two genomes A and B that are very distinct from one another but share an identical transposon. If the population represented by genome A and not genome B is present in a sample, without mapQ filtering you’ll see genome A having a breadth of 100% and genome B having a breadth of ~1%. If genome A is at 100X coverage you’ll see the coverage across most of the genome at 100x, and at the transposon it will be at 50x. Genome B will have 0x coverage across most of the genome, and the transposon will be at 50x coverage. The benefit of this scenario is that we are still able detect that genome A has the transposon; the downside is that it that genome B is erroneously detected has having a transposon present in the sample (however when using recommended threshold of 50% breadth to determine detection genome B will still correctly be identified as not being present in the sample). Performing mapQ filtering on the above situation will result in genome A having a breadth of 99%, 0x coverage at the transposon, and no reads mapping to genome B. The benefit of this scenario is that we properly detect that no reads are mapping to genome B; the downside is that we incorrectly think that genome A does not have a transposon in this sample.
Note
In conclusion, filtering reads by mapQ score is not ideal for a number of reasons. It is best to instead reduce the number of multi-mapped reads using the advice in the sections above to make it so --min_mapq
filtering isn’t necessary.
6. Detecting organisms in metagenomic data.¶
Note
Mis-mapping can fool abundance-based presence/absence thresholds. We recommend using a 50% breadth threshold to determine presence/absence instead.
A critical first step in metagenomic analysis is determining which Representative genomes are “present” or “absent” (and therefore the microbial populations they represent as well). This is actually more complex than meets the eye, mostly due to multi-mapped reads and mismapped reads. Details on these phenomena are discussed above, but the upshot is that just because a genome has reads mapping to it does not mean that that genome is actually present in a sample.
Many studies determine presence/absence based on metrics like coverage or relative abundance. This isn’t great though, since there can easily be substantial numbers of mis-mapped reads. There are countless examples of a genome being detected at 100x coverage and 2% relative abundance, but when looking at the mapping it is discovered that all reads are mapped to a single prophage on the genome. The problem with these metrics is that they are genome-wide averages, so they cannot account for cases where substantial numbers of reads are map to a small region of the genome. Most would agree that detecting solely a prophage or transposon on a genome should not count as that genome being “present”, so we need metrics beyond coverage and 2% relative abundance to determine presence / absence. See Benchmarks for more real-world examples of this phenomena.
A great metric for determining presence/absence is breadth, the percentage of a genome that’s covered by at least one read. Using breadth to determine presence/absence allows the user to account for the problems above. Deciding on an appropriate breadth threshold requires the user to answer the question “How much of the genome do I need to have detected in a sample before I am confident that it’s actually present”? The answer to this question depends on the particular study details and questions, but we can use data to help us decide on a rational breadth cutoff.
The figure above shows the expected genome overlap between genomes of various ANI values from different environments (adapted from “Consistent metagenome-derived metrics verify and define bacterial species boundaries”). As you can see, genomes that share >95% ANI tend to share ~75% of their genome content. Therefore, using a breadth detection cutoff of somewhere around 50-75% seems to be reasonable when using Species representative genome s. In my experience using a 50% breadth cutoff does a great job of ensuring that genomes are actually present when you say they are, and leads to very few false positives. It’s exceedingly rare for mis-mapping to lead to >50% genome breadth. See Benchmarks for real-world examples of the 50% breadth threshold in action.
A caveat of using a breadth threshold is that it requires thousands of reads to map to a genome for it to be considered present. This makes it less ideal for samples with low sequencing depth. To determine the coverage needed to detect a genome at some breadth, we performed an experiment based on synthetic E. coli and C. albicans reads). By generating reads, subsetting them to a number of different total read numbers, and mapping them back to the genome, we generated the following figure
This figure allows us to visually see the relationship between coverage and breadth when reads are mapped randomly across the genome. To achieve a 50% breadth an organism needs to have just under 1x coverage. At over 6x coverage, all organisms should have ~100% breadth. This data also allowed us to fit a curve to calculate the following formula:
Applying this formula allows inStrain to calculate and report expected breadth for a given coverage value. Effective use of expected breadth can allow users to lower their breadth thresholds and still have confidence in determining presence/absence. Imagine that you detect an organism at 10x coverage and 85% breadth. The expected breadth at 10x coverage is 100%, but you only have 85% breadth. This means that 15% of your genome is likely not in the reads set, and that your representative genome has genome content that is 15% different from the organism in your sample. Now imagine that you detect an organism at 3x coverage with 85% breadth. The expected breadth and actual breadth are approximately the same now, meaning that reads and randomly aligning to all parts of the genome and you likely have a very dialed in representative genome. Now imagine you detect organism A with 10% breadth and 0.1x coverage, and organism B with 10% breadth and 10x coverage. Both organisms have the same breadth, but organism A is much more likely to be actually present in your sample. That’s because while few reads overall are mapping, they’re mapping all across the genome in a random way (you know this because breadth is about equal to expected breadth), which is indicative of a true low abundance population. Organism B, however, should be abundant enough for reads to map all over the genome (expected breadth is 100%), but reads are only mapping to 10% of it. This indicates that no matter how deeply you sequence you will not see the rest of organism B’s genome, and the 10% of it that you are seeing is likely due to mis-mapping.
Note
Theoretical models have determined breadth to be: 1 - exp(-coverage) (Lander and Waterman (1988)), slightly different from the empirical derivation presented here and used in inStrain. More information on this subject can be found at this technical note from Illumina.
7. Strain-level comparisons and popANI.¶
InStrain is able to perform detailed, accurate, microdiversity-aware strain-level comparisons between organisms detected in multiple metagenomic samples. The is done using the command inStrain compare
on multiple samples that have been profiled using the command inStrain profile
, and technical details on how this is done is available in the User Manual.
To understand why “microdiversity-aware” genomic comparisons are important, consider the fact that all natural microbial populations have some level of genomic heterogeneity present within them.
The image above incorporates data from Zhao et. al. 2019 and Zanini et. al. 2015 (left and middle phylogenetic trees). In each case different colors represent different individuals, and each leaf represents an individual isolate. You can see from these data that although each individual has a distinct microbial population, there is substantial diversity within each individual as well (referred to as intraspecific genetic variation (within species), intrapatient genetic variation (within patient), or microdiversity). Knowledge of this fact leads to the question- how does one accurately compare populations that have intraspecific genetic variation? Some common approaches include comparing the “average” genome in each sample (the consensus genome) or comparing a number of individual isolates. See Benchmarks for some data on how well these approaches hold up.
InStrain performs microdiversity-aware comparisons using the metric popANI, depicted above, which is also reported alongside the more common consensus-based ANI metric conANI. The calculation of popANI and conANI is not complicated once you understand it (really), but describing can be tricky, and the simplest way of describing it is with examples like those displayed above.
While not depicted in the above figure, the first step of calculating conANI and popANI is identifying all positions along the genome in which both samples have ≥5x coverage. This number is reported as the compared_bases_count
, and it describes the number of base-pairs (bp) that are able to be compared. Next, inStrain goes through each one of these comparable base-pairs and determines if there is a conANI substitution at that position and/or if there is a popANI substitution at that position. The left half of the above figure describes the conditions that will lead to popANI and conANI substitutions. If both samples have the same major allele (e.g. the most common base at that position is the same in both samples), no substitutions will be called. If samples have different major alleles (e.g. the most common base in sample 1 is A, and the most common base in sample 2 is C), a conANI substitution will be called. If there are no alleles that are shared between the two samples, major or minor, a popANI substitution will be called. The calculations that determine whether or not a base is considered “present” as a minor allele in a sample (vs. it just being a sequencing error) are discussed in the User Manual.
On the right side of the above figure we see several examples of this in action. In the top row there are no alleles that are the same in both samples, therefore the site will count as both a conANI SNP and a popANI SNP. In the second row the consensus allele is different in both samples (its G in the sample on the left and T in the sample on the right), so a conANI SNP will be called. However the samples DO share an allele (T is present in both samples), so this will NOT be considered a popANI substitution. In the third row both samples have the same consensus allele and share alleles, so no substitutions are called. In the last row the samples have different consensus alleles (G on the left and T on the right), so a conANI substitution will be called, but there is allele overlap between the samples (both samples have G and T) so a popANI substitution will NOT be called.
Once we have the compared_bases_count
, number of conANI SNPs, and number of popANI SNPs, calculation of conANI and popANI is trivial.
Note
Notice that compared_bases_count
is integral to conANI and popANI calculations. It essentially determines the “denominator” in the calculations, as it let’s you know how bases were compared in the calculation. Attempting to calculate conANI and popANI using SNP-calling data from other programs will likely leave out this critical information. Remember- compared_bases_count
is a measure of how many bases have at least 5x coverage in BOTH samples. Consideration of compared_bases_count
is critical to ensure that popANI isn’t high simply because one or both sample’s doesn’t have high enough coverage to detect SNPs
8. Thresholds for determining “same” vs. “different” strains.¶
Once inStrain performs it’s strain-level comparisons, one must decide on some threshold to define microbes as being the “same” or “different” strains. The figure above illustrates some common ANI values used for defining various relationships between microbes (top left), some previously reported rates of in situ microbial evolution (bottom left), and estimates of divergence times for various ANI thresholds (top left). On the right is an analogy using canine taxonomy.
The figure above illustrates how loose ANI thresholds can be used to define relatively broad groups of organisms, for example the genus Canis or the species Canis Familiaris. Sub-species taxonomic levels, referred to as strains in the microbe world and breeds in the dog world, describe groups of organisms within particular species. Strain definitions in the microbial world are not consistent, but some example strain ANI thresholds are shown. There is still generally some variation within strains, however. This is exemplified by the fact that while dogs of the same breed are similar to one another, they’re not identical to one another. Similarly, microbes of the same strain based on a 99% ANI definition can have diverged for roughly 44,000 years (based on the in situ mutation rate in bold in the bottom left). Clearly microbes that have diverged for tens of thousands of years are not identical to one another. Thus if we want to know whether samples are linked by a recent microbial transmission event, we need an extremely stringent definition of “same” that is beyond the typical strain level. Note that the dogs in the bottom right are clones that truly do represent identical dogs..
To identify microbes that are linked by a recent transmission event we want the most stringent ANI threshold possible. 99.9999% ANI, for example, represents less than 10 years of divergence time and could be a useful metric. Metagenomic sequencing is messy, however, and when working with this level of stringency we need to think about our limit of detection. The Benchmarks section contains data on the limit of detection for inStrain using defined microbial communities (see section “Benchmark with true microbial communities”) The conclusion is that 99.999% popANI is a good, highly stringent definition for identical strains that is within the limit of detection for metagenomic analysis.. In addition to popANI, one must also consider the fraction of the genome that was at sufficient coverage in both samples being compared. This value (reported as percent_genome_compared
) is more of a judgement call, but we recommend requiring a minimum of 25% or 50% percent_genome_compared
in addition to the popANI threshold.
Note
In conclusion, organisms in different samples that are linked by a recent transmission event should have ≥99.999% popANI and ≥50% percent_genome_compared
9. Importance of representative genomes when calculating popANI¶
InStrain strain-level comparisons are based on mappings to representative genomes. In order for this to work well, however reads with variation must be able to map to the representative genomes within the ``–min_read_ani`` threshold. Note that inStrain compare
will use the --min_read_ani
selected during the inStrain profile
commands by default.
Below are a series of plots generated from synthetic data demonstrating this fact. In these plots a reference genome was downloaded from NCBI, mutated to a series of known ANI values, synthetic reads were generated from each of these mutated genomes, and synthetic reads were then mapped back to the original genome.
In the above plot the --min_read_ani
is set to 95%. As you can see, when the true ANI value between the genomes is below 98%, popANI values reported by inStrain are not accurate. The reason that this happens is because reads with genuine variation are being filtered out by inStrain, leaving only the reads without variation, which artificially increases the reported popANI values. In sections above we demonstrated that ``–min_read_ani`` should be ~3% looser than the population you’d like to recruit reads from; the same rule applies here. If you’d like to compare organisms that have a popANI of 95%, your --min_read_ani
needs to be 92%. Here we have a --min_read_ani
of 95%, so we can detect accurate popANI values of 98% or above (as shown in the above figure). This phenomena is explored further in the following way.
The above figure displays the percent_genome_compared
for each of the comparisons in the first figure in this section. As expected, when comparing genomes of low ANI values with a read-pair ANI threshold of 95%, only a small amount of the genome is actually being compared. This genome fraction represents the spaces of the genome that happen to be the most similar, and thus the inStrain calculated ANI value is overestimated. The conclusion here is that in order to get an accurate ANI value, you need to set your ``–min_read_ani`` at least 3% below the ANI value that you wish to detect.
10. Using inStrain for gene-based functional analysis¶
The above figure shows a visual representation of k-mer based metagenomic analysis, gene-based metagenomic analysis, and Representative genome based metagenomic analysis. As you can see, among the advantages of genome-based metagenomic analysis is the ability to perform context-aware functional profiling.
InStrain does not have the ability to annotate genes. However, inStrain does have the ability to deeply profile all genes in a sample, including analysis of coverage, coverage variation, gene pN/pS, nucleotide diversity, individual SNVs, etc. This gene-level information can then be combined with gene annotations to perform robust functional analysis. Any database can be used for this type of analysis, including pFam for protein domain annotations, ABRicate for antibiotic resistance gene annotation, UniRef100 for general protein annotation, and dbCAN for CAZyme annotation.
For examples of inStrain-based functional annotation in action, see Table 1 and Figure 6 of the inStrain publication and this GitHub repo focused on COVID-19 population genomics analysis
11. Reducing inStrain resource usage¶
Note
When mapping to a Genome database with more than a handful of genomes make sure to use the flag --database_mode
The two computational resources to consider when running inStrain are the number of processes given (-p
) and the amount of RAM on the computer (usually not adjustable unless using cloud-based computing).
Using inStrain v1.3.3, running inStrain on a .bam file of moderate size (1 Gbp or less) will generally take less than an hour with 6 cores, and use about 8Gb of RAM. InStrain is designed to handle large .bam files as well. Running a huge .bam file (30 Gbp) with 32 cores, for example, will take ~2 hours and use about 128Gb of RAM. The more processes you give inStrain the faster it will run, but also the more RAM it will use.
In the log folder InStrain provides a lot of information on where it’s spending it’s time and where it’s using it’s RAM.
To reduce RAM usage, you can try the following things:
- Use the
--skip_mm
flag. This won’t profile things on the mm level, and will treat every read pair as perfectly mapped. This is perfectly fine for most applications - Make sure and use the
--database_mode
flag when mapping to genome databases. This will do a couple of things to try and reduce RAM usage - Use less processes (
-p
). Using more processes will make inStrain run faster, but it will also use more RAM while doing so
Tutorial¶
The above figure provides a conceptual overview of the steps involved when running inStrain. Step 1 is generating sequencing reads, step 2 is mapping those sequencing reads to a Genome database, step 3 is profiling the mapping with inStrain profile
, step 4 is comparing inStrian profiles using inStrain compare
.
Quick Start¶
The two main operations of inStrain are compare
and profile
.
Profile¶
InStrain profile
takes as input a fasta file and a bam file and runs a series of steps to characterize the nucleotide diversity, SNSs and SNVs, linkage, etc.. If one provides a scaffold-to-bin file it will calculate genome-level metrics, and if one provides a genes file it will calculate gene level metrics.
The most basic inStrain profile
command has this form:
$ inStrain profile .bam_file .fasta_file -o IS_output_name
Compare¶
InStrain compare
takes as input multiple inStrain profile
objects (generated using the command above) and performs strain-level comparisons. Each inStrain profile
object used by InStrain compare
must be made from reads mapped to the same fasta file.
The most basic inStrain compare command looks like this:
$ inStrain compare -i IS_output_1 IS_output_2 IS_output_3
Other¶
There are a number of other operations that inStrain can perform as well, although these generally perform more niche tasks. Check the program help (inStrain -h
) to see a full list of the available operations
$ inStrain -h
...::: inStrain v1.4.0 :::...
Matt Olm and Alex Crits-Christoph. MIT License. Banfield Lab, UC Berkeley. 2019
Choose one of the operations below for more detailed help. See https://instrain.readthedocs.io for documentation.
Example: inStrain profile -h
Workflows:
profile -> Create an inStrain profile (microdiversity analysis) from a mapping.
compare -> Compare multiple inStrain profiles (popANI, coverage_overlap, etc.)
Single operations:
profile_genes -> Calculate gene-level metrics on an inStrain profile [DEPRECATED; PROVIDE GENES TO profile]
genome_wide -> Calculate genome-level metrics on an inStrain profile [DEPRECATED; PROVIDE .stb FILES TO profile / compare]
quick_profile -> Quickly calculate coverage and breadth of a mapping using coverM
filter_reads -> Commands related to filtering reads from .bam files
plot -> Make figures from the results of "profile" or "compare"
other -> Other miscellaneous operations
See also
- Installation
- To get started using the program
- User Manual
- For descriptions of what the modules can do and information on how to prepare data for inStrain
- Expected output
- To view example output and how to interpret it
Example inStrain commands¶
Running inStrain profile on a single genome
inStrain profile mappingfile.bam genomefile.fasta -o outputlocation.IS -p 6 -g genesfile.fasta
Running inStrain profile on a large set of genomes
inStrain profile mappingfile.bam genomesfile.fasta -o outputlocation.IS -p 6 -g genesfile.fasta -s scaffoldtobin.stb --database_mode
Running inStrain compare on a large set of genomes
inStrain compare -i genomefile-vs-sample1.IS/ genomefile-vs-sample2.IS/ -o genomefile.IS.COMPARE -p 6 -s scaffoldtobin.stb --database_mode
Tutorials¶
The following tutorials give step-by-step instructions on how to run inStrain a couple of different ways. The main difference between these tutorials is in how the Genome database used for inStrain analysis is generated. The user to inStrain decides on their own what genomes should be used for analysis, and there are a couple of broad options as depicted in the figure below.
Tutorial #1 uses test data that comes packaged with the inStrain source code to go through the basic steps of the program. It also describes how you can run using genomes that you have generated on your own.
Tutorial #2 describes how to run inStrain using an existing, public genome database. This way of running inStrain avoids the need for metagenomic assembly and genome binning.
Tutorial #3 describes how to combine custom genomes with an existing genome database. This allows users to include both sample-specific representative genomes and an existing genome database, and allows for comprehensive, accurate analysis.
Tutorial #1) Running inStrain on provided test data¶
The following tutorial goes through an example run of inStrain. You can follow along with your own data, or use a small set of reads that are included in the inStrain source code for testing. They can be found in the folder test/test_data/
of your install folder, or can be downloaded from the inStrain source code at this link on GitHub. The files that we’ll use for this tutorial are the forward and reverse metagenomic reads (N5_271_010G1.R1.fastq.gz
and N5_271_010G1.R2.fastq.gz
) and a .fasta file to map to (N5_271_010G1_scaffold_min1000.fa
). In case you’re curious, these metagenomic reads come from a premature infant fecal sample.
Preparing input files¶
After downloading the genome file that you would like to profile (the fasta file) and at least one set of paired reads, the first thing to do is to map the reads to the .fasta file in order to generate a bam file.
When this mapping is performed it is important that you map to all genomes simultaneously (see Important concepts for why this is important). This involves combining all of the genomes that you’d like to map into a single .fasta file. In our case our .fasta file already has all of the genomes that we’d like to profile within it, but if you did want to profile a number of different genomes, you could combine them using a command like this
$ cat raw_data/S2_002_005G1_phage_Clostridioides_difficile.fasta raw_data/S2_018_020G1_bacteria_Clostridioides_difficile.fasta > allGenomes_v1.fasta
When we do this we also need to generate a file to let inStrain know which scaffolds came from which genomes. We can do this by giving inStrain a list of the .fasta files that went into making the concatenated .fasta file, or we can make a scaffold-to-bin file file, which lists the genome assignment of each scaffold in a tab-delimited file. This is how to do the later method using the parse_stb.py script that comes with the program dRep
(Installed with the command pip install drep --upgrade
)
$ parse_stb.py --reverse -f raw_data/S2_002_005G1_phage_Clostridioides_difficile.fasta raw_data/S2_018_020G1_bacteria_Clostridioides_difficile.fasta -o genomes.stb
Next we must map our reads to this fasta file to create bam files. In this tutorial we will use the mapping program Bowtie 2
$ mkdir bt2
$ bowtie2-build ~/Programs/inStrain/test/test_data/N5_271_010G1_scaffold_min1000.fa bt2/N5_271_010G1_scaffold_min1000.fa
$ bowtie2 -p 6 -x bt2/N5_271_010G1_scaffold_min1000.fa -1 ~/Programs/inStrain/test/test_data/N5_271_010G1.R1.fastq.gz -2 ~/Programs/inStrain/test/test_data/N5_271_010G1.R2.fastq.gz > N5_271_010G1_scaffold_min1000.fa-vs-N5_271_010G1.sam
At this point we have generated a .sam file, the precursor to .bam files. Lets make sure it’s there and not empty
$ ls -lht
total 34944
-rw-r--r-- 1 mattolm staff 16M Jan 23 11:56 N5_271_010G1_scaffold_min1000.fa-vs-N5_271_010G1.sam
drwxr-xr-x 8 mattolm staff 256B Jan 23 11:54 bt2/
Perfect. At this point we could convert the .sam file to a sorted and indexed .bam file using samtools, but since inStrain can do that for us automatically we won’t bother.
If we want inStrain to do gene-level profiling we need to give it a list of genes to profile. Note - this is an optional step that is not required for inStrain to work in general, but without this you will not get gene-level profiles
We will profile our genes using the program prodigal, which can be run using the following example command
$ prodigal -i ~/Programs/inStrain/test/test_data/N5_271_010G1_scaffold_min1000.fa -d N5_271_010G1_scaffold_min1000.fa.genes.fna -a N5_271_010G1_scaffold_min1000.fa.genes.faa
Running inStrain profile¶
Now that we’ve gotten everything set up it’s time to run inStrain. To see all of the options, run
$ inStrain profile -h
A long list of arguments and options will show up. For more details on what these do, see User Manual. The only arguments that are absolutely required, however, are a .sam or .bam mapping file, and the .fasta file that the mapping file is mapped to.
Note
In this case we’re going to have inStrain profile the mapping, call genes, make the results genome wide, and plot the results all in one command. This is the recommended way to do things for the most computational efficiency. The other, not recommended way would be to run these all as separate steps (using the subcommands inStrain profile
, inStrain profile_genes
, inStrain genome_wide
, and inStrain plot
). See User Manual for more information.
Using all of the files we generated above, here is going to be our inStrain command
$ inStrain profile N5_271_010G1_scaffold_min1000.fa-vs-N5_271_010G1.sam ~/Programs/inStrain/test/test_data/N5_271_010G1_scaffold_min1000.fa -o N5_271_010G1_scaffold_min1000.fa-vs-N5_271_010G1.IS -p 6 -g N5_271_010G1_scaffold_min1000.fa.genes.fna -s ~/Programs/inStrain/test/test_data/N5_271_010G1.maxbin2.stb
You should see the following as inStrain runs (should only take a few minutes)
You gave me a sam- I'm going to make it a .bam now
Converting N5_271_010G1_scaffold_min1000.fa-vs-N5_271_010G1.sam to N5_271_010G1_scaffold_min1000.fa-vs-N5_271_010G1.ba
m
samtools view -S -b N5_271_010G1_scaffold_min1000.fa-vs-N5_271_010G1.sam > N5_271_010G1_scaffold_min1000.fa-vs-N5_271_
010G1.bam
sorting N5_271_010G1_scaffold_min1000.fa-vs-N5_271_010G1.bam
samtools sort N5_271_010G1_scaffold_min1000.fa-vs-N5_271_010G1.bam -o N5_271_010G1_scaffold_min1000.fa-vs-N5_271_010G1
.sorted.bam -@ 6
[bam_sort_core] merging from 0 files and 6 in-memory blocks...
Indexing N5_271_010G1_scaffold_min1000.fa-vs-N5_271_010G1.sorted.bam
samtools index N5_271_010G1_scaffold_min1000.fa-vs-N5_271_010G1.sorted.bam N5_271_010G1_scaffold_min1000.fa-vs-N5_271_
010G1.sorted.bam.bai -@ 6
***************************************************
..:: inStrain profile Step 1. Filter reads ::..
***************************************************
Filtering reads: 100%|██████████████████████████████████████████████████████████████| 178/178 [00:02<00:00, 70.99it/s]
37.3% of reads were removed during filtering
1,727 read pairs remain (0.0004472 Gbp)
***************************************************
.:: inStrain profile Step 2. Profile scaffolds ::..
***************************************************
Profiling splits: 100%|█████████████████████████████████████████████████████████████████| 7/7 [00:05<00:00, 1.21it/s]
Merging splits and profiling genes: 100%|███████████████████████████████████████████████| 7/7 [00:08<00:00, 1.18s/it]
***************************************************
.:: inStrain profile Step 4. Make genome-wide ::..
***************************************************
Scaffold to bin was made using .stb file
85.66% of scaffolds have a genome
93.82% of scaffolds have a genome
99.30% of scaffolds have a genome
***************************************************
.:: inStrain profile Step 5. Generate plots ::..
***************************************************
making plots 1, 2, 3, 4, 5, 6, 7, 8, 9
Plotting plot 1
Plotting plot 2
85.66% of scaffolds have a genome
Plotting plot 3
57.37% of scaffolds have a genome
Plotting plot 4
97.33% of scaffolds have a genome
Plotting plot 5
Plotting plot 6
Plotting plot 7
97.33% of scaffolds have a genome
Plotting plot 8
94.32% of scaffolds have a genome
Plotting plot 9
$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$
..:: inStrain profile finished ::..
Output tables........ N5_271_010G1_scaffold_min1000.fa-vs-N5_271_010G1.IS/output/
Figures.............. N5_271_010G1_scaffold_min1000.fa-vs-N5_271_010G1.IS/figures/
Logging.............. N5_271_010G1_scaffold_min1000.fa-vs-N5_271_010G1.IS/log/
See documentation for output descriptions - https://instrain.readthedocs.io/en/latest/
$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$
The last but of the output shows you where the plots and figures have been made. Here’s a list of the files that you should see
$ ls -lht N5_271_010G1_scaffold_min1000.fa-vs-N5_271_010G1.IS/output/
total 91K
-rw-rw-r-- 1 mattolm infantgi 35K Jan 15 10:10 N5_271_010G1_scaffold_min1000.fa-vs-N5_271_010G1.IS_SNVs.tsv
-rw-rw-r-- 1 mattolm infantgi 1.2K Jan 15 10:10 N5_271_010G1_scaffold_min1000.fa-vs-N5_271_010G1.IS_genome_info.tsv
-rw-rw-r-- 1 mattolm infantgi 23K Jan 15 10:10 N5_271_010G1_scaffold_min1000.fa-vs-N5_271_010G1.IS_mapping_info.tsv
-rw-rw-r-- 1 mattolm infantgi 92K Jan 15 10:10 N5_271_010G1_scaffold_min1000.fa-vs-N5_271_010G1.IS_gene_info.tsv
-rw-rw-r-- 1 mattolm infantgi 15K Jan 15 10:10 N5_271_010G1_scaffold_min1000.fa-vs-N5_271_010G1.IS_linkage.tsv
-rw-rw-r-- 1 mattolm infantgi 30K Jan 15 10:10 N5_271_010G1_scaffold_min1000.fa-vs-N5_271_010G1.IS_scaffold_info.tsv
$ ls -lht N5_271_010G1_scaffold_min1000.fa-vs-N5_271_010G1.IS/figures/
total 3.5M
-rw-rw-r-- 1 mattolm infantgi 386K Jan 15 10:10 N5_271_010G1_scaffold_min1000.fa-vs-N5_271_010G1.IS_GeneHistogram_plot.pdf
-rw-rw-r-- 1 mattolm infantgi 379K Jan 15 10:10 N5_271_010G1_scaffold_min1000.fa-vs-N5_271_010G1.IS_LinkageDecay_types_plot.pdf
-rw-rw-r-- 1 mattolm infantgi 404K Jan 15 10:10 N5_271_010G1_scaffold_min1000.fa-vs-N5_271_010G1.IS_ScaffoldInspection_plot.pdf
-rw-rw-r-- 1 mattolm infantgi 375K Jan 15 10:10 N5_271_010G1_scaffold_min1000.fa-vs-N5_271_010G1.IS_ReadFiltering_plot.pdf
-rw-rw-r-- 1 mattolm infantgi 378K Jan 15 10:10 N5_271_010G1_scaffold_min1000.fa-vs-N5_271_010G1.IS_LinkageDecay_plot.pdf
-rw-rw-r-- 1 mattolm infantgi 377K Jan 15 10:10 N5_271_010G1_scaffold_min1000.fa-vs-N5_271_010G1.IS_MajorAllele_frequency_plot.pdf
-rw-rw-r-- 1 mattolm infantgi 375K Jan 15 10:10 N5_271_010G1_scaffold_min1000.fa-vs-N5_271_010G1.IS_readANI_distribution.pdf
-rw-rw-r-- 1 mattolm infantgi 400K Jan 15 10:10 N5_271_010G1_scaffold_min1000.fa-vs-N5_271_010G1.IS_genomeWide_microdiveristy_metrics.pdf
-rw-rw-r-- 1 mattolm infantgi 376K Jan 15 10:10 N5_271_010G1_scaffold_min1000.fa-vs-N5_271_010G1.IS_CoverageAndBreadth_vs_readMismatch.pdf
We have now successfully generated an inStrain profile! For help interpreting the output files, see Expected output
Running inStrain parse_annotations¶
InStrain parse_annotations
creates output files that make it easier to perform functional gene analysis. One input file is the InStrain profile object, which we just created above, and the other input file is a table of gene annotations.
You can annotate your genes using whatever gene annotation database you like (depending on your specific project and questions). The section Gene Annotation in User Manual has instructions for a few databases. In this tutorial let’s just annotate with KEGG Orthologies (KOs) and Carbohydrate-Active enZYmes (CAZymes).
To do the annotations we’ll need the amino acid sequences of the genes (the file ending in .faa, created using the prodigal command above) even though the gene nucleotide sequences is what we provided to inStrain profile. Following the section Gene Annotation in User Manual, we’ll then run the following commands
$ exec_annotation -p profiles -k ko_list --cpu 10 --tmp-dir ./tmp -o N5_271_010G1_scaffold_min1000.fa.genes.faa.KO N5_271_010G1_scaffold_min1000.fa.genes.faa
$ hmmscan --domtblout N5_271_010G1_scaffold_min1000.fa.genes.faa_vs_dbCAN_v11.dm dbCAN-HMMdb-V11.txt N5_271_010G1_scaffold_min1000.fa.genes.faa > /dev/null ; sh /hmmscan-parser.sh N5_271_010G1_scaffold_min1000.fa.genes.faa_vs_dbCAN_v11.dm > N5_271_010G1_scaffold_min1000.fa.genes.faa_vs_dbCAN_v11.dm.ps ; cat N5_271_010G1_scaffold_min1000.fa.genes.faa_vs_dbCAN_v11.dm.ps | awk '$5<1e-15&&$10>0.35' > N5_271_010G1_scaffold_min1000.fa.genes.faa_vs_dbCAN_v11.dm.ps.stringent
We’ll now need to use python / R / Excel to parse and re-format the output of these (and any other) annotations. In the end they need be transformed into a single .csv file with the columns “gene” and “anno”. See User Manual for more details on the specific formatting requirements. In our case the file, which I called geneAnnotations_v1.csv, should look like this:
gene | anno |
N5_271_010G1_scaffold_12_3 | K06956 |
N5_271_010G1_scaffold_14_1 | K09890 |
N5_271_010G1_scaffold_15_2 | K07482 |
N5_271_010G1_scaffold_19_7 | K09890 |
N5_271_010G1_scaffold_25_1 | K20386 |
N5_271_010G1_scaffold_25_1 | K15558 |
N5_271_010G1_scaffold_25_1 | K19762 |
N5_271_010G1_scaffold_25_2 | K06864 |
N5_271_010G1_scaffold_28_3 | K07482 |
Now we can run inStrain parse_annotations with a command like the following
$ inStrain parse_annotations -i N5_271_010G1_scaffold_min1000.fa-vs-N5_271_010G1.IS/ -o genes_output_v1 -a geneAnnotations_v1.csv
Done! To see what output files you can expect, see Expected output
Running inStrain compare¶
InStrain compare
compares genomes that have been profiled by multiple different metagenomic mappings. To compare genomes in the sample we just profiled above, we need to generate another bam file of reads from another sample to the same .fasta file. Provided in the inStrain test_data folder is exactly that- another different set of reads mapped to the same .fasta file (N5_271_010G1_scaffold_min1000.fa-vs-N5_271_010G2.sorted.bam). Let’s run inStrain on this to make a new inStrain profile
$ inStrain profile test_data/N5_271_010G1_scaffold_min1000.fa-vs-N5_271_010G2.sorted.bam N5_271_010G1_scaffold_min1000.fa -o N5_271_010G1_scaffold_min1000.fa-vs-N5_271_010G2.IS -p 6 -g N5_271_010G1_scaffold_min1000.fa.genes.fna -s N5_271_010G1.maxbin2.stb
To see the help section for inStrain compare
run:
$ inStrain compare -h
As above, this will print out a whole list of parameters that can be turned depending on your specific use-case. Important concepts and User Manual provide some insight into what these parameters do and how to tune them. For the purposes of this tutorial we’re going to use mostly default parameters, giving us the following command
$ inStrain compare -i N5_271_010G1_scaffold_min1000.fa-vs-N5_271_010G1.IS/ N5_271_010G1_scaffold_min1000.fa-vs-N5_271_010G2.IS/ -s .N5_271_010G1.maxbin2.stb -p 6 -o N5_271_010G1_scaffold_min1000.fa.IS.COMPARE
This command should produce the following output
Scaffold to bin was made using .stb file
***************************************************
..:: inStrain compare Step 1. Load data ::..
***************************************************
Loading Profiles into RAM: 100%|████████████████████████████████████████████████████████| 2/2 [00:00<00:00, 67.45it/s]
158 of 167 scaffolds are in at least 2 samples
***************************************************
..:: inStrain compare Step 2. Run comparisons ::..
***************************************************
Running group 1 of 1
Comparing scaffolds: 100%|██████████████████████████████████████████████████████████| 158/158 [00:04<00:00, 36.12it/s]
***************************************************
..:: inStrain compare Step 3. Auxiliary processing ::..
***************************************************
***************************************************
..:: inStrain compare Step 4. Store results ::..
***************************************************
making plots 10
Plotting plot 10
/home/mattolm/.pyenv/versions/3.6.10/lib/python3.6/site-packages/inStrain/plottingUtilities.py:963: UserWarning: FixedFormatter should only be used together with FixedLocator
axes.set_xticklabels(labels)
/home/mattolm/.pyenv/versions/3.6.10/lib/python3.6/site-packages/inStrain/plottingUtilities.py:963: UserWarning: FixedFormatter should only be used together with FixedLocator
axes.set_xticklabels(labels)
Done!
$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$
..:: inStrain compare finished ::..
Output tables........ N5_271_010G1_scaffold_min1000.fa.IS.COMPARE/output/
Figures.............. N5_271_010G1_scaffold_min1000.fa.IS.COMPARE/figures/
Logging.............. N5_271_010G1_scaffold_min1000.fa.IS.COMPARE/log/
See documentation for output descriptions - https://instrain.readthedocs.io/en/latest/
$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$
As before, the last part of the output shows you where the plots and figures have been made. Here’s a list of the files that you should see
$ ls -lht N5_271_010G1_scaffold_min1000.fa.IS.COMPARE/output/
total 14K
-rw-rw-r-- 1 mattolm infantgi 28K Jan 15 10:33 N5_271_010G1_scaffold_min1000.fa.IS.COMPARE_comparisonsTable.tsv
-rw-rw-r-- 1 mattolm infantgi 352 Jan 15 10:33 N5_271_010G1_scaffold_min1000.fa.IS.COMPARE_strain_clusters.tsv
-rw-rw-r-- 1 mattolm infantgi 554 Jan 15 10:33 N5_271_010G1_scaffold_min1000.fa.IS.COMPARE_genomeWide_compare.tsv
$ ls -lht N5_271_010G1_scaffold_min1000.fa.IS.COMPARE/figures/
total 393K
-rw-rw-r-- 1 mattolm infantgi 376K Jan 15 10:33 N5_271_010G1_scaffold_min1000.fa.IS.COMPARE_inStrainCompare_dendrograms.pdf
Success! As before, for help interpreting this output see Expected output .
Tutorial #2) Running inStrain using a public genome database¶
If you don’t want to assemble and bin your metagenomic samples it is also possible to run inStrain using publicly available reference genomes. Here we will go through a tutorial on how to do this with the UHGG genome collection, a collection of all microbial species known to exist in the human gut. The steps in this tutorial could be repeated with any set of genomes though, including genomes assembled from non-industrialized human populations, as available at the following link - https://doi.org/10.5281/zenodo.7782709
Preparing a genome database¶
Note
The genome database created in this section is available for direct download at the following link - https://doi.org/10.5281/zenodo.4441269 . You can download those files directly and skip this section if you would like. This genome set is based on UHGG version 1 and was created on Jan 14, 2021.
Note
An alternative genome database that includes UHGG genomes AND genomes assembled from non-industrialized human populations is available for direct download at the following link - https://doi.org/10.5281/zenodo.7782709 . This genome set is described in the following publication - https://doi.org/10.1101/2022.03.30.486478
In order to create a genome database we need to download the genomes, create a scaffold-to-bin file, create a genes file, and merge all genomes into a single fasta file that we can make a bowtie2 mapping index out of. All genomes in a genome need to database need to be distinct from one another, but not too distinct. See section “Establishing and evaluating genome databases” in Important concepts for more info.
First we must download the UHGG genomes themselves. The FTP site is here, and metadata on genomes is genomes-all_metadata.tsv. Let’s download this metadata file using curl:
$ curl http://ftp.ebi.ac.uk/pub/databases/metagenomics/mgnify_genomes/human-gut/v1.0/genomes-all_metadata.tsv -o genomes-all_metadata.tsv
Now that we have this metadata file we need to download all species representative genomes. There are a number of ways to do this, but we’re going to do it by parsing the metadata table in unix. Running the following command will 1) identify columns of species representatives, 2) parse the row to determine their FTP location, 3) create and run a curl
command to download the genome:
$ cat genomes-all_metadata.tsv | awk -F "\t" '{if ($17 == $1) print "curl ftp://ftp.ebi.ac.uk/pub/databases/metagenomics/mgnify_genomes/human-gut/v1.0/uhgg_catalogue/" substr($18,0,13) "/" $18 "/genome/" $18 ".fna -o UHGG_reps/" $1 ".fna"}' | bash
The following command will let us check and make sure that we downloaded all 4644 genomes:
$ ls UHGG_reps/ | wc -l
4644
Next we need to create a scaffold-to-bin file. This can easily be done using the script parse_stb.py that comes with the program dRep
:
$ parse_stb.py --reverse -f UHGG_reps/* -o UHGG.stb
Next we’ll profile the genes for each genome using Prodigal to create a genes file. This can be done on the concatenated genome file (created below) or on the individual genomes (as shown in this code chunk). The benefit of the later is that it allows Prodigal to be run in single genome mode, as opposed to metagenome mode, which can be more accurate:
$ mkdir UHGG_genes
$ cd UHGG_reps/
$ for genome in $(ls *.fna); do echo prodigal -i $genome -o ../UHGG_genes/$genome.genes -a ../UHGG_genes/$genome.gene.faa -d ../UHGG_genes/$genome.gene.fna -m -p single; done | parallel -j 6
$ cat UHGG_genes/*.gene.fna > UHGG_reps.genes.fna
$ cat UHGG_genes/*.gene.faa > UHGG_reps.genes.faa
Finally we need to concatenate all genomes together into a single fasta file and create a bowtie2 mapping index from it:
$ cat UHGG_reps/* > UHGG_reps.fasta
$ bowtie2-build UHGG_reps.fasta UHGG_reps.fasta.bt2 --large-index --threads 20
Mapping to the Genome Database¶
Here we will use the program Bowtie2 to align our reads to the reference database. If you downloaded the pre-made version of bowtie2 index, you’ll need to extract it using the following command
$ tar -zxvf UHGG_reps_v1.bt2.tgz
This should yield a set of 5 files that end in .bt2l
Next we need to map our metagenomic reads to the database. For the purposes of this tutorial we’ll use metagenomic reads that came from a premature infant fecal sample. The bowtie2 command to map these reads is
$ bowtie2 -p 10 -x /groups/banfield/projects/human/data8/ExternalData/UHGG/UHGG_reps.fasta.bt2 -1 /groups/banfield/projects/human/data8/raw.d/NIH5/reads/HR/N5_216_039G1_T0140F_S50_L002.HR.R1.fastq.gz -2 /groups/banfield/projects/human/data8/raw.d/NIH5/reads/HR/N5_216_039G1_T0140F_S50_L002.HR.R2.fastq.gz > UHGG_reps.fasta-vs-N5_216_039G1.sam
7032881 reads; of these:
7032881 (100.00%) were paired; of these:
1690938 (24.04%) aligned concordantly 0 times
1905098 (27.09%) aligned concordantly exactly 1 time
3436845 (48.87%) aligned concordantly >1 times
----
1690938 pairs aligned concordantly 0 times; of these:
139804 (8.27%) aligned discordantly 1 time
----
1551134 pairs aligned 0 times concordantly or discordantly; of these:
3102268 mates make up the pairs; of these:
1851642 (59.69%) aligned 0 times
279669 (9.01%) aligned exactly 1 time
970957 (31.30%) aligned >1 times
86.84% overall alignment rate
This mapping took just over 10 minutes on my computer. Notice how the bowtie2 states that over 85% of reads align to the database- this is an important number to consider, as all reads that do not align to the database will be invisible to inStrain. For human microbiome samples 85% is pretty good, but performing de novo genome assembly and including sample-specific genomes would undoubtedly raise this number.
Running inStrain profile¶
Next we’ll profile the .sam file created above with inStrain. To do this we’ll need the scaffold-to-bin file, genes file, and fasta file for the database that we created in the first step. If you downloaded them you can decompress them with the commands
$ gzip -d UHGG_reps.fasta.gz
$ gzip -d UHGG_reps.genes.fna.gz
**When running inStrain on a big database like we have here it’s critical to add the flag --database mode
. This flag does some quick calculations to figure out which genomes are probably not present, and stops working on them right away. This leads to dramatic reductions in RAM usage and computational time.
The inStrain profile command we’ll use now is
$ inStrain profile UHGG_reps.fasta-vs-N5_216_039G1.sam /groups/banfield/projects/human/data8/ExternalData/UHGG/UHGG_reps.fasta -o UHGG_reps.fasta-vs-N5_216_039G1.IS -p 10 -g /groups/banfield/projects/human/data8/ExternalData/UHGG/UHGG_reps.genes.fna -s /groups/banfield/projects/human/data8/ExternalData/UHGG/UHGG_reps.stb --database_mode
This took just over an hour to run on my computer. We have now successfully generated an inStrain profile! For help interpreting the output files, see Expected output. To link the genomes in the UHGG database with their taxonomy, use the file genomes-nr_metadata.tsv
which we downloaded above and is part of the overall download as well. To subset to just the species representative genomes (SRGs) that make up this database, subset this table to only include rows where the column “Genome” is equal to the column “Species_rep”.
Running inStrain compare¶
InStrain compare
compare genomes that have been profiled by multiple different metagenomic mappings. To compare genomes in the sample we just profiled above, we need to generate another bam file of reads from another sample to the same .fasta file. For example, something like the command (based on reads from this fecal sample from the same premature infant):
bowtie2 -p 10 -x /groups/banfield/projects/human/data8/ExternalData/UHGG/UHGG_reps.fasta.bt2 -1 /groups/banfield/projects/human/data8/raw.d/NIH5/reads/HR/N5_216_046G1_T0140F_S50_L002.HR.R1.fastq.gz -2 /groups/banfield/projects/human/data8/raw.d/NIH5/reads/HR/N5_216_046G1_T0140F_S50_L002.HR.R2.fastq.gz > UHGG_reps.fasta-vs-N5_216_046G1.sam
$ inStrain profile UHGG_reps.fasta-vs-N5_216_046G1.sam /groups/banfield/projects/human/data8/ExternalData/UHGG/UHGG_reps.fasta -o UHGG_reps.fasta-vs-N5_216_046G1.IS -p 10 -g /groups/banfield/projects/human/data8/ExternalData/UHGG/UHGG_reps.genes.fna -s /groups/banfield/projects/human/data8/ExternalData/UHGG/UHGG_reps.stb --database_mode
Now that we have two inStrain profile objects based on reads mapped to the same .fasta file, we can compare all detected genomes using the following command:
inStrain compare -i UHGG_reps.fasta-vs-N5_216_039G1.IS/ UHGG_reps.fasta-vs-N5_216_046G1.IS/ -s /groups/banfield/projects/human/data8/ExternalData/UHGG/UHGG_reps.stb -p 6 -o UHGG_reps.fasta.IS.COMPARE --database_mode
Success! As before, for help interpreting this output see Expected output.
Tutorial #3) Merging custom genomes with an existing genome database.¶
Using a combination of sample-specific genomes for accuracy and public genome databases for comprehensiveness can provide the best of both worlds. The steps are as follows:
- Establish a set of data-set specific genomes through de novo genome assembly and binning. This could be done using a tool such as anvi’o, for example.
- Download an entire database of individual genomes. See the top of Tutorial #2 for instructions on downloading UHGG.
- Dereplicate both sets of genomes. The specific threshold you use for dereplication is important and some thoughts about choosing thresholds is available at Important concepts. A program that can be used for this purpose is dRep; just make sure you have dRep version 3 which is able to handle much larger genome sets than previous versions. An example command that could be used for this step is
dRep dereplicate MergedGenomeSet -g FullListOfGenomes.txt –S_algorithm fastANI –multiround_primary_clustering –clusterAlg greedy -ms 10000 -pa 0.9 -sa 0.95 -nc 0.30 -cm larger -p 16
This command will result in a species-level dereplicated set of genomes that include both your custom genomes and the database genomes. More details on genome dereplication can be found here. To prioritize your custom genomes over the database genomes, use the flag extra_weight_table within dRep.
- Create a genome database out of the genomes in the dereplicated_genomes folder produced in the step above. This can be done following the instructions at the top of Tutorial #2.
User Manual¶
Generating inStrain input¶
There are two main inputs to inStrain: a fasta file containing reference genome sequences, and a bam file containing reads mapped to these sequences. Additionally and optionally, by providing a genes .fna file inStrain can calculate gene-level metrics, and by providing a scaffold-to-bin file inStrain can calculate metrics on a genome level. Here we go over some considerations involved in generating these inputs.
Preparing the .fasta file¶
A fasta file contains the DNA sequences of the contigs that you map your reads to. Choosing what fasta file you will use (consensus / reference genomes) is important and will affect the interpretation of your inStrain results. Below we describe the three most common strategies.
Please note that the fasta file provided to inStrain must always be the same as, or a subset of, the fasta file used to create the bam file (i.e. the fasta file that reads were mapped to).
Using de novo assembled genomes (recommended)¶
This strategy involves assembling genomes from the metagenomic samples that you’d like to profile. This is the recommended workflow for running inStrain:
- Assemble reads into contigs for each sample collected from the environment. Recommended software: IDBA_UD, MEGAHIT, metaSPADES.
- Bin genomes out of each assembly using differential coverage binning. Recommended software: Bowtie2 (for mapping), MetaBAT, CONCOCT, DasTOOL (for binning).
- Dereplicate the entire set of genomes that you would like to profile (all genomes from all environments) at 97-99% identity, and filter out low quality genomes. Recommended software: dRep, checkM.
- Create a scaffold-to-bin file from the genome set. Recommended software: parse_stb.py
- Create a bowtie2 index of the representative genomes from this dereplicated set and map reads to this set from each sample. Recommended software: Bowtie2
- Profile the resulting mapping .bam files using inStrain to calculate genome-level microdiveristy metrics for each originally binned genome.
An important aspect of this workflow is to map to many genomes at once. Mapping to just one genome at a time is highly discouraged, because this encourages mismapped reads from other genomes to be recruited by this genome. By including many (dereplicated) genomes in your bowtie2 index, you will be able to far more accurately filter out mismapped reads and reduce false positive SNPs. See Important concepts for more info.
For instructions on merging your genomes with a public database, see Tutorial #3 of Tutorial.
Using a single genome .fasta file¶
If your .fasta file is a single genome, the main consideration is that it should be a good representative genome for some organism in your sample. Ideally, it was assembled directly from that sample, isolated from that sample, or you have some other evidence that this genome is highly representative of a species in that sample. Regardless, you should check your inStrain plot output and scaffold_info.tsv output file to be sure that your inStrain run had decent coverage and breadth of coverage of the genome that you use before attempting to interpret the results.
Remember, your .fasta file can be a subset of the .fasta file that was used to create the .bam file. You can create a .bam with all dereplicated genomes from your environment, but then just pass a .fasta file for only the genomes of particular interest. This approach is recommended as opposed to creating a bam file for just each genome, as it reduces mismapped reads
Using a metagenomic assembly¶
You can also pass inStrain an entire metagenomic assembly from a sample, including both binned and unbinned contigs. In this case, the output inStrain profile will include population information for each contig in the set. To break it down by microbial genome / species, you can include a scaffold-to-bin file to generate results by genome.
Preparing the .bam file¶
InStrain is designed primarily for paired-end Illumina read sequencing, though un-paired reads can also be used by adjusting the run-time parameters. We recommend using the program Bowtie2 to map your reads to your genome.
Bowtie2 default parameters are what we use for mapping, but it may be worth playing around with them to see how different settings perform on your data. It is important to note that the -X
flag (capital X) is the expected insert length and is by default 500
. In many cases (e.g., 2x250 bp or simply datasets with longer inserts) it may be worthwhile to increase this value up to -X 1000
for passing to Bowtie2. By default, if a read maps equally well to multiple genomes, Bowtie2 will pick one of the positions randomly and give the read a MAPQ score of 1. Thus, if you’d like to remove multi-mapped reads, you can set the minimum mapQ score to 2.
Other mapping software can also be used to generate .bam files for inStrain. However, some software (e.g. BBmap and SNAP) use the fasta file scaffold descriptions when generating the .bam files, which causes problems for inStrain. If using mapping software that does this, include the flag --use_full_fasta_header
to let inStrain account for this.
Note
If the reads that you’d like to run with inStrain are not working, please post an issue on GitHub. We’re happy to upgrade inStrain to work with new mapping software and/or reads from different technologies.
Preparing the genes file¶
You can run prodigal on your fasta file to generate an .fna file with the gene-level information. This .fna file can then be provided to inStrain profile to get gene-level characterizations.
Example:
$ prodigal -i assembly.fasta -d genes.fna -a genes.faa
Preparing a scaffold-to-bin file¶
After running inStrain profile
, most results are presented on a scaffold-by-scaffold basis. There are a number of ways of telling inStrain which scaffold belongs to which genome, so that results can be analyzed on a genome-by-gene level as well.
- Individual .fasta files. As recommended above, if you want to run inStrain on multiple genomes in the same sample, you should first concatenate all of the individual genomes into a single .fasta file and map to that. To view the results of the individual genomes used to create the concatenated .fasta file, you can pass a list of the individual .fasta files the
-s
argument. - Scaffold-to-bin file. This is a text file consists of two columns, with one column listing the scaffold name, and the second column listing the genome bin name. Columns should be separated by tabs. The script parse_stb.py can help you create a scaffold-to-bin file from a list of individual .fasta files, or to split a concatenated .fasta file into individual genomes. The script comes packaged with the program dRep, and can be installed with the command
pip install drep
. - Nothing. If all of your scaffolds belong to the same genome, by running
inStrain profile
without any -s options it will summarize the results of all scaffolds together as if they all belong to the same genome.
Description of inStrain modules and arguments¶
The functionality of inStrain is broken up into modules. To see a list of available modules, check the help:
$ inStrain -h
...::: inStrain v1.3.2 :::...
Matt Olm and Alex Crits-Christoph. MIT License. Banfield Lab, UC Berkeley. 2019
Choose one of the operations below for more detailed help. See https://instrain.readthedocs.io for documentation.
Example: inStrain profile -h
Workflows:
profile -> Create an inStrain profile (microdiversity analysis) from a mapping.
compare -> Compare multiple inStrain profiles (popANI, coverage_overlap, etc.)
Single operations:
profile_genes -> Calculate gene-level metrics on an inStrain profile [DEPRECATED; USE profile INSTEAD]
genome_wide -> Calculate genome-level metrics on an inStrain profile
quick_profile -> Quickly calculate coverage and breadth of a mapping using coverM
filter_reads -> Commands related to filtering reads from .bam files
plot -> Make figures from the results of "profile" or "compare"
other -> Other miscellaneous operations
profile¶
Module description¶
The heart of inStrain. The input is a fasta file and a bam file, and the output is an IS_profile. The functionality of inStrain profile
is broken into several steps.
First, all reads in the .bam file are filtered to only keep those that map with sufficient quality. All non-paired reads will be filtered out by default, and an additional set of filters are applied to each read pair (not the individual reads). Command line parameters can be adjusted to change the specifics, but in general:
- Pairs must be mapped in the proper orientation with an expected insert size. The minimum insert distance can be set with a command line parameter. The maximum insert distance is a multiple of the median insert distance. So if pairs have a median insert size of 500bp, by default all pairs with insert sizes over 1500bp will be excluded. For the max insert cutoff, the median_insert for all scaffolds is used.
- Pairs must have a minimum mapQ score. MapQ scores are confusing and how they’re calculated varies based on the mapping algorithm being used, but are meant to represent both the number of mismatches in the mapping and how unique that mapping is. With bowtie2, if the read maps equally well to two positions on the genome (multi-mapped read), its mapQ score will be set to 2. The read in the pair with the higher mapQ is used for the pair.
- Pairs must be above some minimum nucleotide identity (ANI) value. For example if reads in a pair are 100bp each, and each read has a single mismatch, the ANI of that pair would be 0.99
Next, using only read pairs that pass filters, a number of microdiversity metrics are calculated on a scaffold-by-scaffold basis. This includes:
- Calculate the coverage at each position along the scaffold
- Calculate the nucleotide diversity at each position along the scaffold in which the coverage is greater than the min_cov argument.
- Identify SNSs and SNVs. The criteria for being reported as a divergent site are 1) More than min_cov number of bases at that position, 2) More than min_freq percentage of reads that are a variant base, 3) The number of reads with the variant base is more than the null model for that coverage.
- Calculate linkage between divergent sites on the same read pair. For each pair harboring a divergent site, calculate the linkage of that site with other divergent sites within that same pair. This is only done for pairs of divergent sites that are both on at least MIN_SNP reads
- Calculate scaffold-level properties. These include things like the overall coverage, breadth of coverage, average nucleotide identity (ANI) between the reads and the reference genome, and the expected breadth of coverage based on that true coverage.
Finally, this information is stored as an IS_profile object. This includes the locations of divergent sites, the number of read pairs that passed filters (and other information) for each scaffold, the linkage between SNV pairs, ect.
Module parameters¶
To see the command-line arguments for inStrain profile, check the help:
$ inStrain profile -h
usage: inStrain profile [-o OUTPUT] [--use_full_fasta_header] [-p PROCESSES]
[-d] [-h] [--version] [-l MIN_READ_ANI]
[--min_mapq MIN_MAPQ]
[--max_insert_relative MAX_INSERT_RELATIVE]
[--min_insert MIN_INSERT]
[--pairing_filter {paired_only,all_reads,non_discordant}]
[--priority_reads PRIORITY_READS]
[--detailed_mapping_info] [-c MIN_COV] [-f MIN_FREQ]
[-fdr FDR] [-g GENE_FILE] [-s [STB [STB ...]]]
[--mm_level] [--skip_mm_profiling] [--database_mode]
[--min_scaffold_reads MIN_SCAFFOLD_READS]
[--min_genome_coverage MIN_GENOME_COVERAGE]
[--min_snp MIN_SNP] [--store_everything]
[--scaffolds_to_profile SCAFFOLDS_TO_PROFILE]
[--rarefied_coverage RAREFIED_COVERAGE]
[--window_length WINDOW_LENGTH] [--skip_genome_wide]
[--skip_plot_generation]
bam fasta
REQUIRED:
bam Sorted .bam file
fasta Fasta file the bam is mapped to
I/O PARAMETERS:
-o OUTPUT, --output OUTPUT
Output prefix (default: inStrain)
--use_full_fasta_header
Instead of using the fasta ID (space in header before
space), use the full header. Needed for some mapping
tools (including bbMap) (default: False)
SYSTEM PARAMETERS:
-p PROCESSES, --processes PROCESSES
Number of processes to use (default: 6)
-d, --debug Make extra debugging output (default: False)
-h, --help show this help message and exit
--version show program's version number and exit
READ FILTERING OPTIONS:
-l MIN_READ_ANI, --min_read_ani MIN_READ_ANI
Minimum percent identity of read pairs to consensus to
use the reads. Must be >, not >= (default: 0.95)
--min_mapq MIN_MAPQ Minimum mapq score of EITHER read in a pair to use
that pair. Must be >, not >= (default: -1)
--max_insert_relative MAX_INSERT_RELATIVE
Multiplier to determine maximum insert size between
two reads - default is to use 3x median insert size.
Must be >, not >= (default: 3)
--min_insert MIN_INSERT
Minimum insert size between two reads - default is 50
bp. If two reads are 50bp each and overlap completely,
their insert will be 50. Must be >, not >= (default:
50)
--pairing_filter {paired_only,all_reads,non_discordant}
How should paired reads be handled?
paired_only = Only paired reads are retained
non_discordant = Keep all paired reads and singleton reads that map to a single scaffold
all_reads = Keep all reads regardless of pairing status (NOT RECOMMENDED; See documentation for deatils)
(default: paired_only)
--priority_reads PRIORITY_READS
The location of a list of reads that should be
retained regardless of pairing status (for example
long reads or merged reads). This can be a .fastq file
or text file with list of read names (will assume file
is compressed if ends in .gz (default: None)
READ OUTPUT OPTIONS:
--detailed_mapping_info
Make a detailed read report indicating deatils about
each individual mapped read (default: False)
VARIANT CALLING OPTIONS:
-c MIN_COV, --min_cov MIN_COV
Minimum coverage to call an variant (default: 5)
-f MIN_FREQ, --min_freq MIN_FREQ
Minimum SNP frequency to confirm a SNV (both this AND
the FDR snp count cutoff must be true to call a SNP).
(default: 0.05)
-fdr FDR, --fdr FDR SNP false discovery rate- based on simulation data
with a 0.1 percent error rate (Q30) (default: 1e-06)
GENE PROFILING OPTIONS:
-g GENE_FILE, --gene_file GENE_FILE
Path to prodigal .fna genes file. If file ends in .gb
or .gbk, will treat as a genbank file (EXPERIMENTAL;
the name of the gene must be in the gene qualifier)
(default: None)
GENOME WIDE OPTIONS:
-s [STB [STB ...]], --stb [STB [STB ...]]
Scaffold to bin. This can be a file with each line
listing a scaffold and a bin name, tab-seperated. This
can also be a space-seperated list of .fasta files,
with one genome per .fasta file. If nothing is
provided, all scaffolds will be treated as belonging
to the same genome (default: [])
READ ANI OPTIONS:
--mm_level Create output files on the mm level (see documentation
for info) (default: False)
--skip_mm_profiling Dont perform analysis on an mm level; saves RAM and
time; impacts plots and raw_data (default: False)
PROFILE OPTIONS:
--database_mode Set a number of parameters to values appropriate for
mapping to a large fasta file. Will set:
--min_read_ani 0.92 --skip_mm_profiling
--min_genome_coverage 1 (default: False)
--min_scaffold_reads MIN_SCAFFOLD_READS
Minimum number of reads mapping to a scaffold to
proceed with profiling it (default: 1)
--min_genome_coverage MIN_GENOME_COVERAGE
Minimum number of reads mapping to a genome to proceed
with profiling it. MUST profile .stb if this is set
(default: 0)
--min_snp MIN_SNP Absolute minimum number of reads connecting two SNPs
to calculate LD between them. (default: 20)
--store_everything Store intermediate dictionaries in the pickle file;
will result in significantly more RAM and disk usage
(default: False)
--scaffolds_to_profile SCAFFOLDS_TO_PROFILE
Path to a file containing a list of scaffolds to
profile- if provided will ONLY profile those scaffolds
(default: None)
--rarefied_coverage RAREFIED_COVERAGE
When calculating nucleotide diversity, also calculate
a rarefied version with this much coverage (default:
50)
--window_length WINDOW_LENGTH
Break scaffolds into windows of this length when
profiling (default: 10000)
OTHER OPTIONS:
--skip_genome_wide Do not generate tables that consider groups of
scaffolds belonging to genomes (default: False)
--skip_plot_generation
Do not make plots (default: False)
compare¶
Module description¶
Compare provides the ability to compare multiple inStrain profiles (created by running inStrain profile
).
Note
inStrain can only compare inStrain profiles that have been mapped to the same .fasta file
inStrain compare
does pairwise comparisons between each input inStrain profile. For each pair, a series of steps are undertaken.
- All positions in which both IS_profile objects have at least min_cov coverage (5x by default) are identified. This information can be stored in the output by using the flag –store_coverage_overlap, but due to it’s size, it’s not stored by default
- Each position identified in step 1 is compared to calculate both conANI and popANI. The way that it compares positions is by testing whether the consensus base in sample 1 is detected at all in sample 2 and vice-versa. Detection of an allele in a sample is based on that allele being above the set -min_freq and -fdr. All detected differences between each pair of samples can be reported if the flag –store_mismatch_locations is set.
- The coverage overlap and the average nucleotide identity for each scaffold is reported. For details on how this is done, see Expected output
- New in v1.6 Tables that list the coverage and base-frequencies of each SNV in all samples can be generated using the –bams parameter within inStrain compare. For each inStrain profile provided with the -i parameter, the corresponding bam file must be provided with the –bams parameter. The same read filtering parameters used in the original profile command will be used when running this analysis. See section SNV POOLING OPTIONS: in the help below for full information about this option, and see Expected output for the tables it creates.
Module parameters¶
To see the command-line options, check the help:
$ inStrain compare -h
usage: inStrain compare -i [INPUT [INPUT ...]] [-o OUTPUT] [-p PROCESSES] [-d]
[-h] [--version] [-s [STB [STB ...]]] [-c MIN_COV]
[-f MIN_FREQ] [-fdr FDR] [--database_mode]
[--breadth BREADTH] [-sc SCAFFOLDS] [--genome GENOME]
[--store_coverage_overlap]
[--store_mismatch_locations]
[--include_self_comparisons] [--skip_plot_generation]
[--group_length GROUP_LENGTH] [--force_compress]
[-ani ANI_THRESHOLD] [-cov COVERAGE_TRESHOLD]
[--clusterAlg {centroid,weighted,ward,single,complete,average,median}]
[-bams [BAMS [BAMS ...]]] [--skip_popANI]
REQUIRED:
-i [INPUT [INPUT ...]], --input [INPUT [INPUT ...]]
A list of inStrain objects, all mapped to the same
.fasta file (default: None)
-o OUTPUT, --output OUTPUT
Output prefix (default: instrainComparer)
SYSTEM PARAMETERS:
-p PROCESSES, --processes PROCESSES
Number of processes to use (default: 6)
-d, --debug Make extra debugging output (default: False)
-h, --help show this help message and exit
--version show program's version number and exit
GENOME WIDE OPTIONS:
-s [STB [STB ...]], --stb [STB [STB ...]]
Scaffold to bin. This can be a file with each line
listing a scaffold and a bin name, tab-seperated. This
can also be a space-seperated list of .fasta files,
with one genome per .fasta file. If nothing is
provided, all scaffolds will be treated as belonging
to the same genome (default: [])
VARIANT CALLING OPTIONS:
-c MIN_COV, --min_cov MIN_COV
Minimum coverage to call an variant (default: 5)
-f MIN_FREQ, --min_freq MIN_FREQ
Minimum SNP frequency to confirm a SNV (both this AND
the FDR snp count cutoff must be true to call a SNP).
(default: 0.05)
-fdr FDR, --fdr FDR SNP false discovery rate- based on simulation data
with a 0.1 percent error rate (Q30) (default: 1e-06)
DATABASE MODE PARAMETERS:
--database_mode Using the parameters below, automatically determine
which genomes are present in each Profile and only
compare scaffolds from those genomes. All profiles
must have run Profile with the same .stb (default:
False)
--breadth BREADTH Minimum breadth_minCov required to count a genome
present (default: 0.5)
OTHER OPTIONS:
-sc SCAFFOLDS, --scaffolds SCAFFOLDS
Location to a list of scaffolds to compare. You can
also make this a .fasta file and it will load the
scaffold names (default: None)
--genome GENOME Run scaffolds belonging to this single genome only.
Must provide an .stb file (default: None)
--store_coverage_overlap
Also store coverage overlap on an mm level (default:
False)
--store_mismatch_locations
Store the locations of SNPs (default: False)
--include_self_comparisons
Also compare IS profiles against themself (default:
False)
--skip_plot_generation
Dont create plots at the end of the run. (default:
False)
--group_length GROUP_LENGTH
How many bp to compare simultaneously (higher will use
more RAM and run more quickly) (default: 10000000)
--force_compress Force compression of all output files (default: False)
GENOME CLUSTERING OPTIONS:
-ani ANI_THRESHOLD, --ani_threshold ANI_THRESHOLD
popANI threshold to cluster genomes at. Must provide
.stb file to do so (default: 0.99999)
-cov COVERAGE_TRESHOLD, --coverage_treshold COVERAGE_TRESHOLD
Minimum percent_genome_compared for a genome
comparison to count; if below the popANI will be set
to 0. (default: 0.1)
--clusterAlg {centroid,weighted,ward,single,complete,average,median}
Algorithm used to cluster genomes (passed to
scipy.cluster.hierarchy.linkage) (default: average)
SNV POOLING OPTIONS:
-bams [BAMS [BAMS ...]], --bams [BAMS [BAMS ...]]
Location of .bam files used during inStrain profile
commands; needed to pull low-frequency SNVs. MUST BE
IN SAME ORDER AS THE INPUT FILES (default: None)
--skip_popANI Only run SNV Pooling; skip other compare operations
(default: False)
Other modules¶
The other modules are not commonly used, and mainly provide auxiliary functions or allow you to run certain steps of profile
after the fact. It is recommended to provide a genes file and/or a scaffold-to-bin file during inStrain profile
rather than using profile_genes
or genome_wide
, as it is more computationally efficient to do things that way.
parse_annotations¶
This is a new module (released in inStrain version 1.7) that provides a straightforward way to annotate genes for functional analysis with inStrain. It’s inputs are 1 or more inStrain profile objects (a set of genes must of been provided when running profile) and a gene annotation table. The format of the gene annotation MUST be a .csv file that looks like the following:
gene | anno |
AF21-42.Scaf7_79 | K14374 |
AF21-42.Scaf7_79 | K12633 |
AF21-42.Scaf7_79 | K16034 |
AF21-42.Scaf7_79 | K12705 |
AF21-42.Scaf7_79 | K15467 |
AF21-42.Scaf7_79 | K20592 |
AF21-42.Scaf7_79 | K15958 |
AF21-42.Scaf7_79 | K22590 |
AF21-42.Scaf7_80 | K02835 |
AF21-42.Scaf7_81 | K03431 |
The top line MUST read gene,anno exactly, and all other lines must have the format gene, comma (,), annotation. The gene column must match the names of the genes provided to inStrain profile, and the second column can be whatever you want it to be. It is fine if a gene as multiple annotations, just make that gene have multiple lines in this file.
For tips on running the annotations themselves, see the section Gene Annotation below. Also see the section Running inStrain parse_annotations in Tutorial for more info.
To see the command-line options, check the help:
$ inStrain parse_annotations -h
usage: inStrain parse_annotations -i [INPUT [INPUT ...]] -a [ANNOTATIONS [ANNOTATIONS ...]] [-o OUTPUT] [-p PROCESSES] [-d] [-h] [--version] [-b MIN_GENOME_BREADTH]
[-g MIN_GENE_BREADTH] [--store_rawdata]
REQUIRED:
-i [INPUT [INPUT ...]], --input [INPUT [INPUT ...]]
A list of inStrain objects, all mapped to the same .fasta file (default: None)
-a [ANNOTATIONS [ANNOTATIONS ...]], --annotations [ANNOTATIONS [ANNOTATIONS ...]]
A table or set of tables with gene annotations.
Must be in specific format; see inStrain website for details (default: None)
-o OUTPUT, --output OUTPUT
Output prefix (default: annotation_output)
SYSTEM PARAMETERS:
-p PROCESSES, --processes PROCESSES
Number of processes to use (default: 6)
-d, --debug Make extra debugging output (default: False)
-h, --help show this help message and exit
--version show program's version number and exit
OTHER OPTIONS:
-b MIN_GENOME_BREADTH, --min_genome_breadth MIN_GENOME_BREADTH
Only annotate genomes on genomes with at least this genome breadth. Requires having genomes called. Set to 0 to include all genes. (default: 0.5)
-g MIN_GENE_BREADTH, --min_gene_breadth MIN_GENE_BREADTH
Only annotate genes with at least this breadth. Set to 0 to include all genes. (default: 0.8)
--store_rawdata Store the raw data dictionary (default: False)
quick_profile¶
This is a quirky module that is not really related to any of the others. It is used to quickly profile a bam file to pull out scaffolds from genomes that are at a sufficient breadth. To use it you must provide a .bam file, the .fasta file that you mapped to to generate the .bam file, and a scaffold to bin file (see above section for details). On the backend this module is really just calling the program coverM
To see the command-line options, check the help:
$ inStrain quick_profile -h
usage: inStrain quick_profile [-p PROCESSES] [-d] [-h] [--version]
[-s [STB [STB ...]]] [-o OUTPUT]
[--breadth_cutoff BREADTH_CUTOFF]
[--stringent_breadth_cutoff STRINGENT_BREADTH_CUTOFF]
bam fasta
REQUIRED:
bam Sorted .bam file
fasta Fasta file the bam is mapped to
SYSTEM PARAMETERS:
-p PROCESSES, --processes PROCESSES
Number of processes to use (default: 6)
-d, --debug Make extra debugging output (default: False)
-h, --help show this help message and exit
--version show program's version number and exit
OTHER OPTIONS:
-s [STB [STB ...]], --stb [STB [STB ...]]
Scaffold to bin. This can be a file with each line
listing a scaffold and a bin name, tab-seperated. This
can also be a space-seperated list of .fasta files,
with one genome per .fasta file. If nothing is
provided, all scaffolds will be treated as belonging
to the same genome (default: [])
-o OUTPUT, --output OUTPUT
Output prefix (default: QuickProfile)
--breadth_cutoff BREADTH_CUTOFF
Minimum genome breadth to pull scaffolds (default:
0.5)
--stringent_breadth_cutoff STRINGENT_BREADTH_CUTOFF
Minimum breadth to let scaffold into coverm raw
results (done with greater than; NOT greater than or
equal to) (default: 0.0)
plot¶
This module produces plots based on the results of inStrain profile and inStrain compare. In both cases, before plots can be made, inStrain genome_wide must be run on the output folder first. In order to make plots 8 and 9, inStrain profile_genes must be run first as well.
The recommended way of running this module is with the default -pl a. It will just try and make all of the plots that it can, and will tell you about any plots that it fails to make.
See Expected output for an example of the plots it can make.
To see the command-line options, check the help:
$ inStrain plot -h
usage: inStrain plot -i IS [-pl [PLOTS [PLOTS ...]]] [-p PROCESSES] [-d] [-h]
REQUIRED:
-i IS, --IS IS an inStrain profile object (default: None)
-pl [PLOTS [PLOTS ...]], --plots [PLOTS [PLOTS ...]]
Plots. Input 'all' or 'a' to plot all
1) Coverage and breadth vs. read mismatches
2) Genome-wide microdiversity metrics
3) Read-level ANI distribution
4) Major allele frequencies
5) Linkage decay
6) Read filtering plots
7) Scaffold inspection plot (large)
8) Linkage with SNP type (GENES REQUIRED)
9) Gene histograms (GENES REQUIRED)
10) Compare dendrograms (RUN ON COMPARE; NOT PROFILE)
(default: a)
SYSTEM PARAMETERS:
-p PROCESSES, --processes PROCESSES
Number of processes to use (default: 6)
-d, --debug Make extra debugging output (default: False)
-h, --help show this help message and exit
other¶
This module holds odds and ends functionalities. As of version 1.4, all it can do is convert old IS_profile objects (>v0.3.0) to newer versions (v0.8.0) and create runtime summaries of complete inStrain runs. As the code base around inStrain matures, we expect more functionalities to be included here.
To see the command-line options, check the help:
$ inStrain other -h
usage: inStrain other [-p PROCESSES] [-d] [-h] [--version] [--old_IS OLD_IS]
[--run_statistics RUN_STATISTICS]
SYSTEM PARAMETERS:
-p PROCESSES, --processes PROCESSES
Number of processes to use (default: 6)
-d, --debug Make extra debugging output (default: False)
-h, --help show this help message and exit
--version show program's version number and exit
OTHER OPTIONS:
--old_IS OLD_IS Convert an old inStrain version object to the newer
version. (default: None)
--run_statistics RUN_STATISTICS
Generate runtime reports for an inStrain run.
(default: None)
Expected output¶
InStrain produces a variety of output in the IS folder depending on which operations are run. Generally, output that is meant for human eyes to be easily interpretable is located in the output
folder.
inStrain profile¶
A typical run of inStrain will yield the following files in the output folder:
scaffold_info.tsv¶
This gives basic information about the scaffolds in your sample at the highest allowed level of read identity.
scaffold | length | coverage | breadth | nucl_diversity | coverage_median | coverage_std | coverage_SEM | breadth_minCov | breadth_expected | nucl_diversity_median | nucl_diversity_rarefied | nucl_diversity_rarefied_median | breadth_rarefied | conANI_reference | popANI_reference | SNS_count | SNV_count | divergent_site_count | consensus_divergent_sites | population_divergent_sites |
N5_271_010G1_scaffold_100 | 1148 | 1.89808362369338 | 0.9764808362369338 | 0.0 | 2 | 1.0372318863390368 | 0.030626273060932862 | 0.018292682926829267 | 0.8128805020451009 | 0.0 | 0.0 | 1.0 | 1.0 | 0 | 0 | 0 | 0 | 0 | ||
N5_271_010G1_scaffold_102 | 1144 | 2.388986013986014 | 0.9956293706293706 | 0.003678160837326971 | 2 | 1.3042095721915248 | 0.038576628450898466 | 0.07604895104895107 | 0.8786983245100435 | 0.0 | 0.0 | 1.0 | 1.0 | 0 | 0 | 0 | 0 | 0 | ||
N5_271_010G1_scaffold_101 | 1148 | 1.7439024390243902 | 0.9599303135888502 | 2 | 0.8728918441975071 | 0.025773816178570358 | 0.0 | 0.7855901382035807 | 0.0 | 0.0 | 0.0 | 0 | 00 | 0 | 0 | |||||
N5_271_010G1_scaffold_103 | 1142 | 2.039404553415061 | 0.9938704028021016 | 0.0 | 2 | 1.1288397384374758 | 0.03341869350286944 | 0.04028021015 |
- scaffold
- The name of the scaffold in the input .fasta file
- length
- Full length of the scaffold in the input .fasta file
- coverage
- The average depth of coverage on the scaffold. If half the bases in a scaffold have 5 reads on them, and the other half have 10 reads, the coverage of the scaffold will be 7.5
- breadth
- The percentage of bases in the scaffold that are covered by at least a single read. A breadth of 1 means that all bases in the scaffold have at least one read covering them
- nucl_diversity
- The mean nucleotide diversity of all bases in the scaffold that have a nucleotide diversity value calculated. So if only 1 base on the scaffold meets the minimum coverage to calculate nucleotide diversity, the nucl_diversity of the scaffold will be the nucleotide diversity of that base. Will be blank if no positions have a base over the minimum coverage.
- coverage_median
- The median depth of coverage value of all bases in the scaffold, included bases with 0 coverage
- coverage_std
- The standard deviation of all coverage values
- coverage_SEM
- The standard error of the mean of all coverage values (calculated using scipy.stats.sem)
- breadth_minCov
- The percentage of bases in the scaffold that have at least min_cov coverage (e.g. the percentage of bases that have a nucl_diversity value and meet the minimum sequencing depth to call SNVs)
- breadth_expected
- expected breadth; this tells you the breadth that you should expect if reads are evenly distributed along the genome, given the reported coverage value. Based on the function breadth = -1.000 * e^(0.883 * coverage) + 1.000. This is useful to establish whether or not the scaffold is actually in the reads, or just a fraction of the scaffold. If your coverage is 10x, the expected breadth will be ~1. If your actual breadth is significantly lower then the expected breadth, this means that reads are mapping only to a specific region of your scaffold (transposon, prophage, etc.)
- nucl_diversity_median
- The median nucleotide diversity value of all bases in the scaffold that have a nucleotide diversity value calculated
- nucl_diversity_rarefied
- The average nucleotide diversity among positions that have at least
--rarefied_coverage
(50x by default). These values are also calculated by randomly subsetting the reads at that position to--rarefied_coverage
reads - nucl_diversity_rarefied_median
- The median rarefied nucleotide diversity (similar to that described above)
- breadth_rarefied
- The percentage of bases in a scaffold that have at least
--rarefied_coverage
- conANI_reference
- The conANI between the reads and the reference genome
- popANI_reference
- The popANI between the reads and the reference genome
- SNS_count
- The total number of SNSs called on this scaffold
- SNV_count
- The total number of SNVs called on this scaffold
- divergent_site_count
- The total number of divergent sites called on this scaffold
- consensus_divergent_sites
- The total number of divergent sites in which the reads have a different consensus allele than the reference genome. These count as “differences” in the conANI_reference calculation, and
breadth_minCov
*length
counts as the denominator. - population_divergent_sites
- The total number of divergent sites in which the reads do not have the reference genome base as any allele at all (major or minor). These count as “differences” in the popANI_reference calculation, and
breadth_minCov
*length
counts as the denominator.
mapping_info.tsv¶
This provides an overview of the number of reads that map to each scaffold, and some basic metrics about their quality. The header line (starting with #; not shown in the table below) describes the parameters that were used to filter the reads
scaffold | pass_pairing_filter | filtered_pairs | median_insert | mean_PID | pass_min_insert | unfiltered_reads | unfiltered_pairs | pass_min_read_ani | filtered_priority_reads | unfiltered_singletons | mean_insert_distance | pass_min_mapq | mean_mistmaches | mean_mapq_score | unfiltered_priority_reads | pass_max_insert | filtered_singletons | mean_pair_length |
all_scaffolds | 22886 | 9435 | 318.75998426985933 | 0.942328296264744 | 22804.0 | 71399 | 22886 | 9499.0 | 0 | 25627 | 322.1849602376999 | 22886.0 | 14.325963471117715 | 17.16896792799091 | 0 | 22828.0 | 0 | 255.52 |
N5_271_010G1_scaffold_1 | 432 | 346 | 373.0 | 0.9719013034762376 | 432.0 | 959 | 432 | 346.0 | 0 | 95 | 373.72222222222223 | 432.0 | 7.643518518518517 | 33.030092592592595 | 0 | 432.0 | 0 | 274.7106481481 |
N5_271_010G1_scaffold_0 | 741 | 460 | 389.0 | 0.9643004762700924 | 740.0 | 1841 | 741 | 461.0 | 0 | 359 | 387.94466936572195 | 741.0 | 10.2361673414305 | 26.537112010796218 | 0 | 741.0 | 0 | 285.5033738191 |
N5_271_010G1_scaffold_2 | 348 | 252 | 369.5 | 0.965446218901576 | 347.0 | 865 | 348 | 253.0 | 0 | 169 | 349.0172413793104 | 348.0 | 8.227011494252874 | 31.557471264367813 | 0 | 347.0 | 0 | 243.3103448275 |
N5_271_010G1_scaffold_3 | 301 | 205 | 367.0 | 0.9639376512009891 | 301.0 | 1088 | 301 | 205.0 | 0 | 486 | 327.81395348837214 | 301.0 | 8.70764119601329 | 29.089700996677745 | 0 | 300.0 | 0 | 251.2624584717 |
N5_271_010G1_scaffold_4 | 213 | 153 | 389.0 | 0.9649427929020106 | 213.0 | 502 | 213 | 153.0 | 0 | 76 | 372.3896713615024 | 213.0 | 9.27699530516432 | 30.70422535211268 | 0 | 213.0 | 0 | 269.2300469483 |
N5_271_010G1_scaffold_5 | 134 | 114 | 366.0 | 0.977820509122326 | 134.0 | 349 | 134 | 116.0 | 0 | 81 | 376.4552238805969 | 134.0 | 5.164179104477612 | 37.61194029850746 | 0 | 132.0 | 0 | 246.8059701492 |
N5_271_010G1_scaffold_6 | 140 | 130 | 384.5 | 0.9813174696928879 | 140.0 | 316 | 140 | 130.0 | 0 | 36 | 372.45 | 140.0 | 4.864285714285714 | 38.43571428571428 | 0 | 140.0 | 0 | 261.3071428571429 |
- scaffold
- The name of the scaffold in the input .fasta file. For the top row this will read
all_scaffolds
, and it has the sum of all rows. - pass_pairing_filter
- The number of individual reads that pass the selecting pairing filter (only paired reads will pass this filter by default)
- filtered_pairs
- The number of pairs of reads that pass all cutoffs
- median_insert
- Among all pairs of reads mapping to this scaffold, the median insert distance.
- mean_PID
- Among all pairs of reads mapping to this scaffold, the average percentage ID of both reads in the pair to the reference .fasta file
- pass_min_insert
- The number of pairs of reads mapping to this scaffold that pass the minimum insert size cutoff
- unfiltered_reads
- The raw number of reads that map to this scaffold
- unfiltered_pairs
- The raw number of pairs of reads that map to this scaffold. Only paired reads are used by inStrain
- pass_min_read_ani
- The number of pairs of reads mapping to this scaffold that pass the min_read_ani cutoff
- filtered_priority_reads
- The number of priority reads that pass the rest of the filters (will only be non-0 if a priority reads input file is provided)
- unfiltered_singletons
- The number of reads detected in which only one read of the pair is mapped
- mean_insert_distance
- Among all pairs of reads mapping to this scaffold, the mean insert distance. Note that the insert size is measured from the start of the first read to the end of the second read (2 perfectly overlapping 50bp reads will have an insert size of 50bp)
- pass_min_mapq
- The number of pairs of reads mapping to this scaffold that pass the minimum mapQ score cutoff
- mean_mistmaches
- Among all pairs of reads mapping to this scaffold, the mean number of mismatches
- mean_mapq_score
- Among all pairs of reads mapping to this scaffold, the average mapQ score
- unfiltered_priority_reads
- The number of reads that pass the pairing filter because they were part of the
priority_reads
input file (will only be non-0 if a priority reads input file is provided). - pass_max_insert
- The number of pairs of reads mapping to this scaffold that pass the maximum insert size cutoff- that is, their insert size is less than 3x the median insert size of all_scaffolds. Note that the insert size is measured from the start of the first read to the end of the second read (2 perfectly overlapping 50bp reads will have an insert size of 50bp)
- filtered_singletons
- The number of reads detected in which only one read of the pair is mapped AND which make it through to be considered. This will only be non-0 if the filtering settings allows non-paired reads
- mean_pair_length
- Among all pairs of reads mapping to this scaffold, the average length of both reads in the pair summed together
Warning
Adjusting the pairing filter will result in odd values for the “filtered_pairs” column; this column reports the number of pairs AND singletons that pass the filters. To calculate the true number of filtered pairs, use the formula filtered_pairs - filtered_singletons
SNVs.tsv¶
This describes the SNVs and SNSs that are detected in this mapping. While we should refer to these mutations as divergent sites, sometimes SNV is used to refer to both SNVs and SNSs
Warning
inStrain reports 0-based values for “position”. The first base in a scaffold will be position “0”, second based position “1”, etc.
scaffold | position | position_coverage | allele_count | ref_base | con_base | var_base | ref_freq | con_freq | var_freq | A | C | T | G | gene | mutation | mutation_type | cryptic | class |
N5_271_010G1_scaffold_120 | 174 | 5 | 2 | C | C | A | 0.6 | 0.6 | 0.4 | 2 | 3 | 0 | 0 | I | False | SNV | ||
N5_271_010G1_scaffold_120 | 195 | 6 | 1 | T | C | A | 0.0 | 1.0 | 0.0 | 0 | 6 | 0 | 0 | I | False | SNS | ||
N5_271_010G1_scaffold_120 | 411 | 8 | 2 | A | A | C | 0.75 | 0.75 | 0.25 | 6 | 2 | 0 | 0 | N5_271_010G1_scaffold_120_1 | N:V163G | N | False | SNV |
N5_271_010G1_scaffold_120 | 426 | 9 | 2 | G | G | T | 0.7777777777777778 | 0.7777777777777778 | 0.2222222222222222 | 0 | 0 | 2 | 7 | N5_271_010G1_scaffold_120_1 | N:S178Y | N | False | SNV |
N5_271_010G1_scaffold_120 | 481 | 6 | 2 | C | T | C | 0.3333333333333333 | 0.6666666666666666 | 0.3333333333333333 | 0 | 2 | 4 | 0 | N5_271_010G1_scaffold_120_1 | N:D233N | N | False | con_SNV |
N5_271_010G1_scaffold_120 | 484 | 6 | 2 | G | A | G | 0.3333333333333333 | 0.6666666666666666 | 0.3333333333333333 | 4 | 0 | 0 | 2 | N5_271_010G1_scaffold_120_1 | N:P236S | N | False | con_SNV |
N5_271_010G1_scaffold_120 | 488 | 5 | 1 | T | C | T | 0.2 | 0.8 | 0.2 | 0 | 4 | 1 | 0 | N5_271_010G1_scaffold_120_1 | S:240 | S | False | SNS |
N5_271_010G1_scaffold_120 | 811 | 5 | 1 | T | A | T | 0.2 | 0.8 | 0.2 | 4 | 0 | 1 | 0 | N5_271_010G1_scaffold_120_1 | N:N563Y | N | False | SNS |
N5_271_010G1_scaffold_120 | 897 | 7 | 2 | G | G | T | 0.7142857142857143 | 0.7142857142857143 | 0.2857142857142857 | 0 | 0 | 2 | 5 | I | False | SNV |
See the module_descriptions for what constitutes a SNP (what makes it into this table)
- scaffold
- The scaffold that the SNV is on
- position
- The genomic position of the SNV
- position_coverage
- The number of reads detected at this position
- allele_count
- The number of bases that are detected above background levels (according to the null model. An allele_count of 0 means no bases are supported by the reads, an allele_count of 1 means that only 1 base is supported by the reads, an allele_count of 2 means two bases are supported by the reads, etc.
- ref_base
- The reference base in the .fasta file at that position
- con_base
- The consensus base (the base that is supported by the most reads)
- var_base
- Variant base; the base with the second most reads
- ref_freq
- The fraction of reads supporting the ref_base
- con_freq
- The fraction of reds supporting the con_base
- var_freq
- The fraction of reads supporting the var_base
- A, C, T, and G
- The number of mapped reads encoding each of the bases
- gene
- If a gene file was included, this column will be present listing if the SNV is in the coding sequence of a gene
- mutation
- Short-hand code for the amino acid switch. If synonymous, this will be S: + the position. If nonsynonymous, this will be N: + the old amino acid + the position + the new amino acid. NOTE - the position of the amino acid is always calculated from left to right on the genome file, whether or not it’s the forward or reverse strand. Codons are calculated correctly (considering strandedness), this only impacts the reported “position” in this column. I know this is weird behavior and it will change in future inStrain versions.
- mutation_type
- What type of mutation this is. N = nonsynonymous, S = synonymous, I = intergenic, M = there are multiple genes with this base so you cant tell
- cryptic
- If an SNV is cryptic, it means that it is detected when using a lower read mismatch threshold, but becomes undetected when you move to a higher read mismatch level. See “dealing with mm” in the advanced_use section for more details on what this means.
- class
- The classification of this divergent site. The options are SNS (meaning allele_count is 1 and con_base does not equal ref_base), SNV (meaning allele_count is > 1 and con_base does equal ref_base), con_SNV (meaning allele_count is > 1, con_base does not equal ref_base, and ref_base is present in the reads; these count as differences in conANI calculations), pop_SNV (meaning allele_count is > 1, con_base does not equal ref_base, and ref_base is not present in the reads; these count as differences in popANI and conANI calculations), DivergentSite (meaning allele count is 0), and AmbiguousReference (meaning the ref_base is not A, C, T, or G)
linkage.tsv¶
This describes the linkage between pairs of SNPs in the mapping that are found on the same read pair at least min_snp times.
Warning
inStrain reports 0-based values for “position”. The first base in a scaffold will be position “0”, second based position “1”, etc.
scaffold | position_A | position_B | distance | r2 | d_prime | r2_normalized | d_prime_normalized | allele_A | allele_a | allele_B | allele_b | countab | countAb | countaB | countAB | total |
N5_271_010G1_scaffold_93 | 58 | 59 | 1 | 0.021739130434782702 | 1.0 | 0.031141868512110725 | 1.0 | C | T | G | A | 0 | 3 | 4 | 20 | 27 |
N5_271_010G1_scaffold_93 | 58 | 70 | 12 | 0.012820512820512851 | 1.0 | C | T | T | A | 0 | 2 | 4 | 22 | 28 | ||
N5_271_010G1_scaffold_93 | 58 | 80 | 22 | 0.016722408026755814 | 1.0 | 0.005847953216374271 | 1.0 | C | T | G | A | 0 | 2 | 5 | 21 | 28 |
N5_271_010G1_scaffold_93 | 58 | 84 | 26 | 0.7652173913043475 | 1.0000000000000002 | 0.6296296296296297 | 1.0 | C | T | G | C | 4 | 0 | 1 | 22 | 27 |
N5_271_010G1_scaffold_93 | 58 | 101 | 43 | 0.00907029478458067 | 1.0 | C | T | C | A | 0 | 2 | 2 | 19 | 23 | ||
N5_271_010G1_scaffold_93 | 58 | 126 | 68 | 0.01754385964912257 | 1.0 | 0.002770083102493075 | 1.0 | C | T | A | T | 0 | 2 | 3 | 16 | 21 |
N5_271_010G1_scaffold_93 | 58 | 133 | 75 | 0.008333333333333352 | 1.0 | C | T | G | T | 0 | 1 | 3 | 17 | 21 | ||
N5_271_010G1_scaffold_93 | 59 | 70 | 11 | 0.010869565217391413 | 1.0 | 0.02777777777777779 | 1.0 | G | A | T | A | 0 | 2 | 3 | 21 | 26 |
N5_271_010G1_scaffold_93 | 59 | 80 | 21 | 0.6410256410256397 | 1.0 | 1.0 | 1.0 | G | A | G | A | 2 | 0 | 1 | 25 | 28 |
Linkage is used primarily to determine if organisms are undergoing horizontal gene transfer or not. It’s calculated for pairs of SNPs that can be connected by at least min_snp
reads. It’s based on the assumption that each SNP has two alleles (for example, a A and b B). This all gets a bit confusing and has a large amount of literature around each of these terms, but I’ll do my best to briefly explain what’s going on
- scaffold
- The scaffold that both SNPs are on
- position_A
- The position of the first SNP on this scaffold
- position_B
- The position of the second SNP on this scaffold
- distance
- The distance between the two SNPs
- r2
- This is the r-squared linkage metric. See below for how it’s calculated
- d_prime
- This is the d-prime linkage metric. See below for how it’s calculated
- r2_normalized, d_prime_normalized
- These are calculated by rarefying to
min_snp
number of read pairs. See below for how it’s calculated - allele_A
- One of the two bases at position_A
- allele_a
- The other of the two bases at position_A
- allele_B
- One of the bases at position_B
- allele_b
- The other of the two bases at position_B
- countab
- The number of read-pairs that have allele_a and allele_b
- countAb
- The number of read-pairs that have allele_A and allele_b
- countaB
- The number of read-pairs that have allele_a and allele_B
- countAB
- The number of read-pairs that have allele_A and allele_B
- total
- The total number of read-pairs that have have information for both position_A and position_B
Python code for the calculation of these metrics:
freq_AB = float(countAB) / total
freq_Ab = float(countAb) / total
freq_aB = float(countaB) / total
freq_ab = float(countab) / total
freq_A = freq_AB + freq_Ab
freq_a = freq_ab + freq_aB
freq_B = freq_AB + freq_aB
freq_b = freq_ab + freq_Ab
linkD = freq_AB - freq_A * freq_B
if freq_a == 0 or freq_A == 0 or freq_B == 0 or freq_b == 0:
r2 = np.nan
else:
r2 = linkD*linkD / (freq_A * freq_a * freq_B * freq_b)
linkd = freq_ab - freq_a * freq_b
# calc D-prime
d_prime = np.nan
if (linkd < 0):
denom = max([(-freq_A*freq_B),(-freq_a*freq_b)])
d_prime = linkd / denom
elif (linkD > 0):
denom = min([(freq_A*freq_b), (freq_a*freq_B)])
d_prime = linkd / denom
################
# calc rarefied
rareify = np.random.choice(['AB','Ab','aB','ab'], replace=True, p=[freq_AB,freq_Ab,freq_aB,freq_ab], size=min_snp)
freq_AB = float(collections.Counter(rareify)['AB']) / min_snp
freq_Ab = float(collections.Counter(rareify)['Ab']) / min_snp
freq_aB = float(collections.Counter(rareify)['aB']) / min_snp
freq_ab = float(collections.Counter(rareify)['ab']) / min_snp
freq_A = freq_AB + freq_Ab
freq_a = freq_ab + freq_aB
freq_B = freq_AB + freq_aB
freq_b = freq_ab + freq_Ab
linkd_norm = freq_ab - freq_a * freq_b
if freq_a == 0 or freq_A == 0 or freq_B == 0 or freq_b == 0:
r2_normalized = np.nan
else:
r2_normalized = linkd_norm*linkd_norm / (freq_A * freq_a * freq_B * freq_b)
# calc D-prime
d_prime_normalized = np.nan
if (linkd_norm < 0):
denom = max([(-freq_A*freq_B),(-freq_a*freq_b)])
d_prime_normalized = linkd_norm / denom
elif (linkd_norm > 0):
denom = min([(freq_A*freq_b), (freq_a*freq_B)])
d_prime_normalized = linkd_norm / denom
rt_dict = {}
for att in ['r2', 'd_prime', 'r2_normalized', 'd_prime_normalized', 'total', 'countAB', \
'countAb', 'countaB', 'countab', 'allele_A', 'allele_a', \
'allele_B', 'allele_b']:
rt_dict[att] = eval(att)
gene_info.tsv¶
This describes some basic information about the genes being profiled
Warning
inStrain reports 0-based values for “position”, including the “start” and “stop” in this table. The first base in a scaffold will be position “0”, second based position “1”, etc.
scaffold | gene | gene_length | coverage | breadth | breadth_minCov | nucl_diversity | start | end | direction | partial | dNdS_substitutions | pNpS_variants | SNV_count | SNV_S_count | SNV_N_count | SNS_count | SNS_S_count | SNS_N_count | divergent_site_count |
N5_271_010G1_scaffold_0 | N5_271_010G1_scaffold_0_1 | 141.0 | 0.7092198581560284 | 0.7092198581560284 | 0.0 | 143 | 283 | -1 | False | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | |||
N5_271_010G1_scaffold_0 | N5_271_010G1_scaffold_0_2 | 219.0 | 4.849315068493151 | 1.0 | 0.45662100456620996 | 0.012312216758728069 | 2410 | 2628 | -1 | False | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | ||
N5_271_010G1_scaffold_0 | N5_271_010G1_scaffold_0_3 | 282.0 | 7.528368794326241 | 1.0 | 0.9609929078014184 | 0.00805835530326815 | 3688 | 3969 | -1 | False | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | ||
N5_271_010G1_scaffold_1 | N5_271_010G1_scaffold_1_1 | 336.0 | 2.7261904761904763 | 1.0 | 0.0625 | 0.0 | 0 | 335 | -1 | False | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | ||
N5_271_010G1_scaffold_1 | N5_271_010G1_scaffold_1_2 | 717.0 | 7.714086471408647 | 1.0 | 0.8926080892608089 | 0.011336830817162968 | 378 | 1094 | -1 | False | 0.554203539823008 | 9.0 | 2.0 | 6.0 | 0.0 | 0.0 | 0.0 | 9.0 | |
N5_271_010G1_scaffold_1 | N5_271_010G1_scaffold_1_3 | 114.0 | 13.105263157894735 | 1.0 | 1.0 | 0.016291986431991808 | 1051 | 1164 | -1 | False | 0.3956834532374099 | 4.0 | 1.0 | 2.0 | 0.0 | 0.0 | 0.0 | 4.0 | |
N5_271_010G1_scaffold_1 | N5_271_010G1_scaffold_1_4 | 111.0 | 11.342342342342342 | 1.0 | 1.0 | 0.02102806761458109 | 1164 | 1274 | -1 | False | 5.0 | 0.0 | 5.0 | 0.0 | 0.0 | 0.0 | 5.0 | ||
N5_271_010G1_scaffold_1 | N5_271_010G1_scaffold_1_5 | 174.0 | 9.057471264367816 | 1.0 | 1.0 | 0.006896087493019509 | 1476 | 1649 | -1 | False | 0.0 | 2.0 | 2.0 | 0.0 | 0.0 | 0.0 | 0.0 | 2.0 | |
N5_271_010G1_scaffold_1 | N5_271_010G1_scaffold_1_6 | 174.0 | 6.195402298850576 | 1.0 | 0.7413793103448276 | 0.028698649055273976 | 1656 | 1829 | -1 | False | 0.5790697674418601 | 4.0 | 1.0 | 3.0 | 0.0 | 0.0 | 0.0 | 4.0 |
- scaffold
- Scaffold that the gene is on
- gene
- Name of the gene being profiled
- gene_length
- Length of the gene in nucleotides
- breadth
- The number of bases in the gene that have at least 1x coverage
- breadth_minCov
- The number of bases in the gene that have at least min_cov coverage
- nucl_diversity
- The mean nucleotide diversity of all bases in the gene that have a nucleotide diversity value calculated. So if only 1 base on the scaffold meets the minimum coverage to calculate nucleotide diversity, the nucl_diversity of the scaffold will be the nucleotide diversity of that base. Will be blank if no positions have a base over the minimum coverage.
- start
- Start of the gene (position on scaffold; 0-indexed)
- end
- End of the gene (position on scaffold; 0-indexed)
- direction
- Direction of the gene (based on prodigal call). If -1, means the gene is not coded in the direction expressed by the .fasta file
- partial
- If True this is a partial gene; based on not having partial=00 in the record description provided by Prodigal
- dNdS_substitutions
- The dN/dS of SNSs detected in this gene. Will be blank if 0 N and/or 0 S substitutions are detected
- pNpS_variants
- The pN/pS of SNVs detected in this gene. Will be blank if 0 N and/or 0 S SNVs are detected
- SNV_count
- Total number of SNVs detected in this gene
- SNV_S_count
- Number of synonymous SNVs detected in this gene
- SNV_N_count
- Number of non-synonymous SNVs detected in this gene
- SNS_count
- Total number of SNSs detected in this gens
- SNS_S_count
- Number of synonymous SNSs detected in this gens
- SNS_N_count
- Number of non-synonymous SNSs detected in this gens
- divergent_site_count
- Number of divergent sites detected in this gens
genome_info.tsv¶
Describes many of the above metrics on a genome-by-genome level, rather than a scaffold-by-scaffold level.
genome | coverage | breadth | nucl_diversity | length | true_scaffolds | detected_scaffolds | coverage_median | coverage_std | coverage_SEM | breadth_minCov | breadth_expected | nucl_diversity_rarefied | conANI_reference | popANI_reference | iRep | iRep_GC_corrected | linked_SNV_count | SNV_distance_mean | r2_mean | d_prime_mean | consensus_divergent_sites | population_divergent_sites | SNS_count | SNV_count | filtered_read_pair_cou |
nt | reads_unfiltered_pairs | reads_mean_PID | reads_unfiltered_reads | divergent_site_count | |||||||||||||||||||||
fobin.fasta | 132.07770270270268 | 0.9974662162162162 | 0.035799449026225894 | 1184 | 1 | 1 | 113 | 114.96590198492832 | 3.6668428018497408 | 0.9822635135135136 | 1.0 | 0.034319907739082 | 0.979363714531 | ||||||||||||
3844 | 0.9939810834049873 | False | 1064.0 | 120.48214285714286 | 0.07781470898619759 | 0.8710788695476385 | 24 | 7 | 7 | 97 | 926 | 5991 | 0.9239440924157436 | 19260 | 104 | ||||||||||
maxbin2.maxbin.001.fasta | 6.5637243038012985 | 0.8940915760335204 | 0.007116301715134402 | 264436 | 166 | 166 | 5 | 9.475490303923918 | 0.019704930458769948 | 0.5080246259964604 | 0.99695960719657 | 0.0002 | |||||||||||||
8497234066195295 | 0.997201131457496 | 0.9990248622897128 | False | 777.0 | 80.73101673101674 | 0.2979679685064011 | 0.9518999449773424 | 376 | 131 | 127 | 1246 | 7368 | 9309 | 0.9783316024248924 | 2 | ||||||||||
5281 | 1373 |
- genome
- The name of the genome being profiled. If all scaffolds were a single genome, this will read “all_scaffolds”
- coverage
- Average coverage depth of all scaffolds of this genome
- breadth
- The breadth of all scaffolds of this genome
- nucl_diversity
- The average nucleotide diversity of all scaffolds of this genome
- length
- The full length of this genome across all scaffolds
- true_scaffolds
- The number of scaffolds present in this genome based off of the scaffold-to-bin file
- detected_scaffolds
- The number of scaffolds with at least a single read-pair mapping to them
- coverage_median
- The median coverage among all bases in the genome
- coverage_std
- The standard deviation of all coverage values
- coverage_SEM
- The standard error of the mean of all coverage values (calculated using scipy.stats.sem)
- breadth_minCov
- The percentage of bases in the scaffold that have at least min_cov coverage (e.g. the percentage of bases that have a nucl_diversity value and meet the minimum sequencing depth to call SNVs)
- breadth_expected
- This tells you the breadth that you should expect if reads are evenly distributed along the genome, given the reported coverage value. Based on the function breadth = -1.000 * e^(0.883 * coverage) + 1.000. This is useful to establish whether or not the scaffold is actually in the reads, or just a fraction of the scaffold. If your coverage is 10x, the expected breadth will be ~1. If your actual breadth is significantly lower then the expected breadth, this means that reads are mapping only to a specific region of your scaffold (transposon, prophage, etc.)
- nucl_diversity_rarefied
- The average nucleotide diversity among positions that have at least
--rarefied_coverage
(50x by default). These values are also calculated by randomly subsetting the reads at that position to--rarefied_coverage
reads - conANI_reference
- The conANI between the reads and the reference genome
- popANI_reference
- The popANI between the reads and the reference genome
- iRep
- The iRep value for this genome (if it could be successfully calculated)
- iRep_GC_corrected
- A True / False value of whether the iRep value was corrected for GC bias
- linked_SNV_count
- The number of divergent sites that could be linked in this genome
- SNV_distance_mean
- Average distance between linked divergent sites
- r2_mean
- Average r2 between linked SNVs (see explanation of linkage.tsv above for more info)
- d_prime_mean
- Average d prime between linked SNVs (see explanation of linkage.tsv above for more info)
- consensus_divergent_sites
- The total number of divergent sites in which the reads have a different consensus allele than the reference genome. These count as “differences” in the conANI_reference calculation, and
breadth_minCov
*length
counts as the denominator. - population_divergent_sites
- The total number of divergent sites in which the reads do not have the reference genome base as any allele at all (major or minor). These count as “differences” in the popANI_reference calculation, and
breadth_minCov
*length
counts as the denominator. - SNS_count
- The total number of SNSs called on this genome
- SNV_count
- The total number of SNVs called on this genome
- filtered_read_pair_count
- The total number of read pairs that pass filtering and map to this genome
- reads_unfiltered_pairs
- The total number of pairs, filtered or unfiltered, that map to this genome
- reads_mean_PID
- The average ANI of mapped read pairs to the reference genome for this genome
- reads_unfiltered_reads
- The total number of reads, filtered or unfiltered, that map to this genome
- divergent_site_count
- The total number of divergent sites called on this genome
inStrain parse_annotations¶
A typical run of inStrain parse_gene_annotations will yield the following files in the output folder. For more information, see User Manual
LongFormData.csv¶
All of the annotation information a very long table
sample | anno | genomes | genes | bases |
2bag10_1.bam | K03737 | {‘REFINED_METABAT215_TOP10_CONTIGS_1500_ASSEMBLY_K77_MERGED__Hadza_MoBio_hadza-E-H_A_23_1707.16.fa’} | 1 | 6666 |
2bag10_1.bam | K06973 | {‘REFINED_METABAT215_TOP10_CONTIGS_1500_ASSEMBLY_K77_MERGED__Hadza_MoBio_hadza-E-H_A_23_1707.16.fa’} | 1 | 1068 |
2bag10_1.bam | K04066 | {‘REFINED_METABAT215_TOP10_CONTIGS_1500_ASSEMBLY_K77_MERGED__Hadza_MoBio_hadza-E-H_A_23_1707.16.fa’, ‘Bifidobacterium_longum_subsp_infantis_ATCC_15697.fna’} | 2 | 195761 |
2bag10_1.bam | K15558 | {‘REFINED_METABAT215_TOP10_CONTIGS_1500_ASSEMBLY_K77_MERGED__Hadza_MoBio_hadza-E-H_A_23_1707.16.fa’, ‘Bifidobacterium_longum_subsp_infantis_ATCC_15697.fna’} | 96 | 10748749 |
2bag10_1.bam | K19762 | {‘REFINED_METABAT215_TOP10_CONTIGS_1500_ASSEMBLY_K77_MERGED__Hadza_MoBio_hadza-E-H_A_23_1707.16.fa’, ‘Bifidobacterium_longum_subsp_infantis_ATCC_15697.fna’} | 97 | 10920075 |
2bag10_1.bam | 3000025 | {‘REFINED_METABAT215_TOP10_CONTIGS_1500_ASSEMBLY_K77_MERGED__Hadza_MoBio_hadza-E-H_A_23_1707.16.fa’, ‘Bifidobacterium_longum_subsp_infantis_ATCC_15697.fna’} | 2 | 168916 |
2bag10_1.bam | K18888 | {‘REFINED_METABAT215_TOP10_CONTIGS_1500_ASSEMBLY_K77_MERGED__Hadza_MoBio_hadza-E-H_A_23_1707.16.fa’, ‘Bifidobacterium_longum_subsp_infantis_ATCC_15697.fna’} | 3 | 504008 |
2bag10_1.bam | K20386 | {‘REFINED_METABAT215_TOP10_CONTIGS_1500_ASSEMBLY_K77_MERGED__Hadza_MoBio_hadza-E-H_A_23_1707.16.fa’, ‘Bifidobacterium_longum_subsp_infantis_ATCC_15697.fna’} | 98 | 11007871 |
2bag10_1.bam | K07979 | {‘REFINED_METABAT215_TOP10_CONTIGS_1500_ASSEMBLY_K77_MERGED__Hadza_MoBio_hadza-E-H_A_23_1707.16.fa’} | 1 | 742 |
- sample
- The sample this row refers to (based on the name of the .bam file used to create the inStrain profile)
- anno
- The annotation this row refers to (based on the input annotation table)
- genomes
- The specific genomes that have this particular annotation. Represented as a python set
- genes
- The total number of genes detected with this annotation in this sample
- bases
- The total number of base-pairs mapped to all genes with this annotation in this sample
SampleAnnotationTotals.csv¶
Totals for each sample. Used to generate the _fraction tables enumerated below.
sample | detected_genes | detected_genomes | bases_mapped_to_genes | detected_annotations | detected_genes_with_anno |
2bag10_1.bam | 2625 | 2 | 222405987 | 3302 | 1677 |
2bag10_2.bam | 20909 | 10 | 2418511040 | 32225 | 15513 |
- sample
- The sample this row refers to (based on the name of the .bam file used to create the inStrain profile)
- detected_genes
- The total number of genes detected in this sample after passing the set filters
- detected_genomes
- The total number of genomes detected in this sample after passing the set filters
- bases_mapped_to_genes
- The total number of bases mapped to detected genes. See ParsedGeneAnno_bases.csv below for more info
- detected_annotations
- The total number of annotations detected; this can be higher than detected_genes_with_anno if some genes have multiple annotations
- detected_genes_with_anno
- The total number of genes detected with at least one annotation
ParsedGeneAnno_*.csv¶
There are a total of 6 tables like this generated in the output folder, each looking like the following:
sample | 3000005 | 3000024 | 3000025 | 3000026 | 3000027 | 3000074 | 3000118 | 3000165 | 3000166 |
2bag10_1.bam | 131097 | 1286827 | 168916 | 1656 | 0 | 0 | 0 | 0 | 0 |
2bag10_2.bam | 104013 | 5016854 | 955645 | 2552 | 633275 | 1034042 | 95617 | 409295 | 541951 |
In each case the column sample is the sample the row refers to (based on the name of the .bam file used to create the inStrain profile), and all other columns are annotations from the input annotation_table provides. The number values differ depending on the individual output table being analyzed. Below you can find descriptions on what the numbers refer to:
- ParsedGeneAnno_bases.csv
- The total number of base pairs mapped to all genes with this annotation. The number of base pairs mapped for each gene with this annotation is calculated as the gene length * the coverage of the gene, and the number reported is the sum of this value of all genes
- ParsedGeneAnno_bases_fraction.csv
- The values in ParsedGeneAnno_bases.csv divided by the total number of bases mapped to all detected genes (the value bases_mapped_to_genes reported in SampleAnnotationTotals.csv)
- ParsedGeneAnno_genes.csv
- The total number of detected genes with this annotation
- ParsedGeneAnno_genes_fraction.csv
- The values in ParsedGeneAnno_genes.csv divided by the total number of genes detected (the value detected_genes reported in SampleAnnotationTotals.csv)
- ParsedGeneAnno_genomes.csv
- The total number of genomes with at least one detected gene with this annotation
- ParsedGeneAnno_genomes_fraction.csv
- The values in ParsedGeneAnno_genomes.csv divided by the total number of genomes detected (the value detected_genomes reported in SampleAnnotationTotals.csv)
inStrain compare¶
A typical run of inStrain will yield the following files in the output folder:
comparisonsTable.tsv¶
Summarizes the differences between two inStrain profiles on a scaffold by scaffold level
scaffold | name1 | name2 | coverage_overlap | compared_bases_count | percent_genome_compared | length | consensus_SNPs | population_SNPs | popANI | conANI |
N5_271_010G1_scaffold_98 | N5_271_010G1_scaffold_min1000.fa-vs-N5_271_010G1.sorted.bam | N5_271_010G1_scaffold_min1000.fa-vs-N5_271_010G1.sorted.bam | 1.0 | 61 | 0.05290546400693842 | 1153 | 0 | 0 | 1.0 | 1.0 |
N5_271_010G1_scaffold_133 | N5_271_010G1_scaffold_min1000.fa-vs-N5_271_010G1.sorted.bam | N5_271_010G1_scaffold_min1000.fa-vs-N5_271_010G1.sorted.bam | 1.0 | 78 | 0.0741444866920152 | 1052 | 0 | 0 | 1.0 | 1.0 |
N5_271_010G1_scaffold_144 | N5_271_010G1_scaffold_min1000.fa-vs-N5_271_010G1.sorted.bam | N5_271_010G1_scaffold_min1000.fa-vs-N5_271_010G1.sorted.bam | 1.0 | 172 | 0.16715257531584066 | 1029 | 0 | 0 | 1.0 | 1.0 |
N5_271_010G1_scaffold_158 | N5_271_010G1_scaffold_min1000.fa-vs-N5_271_010G1.sorted.bam | N5_271_010G1_scaffold_min1000.fa-vs-N5_271_010G1.sorted.bam | 1.0 | 36 | 0.035749751737835164 | 1007 | 0 | 0 | 1.0 | 1.0 |
N5_271_010G1_scaffold_57 | N5_271_010G1_scaffold_min1000.fa-vs-N5_271_010G1.sorted.bam | N5_271_010G1_scaffold_min1000.fa-vs-N5_271_010G1.sorted.bam | 1.0 | 24 | 0.0183206106870229 | 1310 | 0 | 0 | 1.0 | 1.0 |
N5_271_010G1_scaffold_139 | N5_271_010G1_scaffold_min1000.fa-vs-N5_271_010G1.sorted.bam | N5_271_010G1_scaffold_min1000.fa-vs-N5_271_010G1.sorted.bam | 1.0 | 24 | 0.023121387283236997 | 1038 | 0 | 0 | 1.0 | 1.0 |
N5_271_010G1_scaffold_92 | N5_271_010G1_scaffold_min1000.fa-vs-N5_271_010G1.sorted.bam | N5_271_010G1_scaffold_min1000.fa-vs-N5_271_010G1.sorted.bam | 1.0 | 336 | 0.286934244235696 | 1171 | 0 | 0 | 1.0 | 1.0 |
N5_271_010G1_scaffold_97 | N5_271_010G1_scaffold_min1000.fa-vs-N5_271_010G1.sorted.bam | N5_271_010G1_scaffold_min1000.fa-vs-N5_271_010G1.sorted.bam | 1.0 | 22 | 0.01901469317199654 | 1157 | 0 | 0 | 1.0 | 1.0 |
N5_271_010G1_scaffold_100 | N5_271_010G1_scaffold_min1000.fa-vs-N5_271_010G1.sorted.bam | N5_271_010G1_scaffold_min1000.fa-vs-N5_271_010G1.sorted.bam | 1.0 | 21 | 0.018292682926829267 | 1148 | 0 | 0 | 1.0 | 1.0 |
- scaffold
- The scaffold being compared
- name1
- The name of the first inStrain profile being compared
- name2
- The name of the second inStrain profile being compared
- coverage_overlap
- The percentage of bases that are either covered or not covered in both of the profiles (covered = the base is present at at least min_snp coverage). The formula is length(coveredInBoth) / length(coveredInEither). If both scaffolds have 0 coverage, this will be 0.
- compared_bases_count
- The number of considered bases; that is, the number of bases with at least min_snp coverage in both profiles. Formula is length([x for x in overlap if x == True]).
- percent_genome_compared
- The percentage of bases in the scaffolds that are covered by both. The formula is length([x for x in overlap if x == True])/length(overlap). When ANI is np.nan, this must be 0. If both scaffolds have 0 coverage, this will be 0.
- length
- The total length of the scaffold
- consensus_SNPs
- The number of locations along the genome where both samples have the base at >= 5x coverage, and the consensus allele in each sample is different. Used to calculate conANI
- population_SNPs
- The number of locations along the genome where both samples have the base at >= 5x coverage, and no alleles are shared between either sample. Used to calculate popANI
- popANI
- The average nucleotide identity among compared bases between the two scaffolds, based on population_SNPs. Calculated using the formula popANI = (compared_bases_count - population_SNPs) / compared_bases_count
- conNI
- The average nucleotide identity among compared bases between the two scaffolds, based on consensus_SNPs. Calculated using the formula conANI = (compared_bases_count - consensus_SNPs) / compared_bases_count
pairwise_SNP_locations.tsv¶
Warning
inStrain reports 0-based values for “position”. The first base in a scaffold will be position “0”, second based position “1”, etc.
Lists the locations of all differences between profiles. Because it’s a big file, this will only be created is you include the flag --store_mismatch_locations
in your inStrain compare command.
scaffold | position | name1 | name2 | consensus_SNP | population_SNP | con_base_1 | ref_base_1 | var_base_1 | position_coverage_1 | A_1 | C_1 | T_1 | G_1 | con_base_2 | ref_base_2 | var_base_2 | position_coverage_2 | A_2 | C_2 | T_2 | G_2 |
N5_271_010G1_scaffold_9 | 823 | N5_271_010G1_scaffold_min1000.fa-vs-N5_271_010G1.sorted.bam | N5_271_010G1_scaffold_min1000.fa-vs-N5_271_010G2.sorted.bam | True | False | G | G | A | 10.0 | 3.0 | 0.0 | 0.0 | 7.0 | A | G | G | 6.0 | 3.0 | 0.0 | 0.0 | 3.0 |
N5_271_010G1_scaffold_11 | 906 | N5_271_010G1_scaffold_min1000.fa-vs-N5_271_010G1.sorted.bam | N5_271_010G1_scaffold_min1000.fa-vs-N5_271_010G2.sorted.bam | True | False | T | T | C | 6.0 | 0.0 | 2.0 | 4.0 | 0.0 | C | T | T | 7.0 | 0.0 | 4.0 | 3.0 | 0.0 |
N5_271_010G1_scaffold_29 | 436 | N5_271_010G1_scaffold_min1000.fa-vs-N5_271_010G1.sorted.bam | N5_271_010G1_scaffold_min1000.fa-vs-N5_271_010G2.sorted.bam | True | False | C | T | T | 6.0 | 0.0 | 3.0 | 3.0 | 0.0 | T | T | C | 7.0 | 0.0 | 3.0 | 4.0 | 0.0 |
N5_271_010G1_scaffold_140 | 194 | N5_271_010G1_scaffold_min1000.fa-vs-N5_271_010G1.sorted.bam | N5_271_010G1_scaffold_min1000.fa-vs-N5_271_010G2.sorted.bam | True | False | A | A | T | 6.0 | 4.0 | 0.0 | 2.0 | 0.0 | T | A | A | 9.0 | 4.0 | 0.0 | 5.0 | 0.0 |
N5_271_010G1_scaffold_24 | 1608 | N5_271_010G1_scaffold_min1000.fa-vs-N5_271_010G1.sorted.bam | N5_271_010G1_scaffold_min1000.fa-vs-N5_271_010G2.sorted.bam | True | False | G | G | A | 8.0 | 2.0 | 0.0 | 0.0 | 6.0 | A | G | G | 6.0 | 5.0 | 0.0 | 0.0 | 1.0 |
N5_271_010G1_scaffold_112 | 600 | N5_271_010G1_scaffold_min1000.fa-vs-N5_271_010G1.sorted.bam | N5_271_010G1_scaffold_min1000.fa-vs-N5_271_010G2.sorted.bam | True | False | A | G | G | 6.0 | 4.0 | 0.0 | 0.0 | 2.0 | ||||||||
N5_271_010G1_scaffold_88 | 497 | N5_271_010G1_scaffold_min1000.fa-vs-N5_271_010G1.sorted.bam | N5_271_010G1_scaffold_min1000.fa-vs-N5_271_010G2.sorted.bam | True | False | A | G | G | 5.0 | 3.0 | 0.0 | 0.0 | 2.0 | ||||||||
N5_271_010G1_scaffold_53 | 1108 | N5_271_010G1_scaffold_min1000.fa-vs-N5_271_010G1.sorted.bam | N5_271_010G1_scaffold_min1000.fa-vs-N5_271_010G2.sorted.bam | True | False | A | A | G | 5.0 | 3.0 | 0.0 | 0.0 | 2.0 | G | A | A | 15.0 | 6.0 | 0.0 | 0.0 | 9.0 |
N5_271_010G1_scaffold_46 | 710 | N5_271_010G1_scaffold_min1000.fa-vs-N5_271_010G1.sorted.bam | N5_271_010G1_scaffold_min1000.fa-vs-N5_271_010G2.sorted.bam | True | False | A | C | C | 6.0 | 4.0 | 2.0 | 0.0 | 0.0 | C | C | A | 6.0 | 2.0 | 4.0 | 0.0 | 0.0 |
- scaffold
- The scaffold on which the difference is located
- position
- The position where the difference is located (0-based)
- name1
- The name of the first inStrain profile being compared
- name2
- The name of the second inStrain profile being compared
- consensus_SNP
- A True / False column listing whether or not this difference counts towards conANI calculations
- population_SNP
- A True / False column listing whether or not this difference counts towards popANI calculations
- con_base_1
- The consensus base of the profile listed in
name1
at this position - ref_base_1
- The reference base of the profile listed in
name1
at this position (will be the same asref_base_2
) - var_base_1
- The variant base of the profile listed in
name1
at this position - position_coverage_1
- The number of reads mapping to this position in
name1
- A_1, C_1, T_1, G_1
- The number of mapped reads with each nucleotide in
name1
- con_base_2, ref_base_2, …
- The above columns are also listed for the
name2
sample
genomeWide_compare.tsv¶
A genome-level summary of the differences detected by inStrain compare. Generated by running inStrain genome_wide
on the results of inStrain compare
, or by providing an stb file to the original inStrain compare command.
genome | name1 | name2 | coverage_overlap | compared_bases_count | consensus_SNPs | population_SNPs | popANI | conANI | percent_compared |
all_scaffolds | N5_271_010G1_scaffold_min1000.fa-vs-N5_271_010G1.sorted.bam | N5_271_010G1_scaffold_min1000.fa-vs-N5_271_010G1.sorted.bam | 1.0 | 100712 | 0 | 0 | 1.0 | 1.0 | 0.3605549091560011 |
all_scaffolds | N5_271_010G1_scaffold_min1000.fa-vs-N5_271_010G1.sorted.bam | N5_271_010G1_scaffold_min1000.fa-vs-N5_271_010G2.sorted.bam | 0.6852932198159855 | 71900 | 196 | 50.9999304589707928 | 0.9972739916550765 | 0.25740624720307886 | |
all_scaffolds | N5_271_010G1_scaffold_min1000.fa-vs-N5_271_010G2.sorted.bam | N5_271_010G1_scaffold_min1000.fa-vs-N5_271_010G2.sorted.bam | 1.0 | 145663 | 0 | 0 | 1.0 | 1.0 | 0.5214821444553835 |
- genome
- The genome being compared
- name1
- The name of the first inStrain profile being compared
- name2
- The name of the second inStrain profile being compared
- coverage_overlap
- The percentage of bases that are either covered or not covered in both of the profiles (covered = the base is present at at least min_snp coverage). The formula is length(coveredInBoth) / length(coveredInEither). If both scaffolds have 0 coverage, this will be 0.
- compared_bases_count
- The number of considered bases; that is, the number of bases with at least min_snp coverage in both profiles. Formula is length([x for x in overlap if x == True]).
- percent_genome_compared
- The percentage of bases in the scaffolds that are covered by both. The formula is length([x for x in overlap if x == True])/length(overlap). When ANI is np.nan, this must be 0. If both scaffolds have 0 coverage, this will be 0.
- length
- The total length of the genome
- consensus_SNPs
- The number of locations along the genome where both samples have the base at >= 5x coverage, and the consensus allele in each sample is different. Used to calculate conANI
- population_SNPs
- The number of locations along the genome where both samples have the base at >= 5x coverage, and no alleles are shared between either sample. Used to calculate popANI
- popANI
- The average nucleotide identity among compared bases between the two scaffolds, based on population_SNPs. Calculated using the formula popANI = (compared_bases_count - population_SNPs) / compared_bases_count
- conNI
- The average nucleotide identity among compared bases between the two scaffolds, based on consensus_SNPs. Calculated using the formula conANI = (compared_bases_count - consensus_SNPs) / compared_bases_count
strain_clusters.tsv¶
The result of clustering the pairwise comparison data provided in genomeWide_compare.tsv
to generate strain-level clusters. Performed using hierarchical clustering in the same manner as the program dRep
; see the dRep documentation for some info on the oddities of hierarchical clustering
cluster | sample | genome |
1_1 | N5_271_010G1_scaffold_min1000.fa-vs-N5_271_010G1.sorted.bam | fobin.fasta |
1_1 | N5_271_010G1_scaffold_min1000.fa-vs-N5_271_010G2.sorted.bam | fobin.fasta |
2_1 | N5_271_010G1_scaffold_min1000.fa-vs-N5_271_010G1.sorted.bam | maxbin2.maxbin.001.fasta |
2_2 | N5_271_010G1_scaffold_min1000.fa-vs-N5_271_010G2.sorted.bam | maxbin2.maxbin.001.fasta |
- cluster
- The strain identity of this genome in this sample. Each “strain” assigned by the hierarchical clustering algorithm will have a unique cluster. In the example table above strains of the genome
fobin.fasta
are the same in both samples (they have the same “cluster” identities), but strains of the genomemaxbin2.maxbin.001.fasta
are different in the two samples (they have different “cluster” identities). - sample
- The sample that the genome was detected in.
- genome
- The genome that the cluster is referring to.
pooled_SNV_info.tsv, pooled_SNV_data.tsv, and pooled_SNV_data_keys.tsv¶
The tables pooled_SNV_info.tsv, pooled_SNV_data.tsv, and pooled_SNV_data_keys.tsv can be generated by inStrain compare by providing .bam files to the inStrain compare command. See User Manual for more information.
scaffold | position | depth | A | C | T | G | ref_base | con_base | var_base | sample_detections | sample_5x_detections | DivergentSite_count | SNS_count | SNV_count | con_SNV_count | pop_SNV_count | sample_con_bases |
N5_271_010G1_scaffold_114 | 3 | 10 | 2 | 0 | 7 | 1 | T | T | A | 2 | 1 | 0 | 0 | 1 | 0 | 0 | [‘T’] |
N5_271_010G1_scaffold_114 | 20 | 33 | 0 | 31 | 2 | 0 | C | C | T | 2 | 2 | 0 | 0 | 1 | 0 | 0 | [‘C’] |
N5_271_010G1_scaffold_114 | 24 | 35 | 29 | 0 | 2 | 4 | A | A | G | 2 | 2 | 0 | 0 | 2 | 0 | 0 | [‘A’] |
N5_271_010G1_scaffold_114 | 25 | 38 | 2 | 36 | 0 | 0 | C | C | A | 2 | 2 | 0 | 0 | 1 | 0 | 0 | [‘C’] |
N5_271_010G1_scaffold_114 | 55 | 71 | 66 | 5 | 0 | 0 | A | A | C | 2 | 2 | 0 | 0 | 1 | 0 | 0 | [‘A’] |
N5_271_010G1_scaffold_114 | 57 | 67 | 2 | 0 | 0 | 65 | G | G | A | 2 | 2 | 0 | 0 | 1 | 0 | 0 | [‘G’] |
N5_271_010G1_scaffold_114 | 75 | 95 | 4 | 90 | 0 | 1 | C | C | A | 2 | 2 | 0 | 0 | 1 | 0 | 0 | [‘C’] |
N5_271_010G1_scaffold_114 | 76 | 95 | 0 | 90 | 2 | 3 | C | C | G | 2 | 2 | 0 | 0 | 2 | 0 | 0 | [‘C’] |
N5_271_010G1_scaffold_114 | 79 | 98 | 0 | 3 | 0 | 95 | G | G | C | 2 | 2 | 0 | 0 | 1 | 0 | 0 | [‘G’] |
This table has information about each SNV, summarized across all samples
- scaffold
- The scaffold being analyzed
- position
- The position in the scaffold where the SNV is located (0-based)
- depth
- The total number of reads mapping to this scaffold across samples
- A
- The number of reads with A at this position in this scaffold across samples
- C
- The number of reads with C at this position in this scaffold across samples
- T
- The number of reads with T at this position in this scaffold across samples
- G
- The number of reads with G at this position in this scaffold across samples
- ref_base
- The reference base at this position in this scaffold across samples
- con_base
- The consensus base (most common) at this position in this scaffold across samples
- var_base
- The variant base (second most common) at this position in this scaffold across samples
- sample_detections
- The number of samples in which this position at this scaffold has at least one read mapping to it
- sample_5x_detections
- The number of samples in which this position at this scaffold has at least 5 reads mapping to it
- DivergentSite_count
- The number of samples with a divergent sites detected at this position
- SNS_count
- The number of samples with a SNSs detected at this position
- SNV_count
- The number of samples with a SNVs detected at this position
- con_SNV_count
- The number of samples with consenus SNPs (conANI) detected at this position
- pop_SNV_count
- The number of samples with population SNPs (popANI) detected at this position
- sample_con_bases
- The number of different consensus bases at this position across all analyzed samples
sample | scaffold | position | A | C | T | G |
0 | 0 | 3 | 2 | 0 | 5 | 1 |
0 | 0 | 20 | 0 | 21 | 2 | 0 |
0 | 0 | 24 | 21 | 0 | 0 | 4 |
0 | 0 | 25 | 2 | 26 | 0 | 0 |
0 | 0 | 55 | 38 | 5 | 0 | 0 |
0 | 0 | 57 | 2 | 0 | 0 | 38 |
0 | 0 | 75 | 3 | 55 | 0 | 0 |
0 | 0 | 76 | 0 | 56 | 0 | 3 |
0 | 0 | 79 | 0 | 1 | 0 | 57 |
This table has information about each SNV in each sample. Because the table can be huge, names of scaffolds and samples are listed as “keys” to be translated using the also-provided pooled_SNV_data_keys.tsv table
- sample
- The key for the sample being analyzed (as detailed in the pooled_SNV_data_keys.tsv table below)
- scaffold
- The key for the scaffold being analyzed (as detailed in the pooled_SNV_data_keys.tsv table below)
- position
- The position in the scaffold where the SNV is located (0-based)
- A,C,T,G
- The number of reads with this base in this sample in this scaffold at this position
key | sample | scaffold |
0 | N5_271_010G1_scaffold_min1000.fa-vs-N5_271_010G1.sorted.bam | N5_271_010G1_scaffold_114 |
1 | N5_271_010G1_scaffold_min1000.fa-vs-N5_271_010G2.sorted.bam | N5_271_010G1_scaffold_63 |
2 | N5_271_010G1_scaffold_89 | |
3 | N5_271_010G1_scaffold_33 | |
4 | N5_271_010G1_scaffold_95 | |
5 | N5_271_010G1_scaffold_11 | |
6 | N5_271_010G1_scaffold_74 | |
7 | N5_271_010G1_scaffold_71 | |
8 | N5_271_010G1_scaffold_96 |
This table has “keys” needed to translate the pooled_SNV_data.tsv table
- key
- The key in question. This is the number presented in the sample or scaffold column in the pooled_SNV_data.tsv table above
- sample
- The name of the sample with this key. For example: for the row with a 0 as the key, sample 0 in pooled_SNV_data.tsv refers to the sample listed here
- scaffold
- The name of the scaffold with this key. For example: for the row with a 0 as the key, scaffold 0 in pooled_SNV_data.tsv refers to the sample listed here
inStrain plot¶
This is what the results of inStrain plot look like.
1) Coverage and breadth vs. read mismatches¶
Breadth of coverage (blue line), coverage depth (red line), and expected breadth of coverage given the depth of coverage (dotted blue line) versus the minimum ANI of mapped reads. Coverage depth continues to increase while breadth of plateaus, suggesting that all regions of the reference genome are not present in the reads being mapped.
2) Genome-wide microdiversity metrics¶
SNV density, coverage, and nucleotide diversity. Spikes in nucleotide diversity and SNV density do not correspond with increased coverage, indicating that the signals are not due to read mis-mapping. Positions with nucleotide diversity and no SNV-density are those where diversity exists but is not high enough to call a SNV
3) Read-level ANI distribution¶
Distribution of read pair ANI levels when mapped to a reference genome; this plot suggests that the reference genome is >1% different than the mapped reads
4) Major allele frequencies¶
Distribution of the major allele frequencies of bi-allelic SNVs (the Site Frequency Spectrum). Alleles with major frequencies below 50% are the result of multiallelic sites. The lack of distinct puncta suggest that more than a few distinct strains are present.
5) Linkage decay¶
Metrics of SNV linkage vs. distance between SNVs; linkage decay (shown in one plot and not the other) is a common signal of recombination.
6) Read filtering plots¶
Bar plots showing how many reads got filtered out during filtering. All percentages are based on the number of paired reads; for an idea of how many reads were filtered out for being non-paired, compare the top bar and the second to top bar.
7) Scaffold inspection plot (large)¶
This is an elongated version of the genome-wide microdiversity metrics that is long enough for you to read scaffold names on the y-axis
8) Linkage with SNP type (GENES REQUIRED)¶
Linkage plot for pairs of non-synonymous SNPs and all pairs of SNPs
10) Compare dendrograms (RUN ON COMPARE; NOT PROFILE)¶
A dendrogram comparing all samples based on popANI and based on shared_bases.
Raw data access and API¶
API for accessing raw data¶
inStrain stores much more data than is shown in the output folder. It is kept in the raw_data
folder, and is mostly stored in compressed formats. This data can be easily accessed using python, as described below.
To access the data, you first make an SNVprofile object of the inStrain output profile, and then you access data from that object. For example, the following code accessed the raw SNP table
import inStrain
import inStain.SNVprofile
IS = inStain.SNVprofile.SNVprofile(``/home/mattolm/inStrainOutputTest/``)
raw_snps = IS.get('raw_snp_table')
You can use the example above (IS.get()
) to access any of the raw data described in the following section. There are also another special things that are accessed in other ways, as described in the section “Accessing other data”
Basics of raw_data¶
A typical run of inStrain will yield a folder titled “raw_data”, with lots of individual files in it. The specifics of what files are in there depend on how inStrain was run, and whether or not additional commands were run as well (like profile_genes).
There will always be a file titled “attributes.tsv”. This describes some basic information about each item in the raw data. Here’s an example
name | value | type | description |
location | /Users/mattolm/Programs/inStrain/test/test_backend/testdir/test | value | Location of SNVprofile object |
version | 1.3.3 | value | Version of inStrain |
Rdic | /Users/mattolm/Programs/inStrain/test/test_backend/testdir/test/raw_data/Rdic.json | dictionary | Read pair -> mismatches |
mapping_info | /Users/mattolm/Programs/inStrain/test/test_backend/testdir/test/raw_data/mapping_info.csv.gz | pandas | Report on reads |
fasta_loc | /Users/mattolm/Programs/inStrain/test/test_data/N5_271_010G1_scaffold_min1000.fa | value | Location of .fasta file used during profile |
scaffold2length | /Users/mattolm/Programs/inStrain/test/test_backend/testdir/test/raw_data/scaffold2length.json | dictionary | Dictionary of scaffold 2 length |
object_type | profile | value | Type of SNVprofile (profile or compare) |
bam_loc | /Users/mattolm/Programs/inStrain/test/test_data/N5_271_010G1_scaffold_min1000.fa-vs-N5_271_010G1.sorted.bam | value | Location of .bam file |
scaffold_list | /Users/mattolm/Programs/inStrain/test/test_backend/testdir/test/raw_data/scaffold_list.txt | list | 1d list of scaffolds that were profiled |
raw_linkage_table | /Users/mattolm/Programs/inStrain/test/test_backend/testdir/test/raw_data/raw_linkage_table.csv.gz | pandas | Raw table of linkage information |
raw_snp_table | /Users/mattolm/Programs/inStrain/test/test_backend/testdir/test/raw_data/raw_snp_table.csv.gz | pandas | Contains raw SNP information on a mm level |
cumulative_scaffold_table | /Users/mattolm/Programs/inStrain/test/test_backend/testdir/test/raw_data/cumulative_scaffold_table.csv.gz | pandas | Cumulative coverage on mm level. Formerly scaffoldTable.csv |
cumulative_snv_table | /Users/mattolm/Programs/inStrain/test/test_backend/testdir/test/raw_data/cumulative_snv_table.csv.gz | pandas | Cumulative SNP on mm level. Formerly snpLocations.pickle |
scaffold_2_mm_2_read_2_snvs | /Users/mattolm/Programs/inStrain/test/test_backend/testdir/test/raw_data/scaffold_2_mm_2_read_2_snvs.pickle | pickle | crazy nonsense needed for linkage |
genes_coverage | /Users/mattolm/Programs/inStrain/test/test_backend/testdir/test/raw_data/genes_coverage.csv.gz | pandas | Coverage of individual genes |
genes_clonality | /Users/mattolm/Programs/inStrain/test/test_backend/testdir/test/raw_data/genes_clonality.csv.gz | pandas | Clonality of individual genes |
genes_SNP_count | /Users/mattolm/Programs/inStrain/test/test_backend/testdir/test/raw_data/genes_SNP_count.csv.gz | pandas | SNP density and counts of individual genes |
SNP_mutation_types | /Users/mattolm/Programs/inStrain/test/test_backend/testdir/test/raw_data/SNP_mutation_types.csv.gz | pandas | The mutation types of SNPs |
covT | /Users/mattolm/Programs/inStrain/test/test_backend/testdir/test/raw_data/covT.hd5 | special | Scaffold -> mm -> position based coverage |
clonT | /Users/mattolm/Programs/inStrain/test/test_backend/testdir/test/raw_data/clonT.hd5 | special | Scaffold -> mm -> position based clonality |
clonTR | /Users/mattolm/Programs/inStrain/test/test_backend/testdir/test/raw_data/clonTR.hd5 | special | Scaffold -> mm -> rarefied position based clonality |
genes_fileloc | /Users/mattolm/Programs/inStrain/test/test_data/N5_271_010G1_scaffold_min1000.fa.genes.fna | value | Location of genes file that was used to call genes |
genes_table | /Users/mattolm/Programs/inStrain/test/test_backend/testdir/test/raw_data/genes_table.csv.gz | pandas | Location of genes in the associated genes_file |
scaffold2bin | /Users/mattolm/Programs/inStrain/test/test_backend/testdir/test/raw_data/scaffold2bin.json | dictionary | Dictionary of scaffold 2 bin |
bin2length | /Users/mattolm/Programs/inStrain/test/test_backend/testdir/test/raw_data/bin2length.json | dictionary | Dictionary of bin 2 total length |
genome_level_info | /Users/mattolm/Programs/inStrain/test/test_backend/testdir/test/raw_data/genome_level_info.csv.gz | pandas | Table of genome-level information |
This is what the columns correspond to:
- name
- The name of the data. This is the name that you put into
IS.get()
to have inStrain retrieve the data for you. See the section “Accessing raw data” for an example. - value
- This lists the path to where the data is located within the raw_data folder. If the type of data is a value, than this just lists the value
- type
- This describes how the data is stored. Value = the data is whatever is listed under value; list = a python list; numpy = a numpy array; dictionary = a python dictionary; pandas = a pandas dataframe; pickle = a piece of data that’s stored as a python pickle object; special = a piece of data that is stored in a special way that inStrain knows how to de-compress
- description
- A one-sentence description of what’s in the data.
Warning
Many of these pieces of raw data have the column “mm” in them, which means that things are calculated at every possible read mismatch level. This is often not what you want. See the section “Dealing with mm” for more information.
Accessing other data¶
In addition to the raw_data described above, there are a couple of other things that inStrain can make for you. You access these from methods that run on the IS object itself, instead of using the get
method. For example:
import inStrain
import inStain.SNVprofile
IS = inStain.SNVprofile.SNVprofile(``/home/mattolm/inStrainOutputTest/``)
coverage_table = IS.get_raw_coverage_table()
The following methods work like that:
- get_nonredundant_scaffold_table()
- Get a scaffold table with just one line per scaffold, not multiple mms
- get_nonredundant_linkage_table()
- Get a linkage table with just one line per scaffold, not multiple mms
- get_nonredundant_snv_table()
- Get a SNP table with just one line per scaffold, not multiple mms
- get_clonality_table()
- Get a raw clonality table, listing the clonality of each position. Pass nonredundant=False to keep multiple mms
Dealing with “mm”¶
Behind the scenes, inStrain actually calculates pretty much all metrics for every read pair mismatch level. That is, only including read pairs with 0 mis-match to the reference sequences, only including read pairs with >= 1 mis-match to the reference sequences, all the way up to the number of mismatches associated with the “PID” parameter.
For most of the output that inStrain makes in the output folder, it removes the “mm” column and just gives the results for the maximum number of mismatches. However, it’s often helpful to explore other mismatches levels, to see how parameters vary with more or less stringent mappings. Much of the data stored in “read_data” is on the mismatch level. Here’s an example of what the looks like (this is the cumulative_scaffold_table):
,scaffold,length,breadth,coverage,coverage_median,coverage_std,bases_w_0_coverage,mean_clonality,median_clonality,unmaskedBreadth,SNPs,breadth_expected,ANI,mm
0,N5_271_010G1_scaffold_102,1144,0.9353146853146853,5.106643356643357,5,2.932067325774674,74,1.0,1.0,0.6145104895104895,0,0.9889923642060382,1.0,0
1,N5_271_010G1_scaffold_102,1144,0.9353146853146853,6.421328671328672,6,4.005996333777764,74,0.9992001028104149,1.0,0.6748251748251748,0,0.9965522492489882,1.0,1
2,N5_271_010G1_scaffold_102,1144,0.9423076923076923,7.3627622377622375,7,4.2747074564903285,66,0.9993874800638958,1.0,0.7928321678321678,0,0.998498542620078,1.0,2
3,N5_271_010G1_scaffold_102,1144,0.9423076923076923,7.859265734265734,8,4.748789115369562,66,0.9992251555869703,1.0,0.7928321678321678,0,0.9990314705263914,1.0,3
4,N5_271_010G1_scaffold_102,1144,0.9423076923076923,8.017482517482517,8,4.952541407151938,66,0.9992251555869703,1.0,0.7928321678321678,0,0.9991577528529144,1.0,4
5,N5_271_010G1_scaffold_102,1144,0.9458041958041958,8.271853146853147,8,4.9911156795536105,62,0.9992512780077317,1.0,0.8024475524475524,0,0.9993271891539499,1.0,7
As you can see, the same scaffold is shown multiple times, and the last column is mm
. At the row with mm = 0, you can see what the stats are when only considering reads that perfectly map to the reference sequence. As the mm goes higher, so do stats like coverage and breadth, as you now allow reads with more mismatches to count in the generation of these stats. In order to convert this files to what is provided in the output folder, the following code is run:
import inStrain
import inStain.SNVprofile
IS = inStain.SNVprofile.SNVprofile(``/home/mattolm/inStrainOutputTest/``)
scdb = IS.get('cumulative_scaffold_table')
ScaffDb = scdb.sort_values('mm')\
.drop_duplicates(subset=['scaffold'], keep='last')\
.sort_index().drop(columns=['mm'])
The last line looks complicated, but it’s very simple what is going on. First, you sort the database by mm
, with the lowest mms at the top. Next, for each scaffold, you only keep the row with the lowest mm. That’s done using the drop_duplicates(subset=['scaffold'], keep='last')
command. Finally, you re-sort the DataFrame to the original order, and remove the mm
column. In the above example, this would mean that the only row that would survive would be where mm = 7, because that’s the bottom row for that scaffold.
You can of course subset to any level of mismatch by modifying the above code slightly. For example, to generate this table only using reads with <=5 mismatches, you could use the following code:
import inStrain
import inStain.SNVprofile
IS = inStain.SNVprofile.SNVprofile(``/home/mattolm/inStrainOutputTest/``)
scdb = IS.get('cumulative_scaffold_table')
scdb = scdb[scdb['mm'] <= 5]
ScaffDb = scdb.sort_values('mm')\
.drop_duplicates(subset=['scaffold'], keep='last')\
.sort_index().drop(columns=['mm'])
Warning
You usually do not want to subset these DataFrames using something like scdb = scdb[scdb['mm'] == 5]
. That’s because if there are no reads that have 5 mismatches, as in the case above, you’ll end up with an empty DataFrame. By using the drop_duplicates technique described above you avoid this problem, because in the cases where you don’t have 5 mismatches, you just get the next-highest mm level (which is usually what you want)
A note for programmers¶
If you’d like to edit inStrain to add functionality for your data, don’t hesitate to reach out to the authors of this program for help. Additionally, please consider submitting a pull request on GitHub so that others can use your changes as well.
Benchmarks¶
This page contains data from tests performed to evaluate the accuracy inStrain. In most cases similar tests were performed to compare inStrain’s accuracy to other leading tools. Most of these benchmarks are adapted from the inStrain publication where you can find more details on how they were performed.
Strain-level comparisons¶
This section contains a series of benchmarks evaluating the ability of inStrain to perform detailed strain-level comparisons. In all cases inStrain is benchmarked against three leading tools:
MIDAS - an integrated pipeline to estimate bacterial species abundance and strain-level genomic variation. Strain-level comparisons are based on consensus alleles called on whole genomes. The script strain_tracking.py was used for benchmarks.
StrainPhlAn - a tool for strain-level resolution of species across large sample sets, based on consensus single nucleotide polymorphisms (SNPs) called on species marker genes. Based on the MetaPhlAn2 db_v20 database.
dRep - a genome alignment program. dRep does not purport to have accuracy over 99.9% ANI and was just used for comparison purposes.
Benchmark with synthetic data¶
A straightforward in silico test. A randomly selected E. coli genome was downloaded and mutated to various chosen ANI levels using SNP Mutator. The original genome was compared to the mutated genome, and we looked for the difference between the actual ANI and the calculated ANI (the ANI reported by each program).
All four methods performed well on this test, although dRep, inStrain, and MIDAS had lower errors in the ANI calculation than StrainPhlAn overall (0.00001%, 0.002%, 0.006% and 0.03%, respectively; average discrepancy between the true and calculated ANI). This is likely because dRep, inStrain, and MIDAS compare positions from across the entire genome (99.99998%, 99.7%, and 85.8% of the genome, respectively) and StrainPhlAn does not (0.3% of the genome).
Methods for synthetic data benchmark:
For dRep, mutated genomes were compared to the reference genome using dRep on default settings. For inStrain, MIDAS, andStrainPhlAn, Illumina reads were simulated for all genomes at 20x coverage using pIRS.
For inStrain, synthetic reads were mapped back to the reference genome using Bowtie 2, profiled using “inStrain profile” under default settings, and compared using “inStrain compare” under default settings.
For StrainPhlAn, synthetic reads profiled with Metaphlan2, resulting marker genes were aligned using StrainPhlan, and the ANI of resulting nucleotide alignments was calculated using the class “Bio.Phylo.TreeConstruction.DistanceCalculator(‘identity’)” from the BioPython python package.
For MIDAS, synthetic reads were provided to the program directly using the “run_midas.py species” command, and compared using the “run_midas.py snps” command. The ANI of the resulting comparisons was calculated as “[mean(sample1_bases, sample2_bases) - count_either] / mean(sample1_bases, sample2_bases)”.
Benchmark with defined microbial communities¶
This test (schematic above) involved comparing real metagenomes derived from defined bacterial communities. The ZymoBIOMICS Microbial Community Standard, which contains cells from eight bacterial species at defined abundances, was divided into three aliquots and subjected to DNA extraction, library preparation, and metagenomic sequencing. The same community of 8 bacterial species was sequenced 3 times, so each program should report 100% ANI for all species comparisons. Deviations from this ideal either represent errors in sequence alignment, the presence of microdiversity consistent with maintenance of cultures in the laboratory, or inability of programs to handle errors and biases introduced during routine DNA extraction, library preparation, and sequencing with Illumina).
MIDAS, dRep, StrainPhlAn, and inStrain reported average ANI values of 99.97%, 99.98%, 99.990% and 99.999998%, respectively, with inStrain reporting average popANI values of 100% for 23 of the 24 comparisons and 99.99996% for one comparison. The difference in performance likely arises because the Zymo cultures contain non-fixed nucleotide variants that inStrain uses to confirm population overlap but that confuse the consensus sequences reported by dRep, StrainPhlAn, and MIDAS (conANI). We also used this data to establish a threshold for the detection of “same” versus “different” strains. The thresholds for MIDAS, dRep, StrainPhlAn, and inStrain, calculated based on the comparison with the lowest average ANI across all 24 sequence comparisons, are shown in the table below.
Program | Minimum reported ANI | Years divergence |
---|---|---|
MIDAS | 99.92% | 3771 |
dRep | 99.94% | 2528 |
StrainPhlAn | 99.97% | 1307 |
InStrain | 99.99996% | 2.2 |
Years divergence was calculated from “minimum reported ANI” using the previously reported rate of 0.9 SNSs accumulated per genome per year in the gut microbiome of healthy human adults (Zhao 2019) . This benchmark demonstrates that inStrain can be used for detection of identical microbial strains with a stringency that is substantially higher than other tools. Stringent thresholds are useful for strain tracking, as strains that have diverged for hundreds to thousands of years are clearly not linked by a recent transmission event.
We also performed an additional benchmark with this data on inStrain only. InStrain relies on representative genomes to calculate popANI, so we wanted to know whether using non-ideal reference genomes would impact it’s accuracy. By mapping reads to all 4,644 representative genomes in the Unified Human Gastrointestinal Genome (UHGG) collection we identified the 8 relevant representative genomes. These genomes had between 93.9% - 99.6% ANI to the organisms present in the Zymo samples. InStrain comparisons based on these genomes were still highly accurate (average 99.9998% ANI, lowest 99.9995% ANI, limit of detection 32.2 years), highlighting that inStrain can be used with reference genomes from databases when sample-specific reference genomes cannot be assembled.
Methods for defined microbial community benchmark:
Reads from Zymo samples are available under BioProject PRJNA648136
For dRep, reads from each sample were assembled independently using IDBA_UD, binned into genomes based off of alignment to the provided reference genomes using nucmer, and compared using dRep on default settings.
For StrainPhlAn, reads from Zymo samples profiled with Metaphlan2, resulting marker genes were aligned using StrainPhlan, and the ANI of resulting nucleotide alignments was calculated as described in the synthetic benchmark above.
For MIDAS, reads from Zymo samples were provided to MIDAS directly and the ANI of sample comparisons was calculated as described in the synthetic benchmark above.
For inStrain, reads from Zymo samples were aligned to the provided reference genomes using Bowtie 2, profiled using “inStrain profile” under default settings, and compared using “inStrain compare” under default settings. popANI values were used for inStrain.
Eukaryotic genomes were excluded from this analysis, and raw values are available in Supplemental Table S1 of the inStrain manuscript. To evaluate inStrain when using genomes from public databases, all reference genomes from the UHGG collection were downloaded and concatenated into a single .fasta file. Reads from the Zymo sample were mapped against this database and processed with inStrain as described above. The ability of each method to detect genomes was performed using all Zymo reads concatenated together.
Benchmark with true microbial communities¶
This test evaluated the stringency with which each tool can detect shared strains in genuine microbial communities. Tests like this are hard to perform because it is difficult to know the ground truth. We can never really know whether two true genuine communities actually share strains. For this test we leveraged the fact that new-born siblings share far more strains than unrelated newborns.. In this test, we compared the ability of the programs to detect strains shared by twin premature infants (presumably True Positives) vs. their detection of strains shared by unrelated infants (presumably False Positives).
All methods identified significantly more strain sharing among twin pairs than pairs of unrelated infants, as expected, and inStrain remained sensitive at substantially higher ANI thresholds than the other tools. The reduced ability of StrainPhlAn and MIDAS to identify shared strains is likely based on their reliance on consensus-based ANI (conANI) measurements. We know that microbiomes can contain multiple coexisting strains, and when two or more strains of a species are in a sample at similar abundance levels it can lead to pileups of reads from multiple strains and chimeric sequences. The popANI metric is designed to account for this complexity.
It is also worth discussing Supplemental Figure S5 from the inStrain manuscript here.
This figure was generated from genomic comparisons between genomes present in the same infant over time (longitudinal data). In cases where the same genome was detected in multiple time-points over the time-series sampling of an infant, the percentage of comparisons between genomes that exceed various popANI (a) and conANI (b) thresholds is plotted. This figure shows that the use of popANI allows greater stringency than conANI.
Note
Based on the data presented in the above tests, a threshold of 99.999% popANI was chosen as the threshold to define bacterial, bacteriophage, and plasmid strains for the work presented in the inStrain manuscript. This is likely a good threshold for a variety of communities.
Methods for true microbial community benchmark:
Twin-based comparisons were performed on three randomly chosen sets of twins that were sequenced during a previous study (Olm 2019). Reads can be found under Bioproject PRJNA294605
For StrainPhlAn, all reads sequenced from each infant were concatenated and profiled using Metaphlan2, compared using StrainPhlAn, and the ANI of resulting nucleotide alignments was calculated as described for the synthetic benchmark.
For MIDAS, all reads sequenced from each infant were concatenated and profiled with MIDAS, and the ANI of species profiled in multiple infants was calculated as described for the synthetic benchmark.
For dRep, all de-replicated bacterial genomes assembled and binned from each infant (available from (Olm 2019)) were compared in a pairwise manner using dRep under default settings.
For inStain, strain-sharing from these six infants was determined using the methods described below.
ANI values from all compared genomes and the number of genomes shared at a number of ANI thresholds are available for all methods in Supplemental Table S1 of the inStrain publication.
Species-level community profiling¶
This section contains tests evaluating the ability of inStrain and other tools to accurately profile microbial communities. Here inStrain is benchmarked against two other tools:
MIDAS - an integrated pipeline to estimate bacterial species abundance and strain-level genomic variation.
MetaPhlAn 2 - a computational tool for profiling the composition of microbial communities from metagenomic shotgun sequencing data. MetaPhlAn 2 uses unique clade-specific marker genes.
Benchmark with defined microbial communities¶
This test evaluated the ability of programs to identify the microbial species present in metagenomes of defined bacterial communities. For this test we purchased, extracted DNA from, and sequenced a ZymoBIOMICS Microbial Community Standard. The reads used for this test are available here. This community contains 8 defined bacterial species, and we simply evaluated the ability of each program to identify those and only those 8 bacterial species. Results in the table below.
Program | True species detected | False species detected | Accuracy |
---|---|---|---|
MIDAS | 8 | 15 | 35% |
MetaPhlAn 2 | 8 | 11 | 42% |
InStrain | 8 | 0 | 100% |
All programs successfully identified the 8 bacteria present in the metagenome, but MIDAS and StrainPhlAn detected an additional 15 and 11 bacterial species as well. The raw tables produced by each tool are available at the bottom of this section. Looking at these tables, you’ll notice that many of these False positive species detected are related to species that are actually present in the community. For example, MetaPhlAn2 reported the detection of Bacillus cereus thuringiensis (False Positive) as well as the detection of Bacillus subtilis (True Positive). Similarly, MIDAS reported the detection of Escherichia fergusonii (related to True Positive Escherichia coli) and Bacillus anthracis (related to True Positive Bacillus subtilis).
Importantly inStrain detected many of these same False Positives as well. However inStrain also provides a set of other metrics that properly filter out erroneous detections. Taking a look at the information reported by inStrain (at the very bottom of this page) shows that many genomes besides the 8 True Positives were detected. When using the recommended genome breadth cutoff of 50%, only the 8 True Positive genomes remain (see section “Detecting organisms in metagenomic data” in Important concepts for more info). You’ll notice that no such info is reported with MIDAS or MetaPhlAn 2. While relative abundance could conceivably be used to filter out erroneous taxa with these tools, doing so would majorly limit their ability to detect genuine low-abundance taxa.
It’s also worth noting that if one is just interested in measuring community presence / absence, as in this test, any program that accurately reports breadth should give similar results to inStrain when mapped against the UHGG genome set. One such program is coverM, a fast program for calculating genome coverage and breadth that can be run on its own or through inStrain using the command inStrain quick_profile
.
Methods for defined microbial community profiling experiment:
For inStrain, all reference genomes from the UHGG collection were downloaded and concatenated into a single .fasta file, reads from the Zymo sample were mapped against this database, and inStrain profile was run on default settings.
Note
The UHGG genome database used for this section is available for download in the Tutorial section.
For MIDAS, the command run_midas.py species
was used along with the default database. In cases where the same species was detected multiple times as part of multiple genomes, the species was only counted once.
For MetaPhlAn 2, the command metaphlan2.py
was used along with the MetaPhlAn2 db_v20 database.
Eukaryotic genomes were excluded from this analysis.
Raw data for defined microbial community profiling experiment:
MetaPhlAn 2:
species | abundance | Metaphlan2_species |
Lactobacillus fermentum | 23.1133 | k__Bacteria|p__Firmicutes|c__Bacilli|o__Lactobacillales|f__Lactobacillaceae|g__Lactobacillus|s__Lactobacillus_fermentum |
Escherichia coli | 20.0587 | k__Bacteria|p__Proteobacteria|c__Gammaproteobacteria|o__Enterobacteriales|f__Enterobacteriaceae|g__Escherichia|s__Escherichia_coli |
Salmonella enterica | 18.44954 | k__Bacteria|p__Proteobacteria|c__Gammaproteobacteria|o__Enterobacteriales|f__Enterobacteriaceae|g__Salmonella|s__Salmonella_enterica |
Pseudomonas aeruginosa | 14.42109 | k__Bacteria|p__Proteobacteria|c__Gammaproteobacteria|o__Pseudomonadales|f__Pseudomonadaceae|g__Pseudomonas|s__Pseudomonas_aeruginosa |
Enterococcus faecalis | 12.21137 | k__Bacteria|p__Firmicutes|c__Bacilli|o__Lactobacillales|f__Enterococcaceae|g__Enterococcus|s__Enterococcus_faecalis |
Staphylococcus aureus | 6.36267 | k__Bacteria|p__Firmicutes|c__Bacilli|o__Bacillales|f__Staphylococcaceae|g__Staphylococcus|s__Staphylococcus_aureus |
Bacillus subtilis | 2.44228 | k__Bacteria|p__Firmicutes|c__Bacilli|o__Bacillales|f__Bacillaceae|g__Bacillus|s__Bacillus_subtilis |
Listeria monocytogenes | 1.8644 | k__Bacteria|p__Firmicutes|c__Bacilli|o__Bacillales|f__Listeriaceae|g__Listeria|s__Listeria_monocytogenes |
Salmonella unclassified | 0.67363 | k__Bacteria|p__Proteobacteria|c__Gammaproteobacteria|o__Enterobacteriales|f__Enterobacteriaceae|g__Salmonella|s__Salmonella_unclassified |
Saccharomyces cerevisiae | 0.20426 | k__Eukaryota|p__Ascomycota|c__Saccharomycetes|o__Saccharomycetales|f__Saccharomycetaceae|g__Saccharomyces|s__Saccharomyces_cerevisiae |
Cryptococcus neoformans | 0.05417 | k__Eukaryota|p__Basidiomycota|c__Tremellomycetes|o__Tremellales|f__Tremellaceae|g__Filobasidiella|s__Cryptococcus_neoformans |
Listeria unclassified | 0.02341 | k__Bacteria|p__Firmicutes|c__Bacilli|o__Bacillales|f__Listeriaceae|g__Listeria|s__Listeria_unclassified |
Klebsiella oxytoca | 0.0165 | k__Bacteria|p__Proteobacteria|c__Gammaproteobacteria|o__Enterobacteriales|f__Enterobacteriaceae|g__Klebsiella|s__Klebsiella_oxytoca |
Naumovozyma unclassified | 0.01337 | k__Eukaryota|p__Ascomycota|c__Saccharomycetes|o__Saccharomycetales|f__Saccharomycetaceae|g__Naumovozyma|s__Naumovozyma_unclassified |
Klebsiella unclassified | 0.01307 | k__Bacteria|p__Proteobacteria|c__Gammaproteobacteria|o__Enterobacteriales|f__Enterobacteriaceae|g__Klebsiella|s__Klebsiella_unclassified |
Bacillus cereus thuringiensis | 0.00809 | k__Bacteria|p__Firmicutes|c__Bacilli|o__Bacillales|f__Bacillaceae|g__Bacillus|s__Bacillus_cereus_thuringiensis |
Clostridium perfringens | 0.00554 | k__Bacteria|p__Firmicutes|c__Clostridia|o__Clostridiales|f__Clostridiaceae|g__Clostridium|s__Clostridium_perfringens |
Eremothecium unclassified | 0.00319 | k__Eukaryota|p__Ascomycota|c__Saccharomycetes|o__Saccharomycetales|f__Saccharomycetaceae|g__Eremothecium|s__Eremothecium_unclassified |
Veillonella parvula | 0.0015 | k__Bacteria|p__Firmicutes|c__Negativicutes|o__Selenomonadales|f__Veillonellaceae|g__Veillonella|s__Veillonella_parvula |
Clostridium butyricum | 0.00054 | k__Bacteria|p__Firmicutes|c__Clostridia|o__Clostridiales|f__Clostridiaceae|g__Clostridium|s__Clostridium_butyricum |
Enterobacter cloacae | 0.00051 | k__Bacteria|p__Proteobacteria|c__Gammaproteobacteria|o__Enterobacteriales|f__Enterobacteriaceae|g__Enterobacter|s__Enterobacter_cloacae |
MIDAS:
species_id | count_reads | coverage | relative_abundance | species |
Lactobacillus_fermentum_54035 | 22305 | 322.661072 | 0.202032 | Lactobacillus fermentum |
Salmonella_enterica_58156 | 18045 | 296.117276 | 0.185412 | Salmonella enterica |
Escherichia_coli_58110 | 19262 | 286.702733 | 0.179517 | Escherichia coli |
Pseudomonas_aeruginosa_57148 | 14214 | 214.266462 | 0.134162 | Pseudomonas aeruginosa |
Enterococcus_faecalis_56297 | 12382 | 183.37939 | 0.114822 | Enterococcus faecalis |
Staphylococcus_aureus_56630 | 6146 | 89.116402 | 0.0558 | Staphylococcus aureus |
Bacillus_subtilis_57806 | 3029 | 44.275375 | 0.027723 | Bacillus subtilis |
Salmonella_enterica_58266 | 3027 | 41.774295 | 0.026157 | Salmonella enterica |
Listeria_monocytogenes_53478 | 2250 | 33.367947 | 0.020893 | Listeria monocytogenes |
Escherichia_fergusonii_56914 | 2361 | 33.034998 | 0.020685 | Escherichia fergusonii |
Pseudomonas_aeruginosa_55861 | 927 | 12.402473 | 0.007766 | Pseudomonas aeruginosa |
Salmonella_enterica_53987 | 791 | 10.982231 | 0.006876 | Salmonella enterica |
Escherichia_coli_57907 | 713 | 9.860496 | 0.006174 | Escherichia coli |
Escherichia_albertii_56276 | 457 | 6.543769 | 0.004097 | Escherichia albertii |
Citrobacter_youngae_61659 | 455 | 6.248948 | 0.003913 | Citrobacter youngae |
Salmonella_bongori_55351 | 314 | 4.187424 | 0.002622 | Salmonella bongori |
Staphylococcus_aureus_37016 | 62 | 0.907418 | 0.000568 | Staphylococcus aureus |
Klebsiella_oxytoca_54123 | 29 | 0.418764 | 0.000262 | Klebsiella oxytoca |
Bacillus_sp_58480 | 17 | 0.233451 | 0.000146 | Bacillus sp |
Clostridium_perfringens_56840 | 12 | 0.182686 | 0.000114 | Clostridium perfringens |
Listeria_monocytogenes_56337 | 11 | 0.162597 | 0.000102 | Listeria monocytogenes |
Bacillus_subtilis_55718 | 9 | 0.127828 | 0.00008 | Bacillus subtilis |
Bacillus_anthracis_57688 | 2 | 0.031576 | 0.00002 | Bacillus anthracis |
Bacillus_cereus_58113 | 1 | 0.014684 | 0.000009 | Bacillus cereus |
Enterococcus_faecium_56947 | 1 | 0.014791 | 0.000009 | Enterococcus faecium |
Klebsiella_pneumoniae_54788 | 1 | 0.014852 | 0.000009 | Klebsiella pneumoniae |
Veillonella_parvula_57794 | 1 | 0.014925 | 0.000009 | Veillonella parvula |
Haemophilus_haemolyticus_58350 | 1 | 0.01351 | 0.000008 | Haemophilus haemolyticus |
Veillonella_parvula_58184 | 1 | 0.012646 | 0.000008 | Veillonella parvula |
Enterobacter_sp_59441 | 1 | 0.003478 | 0.000002 | Enterobacter sp |
Pseudomonas_sp_59807 | 1 | 0.003203 | 0.000002 | Pseudomonas sp |
InStrain:
genome | species | breadth | relative_abundance | coverage | nucl_diversity | length | true_scaffolds | detected_scaffolds | coverage_median | coverage_std | coverage_SEM | breadth_minCov | breadth_expected | nucl_diversity_rarefied | conANI_reference | popANI_reference | iRep | iRep_GC_corrected | linked_SNV_count | SNV_distance_mean | r2_mean | d_prime_mean | consensus_divergent_sites | population_divergent_sites | SNS_count | SNV_count | filtered_read_pair_count | reads_unfiltered_pairs | reads_mean_PID | reads_unfiltered_reads | divergent_site_count | Genome | lineage | genus |
GUT_GENOME142031.fna.gz | Salmonella enterica | 0.890900711 | 0.192920699 | 418.6273152 | 0.001575425 | 4955431 | 2 | 1 | 470 | 167.6327653 | 0.075307073 | 0.889570251 | 1 | 0.0012784 | 0.988936764 | 0.989338515 | FALSE | 29019 | 66.11144423 | 0.579492141 | 0.962855588 | 48769 | 46998 | 46575 | 5776 | 7448284 | 7508196 | 0.988009952 | 15332041 | 52351 | GUT_GENOME142031 | d__Bacteria;p__Proteobacteria;c__Gammaproteobacteria;o__Enterobacterales;f__Enterobacteriaceae;g__Salmonella;s__Salmonella enterica | Salmonella | |
GUT_GENOME143383.fna.gz | Pseudomonas aeruginosa | 0.894646033 | 0.176078828 | 280.4839772 | 0.001768407 | 6750396 | 80 | 74 | 312 | 110.1119447 | 0.042431183 | 0.892788808 | 1 | 0.001509706 | 0.99020306 | 0.990485637 | FALSE | 38085 | 95.96318761 | 0.622403987 | 0.959864075 | 59043 | 57340 | 56922 | 6330 | 6780948 | 6815597 | 0.985681579 | 13897895 | 63252 | GUT_GENOME143383 | d__Bacteria;p__Proteobacteria;c__Gammaproteobacteria;o__Pseudomonadales;f__Pseudomonadaceae;g__Pseudomonas;s__Pseudomonas aeruginosa | Pseudomonas | |
GUT_GENOME144544.fna.gz | Escherichia coli | 0.777878058 | 0.138653747 | 279.2311026 | 0.001677588 | 5339468 | 2 | 2 | 358 | 191.117164 | 0.082711711 | 0.772478831 | 1 | 0.001249049 | 0.976197599 | 0.976755468 | FALSE | 38443 | 87.25424655 | 0.594190893 | 0.979230955 | 98176 | 95875 | 95304 | 7777 | 5341780 | 5396883 | 0.949021155 | 11455710 | 103081 | GUT_GENOME144544 | d__Bacteria;p__Proteobacteria;c__Gammaproteobacteria;o__Enterobacterales;f__Enterobacteriaceae;g__Escherichia;s__Escherichia coli_D | Escherichia | |
GUT_GENOME000862.fna.gz | Lactobacillus fermentum | 0.862275194 | 0.096747331 | 528.5687445 | 0.002324426 | 1968193 | 80 | 71 | 512 | 531.1341113 | 0.380139439 | 0.861009566 | 1 | 0.001853701 | 0.992704615 | 0.99333307 | FALSE | 24523 | 43.65012437 | 0.472475022 | 0.941868743 | 12363 | 11298 | 10943 | 4409 | 3897780 | 3915245 | 0.987724733 | 8141064 | 15352 | GUT_GENOME000862 | d__Bacteria;p__Firmicutes;c__Bacilli;o__Lactobacillales;f__Lactobacillaceae;g__Lactobacillus_H;s__Lactobacillus_H fermentum | Lactobacillus_H | |
GUT_GENOME103721.fna.gz | Enterococcus faecalis | 0.890365837 | 0.064796999 | 247.8988054 | 0.001359966 | 2810675 | 1 | 1 | 263 | 106.7589636 | 0.063681687 | 0.889176088 | 1 | 0.001009117 | 0.992572379 | 0.992928895 | FALSE | 16206 | 70.9866099 | 0.521939243 | 0.942993799 | 18563 | 17672 | 17443 | 3088 | 2521061 | 2542269 | 0.991283432 | 5173192 | 20531 | GUT_GENOME103721 | d__Bacteria;p__Firmicutes;c__Bacilli;o__Lactobacillales;f__Enterococcaceae;g__Enterococcus;s__Enterococcus faecalis | Enterococcus | |
GUT_GENOME141183.fna.gz | Staphylococcus aureus | 0.941567947 | 0.03353288 | 131.1378903 | 0.001375635 | 2749621 | 2 | 2 | 142 | 41.45182417 | 0.024999936 | 0.93883157 | 1 | 0.000914462 | 0.99279082 | 0.993222751 | FALSE | 22344 | 72.09188149 | 0.540395915 | 0.915284188 | 18610 | 17495 | 17253 | 4097 | 1305000 | 1316045 | 0.959624337 | 2691675 | 21350 | GUT_GENOME141183 | d__Bacteria;p__Firmicutes;c__Bacilli;o__Staphylococcales;f__Staphylococcaceae;g__Staphylococcus;s__Staphylococcus aureus | Staphylococcus | |
GUT_GENOME145983.fna.gz | Escherichia fergusonii | 0.336139906 | 0.013061031 | 30.24324694 | 0.001418115 | 4643861 | 2 | 1 | 0 | 73.62873041 | 0.034168543 | 0.286148746 | 1 | 0.00110964 | 0.96878243 | 0.969196326 | FALSE | 5682 | 86.23970433 | 0.70271658 | 0.99040633 | 41483 | 40933 | 40787 | 1849 | 507148 | 517498 | 0.971752174 | 1312986 | 42636 | GUT_GENOME145983 | d__Bacteria;p__Proteobacteria;c__Gammaproteobacteria;o__Enterobacterales;f__Enterobacteriaceae;g__Escherichia;s__Escherichia fergusonii | Escherichia | |
GUT_GENOME141005.fna.gz | Listeria monocytogenes | 0.924613578 | 0.012249816 | 43.6464116 | 0.000913049 | 3017944 | 1 | 1 | 47 | 15.93616519 | 0.009173661 | 0.920719868 | 1 | 0.000838075 | 0.995876101 | 0.995969311 | FALSE | 4390 | 69.92277904 | 0.693771377 | 0.988098539 | 11459 | 11200 | 11171 | 1341 | 475888 | 477867 | 0.994961092 | 980123 | 12512 | GUT_GENOME141005 | d__Bacteria;p__Firmicutes;c__Bacilli;o__Lactobacillales;f__Listeriaceae;g__Listeria;s__Listeria monocytogenes_B | Listeria | |
GUT_GENOME000031.fna.gz | Bacillus subtilis | 0.796819131 | 0.011738601 | 31.12210212 | 0.00098138 | 4055810 | 14 | 13 | 28 | 32.20366662 | 0.015996189 | 0.729697397 | 1 | 0.000859228 | 0.939115003 | 0.93935457 | FALSE | 1163 | 21.13241617 | 0.702766252 | 0.973300401 | 180190 | 179481 | 179281 | 2155 | 454156 | 593524 | 0.940236231 | 1292915 | 181436 | GUT_GENOME000031 | d__Bacteria;p__Firmicutes;c__Bacilli;o__Bacillales;f__Bacillaceae;g__Bacillus;s__Bacillus subtilis | Bacillus | |
GUT_GENOME140826.fna.gz | Escherichia sp000208585 | 0.278930014 | 0.007440286 | 17.72017163 | 0.001540747 | 4514939 | 29 | 17 | 0 | 56.92691685 | 0.0268084 | 0.206155831 | 0.99999984 | 0.001388545 | 0.967896852 | 0.968409325 | FALSE | 5947 | 105.5668404 | 0.766140483 | 0.983050761 | 29881 | 29404 | 29302 | 1691 | 290320 | 305163 | 0.960887088 | 820252 | 30993 | GUT_GENOME140826 | d__Bacteria;p__Proteobacteria;c__Gammaproteobacteria;o__Enterobacterales;f__Enterobacteriaceae;g__Escherichia;s__Escherichia sp000208585 | Escherichia | |
GUT_GENOME145378.fna.gz | Escherichia albertii | 0.122179339 | 0.003167813 | 6.860465646 | 0.00235464 | 4965193 | 164 | 83 | 0 | 39.5967056 | 0.017829135 | 0.081533789 | 0.997660437 | 0.00218204 | 0.961660545 | 0.962826463 | FALSE | 12166 | 157.8004274 | 0.794119571 | 0.973439113 | 15521 | 15049 | 14939 | 1495 | 123735 | 140151 | 0.953424167 | 407112 | 16434 | GUT_GENOME145378 | d__Bacteria;p__Proteobacteria;c__Gammaproteobacteria;o__Enterobacterales;f__Enterobacteriaceae;g__Escherichia;s__Escherichia albertii | Escherichia | |
GUT_GENOME143726.fna.gz | Salmonella bongori | 0.070875018 | 0.003160648 | 7.43337583 | 0.001600002 | 4572147 | 84 | 48 | 0 | 96.31442795 | 0.045126398 | 0.038681827 | 0.998589302 | 0.001128128 | 0.972543099 | 0.973068942 | FALSE | 552 | 38.54710145 | 0.364521691 | 0.826322255 | 4856 | 4763 | 4731 | 305 | 122896 | 130428 | 0.968771686 | 341983 | 5036 | GUT_GENOME143726 | d__Bacteria;p__Proteobacteria;c__Gammaproteobacteria;o__Enterobacterales;f__Enterobacteriaceae;g__Salmonella;s__Salmonella bongori | Salmonella | |
GUT_GENOME140808.fna.gz | Escherichia marmotae | 0.120972135 | 0.002155978 | 5.167057447 | 0.001726413 | 4486744 | 47 | 38 | 0 | 33.4691403 | 0.015817374 | 0.074067966 | 0.989564186 | 0.001588357 | 0.965091296 | 0.965708164 | FALSE | 2054 | 74.47565725 | 0.741056343 | 0.989146608 | 11601 | 11396 | 11330 | 792 | 84663 | 96297 | 0.955863071 | 277452 | 12122 | GUT_GENOME140808 | d__Bacteria;p__Proteobacteria;c__Gammaproteobacteria;o__Enterobacterales;f__Enterobacteriaceae;g__Escherichia;s__Escherichia marmotae | Escherichia | |
GUT_GENOME142492.fna.gz | Listeria monocytogenes | 0.126213955 | 0.001176886 | 4.302069537 | 0.001059694 | 2941624 | 14 | 11 | 0 | 34.20030999 | 0.01995002 | 0.059155079 | 0.977600741 | 0.000372324 | 0.982937958 | 0.983231042 | FALSE | 206 | 9.058252427 | 0.844509747 | 1 | 2969 | 2918 | 2910 | 151 | 45956 | 47236 | 0.983439086 | 109110 | 3061 | GUT_GENOME142492 | d__Bacteria;p__Firmicutes;c__Bacilli;o__Lactobacillales;f__Listeriaceae;g__Listeria;s__Listeria monocytogenes | Listeria | |
GUT_GENOME146010.fna.gz | Metakosakonia intermedia | 0.011269041 | 0.0007257 | 1.265470484 | 0.001803324 | 6166452 | 5 | 4 | 0 | 20.17351043 | 0.008124545 | 0.0086103 | 0.672874189 | 0.001151362 | 0.987399943 | 0.988096808 | FALSE | 68 | 6.632352941 | 0.519849079 | 0.971330957 | 669 | 632 | 623 | 100 | 28027 | 28249 | 0.990376613 | 65337 | 723 | GUT_GENOME146010 | d__Bacteria;p__Proteobacteria;c__Gammaproteobacteria;o__Enterobacterales;f__Enterobacteriaceae;g__Metakosakonia;s__Metakosakonia intermedia | Metakosakonia | |
GUT_GENOME143527.fna.gz | Cronobacter malonaticus | 0.0056892 | 0.000591325 | 1.422193205 | 0.001895836 | 4470927 | 309 | 24 | 0 | 24.52026825 | 0.011677475 | 0.004481621 | 0.715151153 | 0.000863784 | 0.991316065 | 0.99186505 | FALSE | 43 | 3.348837209 | 0.323536849 | 0.895980567 | 174 | 163 | 157 | 52 | 22952 | 23584 | 0.95668831 | 51969 | 209 | GUT_GENOME143527 | d__Bacteria;p__Proteobacteria;c__Gammaproteobacteria;o__Enterobacterales;f__Enterobacteriaceae;g__Cronobacter;s__Cronobacter malonaticus | Cronobacter | |
GUT_GENOME147796.fna.gz | Staphylococcus argenteus | 0.072414548 | 0.000549275 | 2.122002981 | 0.002476746 | 2783391 | 89 | 66 | 0 | 13.35139068 | 0.008028467 | 0.046540353 | 0.846449939 | 0.001206762 | 0.96824147 | 0.969777675 | FALSE | 1626 | 68.76383764 | 0.880724477 | 0.994276953 | 4114 | 3915 | 3871 | 560 | 21703 | 24408 | 0.955482052 | 67154 | 4431 | GUT_GENOME147796 | d__Bacteria;p__Firmicutes;c__Bacilli;o__Staphylococcales;f__Staphylococcaceae;g__Staphylococcus;s__Staphylococcus argenteus | Staphylococcus | |
GUT_GENOME095995.fna.gz | Citrobacter portucalensis_A | 0.009471768 | 0.000474738 | 0.990436461 | 0.001709209 | 5154159 | 10 | 8 | 0 | 19.06540862 | 0.008399463 | 0.005825781 | 0.5829526 | 0.001430662 | 0.953575116 | 0.954008059 | FALSE | 36 | 67.25 | 0.565195465 | 1 | 1394 | 1381 | 1381 | 49 | 18414 | 20414 | 0.952624755 | 48144 | 1430 | GUT_GENOME095995 | d__Bacteria;p__Proteobacteria;c__Gammaproteobacteria;o__Enterobacterales;f__Enterobacteriaceae;g__Citrobacter;s__Citrobacter portucalensis_A | Citrobacter | |
GUT_GENOME000024.fna.gz | Lactobacillus_B murinus | 0.001870138 | 0.000441384 | 2.136753757 | 0.014148006 | 2221226 | 144 | 7 | 0 | 107.2864275 | 0.072457344 | 0.001074182 | 0.84843695 | 0.009892552 | 0.959346186 | 0.964375524 | FALSE | 542 | 135.7435424 | 0.169030486 | 0.84185153 | 97 | 85 | 84 | 72 | 17418 | 17595 | 0.941170306 | 40278 | 156 | GUT_GENOME000024 | d__Bacteria;p__Firmicutes;c__Bacilli;o__Lactobacillales;f__Lactobacillaceae;g__Lactobacillus_B;s__Lactobacillus_B murinus | Lactobacillus_B | |
GUT_GENOME078306.fna.gz | Lactobacillus_H oris | 0.001220271 | 0.000399444 | 2.141072453 | 0.012295828 | 2006111 | 94 | 4 | 0 | 77.23755615 | 0.054789295 | 0.000866353 | 0.849013821 | 0.010541041 | 0.995972382 | 0.998849252 | FALSE | 398 | 16.92713568 | 0.302731013 | 0.826865848 | 7 | 2 | 2 | 40 | 16099 | 16158 | 0.936115413 | 38348 | 42 | GUT_GENOME078306 | d__Bacteria;p__Firmicutes;c__Bacilli;o__Lactobacillales;f__Lactobacillaceae;g__Lactobacillus_H;s__Lactobacillus_H oris | Lactobacillus_H | |
GUT_GENOME225144.fna.gz | 0.009928145 | 0.000347535 | 1.549658971 | 0.002163833 | 2411528 | 1340 | 22 | 0 | 23.45474186 | 0.016020135 | 0.009063133 | 0.745473131 | 0.000727385 | 0.979730966 | 0.980646047 | FALSE | 66 | 3.333333333 | 0.700603722 | 0.963204482 | 443 | 423 | 414 | 70 | 13562 | 13757 | 0.97127631 | 33858 | 484 | GUT_GENOME225144 | d__Bacteria;p__Firmicutes_A;c__Clostridia;o__Lachnospirales;f__Lachnospiraceae;g__Faecalicatena;s__ | Faecalicatena | ||
GUT_GENOME038289.fna.gz | 0.00099285 | 0.000296215 | 1.742384732 | 0.010683495 | 1828071 | 235 | 2 | 0 | 74.35939765 | 0.055717981 | 0.000973157 | 0.785302607 | 0.008367741 | 0.983136594 | 0.988757729 | FALSE | 240 | 68.54166667 | 0.330322272 | 0.915762753 | 30 | 20 | 20 | 36 | 11886 | 11920 | 0.982379242 | 28904 | 56 | GUT_GENOME038289 | d__Bacteria;p__Proteobacteria;c__Gammaproteobacteria;o__Enterobacterales;f__Pasteurellaceae;g__Haemophilus_D;s__ | Haemophilus_D | ||
GUT_GENOME143493.fna.gz | Lactobacillus_G kefiri | 0.00864699 | 0.000280612 | 1.173767593 | 0.006727598 | 2570721 | 10 | 3 | 0 | 19.31409016 | 0.0120508 | 0.007831655 | 0.645283636 | 0.005352071 | 0.992251527 | 0.995380718 | FALSE | 1002 | 65.8992016 | 0.448661201 | 0.961561942 | 156 | 93 | 82 | 276 | 11319 | 11437 | 0.990954152 | 28111 | 358 | GUT_GENOME143493 | d__Bacteria;p__Firmicutes;c__Bacilli;o__Lactobacillales;f__Lactobacillaceae;g__Lactobacillus_G;s__Lactobacillus_G kefiri | Lactobacillus_G | |
GUT_GENOME212929.fna.gz | 0.004865114 | 0.000265953 | 1.504297544 | 0.006577654 | 1901086 | 52 | 12 | 0 | 31.30317094 | 0.022765581 | 0.004161306 | 0.735071349 | 0.003925862 | 0.982176716 | 0.986348123 | FALSE | 730 | 96.23287671 | 0.844539753 | 0.995313875 | 141 | 108 | 99 | 114 | 10256 | 10633 | 0.958751213 | 27570 | 213 | GUT_GENOME212929 | d__Bacteria;p__Firmicutes_C;c__Negativicutes;o__Veillonellales;f__Veillonellaceae;g__F0422;s__ | F0422 | ||
GUT_GENOME141398.fna.gz | Lactobacillus crispatus | 0.001957446 | 0.000245098 | 1.440641732 | 0.016147943 | 1829425 | 63 | 3 | 0 | 38.97971904 | 0.028918934 | 0.001950886 | 0.719753764 | 0.01075873 | 0.985149902 | 0.995236761 | FALSE | 919 | 85.0968444 | 0.532244527 | 0.953133466 | 53 | 17 | 15 | 120 | 9893 | 10121 | 0.984396172 | 25843 | 135 | GUT_GENOME141398 | d__Bacteria;p__Firmicutes;c__Bacilli;o__Lactobacillales;f__Lactobacillaceae;g__Lactobacillus;s__Lactobacillus crispatus | Lactobacillus | |
GUT_GENOME229203.fna.gz | 0.001076576 | 0.000237003 | 1.216154076 | 0.009783963 | 2095533 | 369 | 2 | 0 | 50.30343773 | 0.035378211 | 0.0010656 | 0.658314327 | 0.005923767 | 0.991939095 | 0.997313032 | FALSE | 176 | 7.352272727 | 0.255200979 | 0.773989489 | 18 | 6 | 4 | 42 | 9197 | 9229 | 0.996035052 | 24260 | 46 | GUT_GENOME229203 | d__Bacteria;p__Bacteroidota;c__Bacteroidia;o__Bacteroidales;f__Bacteroidaceae;g__Prevotella;s__ | Prevotella | ||
GUT_GENOME140701.fna.gz | Lactobacillus_H mucosae | 0.006795042 | 0.00022553 | 1.023402847 | 0.003734144 | 2369669 | 12 | 9 | 0 | 18.31360326 | 0.011902826 | 0.005363618 | 0.594917575 | 0.001824337 | 0.959480724 | 0.960582219 | FALSE | 154 | 89.28571429 | 0.726658579 | 0.984646084 | 515 | 501 | 493 | 83 | 9190 | 10224 | 0.968193502 | 25547 | 576 | GUT_GENOME140701 | d__Bacteria;p__Firmicutes;c__Bacilli;o__Lactobacillales;f__Lactobacillaceae;g__Lactobacillus_H;s__Lactobacillus_H mucosae | Lactobacillus_H | |
GUT_GENOME001416.fna.gz | Vagococcus teuberi | 0.001203634 | 0.000195751 | 0.932135043 | 0.001248896 | 2258161 | 50 | 4 | 0 | 37.87233052 | 0.02525855 | 0.001052626 | 0.560920699 | 0.001109488 | 0.991165334 | 0.991586033 | FALSE | 4 | 6.25 | 1 | 1 | 21 | 20 | 19 | 5 | 7731 | 7736 | 0.976960693 | 21662 | 24 | GUT_GENOME001416 | d__Bacteria;p__Firmicutes;c__Bacilli;o__Lactobacillales;f__Vagococcaceae;g__Vagococcus;s__Vagococcus teuberi | Vagococcus |
Acknowledgements¶
People¶
InStrain was developed by Matt Olm and Alex Crits-Christoph in the Banfield Lab at the University of California, Berkeley.
Special thanks to all those who have provided feedback on GitHub and otherwise, especially early adopters Keith Bouma-Gregson, Ben Siranosian, Yue “Clare” Lou, Naïma Madi, and Antônio Pedro Camargo.
Many of the ideas in Important concepts were honed over many years during countless discussions with members of the Banfield Lab, the Sonnenburg Lab, and others. Special thanks to Christopher Brown, Keith Bouma-Gregson, Yue “Clare” Lou, Spencer Diamond, Alex Thomas, Patrick West, Alex Jaffe, Bryan Merrill, Matt Carter, and Dylan Dahan.
Software¶
InStrain relies on several previously published programs and python modules to run - see here and here for a full list of dependencies. Of special importance are samtools (the basis for parsing .bam files) and coverM (the heart of quick_profile
).
While not a direct dependency, the open-source program anvi’o was used as significant inspiration for several implementation details, especially related to multiprocessing efficiency and memory management.
Citation¶
The manuscript describing inStrain is available in Nature Biotechnology and on bioRxiv and can be cited as follows:
Olm, M.R., Crits-Christoph, A., Bouma-Gregson, K., Firek, B.A., Morowitz, M.J., Banfield, J.F., 2021. inStrain profiles population microdiversity from metagenomic data and sensitively detects shared microbial strains. Nature Biotechnology. https://doi.org/10.1038/s41587-020-00797-0