Tutorial¶

Quick Start¶

The functionality of inStrain is broken up into several core modules. For more details on these modules, see module_descriptions.:

$ inStrain -h

              ...::: inStrain v1.3.0 :::...

Matt Olm and Alex Crits-Christoph. MIT License. Banfield Lab, UC Berkeley. 2019

Choose one of the operations below for more detailed help. See https://instrain.readthedocs.io for documentation.
Example: inStrain profile -h

Workflows:
  profile         -> Create an inStrain profile (microdiversity analysis) from a mapping.
  compare         -> Compare multiple inStrain profiles (popANI, coverage_overlap, etc.)

Single operations:
  profile_genes   -> Calculate gene-level metrics on an inStrain profile
  genome_wide     -> Calculate genome-level metrics on an inStrain profile
  quick_profile   -> Quickly calculate coverage and breadth of a mapping using coverM
  filter_reads    -> Commands related to filtering reads from .bam files
  plot            -> Make figures from the results of "profile" or "compare"
  other           -> Other miscellaneous operations

Below is a list of brief descriptions of each of the modules. For more information see module_descriptions, for help understanding the output, see Example output and explanations, and to change the parameters see choosing_parameters

profile¶

inStrain profile is the main method of the program. It takes a .fasta file and a .bam file (consisting of reads mapping to the .fasta file) and runs a series of steps to characterize the microdiversity, SNPs, linkage, etc. Details on how to generate the mapping, how the profiling is done, explanations of the output, how to choose the parameters can be found at preparing_input and module_descriptions

To run inStrain on a mapping run the following command:

$ inStrain profile .bam_file .fasta_file -o IS_output_name

compare¶

inStrain is able to compare multiple read mappings to the same .fasta file. Each mapping file must first be make into an inStrain profile using the above command. The coverage overlap and popANI between all pairs is calculated:

$ inStrain compare -i IS_output_1 IS_output_2 IS_output_3

profile_genes¶

Once you’ve run inStrain profile, you can also calculate gene-wise microdiversity, coverage, and SNP functions using this command. It relies on having gene calls in the .fna format from the program prodigal:

$ inStrain profile_genes -i IS_output -g called_genes.fna

genome_wide¶

This module is able to translate scaffold-level results to genome-level results. If the .fasta file you mapped to consists of a single genome, running this module on its own will average the results among all scaffolds. If the .fasta file you mapped to consists of several genomes, by providing a scaffold to bin file or a list of the individual .fasta files making up the combined .fasta file, you can get summary results for each individual genome. Running this module is also required before generating plots.:

$ inStrain genome_wide -i IS_output -s genome1.fasta genome2.fasta genome3.fasta

quick_profile¶

This auxiliary module is merely a quick way to calculate the coverage and breadth using the blazingly fast program coverM. This can be useful for quickly figuring out which scaffolds have any coverage, and then generating a list of these scaffolds to profile with inStrain profile, making it run faster:

$ inStrain quick_profile -b .bam_file -f .fasta_file -s scaffold_to_bin_file -o output_name

filter_reads¶

This auxiliary module lets you do various tasks to filter and/or characterize a mapping file, and then generate a new mapping file with those filters applied:

$ inStrain filter_reads .bam_file .fasta_file -g new_sam_file_location

plot¶

This method makes a number of plots from an inStrain object. It is required that you run genome_wide first before running this module:

$ inStrain plot -i IS_output

other¶

This module lets you do random small things, like convert IS_profile objects that are in an old format to the newest format.

Running inStrain with custom genomes¶

The following tutorial goes through an example run of inStrain using your own set of genomes. You can follow along with your own data, or use a small set of reads that are included in the inStrain install for testing. They can be found in the folder test/test_data/ of your install folder, or can be downloaded from the inStrain source code at this link on GitHub. The only files that you’ll need for this tutorial are forward and reverse metagenomic reads (N5_271_010G1.R1.fastq.gz and N5_271_010G1.R2.fastq.gz) and a .fasta file to map to (N5_271_010G1_scaffold_min1000.fa). In case you’re curious, these metagenomic reads come from a premature infant fecal sample.

Preparing .bam and .fasta files¶

After downloading the genome file that you would like to profile (.fasta file) and at least one set of paired reads, the first thing to do is to map the reads to the .fasta file in order to generate a .bam file.

When this mapping is performed it is important that you map to all genomes simultaneously, so the first thing to do is to combine all of the genomes that you’d like to map into a single .fasta file. In our case our .fasta file already has all of the genomes that we’d like to profile within it, but if you did want to profile a number of different genomes, you could combine them using a command like this

$  cat raw_data/S2_002_005G1_phage_Clostridioides_difficile.fasta raw_data/S2_018_020G1_bacteria_Clostridioides_difficile.fasta > allGenomes_v1.fasta

Next we must map our reads to this .fasta file to create .bam files. In this tutorial we will use the mapping program Bowtie 2

$ mkdir bt2

$ bowtie2-build ~/Programs/inStrain/test/test_data/N5_271_010G1_scaffold_min1000.fa bt2/N5_271_010G1_scaffold_min1000.fa

$ bowtie2 -p 6 -x bt2/N5_271_010G1_scaffold_min1000.fa -1 ~/Programs/inStrain/test/test_data/N5_271_010G1.R1.fastq.gz -2 ~/Programs/inStrain/test/test_data/N5_271_010G1.R2.fastq.gz > N5_271_010G1_scaffold_min1000.fa-vs-N5_271_010G1.sam

At this point we have generated a .sam file, the precursor to .bam files. Lets make sure it’s there and not empty

$ ls -lht

total 34944
-rw-r--r--  1 mattolm  staff    16M Jan 23 11:56 N5_271_010G1_scaffold_min1000.fa-vs-N5_271_010G1.sam
drwxr-xr-x  8 mattolm  staff   256B Jan 23 11:54 bt2/

Perfect. At this point we could convert the .sam file to a sorted and indexed .bam file, but since inStrain can do that for us automatically we won’t bother.

Preparing genes file¶

If we want inStrain to do gene-level profiling we need to give it a list of genes to profile. Note - this is an optional step that is not required for inStrain to work in general, but without this you will not get gene-level profiles

We will profile our genes using the program prodigal, which can be run using the following example command

$ prodigal -i ~/Programs/inStrain/test/test_data/N5_271_010G1_scaffold_min1000.fa -d N5_271_010G1_scaffold_min1000.fa.genes.fna

Preparing for genome-level characterization¶

In the step above (“Preparing .bam and .fasta files”), we combined all of our genomes into a single .fasta file for mapping. However we likely want to profile the microdiversity of the individual genomes in that .fasta file. In order to do that we need to tell inStrain which scaffolds belong to which genomes.

There are two ways of providing this information. One is to give inStrain a list of the .fasta files that went into making the concatenated .fasta file. The other is to provide inStrain with a “scaffold to bin” file, which lists the genome assignment of each scaffold in a tab-delimited file. In this case we’re going to use the scaffold to bin file provided by inStrain (called “N5_271_010G1.maxbin2.stb”). Here’s what it looks like

$ head ~/Programs/inStrain/test/test_data/N5_271_010G1.maxbin2.stb
N5_271_010G1_scaffold_0        maxbin2.maxbin.001.fasta
N5_271_010G1_scaffold_1        maxbin2.maxbin.001.fasta
N5_271_010G1_scaffold_2        maxbin2.maxbin.001.fasta
N5_271_010G1_scaffold_3        maxbin2.maxbin.001.fasta
N5_271_010G1_scaffold_4        maxbin2.maxbin.001.fasta

Running inStrain profile¶

Now that we’ve gotten everything set up, it’s time to run inStrain. To see all of the options, run

$ inStrain -h

A long list of arguments and options will show up. For more details on what these do, see Program documentation. The only arguments that are absolutely required, however, are a .sam or .bam mapping file, and the .fasta file that the mapping file is mapped to.

Note

In this case we’re going to have inStrain profile the mapping, call genes, make the results genome wide, and plot the results all in one command. It is possible to do these all as separate steps, however, using the subcommands “inStrain profile”, “inStrain profile_genes”, “inStrain genome_wide”, and “inStrain plot”. See Program documentation for more information.

Using all of the files we generated above, here is going to be our inStrain command

$ inStrain profile N5_271_010G1_scaffold_min1000.fa-vs-N5_271_010G1.sam ~/Programs/inStrain/test/test_data/N5_271_010G1_scaffold_min1000.fa -o N5_271_010G1_scaffold_min1000.fa-vs-N5_271_010G1.IS -p 6 -g N5_271_010G1_scaffold_min1000.fa.genes.fna -s ~/Programs/inStrain/test/test_data/N5_271_010G1.maxbin2.stb

You should see the following as inStrain runs (should only take a few minutes)

You gave me a sam- I'm going to make it a .bam now
Converting N5_271_010G1_scaffold_min1000.fa-vs-N5_271_010G1.sam to N5_271_010G1_scaffold_min1000.fa-vs-N5_271_010G1.bam
samtools view -S -b N5_271_010G1_scaffold_min1000.fa-vs-N5_271_010G1.sam > N5_271_010G1_scaffold_min1000.fa-vs-N5_271_010G1.bam
sorting N5_271_010G1_scaffold_min1000.fa-vs-N5_271_010G1.bam
samtools sort N5_271_010G1_scaffold_min1000.fa-vs-N5_271_010G1.bam -o N5_271_010G1_scaffold_min1000.fa-vs-N5_271_010G1.sorted.bam
Indexing N5_271_010G1_scaffold_min1000.fa-vs-N5_271_010G1.sorted.bam
samtools index N5_271_010G1_scaffold_min1000.fa-vs-N5_271_010G1.sorted.bam N5_271_010G1_scaffold_min1000.fa-vs-N5_271_010G1.sorted.bam.bai
***************************************************
    ..:: inStrain profile Step 1. Filter reads ::..
***************************************************

Getting read pairs: 100%|██████████████████████████████████████████████████████████| 178/178 [00:00<00:00, 715.57it/s]
Making read report
/Users/mattolm/.pyenv2/versions/3.6.9/lib/python3.6/site-packages/numpy/core/fromnumeric.py:3335: RuntimeWarning: Mean of empty slice.
  out=out, **kwargs)
/Users/mattolm/.pyenv2/versions/3.6.9/lib/python3.6/site-packages/numpy/core/_methods.py:161: RuntimeWarning: invalid value encountered in double_scalars
  ret = ret.dtype.type(ret / rcount)
Filtering reads
1,727 read pairs remain after filtering
***************************************************
.:: inStrain profile Step 2. Profile scaffolds ::..
***************************************************

Profiling scaffolds: 100%|████████████████████████████████████████████████████████████| 23/23 [00:06<00:00,  3.44it/s]
Storing output
***************************************************
  .:: inStrain profile Step 3. Profile genes ::..
***************************************************

20.67703568161025% of the input 1093 genes were marked as incomplete
161 scaffolds with genes, 169 in the IS, 153 to compare
Running gene-level calculations on scaffolds: 100%|█████████████████████████████████| 153/153 [00:18<00:00,  8.16it/s]
***************************************************
.:: inStrain profile Step 4. Make genome-wide ::..
***************************************************

Scaffold to bin was made using .stb file
85.66% of scaffolds have a genome
93.82% of scaffolds have a genome
***************************************************
 .:: inStrain profile Step 5. Generate plots ::..
***************************************************

making plots 1, 2, 3, 4, 5, 6, 7, 8, 9
85.66% of scaffolds have a genome
Plotting plot 1
Plotting plot 2
85.66% of scaffolds have a genome
Plotting plot 3
57.37% of scaffolds have a genome
Plotting plot 4
97.33% of scaffolds have a genome
Plotting plot 5
93.82% of scaffolds have a genome
Plotting plot 6
Plotting plot 7
97.33% of scaffolds have a genome
Plotting plot 8
93.96% of scaffolds have a genome
Plotting plot 9
$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$

    ..:: inStrain profile finished ::..

Output tables........ /Users/mattolm/Programs/testing_house/tutorial/N5_271_010G1_scaffold_min1000.fa-vs-N5_271_010G1.IS/output/
Figures.............. /Users/mattolm/Programs/testing_house/tutorial/N5_271_010G1_scaffold_min1000.fa-vs-N5_271_010G1.IS/figures/

See documentation for output descriptions - https://instrain.readthedocs.io/en/latest/

$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$

The last note shows you where the plots and figures have been made. Here’s a list of the files that you should see

$ ls -lht N5_271_010G1_scaffold_min1000.fa-vs-N5_271_010G1.IS/output/
total 512
-rw-r--r--  1 mattolm  staff   545B Jan 23 15:16 N5_271_010G1_scaffold_min1000.fa-vs-N5_271_010G1.IS_genomeWide_mapping_info.tsv
-rw-r--r--  1 mattolm  staff   602B Jan 23 15:16 N5_271_010G1_scaffold_min1000.fa-vs-N5_271_010G1.IS_genomeWide_scaffold_info.tsv
-rw-r--r--  1 mattolm  staff    25K Jan 23 15:16 N5_271_010G1_scaffold_min1000.fa-vs-N5_271_010G1.IS_SNP_mutation_types.tsv
-rw-r--r--  1 mattolm  staff   125K Jan 23 15:16 N5_271_010G1_scaffold_min1000.fa-vs-N5_271_010G1.IS_gene_info.tsv
-rw-r--r--  1 mattolm  staff    19K Jan 23 15:16 N5_271_010G1_scaffold_min1000.fa-vs-N5_271_010G1.IS_mapping_info.tsv
-rw-r--r--  1 mattolm  staff    14K Jan 23 15:16 N5_271_010G1_scaffold_min1000.fa-vs-N5_271_010G1.IS_linkage.tsv
-rw-r--r--  1 mattolm  staff    26K Jan 23 15:16 N5_271_010G1_scaffold_min1000.fa-vs-N5_271_010G1.IS_SNVs.tsv
mattolm@Matts-MacBook-Pro-3:~/Programs/testing_house/tutorial$ caffold_min1000.fa-vs-N5_271_010G1.IS_scaffold_info.tsv

$ ls -lht N5_271_010G1_scaffold_min1000.fa-vs-N5_271_010G1.IS/figures
total 7792
-rw-r--r--  1 mattolm  staff   432K Jan 23 15:17 N5_271_010G1_scaffold_min1000.fa-vs-N5_271_010G1.IS_GeneHistogram_plot.pdf
-rw-r--r--  1 mattolm  staff   422K Jan 23 15:17 N5_271_010G1_scaffold_min1000.fa-vs-N5_271_010G1.IS_LinkageDecay_types_plot.pdf
-rw-r--r--  1 mattolm  staff   448K Jan 23 15:17 N5_271_010G1_scaffold_min1000.fa-vs-N5_271_010G1.IS_ScaffoldInspection_plot.pdf
-rw-r--r--  1 mattolm  staff   419K Jan 23 15:16 N5_271_010G1_scaffold_min1000.fa-vs-N5_271_010G1.IS_ReadFiltering_plot.pdf
-rw-r--r--  1 mattolm  staff   421K Jan 23 15:16 N5_271_010G1_scaffold_min1000.fa-vs-N5_271_010G1.IS_LinkageDecay_plot.pdf
-rw-r--r--  1 mattolm  staff   420K Jan 23 15:16 N5_271_010G1_scaffold_min1000.fa-vs-N5_271_010G1.IS_MajorAllele_frequency_plot.pdf
-rw-r--r--  1 mattolm  staff   419K Jan 23 15:16 N5_271_010G1_scaffold_min1000.fa-vs-N5_271_010G1.IS_readANI_distribution.pdf
-rw-r--r--  1 mattolm  staff   443K Jan 23 15:16 N5_271_010G1_scaffold_min1000.fa-vs-N5_271_010G1.IS_genomeWide_microdiveristy_metrics.pdf
-rw-r--r--  1 mattolm  staff   419K Jan 23 15:16 N5_271_010G1_scaffold_min1000.fa-vs-N5_271_010G1.IS_CoverageAndBreadth_vs_readMismatch.pdf

For help interpreting these output files, see Example output and explanations

inStrain compare¶

inStrain compare allows you to compare genomes that have been profiled by multiple mappings. To compare a genome in multiple samples, you must first map reads from multiple samples to the same .fasta file, then run run `inStrain profile on each mapping.

In this tutorial we profiled reads mapped to the .fasta file “N5_271_010G1_scaffold_min1000.fa”. Provided in the inStrain test_data folder (<https://github.com/MrOlm/inStrain/tree/master/test/test_data>) is also a different set of reads mapped to the same .fasta file. We’ve also already run inStrain on this mapping for you! The resulting inStrain profile is the folder N5_271_010G1_scaffold_min1000.fa-vs-N5_271_010G2.IS/

To compare these inStrain profiles we will use the following command

$ inStrain compare -i N5_271_010G1_scaffold_min1000.fa-vs-N5_271_010G1.IS/ ~/Programs/inStrain/test/test_data/N5_271_010G1_scaffold_min1000.fa-vs-N5_271_010G2.IS/ -o N5_271_010G1_scaffold_min1000.fa.IS.COMPARE -p 6

 Loading N5_271_010G1_scaffold_min1000.fa-vs-N5_271_010G1.IS/
 Loading /Users/mattolm/Programs/inStrain/test/test_data/N5_271_010G1_scaffold_min1000.fa-vs-N5_271_010G2.IS/
 Warning! Your inStrain folder is from version 1.0.0, while the installed version is 1.2.1.
 If you experience weird behavior, this might be why
 169 of 178 scaffolds are in at least 2 samples
 Profiling scaffolds: 100%|█████████████████████████████████████████████████████| 169/169 [00:22<00:00,  7.38it/s]

You should now have the following output file created

$ ls -lht N5_271_010G1_scaffold_min1000.fa.IS.COMPARE/output/
total 64
-rw-r--r--  1 mattolm  staff    30K Jan 23 15:20 N5_271_010G1_scaffold_min1000.fa.IS.COMPARE_comparisonsTable.tsv

This file shows the comparison values between scaffolds, however. To make these on the genome level, we can run inStrain genome_wide

$ inStrain genome_wide -i N5_271_010G1_scaffold_min1000.fa.IS.COMPARE/ -s ~/Programs/inStrain/test/test_data/N5_271_010G1.maxbin2.stb
Scaffold to bin was made using .stb file
89.62% of scaffolds have a genome

Now we should also have a table that compares these genomes on the genome level

$ ls -lht N5_271_010G1_scaffold_min1000.fa.IS.COMPARE/output/
total 72
-rw-r--r--  1 mattolm  staff   556B Jan 23 15:23 N5_271_010G1_scaffold_min1000.fa.IS.COMPARE_genomeWide_compare.tsv
-rw-r--r--  1 mattolm  staff    30K Jan 23 15:20 N5_271_010G1_scaffold_min1000.fa.IS.COMPARE_comparisonsTable.tsv

Finally, we can also plot these results using the inStrain plot function

$ inStrain plot -i N5_271_010G1_scaffold_min1000.fa.IS.COMPARE/
making plots 10
89.62% of scaffolds have a genome
Plotting plot 10
Done!

This should make a figure in the figures folder

$ ls -lht N5_271_010G1_scaffold_min1000.fa.IS.COMPARE/figures/
total 936
-rw-r--r--  1 mattolm  staff   419K Jan 23 15:25 N5_271_010G1_scaffold_min1000.fa.IS.COMPARE_inStrainCompare_dendrograms.pdf

As before, for help interpreting this output see Example output and explanations .

Running inStrain with public reference genomes¶

The following tutorial goes through running inStrain with a set of publically available reference genomes.