Raw data access and API
==============================

API for accessing raw data
----------------------------

inStrain stores much more data than is shown in the output folder. It is kept in the ``raw_data`` folder, and is mostly stored in compressed formats. This data can be easily accessed using python, as described below.

To access the data, you first make an SNVprofile object of the inStrain output profile, and then you access data from that object. For example, the following code accessed the raw SNP table ::

  import inStrain
  import inStain.SNVprofile

  IS = inStain.SNVprofile.SNVprofile(``/home/mattolm/inStrainOutputTest/``)
  raw_snps = IS.get('raw_snp_table')


You can use the example above (``IS.get()``) to access any of the raw data described in the following section. There are also another special things that are accessed in other ways, as described in the section "Accessing other data"

Basics of raw_data
++++++++++++++++++++

A typical run of inStrain will yield a folder titled "raw_data", with lots of individual files in it. The specifics of what files are in there depend on how inStrain was run, and whether or not additional commands were run as well (like profile_genes).

There will always be a file titled "attributes.tsv". This describes some basic information about each item in the raw data. Here's an example

.. csv-table:: attributes.tsv

    name,value,type,description
    location,/Users/mattolm/Programs/inStrain/test/test_backend/testdir/test,value,Location of SNVprofile object
    version,1.3.3,value,Version of inStrain
    Rdic,/Users/mattolm/Programs/inStrain/test/test_backend/testdir/test/raw_data/Rdic.json,dictionary,Read pair -> mismatches
    mapping_info,/Users/mattolm/Programs/inStrain/test/test_backend/testdir/test/raw_data/mapping_info.csv.gz,pandas,Report on reads
    fasta_loc,/Users/mattolm/Programs/inStrain/test/test_data/N5_271_010G1_scaffold_min1000.fa,value,Location of .fasta file used during profile
    scaffold2length,/Users/mattolm/Programs/inStrain/test/test_backend/testdir/test/raw_data/scaffold2length.json,dictionary,Dictionary of scaffold 2 length
    object_type,profile,value,Type of SNVprofile (profile or compare)
    bam_loc,/Users/mattolm/Programs/inStrain/test/test_data/N5_271_010G1_scaffold_min1000.fa-vs-N5_271_010G1.sorted.bam,value,Location of .bam file
    scaffold_list,/Users/mattolm/Programs/inStrain/test/test_backend/testdir/test/raw_data/scaffold_list.txt,list,1d list of scaffolds that were profiled
    raw_linkage_table,/Users/mattolm/Programs/inStrain/test/test_backend/testdir/test/raw_data/raw_linkage_table.csv.gz,pandas,Raw table of linkage information
    raw_snp_table,/Users/mattolm/Programs/inStrain/test/test_backend/testdir/test/raw_data/raw_snp_table.csv.gz,pandas,Contains raw SNP information on a mm level
    cumulative_scaffold_table,/Users/mattolm/Programs/inStrain/test/test_backend/testdir/test/raw_data/cumulative_scaffold_table.csv.gz,pandas,Cumulative coverage on mm level. Formerly scaffoldTable.csv
    cumulative_snv_table,/Users/mattolm/Programs/inStrain/test/test_backend/testdir/test/raw_data/cumulative_snv_table.csv.gz,pandas,Cumulative SNP on mm level. Formerly snpLocations.pickle
    scaffold_2_mm_2_read_2_snvs,/Users/mattolm/Programs/inStrain/test/test_backend/testdir/test/raw_data/scaffold_2_mm_2_read_2_snvs.pickle,pickle,crazy nonsense needed for linkage
    genes_coverage,/Users/mattolm/Programs/inStrain/test/test_backend/testdir/test/raw_data/genes_coverage.csv.gz,pandas,Coverage of individual genes
    genes_clonality,/Users/mattolm/Programs/inStrain/test/test_backend/testdir/test/raw_data/genes_clonality.csv.gz,pandas,Clonality of individual genes
    genes_SNP_count,/Users/mattolm/Programs/inStrain/test/test_backend/testdir/test/raw_data/genes_SNP_count.csv.gz,pandas,SNP density and counts of individual genes
    SNP_mutation_types,/Users/mattolm/Programs/inStrain/test/test_backend/testdir/test/raw_data/SNP_mutation_types.csv.gz,pandas,The mutation types of SNPs
    covT,/Users/mattolm/Programs/inStrain/test/test_backend/testdir/test/raw_data/covT.hd5,special,Scaffold -> mm -> position based coverage
    clonT,/Users/mattolm/Programs/inStrain/test/test_backend/testdir/test/raw_data/clonT.hd5,special,Scaffold -> mm -> position based clonality
    clonTR,/Users/mattolm/Programs/inStrain/test/test_backend/testdir/test/raw_data/clonTR.hd5,special,Scaffold -> mm -> rarefied position based clonality
    genes_fileloc,/Users/mattolm/Programs/inStrain/test/test_data/N5_271_010G1_scaffold_min1000.fa.genes.fna,value,Location of genes file that was used to call genes
    genes_table,/Users/mattolm/Programs/inStrain/test/test_backend/testdir/test/raw_data/genes_table.csv.gz,pandas,Location of genes in the associated genes_file
    scaffold2bin,/Users/mattolm/Programs/inStrain/test/test_backend/testdir/test/raw_data/scaffold2bin.json,dictionary,Dictionary of scaffold 2 bin
    bin2length,/Users/mattolm/Programs/inStrain/test/test_backend/testdir/test/raw_data/bin2length.json,dictionary,Dictionary of bin 2 total length
    genome_level_info,/Users/mattolm/Programs/inStrain/test/test_backend/testdir/test/raw_data/genome_level_info.csv.gz,pandas,Table of genome-level information

This is what the columns correspond to:

name
  The name of the data. This is the name that you put into ``IS.get()`` to have inStrain retrieve the data for you. See the section "Accessing raw data" for an example.

value
  This lists the path to where the data is located within the raw_data folder. If the type of data is a value, than this just lists the value

type
  This describes how the data is stored. Value = the data is whatever is listed under value; list = a python list; numpy = a numpy array; dictionary = a python dictionary; pandas = a pandas dataframe; pickle = a piece of data that's stored as a python pickle object; special = a piece of data that is stored in a special way that inStrain knows how to de-compress

description
  A one-sentence description of what's in the data.

.. warning::

  Many of these pieces of raw data have the column "mm" in them, which means that things are calculated at every possible read mismatch level. This is often not what you want. See the section "Dealing with mm" for more information.

Accessing other data
++++++++++++++++++++++++++++++

In addition to the raw_data described above, there are a couple of other things that inStrain can make for you. You access these from methods that run on the IS object itself, instead of using the ``get`` method. For example::

  import inStrain
  import inStain.SNVprofile

  IS = inStain.SNVprofile.SNVprofile(``/home/mattolm/inStrainOutputTest/``)
  coverage_table = IS.get_raw_coverage_table()

The following methods work like that:

get_nonredundant_scaffold_table()
  Get a scaffold table with just one line per scaffold, not multiple mms

get_nonredundant_linkage_table()
  Get a linkage table with just one line per scaffold, not multiple mms

get_nonredundant_snv_table()
  Get a SNP table with just one line per scaffold, not multiple mms

get_clonality_table()
  Get a raw clonality table, listing the clonality of each position. Pass `nonredundant=False` to keep multiple mms

Dealing with "mm"
++++++++++++++++++++

Behind the scenes, inStrain actually calculates pretty much all metrics for every read pair mismatch level. That is, only including read pairs with 0 mis-match to the reference sequences, only including read pairs with >= 1 mis-match to the reference sequences, all the way up to the number of mismatches associated with the "PID" parameter.

For most of the output that inStrain makes in the output folder, it removes the "mm" column and just gives the results for the maximum number of mismatches. However, it's often helpful to explore other mismatches levels, to see how parameters vary with more or less stringent mappings. Much of the data stored in "read_data" is on the mismatch level. Here's an example of what the looks like (this is the cumulative_scaffold_table)::

  ,scaffold,length,breadth,coverage,coverage_median,coverage_std,bases_w_0_coverage,mean_clonality,median_clonality,unmaskedBreadth,SNPs,breadth_expected,ANI,mm
  0,N5_271_010G1_scaffold_102,1144,0.9353146853146853,5.106643356643357,5,2.932067325774674,74,1.0,1.0,0.6145104895104895,0,0.9889923642060382,1.0,0
  1,N5_271_010G1_scaffold_102,1144,0.9353146853146853,6.421328671328672,6,4.005996333777764,74,0.9992001028104149,1.0,0.6748251748251748,0,0.9965522492489882,1.0,1
  2,N5_271_010G1_scaffold_102,1144,0.9423076923076923,7.3627622377622375,7,4.2747074564903285,66,0.9993874800638958,1.0,0.7928321678321678,0,0.998498542620078,1.0,2
  3,N5_271_010G1_scaffold_102,1144,0.9423076923076923,7.859265734265734,8,4.748789115369562,66,0.9992251555869703,1.0,0.7928321678321678,0,0.9990314705263914,1.0,3
  4,N5_271_010G1_scaffold_102,1144,0.9423076923076923,8.017482517482517,8,4.952541407151938,66,0.9992251555869703,1.0,0.7928321678321678,0,0.9991577528529144,1.0,4
  5,N5_271_010G1_scaffold_102,1144,0.9458041958041958,8.271853146853147,8,4.9911156795536105,62,0.9992512780077317,1.0,0.8024475524475524,0,0.9993271891539499,1.0,7

As you can see, the same scaffold is shown multiple times, and the last column is ``mm``. At the row with mm = 0, you can see what the stats are when only considering reads that perfectly map to the reference sequence. As the mm goes higher, so do stats like coverage and breadth, as you now allow reads with more mismatches to count in the generation of these stats. In order to convert this files to what is provided in the output folder, the following code is run::

  import inStrain
  import inStain.SNVprofile

  IS = inStain.SNVprofile.SNVprofile(``/home/mattolm/inStrainOutputTest/``)
  scdb = IS.get('cumulative_scaffold_table')
  ScaffDb = scdb.sort_values('mm')\
              .drop_duplicates(subset=['scaffold'], keep='last')\
              .sort_index().drop(columns=['mm'])

The last line looks complicated, but it's very simple what is going on. First, you sort the database by ``mm``, with the lowest mms at the top. Next, for each scaffold, you only keep the row with the lowest mm. That's done using the ``drop_duplicates(subset=['scaffold'], keep='last')`` command. Finally, you re-sort the DataFrame to the original order, and remove the ``mm`` column. In the above example, this would mean that the only row that would survive would be where mm = 7, because that's the bottom row for that scaffold.

You can of course subset to any level of mismatch by modifying the above code slightly. For example, to generate this table only using reads with <=5 mismatches, you could use the following code::

  import inStrain
  import inStain.SNVprofile

  IS = inStain.SNVprofile.SNVprofile(``/home/mattolm/inStrainOutputTest/``)
  scdb = IS.get('cumulative_scaffold_table')
  scdb = scdb[scdb['mm'] <= 5]
  ScaffDb = scdb.sort_values('mm')\
              .drop_duplicates(subset=['scaffold'], keep='last')\
              .sort_index().drop(columns=['mm'])

.. warning::

  You usually do not want to subset these DataFrames using something like ``scdb = scdb[scdb['mm'] == 5]``. That's because if there are no reads that have 5 mismatches, as in the case above, you'll end up with an empty DataFrame. By using the drop_duplicates technique described above you avoid this problem, because in the cases where you don't have 5 mismatches, you just get the next-highest mm level (which is usually what you want)

A note for programmers
+++++++++++++++++++++++

If you'd like to edit inStrain to add functionality for your data, don't hesitate to reach out to the authors of this program for help. Additionally, please consider submitting a pull request on GitHub so that others can use your changes as well.