Execution

Execution is the main stage of EnTAP that will annotate a transcriptome input by the user. After following the installation/configuration steps at least once, all of the required databases have been downloaded and configured so that Execution can be ran. Configuration will not need to be ran again unless you would like to update any databases or configure more.

The following stages will be run:

  1. Expression Filtering (optional)

  2. Frame Selection (optional)

  3. Similarity Search

  4. Orthologous Group Assignment

  5. InterProScan (optional)

Input Files:

Required:

  • .FASTA formatted transcriptome file (either protein or nucleotide)

  • .dmnd (DIAMOND) indexed databases, which can be formatted in the Configuration stage.

Optional:

  • .BAM/.SAM alignment file. If left unspecified expression filtering will not be performed.
    • This can be generated by software that does not perform gapped alignments such as Bowtie (not Bowtie2). All you need to generate an alignment file is a pair of reads and your assembled transcriptome!

Sample Run:

A specific run flag (runP/runN) must be used:

  • runP: Indicates blastp. Frame selection will be ran if nucleotide sequences are input

  • runN: Indicates blastx. Frame selection will not be ran with this input

An example run with a nucleotide transcriptome (transcriptome.fasta), two reference DIAMOND databases, an alignment file (alignment.sam), and 8 threads:

EnTAP --runP -i path/to/transcriptome.fasta -d path/to/database.dmnd -d path/to/database2.dmnd -a path/to/alignment.sam --ini path/to/ini_file -t 8

With the above command, the entire EnTAP pipeline will run. Both Frame Selection and Expression Filtering can be skipped if preferred by the user. If a protein transcriptome is input with the runP flag, or the runN flag is used, Frame Selection will be skipped. If there is not a short read alignment file provided in SAM/BAM format, then Expression Filtering via RSEM will be skipped.

Expression Analysis

The goal of expression filtering, or transcript quantification, is to determine the relative abundance levels of transcripts when taking into account the sequenced reads and how they map back to the assembled transcriptome and using this information to filter out suspect expression profiles possibly originated from poor or incomplete assemblies. Filtering is done through the use of the FPKM (fragments per kilobase per of million mapped reads) , or a measurable number of expression. This can be specified with the - -fpkm flag as specified above. EnTAP will use this FPKM value and remove any sequences that are below the threshold.

Frame Selection

Frame selection is the process of determining the coding region of a transcript. Oftentimes, due to assembly errors or other factors, a coding region may not be found for a transcript and EnTAP will remove this sequence. When a coding region is found, EnTAP will include the sequence for further annotation.

Taxonomic Favoring and Contaminant Filtering

Taxonomic contaminant filtering (as well as taxonomic favoring) is based upon the NCBI Taxonomy database. In saying this, all species/genus/lineage names must be contained within this database in order for it to be recognized by EnTAP.

Contaminant Filtering:

Contaminants can be introduced during collection or processing of a sample. A contaminant is essentially a species that is not of the target species you are collecting. Some common contaminants are bacteria and fungi that can sometimes be found within collected samples. Transcripts flagged as contaminants will be written to a file appended with “_contam”, but not removed from the final annotations file. Oftentimes, researchers would like to remove these sequences from the dataset.

One or more contaminants can be specified in the ini file (separated by a comma). An example of flagging bacteria and fungi as contaminants can be seen below:

contam=fungi,bacteria

Some common contaminants: * insecta * fungi * bacteria

Taxonomic Favoring

During best hit selection of similarity searched results, taxonomic consideration can utilized. If a certain lineage (pinus) is specified, hits closer in taxonomic lineage to this selection will be chosen. Any lineage such as species/kingdom/phylum can be utilized as long as it is contained within the Taxonomic Database. If it is not located within the database, EnTAP will stop the execution immediately and let you know!

This feature can be utilized via the ini file. An example can be seen below (Note: replace any spaces with an underscore):

taxon=pinus_taeda

Another example could be:

taxon=pinus

Keep in mind, EnTAP will weigh the E-Value (within a database)and Coverage of the alignment before taxonomic weight in order to provide the most accurate result. If both the E-Value and Coverage are relatively similar, EnTAP will leverage taxonomic information.

Picking Up Where You Left Off

In order to save time and make it easier to do different analyses of data, EnTAP allows for picking up where you left off if certain stages were already ran and you’d like analyze data with different contaminant flags or taxonomic favoring. As an example, if similarity searching was ran previously you can skip aligning against the database and analyze the data to save time. However, the - - overwrite flag will not allow for this as it will remove previous runs and not recognize them.

In order to pick up and skip re-running certain stages again, the files that were ran previously must be in the same directories and have the same names. With an input transcriptome name of ‘transcriptome’ and example database of ‘complete.protein’:

  • Expression Filtering
    • transcriptome.genes.results

  • Frame Selection
    • transcriptome.fasta.faa

    • transcriptome.fasta.fnn

    • transcriptome.fasta.lst

  • Similarity Search
    • blastp_transcriptome_complete.protein.faa.out

  • Gene Family
    • blastp_transcriptome_eggnog_proteins.out (for runP)

    • blastp_transcriptome_eggnog_proteins.out (for runN)

Since file naming is based on your input as well, the flags below must remain the same:

  • (- - runN / - - runP)

  • (- - ontology)

  • (- - protein)

  • (-i / - - input)

  • (-a / - - align)

  • (-d / - - database)
    • Does not necessarily need to remain the same. If additional databases are added, EnTAP will recognize the new ones and run similarity searching on them whilst skipping those that have already been ran

  • (- - qcoverage)

  • (- - tcoverage)

  • (- - no-trim)

  • (- - out-dir)

State Control

Warning

This is experimental and certain configurations may not work. This is not needed if you’d like to run certain portions because of “picking up where you left off!”

State control of EnTAP allows you to further customize your runs. This is separate from the exclusion of - - align flag to skip expression filtering, or runP, instead of runN, to skip frame selection. You probably will never actually have to use this feature! Nonetheless, state control is based around the following stages of EnTAP:

  1. Expression Filtering

  2. Frame Selection

  3. Transcriptome Filtering (selection of final transcriptome)

  4. Similarity Search

  5. Gene Ontology / Gene Families

With this functionality of EnTAP, you can execute whatever states you would like with certain commands. Using a ‘+’ will execute from that state to the end, while using a ‘x’ will stop at that state. These basic commands can be combined to execute whatever you would like. It’s easier if I lay out some examples:

  • (- - state 1+)
    • This will start at expression filtering and continue to the end of the pipeline

  • (- - state 1+4x)
    • This will start at expression filtering and stop after similarity search

  • (- - state 4x)
    • This will just execute similarity search and stop

  • (- - state 1+3x5)
    • This will essentially execute every stage besides similarity searching

The default ‘state’ of EnTAP is merely ‘+’. This executes every stage of the pipeline (or attempts to if the correct commands are in place).