Interpreting the Results

EnTAP provides many output files at each stage of execution to better see how the data is being managed throughout the pipeline. In addition to the files generated at each stage of the pipeline in their respective directories, the following are important files to review after EnTAP has finished which show a summary of the overall process:

  1. Final Annotation Results

  2. Log File

  3. Transcriptomes

/final_results Directory

The overall EnTAP annotations are contained within the /final_results directory. These files are the summation of each stage of the pipeline and contain the combined information. So these can be considered the most important files! Each file generated by EnTAP can have several file types for the given subset of information - which is appended to the base name. These can be controlled through the output-format flag.

EnTAP File Types

file type

description

.tsv

All of the TSV files will have the header information from each stage of the pipeline (which can be seen on the other pages of the docs). Keep in mind, some headers will not be shown if that part of the pipeline was skipped or the information was not found for any of the input sequences. TSV formatted files support Tidyverse format (including ‘NA’ being used for empty data cells).

.faa

FASTA-formatted amino acid / protein file. Sequences without protein information available will not be printed here

.fnn

FASTA-formatted nucleotide file. Sequences without nucleotide information available will not be printed here

_gene_ontology_terms.tsv

Tab-deliminated file that can be used for Gene Enrichment. Columns are as follows: Sequence ID, Gene Ontology Term ID, Gene Ontology Term, Gene Ontology Category, and Effective Length. Note: the Effective Length column will not be printed when Expression Filtering has not been performed.

_enrich_geneid_go.tsv

Tab-deliminated file that can be used for Gene Enrichment. First column contains the gene ID and second column contains the Gene Ontology term corresponding to the gene ID. This file is not printed by default, but can be selected through the output-format flag.

_enrich_geneid_len.tsv

Tab-deliminated file that can be used for Gene Enrichment. First column contains the gene ID and second columns contains the effective length from Expression Analysis. This file will not be printed if Expression Analysis has not been ran. Note: the Length column will not be printed when Expression Filtering has not been performed. This file is not printed by default, but can be selected through the output-format flag.

The following files will be printed to the /final_results directory. As mentioned above, each of these can have a variety of file types printed.

EnTAP Final Summary Files

base filename

description

file types

entap_results

This file is essentially a final report from EnTAP, summarizing the results of the entire pipeline. Since this includes every single transcript, there will be annotated, unannotated, and contaminated sequences. Further filtering of transcripts (for example if you are only interested in those transcripts that were annotated) can be done with this file (through filtering columns) or the below files

.tsv

annotated

Contains all sequences that either aligned against databases through Similarity Searching or aligned against EggNOG/InterProScan databases.

all

unannotated

Contains all sequences that did NOT align either against databases through Similarity Searching nor against EggNOG/InterProScan databases.

all

annotated_contam

Contains all annotated sequences that were flagged as a contaminant. These are sequences that either aligned against databases through Similarity Searching or aligned against EggNOG/InterProScan databases.

all

annotated_without_contam

Contains all annotated sequences that were not flagged as a contaminant. Sequences are flagged as a contaminant if the species aligned through Similarity Searching matches anything input through the contam flag.

all

EnTAP Log File (log_file)

The log file contains a statistical analysis of each stage of the pipeline that you ran. I’ll give a brief outline of some of the stats performed:

  1. Initial Statistics

    • Transcriptome statistics: n50, n90, average gene length, longest/shortest gene

    • Summary of user flags

    • Summary of execution paths (from config file)

  2. Expression analysis

    • Transcriptome statistics: n50, n90, average gene length, longest/shortest gene

    • Summary of sequences kept/removed after filtering

  3. Frame Selection

    • Transcriptome statistics: n50, n90, average gene length, longest/shortest gene

    • Summary of frame selection: Partial, internal, complete genes. Genes where no frame was found

  4. Similarity Searching

    • Contaminant/uninformative/informative count

    • Phylogenetic/contaminant distribution of alignments

    • Alignment distribution based upon frame results (partial/internal/complete)

    • Sequence count that did not align against a database reference

    • Statistics calculated for each individual database and final results

  5. Gene Family Assignment

    • Phylogenetic distribution of gene family assignments

    • Gene Ontology category distribution (biological processes, molecular function, cellular component)

  6. Final Annotation Statistics

    • Statistical summary of each stage

    • Runtime

/transcriptomes Directory

The /transcriptomes directory contains transcriptomes used at various stages of the EnTAP pipeline. Due to stages like Frame Selection and Expression Filtering, the transcriptome can be changed with certain sequences being removed (due to being below the FPKM threshold or not having a reading frame). In the following files “transcriptome” will be replaced by the filename of your input transcriptome.

EnTAP Transcriptomes

filename

description

transcriptome.fasta

This file is essentially a copy of your input transcriptome. The sequence ID’s may be changed depending on whether you selected the ‘trim’ flag or otherwise.

transcriptome_expression_filtered.fasta

As the name implies, this transcriptome is the resultant of the Expression Filtering stage with sequences removed that fall under the FPKM threshold you have specified.

transcriptome_frame_selected.fasta

This transcriptome is the resultant of Frame Selection. Sequences in which a frame was not selected are removed and those with a frame are kept in this file. As a result, this file will always be in protein format.

transcriptome_final.fasta

This is your final transcriptome following the “Transcriptome Filtering” stage of EnTAP. This transcriptome will be used for the later stages of the pipeline (Similarity Searching / Ontology / HGT). Depending on which methods of execution you chose (runN / runP), the result here may be either protein or nucleotide with Frame Selection and/or Expression Filtering.