Interpreting the Results

EnTAP provides many output files at each stage of execution to better see how the data is being managed throughout the pipeline. In addition to the files generated at each stage of the pipeline in their respective directories, the following are important files to review after EnTAP has finished which show a summary of the overall process:

Final Annotation Results
Log File
Transcriptomes

`/final_results` Directory

The overall EnTAP annotations are contained within the /final_results directory. These files are the summation of each stage of the pipeline and contain the combined information. So these can be considered the most important files! Each file generated by EnTAP can have several file types for the given subset of information - which is appended to the base name. These can be controlled through the output-format flag.

**EnTAP File Types**
file type	description
`.tsv`	All of the TSV files will have the header information from each stage of the pipeline (which can be seen on the other pages of the docs). Keep in mind, some headers will not be shown if that part of the pipeline was skipped or the information was not found for any of the input sequences. TSV formatted files support Tidyverse format (including ‘NA’ being used for empty data cells).
`.faa`	FASTA-formatted amino acid / protein file. Sequences without protein information available will not be printed here
`.fnn`	FASTA-formatted nucleotide file. Sequences without nucleotide information available will not be printed here
`_gene_ontology_terms.tsv`	Tab-deliminated file that can be used for Gene Enrichment. Columns are as follows: Sequence ID, Gene Ontology Term ID, Gene Ontology Term, Gene Ontology Category, and Effective Length. Note: the Effective Length column will not be printed when Expression Filtering has not been performed.
`_enrich_geneid_go.tsv`	Tab-deliminated file that can be used for Gene Enrichment. First column contains the gene ID and second column contains the Gene Ontology term corresponding to the gene ID. This file is not printed by default, but can be selected through the `output-format` flag.
`_enrich_geneid_len.tsv`	Tab-deliminated file that can be used for Gene Enrichment. First column contains the gene ID and second columns contains the effective length from Expression Analysis. This file will not be printed if Expression Analysis has not been ran. Note: the Length column will not be printed when Expression Filtering has not been performed. This file is not printed by default, but can be selected through the `output-format` flag.

The following files will be printed to the /final_results directory. As mentioned above, each of these can have a variety of file types printed.

**EnTAP Final Summary Files**
base filename	description	file types
`entap_results`	This file is essentially a final report from EnTAP, summarizing the results of the entire pipeline. Since this includes every single transcript, there will be annotated, unannotated, and contaminated sequences. Further filtering of transcripts (for example if you are only interested in those transcripts that were annotated) can be done with this file (through filtering columns) or the below files	.tsv
`annotated`	Contains all sequences that either aligned against databases through Similarity Searching or aligned against EggNOG/InterProScan databases.	all
`unannotated`	Contains all sequences that did NOT align either against databases through Similarity Searching nor against EggNOG/InterProScan databases.	all
`annotated_contam`	Contains all annotated sequences that were flagged as a contaminant. These are sequences that either aligned against databases through Similarity Searching or aligned against EggNOG/InterProScan databases.	all
`annotated_without_contam`	Contains all annotated sequences that were not flagged as a contaminant. Sequences are flagged as a contaminant if the species aligned through Similarity Searching matches anything input through the `contam` flag.	all

EnTAP Log File (`log_file`)

The log file contains a statistical analysis of each stage of the pipeline that you ran. I’ll give a brief outline of some of the stats performed:

Initial Statistics
- Transcriptome statistics: n50, n90, average gene length, longest/shortest gene
- Summary of user flags
- Summary of execution paths (from config file)
Expression analysis
- Transcriptome statistics: n50, n90, average gene length, longest/shortest gene
- Summary of sequences kept/removed after filtering
Frame Selection
- Transcriptome statistics: n50, n90, average gene length, longest/shortest gene
- Summary of frame selection: Partial, internal, complete genes. Genes where no frame was found
Similarity Searching
- Contaminant/uninformative/informative count
- Phylogenetic/contaminant distribution of alignments
- Alignment distribution based upon frame results (partial/internal/complete)
- Sequence count that did not align against a database reference
- Statistics calculated for each individual database and final results
Gene Family Assignment
- Phylogenetic distribution of gene family assignments
- Gene Ontology category distribution (biological processes, molecular function, cellular component)
Final Annotation Statistics
- Statistical summary of each stage
- Runtime

`/transcriptomes` Directory

The /transcriptomes directory contains transcriptomes used at various stages of the EnTAP pipeline. Due to stages like Frame Selection and Expression Filtering, the transcriptome can be changed with certain sequences being removed (due to being below the FPKM threshold or not having a reading frame). In the following files “transcriptome” will be replaced by the filename of your input transcriptome.

**EnTAP Transcriptomes**
filename	description
`transcriptome.fasta`	This file is essentially a copy of your input transcriptome. The sequence ID’s may be changed depending on whether you selected the ‘trim’ flag or otherwise.
`transcriptome_expression_filtered.fasta`	As the name implies, this transcriptome is the resultant of the Expression Filtering stage with sequences removed that fall under the FPKM threshold you have specified.
`transcriptome_frame_selected.fasta`	This transcriptome is the resultant of Frame Selection. Sequences in which a frame was not selected are removed and those with a frame are kept in this file. As a result, this file will always be in protein format.
`transcriptome_final.fasta`	This is your final transcriptome following the “Transcriptome Filtering” stage of EnTAP. This transcriptome will be used for the later stages of the pipeline (Similarity Searching / Ontology / HGT). Depending on which methods of execution you chose (runN / runP), the result here may be either protein or nucleotide with Frame Selection and/or Expression Filtering.

Interpreting the Results

/final_results Directory

EnTAP Log File (log_file)

/transcriptomes Directory

`/final_results` Directory

EnTAP Log File (`log_file`)

`/transcriptomes` Directory