Executing EnTAP

Execution is the main stage of EnTAP that will annotate a transcriptome input by the user. After following the installation/configuration steps at least once, all of the required databases have been downloaded and configured so that Execution can be ran. Configuration will not need to be ran again unless you would like to update any databases or configure more.

The following stages will be run:

Input Files:

Required:

.FASTA formatted transcriptome file (either protein or nucleotide)
.dmnd (DIAMOND) indexed databases, which can be formatted in Configuration

Optional:

.BAM/.SAM alignment file. If left unspecified expression filtering will not be performed. Refer to Expression Analysis section for further detail
frame_selection flag can be set to true to perform Frame Selection on your input nucleotide sequences
GFF, Donor, and Recipient databases. If left unspecified Horizontal Gene Transfer analysis will not be performed

Sample Run:

An example run with a nucleotide transcriptome (transcriptome.fnn), two reference DIAMOND databases, an alignment file (alignment.sam), and 8 threads. Expression analysis, frame selection, similarity search, and EggNOG analysis will be ran with the following.

In entap_run.params, change the following lines for our frame selection test:

input=path/to/transcriptome.fnn
database=path/to/database.dmnd,path/to/database2.dmnd
align=path/to/alignment.sam
frame_selection=true

EnTAP --run --run-ini path/to/entap_run.params --entap-ini path/to/entap_config.ini -t thread_number

Further detail on each stage of EnTAP can be found in the page for each stage!

Resuming an EnTAP Run

In order to save time and make it easier to do different analyses of data, EnTAP allows for picking up where you left off if certain stages were already ran and you’d like to analyze data with different contaminant flags or taxonomic favoring (or more!). As an example, if similarity searching was ran previously you can skip aligning against the database and just analyze the data to save time. This is done through use of the resume flag within the entap_run.params. Setting this to TRUE will allow EnTAP to resume from data in a previous run, while FALSE will cause EnTAP to stop execution when data from a previous run is found.

In order to pick up and skip re-running certain stages again, the files that were ran previously must be in the same directories and have the same names. With an input transcriptome name of ‘transcriptome’ and example DIAMOND database of ‘complete.protein.dmnd’:

Expression Filtering
- transcriptome.genes.results
Frame Selection
- transcriptome.fasta.faa
- transcriptome.fasta.fnn
- transcriptome.fasta.lst
Similarity Search
- blastp_transcriptome_complete.protein.out
Gene Family (EggNOG)
- blastp_transcriptome_eggnog_proteins.out
Horizontal Gene Transfer
- blastp_transcriptome_complete.protein.out

In order to resume a run, some things must stay the same! Deviations from these flags will cause those stages to be ran again:

ontology
protein
input
align
database
- Does not necessarily need to remain the same. If additional databases are added, EnTAP will recognize the new ones and run similarity searching on them whilst skipping those that have already been ran
no-trim
out-dir
hgt-donor/hgt_recipient
- Similar to the DIAMOND databases above, if additional databases are added EnTAP will recognize the new ones and run similarity searching on them whilst skipping those that have already been ran
hgt-gff