Executing EnTAP
Execution is the main stage of EnTAP that will annotate a transcriptome input by the user. After following the installation/configuration steps at least once, all of the required databases have been downloaded and configured so that Execution can be ran. Configuration will not need to be ran again unless you would like to update any databases or configure more.
The following stages will be run:
Input Files:
Required:
.FASTA formatted transcriptome file (either protein or nucleotide)
.dmnd (DIAMOND) indexed databases, which can be formatted in Configuration
Optional:
.BAM/.SAM alignment file. If left unspecified expression filtering will not be performed. Refer to Expression Analysis section for further detail
frame_selection flag can be set to true to perform Frame Selection on your input nucleotide sequences
GFF, Donor, and Recipient databases. If left unspecified Horizontal Gene Transfer analysis will not be performed
Sample Run:
An example run with a nucleotide transcriptome (transcriptome.fnn), two reference DIAMOND databases, an alignment file (alignment.sam), and 8 threads. Expression analysis, frame selection, similarity search, and EggNOG analysis will be ran with the following.
In entap_run.params, change the following lines for our frame selection test:
input=path/to/transcriptome.fnn
database=path/to/database.dmnd,path/to/database2.dmnd
align=path/to/alignment.sam
frame_selection=true
EnTAP --run --run-ini path/to/entap_run.params --entap-ini path/to/entap_config.ini -t thread_number
Further detail on each stage of EnTAP can be found in the page for each stage!
Resuming an EnTAP Run
In order to save time and make it easier to do different analyses of data, EnTAP allows for picking up where you left off if certain stages were already ran and you’d like to analyze data with different contaminant flags or taxonomic favoring (or more!). As an example, if similarity searching was ran previously you can skip aligning against the database and just analyze the data to save time. This is done through use of the resume flag within the entap_run.params. Setting this to TRUE will allow EnTAP to resume from data in a previous run, while FALSE will cause EnTAP to stop execution when data from a previous run is found.
In order to pick up and skip re-running certain stages again, the files that were ran previously must be in the same directories and have the same names. With an input transcriptome name of ‘transcriptome’ and example DIAMOND database of ‘complete.protein.dmnd’:
- Expression Filtering
transcriptome.genes.results
- Frame Selection
transcriptome.fasta.faa
transcriptome.fasta.fnn
transcriptome.fasta.lst
- Similarity Search
blastp_transcriptome_complete.protein.out
- Gene Family (EggNOG)
blastp_transcriptome_eggnog_proteins.out
- Horizontal Gene Transfer
blastp_transcriptome_complete.protein.out
In order to resume a run, some things must stay the same! Deviations from these flags will cause those stages to be ran again:
ontology
protein
input
align
database
Does not necessarily need to remain the same. If additional databases are added, EnTAP will recognize the new ones and run similarity searching on them whilst skipping those that have already been ran
no-trim
out-dir
hgt-donor/hgt_recipient
Similar to the DIAMOND databases above, if additional databases are added EnTAP will recognize the new ones and run similarity searching on them whilst skipping those that have already been ran
hgt-gff