Configuration¶

After installation is complete, EnTAP must be configured for use. This stage will simply download and configure the necessary databases for full functionality. Configuration can be ran at anytime by a user or the system admin if they would like to update databases. It is in the installation section because it might be a bit easier to setup common databases beforehand that can be shared by multiple users.

I’ll break this up into two sections, Ini File and Usage. The Ini File section will just describe how to ensure EnTAP is reading from the correct paths, which can be easily changed in the entap_config.ini (more on that later!). It will also go over the directories included in the installation. The Usage sections will go over the basic usage during the Configuration stage of EnTAP and how to setup reference databases.

Ini File¶

From here on out, the “execution”, or “EnTAP”, directory will refer to the directory containing the EnTAP install (or binary file). Typically, this will just be at the root directory that was downloaded from the repository. All paths mentioned in this documentation will be relative to this directory.

Why is this important? EnTAP relies on several accompanying software packages and databases in order to run properly. Correct recognition of these paths is crucial and, as such, needed an entire section! The entap_config.ini is the answer to this pathing issue. It contains all of the necessary paths required for EnTAP (among many other commands) to run and can be configured as seen fit.

When a user is trying to execute EnTAP, they must specify the path to this ini file with the ini flag. By default, the ini file comes with some preset paths based on the installation directory. However, these should be checked for validity. If the ini file is not specified and there is not one in the working directory, an empty entap_config.ini will be generated with the following presets for execution paths. The ini file contains many other commands, but only the execution paths are required for configuration (specifically DIAMOND), so I will get to the others later on.

diamond-exe=/EnTAP/libs/diamond-2.1.8/bin/diamond

rsem-sam-validator=/EnTAP/libs/RSEM-1.3.0/rsem-sam-validator

rsem-calculate-expression=/EnTAP/libs/RSEM-1.3.3/rsem-calculate-expression

rsem-prepare-reference=/EnTAP/libs/RSEM-1.3.3/rsem-prepare-reference

rsem-convert-sam-for-rsem=/EnTAP/libs/RSEM-1.3.3/convert-sam-for-rsem

genemarkst-exe=/EnTAP/libs/gmst_linux_64/gmst.pl

transdecoder-long-exe=EnTAP/libs/TransDecoder-v5.7.1/TransDecoder.LongOrfs

transdecoder-predict-exe=EnTAP/libs/TransDecoder-v5.7.1/TransDecoder.Predict

interpro_exe_path=interproscan.sh

If something is globally installed, such as “interproscan-exe” above, put how you’d normally run the software after the ‘=’. As an example, running DIAMOND through a global installation may simply be “diamond”. The Ini File line for DIAMOND will simply read:

diamond-exe=diamond

Warning

Be sure to at least set the DIAMOND path before moving on

Below is a sample ini file with all of the defaults that should be included in the EnTAP repository.

#-------------------------------
# [ini_instructions]
#When using this ini file keep the following in mind:
#	1. Do not edit the input keys to the left side of the '=' sign
#	2. Be sure to use the proper value type (either a string, list, or number)
#	3. Do not add unecessary spaces to your input
#	4. When inputting a list, only add a ',' between each entry
#-------------------------------
# [configuration]
#-------------------------------
#Specify which EnTAP database you would like to download/generate or use throughout execution. Only one is required.
#    0. Serialized Database (default)
#    1. SQLITE Database
#It is advised to use the default Serialized Database as this is fastest.
#type:list (integer)
data-type=0,
#-------------------------------
# [entap]
#-------------------------------
#Path to the EnTAP binary database
#type:string
entap-db-bin=/bin/entap_database.bin
#Path to the EnTAP SQL database (not needed if you are using the binary database)
#type:string
entap-db-sql=/databases/entap_database.db
#Path to the EnTAP graphing script (entap_graphing.py)
#type:string
entap-graph=/src/entap_graphing.py
#-------------------------------
# [entap-api]
#-------------------------------
#-------------------------------
# [expression_analysis]
#-------------------------------
#Specify the FPKM threshold with expression analysis. EnTAP will filter out transcripts below this value. (default: 0.5)
#type:decimal
fpkm=0.5
#Specify this flag if your BAM/SAM file was generated through single-end reads
#Note: this is only required in expression analysis
#Default: paired-end
#type:boolean (true/false)
single-end=false
#-------------------------------
# [expression_analysis-rsem]
#-------------------------------
#Execution method of RSEM Calculate Expression.
#Example: rsem-calculate-expression
#type:string
rsem-calculate-expression=/libs/RSEM-1.3.3//rsem-calculate-expression
#Execution method of RSEM SAM Validate.
#Example: rsem-sam-validator
#type:string
rsem-sam-validator=/libs/RSEM-1.3.3//rsem-sam-validator
#Execution method of RSEM Prep Reference.
#Example: rsem-prepare-reference
#type:string
rsem-prepare-reference=/libs/RSEM-1.3.3//rsem-prepare-reference
#Execution method of RSEM Convert SAM
#Example: convert-sam-for-rsem
#type:string
convert-sam-for-rsem=/libs/RSEM-1.3.3//convert-sam-for-rsem
#-------------------------------
# [frame_selection]
#-------------------------------
#Select this option if all of your sequences are complete proteins.
#At this point, this option will merely flag the sequences in your output file
#type:boolean (true/false)
complete=false
#Specify the Frame Selection software you would like to use. Only one flag can be specified.
#Specify flags as follows:
#    1. GeneMarkS-T
#    2. Transdecoder (default)
#type:integer
frame-selection=2
#-------------------------------
# [frame_selection-genemarks-t]
#-------------------------------
#Method to execute GeneMarkST. This may be the path to the executable.
#type:string
genemarkst-exe=gmst.pl
#-------------------------------
# [frame_selection-transdecoder]
#-------------------------------
#Method to execute TransDecoder.LongOrfs. This may be the path to the executable or simply TransDecoder.LongOrfs
#type:string
transdecoder-long-exe=TransDecoder.LongOrfs
#Method to execute TransDecoder.Predict. This may be the path to the executable or simply TransDecoder.Predict
#type:string
transdecoder-predict-exe=TransDecoder.Predict
#Transdecoder only. Specify the minimum protein length
#type:integer
transdecoder-m=100
#Specify this flag if you would like to pipe the TransDecoder command '--no_refine_starts' when it is executed. Default: False
#This will 'start refinement identifies potential start codons for 5' partial ORFs using a PWM, process on by default.' 
#type:boolean (true/false)
transdecoder-no-refine-starts=false
#-------------------------------
# [general]
#-------------------------------
#Specify the output format for the processed alignments. EnTAP will generally try to output these unless the data is unavailable. Multiple flags can be specified:
#    1. TSV Format (default)
#    2. CSV Format
#    3. FASTA Amino Acid (default)
#    4. FASTA Nucleotide (default)
#    5. Gene Enrichment Sequence ID vs. Effective Length TSV
#    6. Gene Enrichment Sequence ID vs. GO Term TSV
#    7. Gene Ontology Terms TSV (default)
#type:list (integer)
output-format=1,3,4,7,
#-------------------------------
# [ontology]
#-------------------------------
# Specify the ontology software you would like to use
#Note: it is possible to specify more than one! Just usemultiple --ontology flags
#Specify flags as follows:
#    0. EggNOG (default)
#    1. InterProScan
#type:list (integer)
ontology=0,
#-------------------------------
# [ontology-eggnog]
#-------------------------------
#Path to the EggNOG SQL database that was downloaded during the Configuration stage.
#type:string
eggnog-sql=/databases/eggnog.db
#Path to EggNOG DIAMOND configured database that was generated during the Configuration stage.
#type:string
eggnog-dmnd=/bin/eggnog_proteins.dmnd
#-------------------------------
# [ontology-interproscan]
#-------------------------------
#Execution method of InterProScan. This is how InterProScan is generally ran on your system.  It could be as simple as 'interproscan.sh' depending on if it is globally installed.
#type:string
interproscan-exe=interproscan.sh
#Select which databases you would like for InterProScan. Databases must be one of the following:
#    -tigrfam
#    -sfld
#    -prodom
#    -hamap
#    -pfam
#    -smart
#    -cdd
#    -prositeprofiles
#    -prositepatterns
#    -superfamily
#    -prints
#    -panther
#    -gene3d
#    -pirsf
#    -coils
#    -morbidblite
#Make sure the database is downloaded, EnTAP will not check!
#--protein tigrfam --protein pfam
#type:list (string)
protein=
#-------------------------------
# [similarity_search]
#-------------------------------
#Method to execute DIAMOND. This can be a path to the executable or simply 'diamond' if installed globally.
#type:string
diamond-exe=/libs/diamond-v2.1.8/bin/diamond
#Specify the type of species/taxon you are analyzing and would like alignments closer in taxonomic relevance to be favored (based on NCBI Taxonomic Database)
#Note: replace all spaces with underscores '_'
#type:string
taxon=
#Select the minimum query coverage to be allowed during similarity searching
#type:decimal
qcoverage=50
#Select the minimum target coverage to be allowed during similarity searching
#type:decimal
tcoverage=50
#Specify the contaminants you would like to flag for similarity searching. Contaminants can be selected by species or through a specific taxon (insecta) from the NCBI Taxonomy Database. If your taxon is more than one word just replace the spaces with underscores (_).
#Note: since hits are based upon a multitide of factors, a contaminant might end up being the best hit for an alignment. In this scenario, EnTAP will flag the contaminant and it can be removed if you would like.
#type:list (string)
contam=
#Specify the E-Value that will be used as a cutoff during similarity searching.
#type:decimal
e-value=1e-05
#List of keywords that should be used to specify uninformativeness of hits during similarity searching. Generally something along the lines of 'hypothetical' or 'unknown' are used. Each term should be separated by a comma (,) This can be used if you would like to tag certain descriptions or would like to weigh certain alignments differently (see full documentation)
#Example (defaults):
#conserved, predicted, unknown, hypothetical, putative, unidentified, uncultured, uninformative, unnamed
#type:list (string)
uninformative=conserved,predicted,unknown,unnamed,hypothetical,putative,unidentified,uncharacterized,uncultured,uninformative,

Preparing Your Reference Databases¶

All source databases must be provided in FASTA format (protein) so that they can be indexed for use by DIAMOND. This can be completed independent of EnTAP with DIAMOND (- - makedb flag) or as part of the Configuration phase of EnTAP. This section will focus on downloading and preparing some of the more common FASTA source databases. If you already have DIAMOND databases configured, you can skip to Running Configuration. Even if you have a DIAMOND database already configured, Configuration must still be ran!

While any protein FASTA database can be used, it is recommended to use NCBI (Genbank) sourced databases such as RefSeq databases or NR. In addition, EnTAP can easily accept EBI databases such as UniProt/SwissProt.

EnTAP can recognize the species information from these header formats ONLY (NCBI and UniProt):

[homo sapiens]
OS=homo sapiens

If the individual FASTAs in a custom database you create do not adhere to one of these two formats, it will not be possible to weight taxonomic or contaminant status from them. You will need to change the headers to ensure they align.

The following FTP sites contain common reference databases that EnTAP can recognize:

RefSeq: ftp://ftp.ncbi.nlm.nih.gov/refseq/release/complete/
Plant RefSeq: ftp://ftp.ncbi.nlm.nih.gov/refseq/release/plant/
Mammalian Vertebrate RefSeq: ftp://ftp.ncbi.nlm.nih.gov/refseq/release/vertebrate_mammalian/
Other Vertebrate RefSeq: https://ftp.ncbi.nih.gov/refseq/release/vertebrate_other/
Invertebrate RefSeq: https://ftp.ncbi.nih.gov/refseq/release/invertebrate/
NR: ftp://ftp.ncbi.nlm.nih.gov/blast/db/
SwissProt: ftp://ftp.uniprot.org/pub/databases/uniprot/current_release/knowledgebase/complete/uniprot_sprot.fasta.gz
- Reviewed
- It is highly recommended to use the UniProt SwissProt database as EnTAP will map all UniProt alignments to additional database cross-references
TrEMBL: ftp://ftp.uniprot.org/pub/databases/uniprot/current_release/knowledgebase/complete/uniprot_trembl.fasta.gz
- Unreviewed

Both Uniprot databases (SwissProt and TrEMBL) can be downloaded on a Unix system through the following command:

wget ftp://ftp.uniprot.org/pub/databases/uniprot/current_release/knowledgebase/complete/uniprot_sprot.fasta.gz

Or, for the TrEMBL database:

wget ftp://ftp.uniprot.org/pub/databases/uniprot/current_release/knowledgebase/complete/uniprot_trembl.fasta.gz

Alternatively, the NCBI databases must be downloaded in separate, smaller files, and concatenated together. As an example, the following commands will download and combine the NR database files:

Download:

wget ftp://ftp.ncbi.nlm.nih.gov/blast/db/nr.*.tar.gz

Decompress/Concatenate:

tar -xvzf nr.*

cat nr.* > nr_database.fasta

It is generally recommended that a user select at least three databases with varying levels of curation. Unless the species is very non-model (i.e. does not have close relatives in databases such as RefSeq, it is not necessary to use the full NR database which is less curated). Once your FASTA databases are ready, move on to Running Configuration.

Running Configuration¶

Once you have your protein FASTA database ready, you can begin to run the Configuration stage. As mentioned before, Configuration will only need to be run once prior to Execution unless you would like to configure/update more databases.

To run configuration with a FASTA database to output directory path/to/output (default is current working directory), the command is as follows (additional databases can be specified if necessary with the -d flag and threads with the -t flag):

EnTAP --config -d path/to/database.fasta -d path/to/database2.fasta --out-dir path/to/output -t 8 --ini path/to/ini

If your databases are already indexed for DIAMOND, you can simply provide the paths in the .ini file and run the following command with 8 threads:

EnTAP --config -t 8 --ini path/to/ini

Note

This is the only stage that requires connection to the Internet.

In both cases, the following databases will be downloaded and configured:

EnTAP Binary Database:
- Comprised of Gene Ontology, UniProt, and Taxonomic mappings for use during Execution. FTP downloaded file.
- Downloaded from https://treegenesdb.org/FTP/EnTAP/latest/databases/entap_database.bin.gz
- Filename: entap_database.bin
- The SQL version is the same database, but formatted as a SQL database. Only one version of the database is needed (binary is used by default)
EggNOG DIAMOND Reference:
- Reference database containing EggNOG database entries
- FASTA file is downloaded and configured for DIAMOND from http://eggnog5.embl.de/download/eggnog_4.1/eggnog-mapper-data/eggnog4.clustered_proteins.fa.gz
- Filename: eggnog_proteins.dmnd
EggNOG SQL Database:
- SQL database containing EggNOG mappings
- Downloaded from http://eggnog5.embl.de/download/eggnog_4.1/eggnog-mapper-data/eggnog.db.gz
- Filename: eggnog.db

Note

Either the EnTAP binary database (default) or the EnTAP SQL database is required for execution. Both are not needed.

The EnTAP Binary Database is downloaded from the FTP addresses below. By default, the binary version will be downloaded and used. Only one version is required. If you experience any trouble in downloading, you can simply specify the - - data-generate flag during Configuration to configure it locally (more on that later). The database for the newest version of EnTAP will always reside in the “latest” FTP directory. Keep in mind, if you are using an older version of EnTAP, you do not want to download from the “latest” directory. Instead, you will need to consider the version you are using. The FTP will always be updated only when a new database version is created. For example, if you see v0.8.2 and v0.8.5 on the FTP while you are using v0.8.3, you will download the database located in the v0.8.2 directory.

https://treegenesdb.org/FTP/EnTAP/latest/databases/entap_database.bin.gz

https://treegenesdb.org/FTP/EnTAP/latest/databases/entap_database.db.gz

Warning

DIAMOND databases must be configured and eventually executed with the same version of DIAMOND.