EnTAP Ini Files

Before starting to Configure EnTAP, it’s important to understand the ini files EnTAP uses and to setup the proper paths. EnTAP relies on several accompanying software packages and databases in order to run properly. This requires an easy way to specify parameters and paths rather than overloading the command line with a bunch of arguments. To resolve this, separate ini files are mandatory to be specified when running EnTAP as of version 1.1.0:

  1. entap_config.ini

  2. entap_run.params

entap_config.ini File

The entap_config.ini contains database paths and execution paths for accompanying software. The intention is that this can be setup once and distributed to all users in a shared cluster environment, or all users of the EnTAP installation. It should likely not change between separate runs of EnTAP, unlike the entap_run.params.

Many of the paths for accompanying software can simply use the globally installed command, such as diamond. Or they full path to the executable. Below is the default entap_config.ini located in the EnTAP repository with it’s defaults:

#-------------------------------
# [ini_instructions]
#When using this ini file keep the following in mind:
#	1. Do not edit the input keys to the left side of the '=' sign
#	2. Be sure to use the proper value type (either a string, list, or number)
#	3. Do not add unecessary spaces to your input
#	4. When inputting a list, only add a ',' between each entry
#-------------------------------
# [configuration]
#-------------------------------
#Specify whether you would like to generate EnTAP databases locally instead of downloading them. By default, EnTAP will download the databases. This may be used if you are experiencing  errors with the default process.
#type:boolean (true/false)
data-generate=false
#Specify which EnTAP database you would like to download/generate or use throughout execution. Only one is required.
#    0. Serialized Database (default)
#    1. SQLITE Database
#It is advised to use the default Serialized Database as this is fastest.
#type:list (integer)
data-type=0,
#-------------------------------
# [entap]
#-------------------------------
#Path to the EnTAP binary database
#type:string
entap-db-bin=/bin/entap_database.bin
#Path to the EnTAP SQL database (not needed if you are using the binary database)
#type:string
entap-db-sql=/databases/entap_database.db
#Path to the EnTAP graphing script (entap_graphing.py)
#type:string
entap-graph=/src/entap_graphing.py
#-------------------------------
# [expression_analysis-rsem]
#-------------------------------
#Execution method of RSEM Calculate Expression.
#Example: rsem-calculate-expression
#type:string
rsem-calculate-expression=/libs/RSEM-1.3.3/rsem-calculate-expression
#Execution method of RSEM SAM Validate.
#Example: rsem-sam-validator
#type:string
rsem-sam-validator=/libs/RSEM-1.3.3/rsem-sam-validator
#Execution method of RSEM Prep Reference.
#Example: rsem-prepare-reference
#type:string
rsem-prepare-reference=/libs/RSEM-1.3.3/rsem-prepare-reference
#Execution method of RSEM Convert SAM
#Example: convert-sam-for-rsem
#type:string
convert-sam-for-rsem=/libs/RSEM-1.3.3/convert-sam-for-rsem
#-------------------------------
# [frame_selection-transdecoder]
#-------------------------------
#Method to execute TransDecoder.LongOrfs. This may be the path to the executable or simply TransDecoder.LongOrfs
#type:string
transdecoder-long-exe=TransDecoder.LongOrfs
#Method to execute TransDecoder.Predict. This may be the path to the executable or simply TransDecoder.Predict
#type:string
transdecoder-predict-exe=TransDecoder.Predict
#-------------------------------
# [similarity_search]
#-------------------------------
#Method to execute DIAMOND. This can be a path to the executable or simply 'diamond' if installed globally.
#type:string
diamond-exe=/libs/diamond-v2.1.8/bin/diamond
#-------------------------------
# [ontology-eggnog-mapper]
#-------------------------------
#Path to EggNOG-mapper executable. Likely just 'emapper.py'
#type:string
eggnog-map-exe=emapper.py
#Path to directory containing the EggNOG SQL database(s).
#type:string
eggnog-map-data=/databases
#Path to EggNOG DIAMOND configured database that was generated during the Configuration stage
#type:string
eggnog-map-dmnd=/bin/eggnog_proteins.dmnd
#-------------------------------
# [ontology-interproscan]
#-------------------------------
#Execution method of InterProScan. This is how InterProScan is generally ran on your system.  It could be as simple as 'interproscan.sh' depending on if it is globally installed.
#type:string
interproscan-exe=interproscan.sh

entap_run.params File

The entap_run.params contains paths and parameters specific to each EnTAP run. The intention is that this parameter file will change between separate runs of EnTAP, either utilizing different databases or different parameters/cutoffs. Unlike the entap_config.ini this will be updated often by the user!

EnTAP gives a lot of power to the user to customize their run. However, defaults are provided for all, non-mandatory, parameters. Below is an example of the entap_run.params:

#-------------------------------
# [ini_instructions]
#When using this ini file keep the following in mind:
#	1. Do not edit the input keys to the left side of the '=' sign
#	2. Be sure to use the proper value type (either a string, list, or number)
#	3. Do not add unnecessary spaces to your input
#	4. When inputting a list, only add a ',' between each entry
#-------------------------------
# [general]
#-------------------------------
#Specify the output directory you would like the data produced by EnTAP to be saved to.
#type:string
out-dir=entap_outfiles
#Select this option if you would like to overwrite files from a previous execution of EnTAP. This will supercede the 'resume' flag which enables you to continue an annotation from where you left off before. Refer to the documentation for more information.
#type:boolean (true/false)
overwrite=false
#Select this option if you would like EnTAP to continue execution if existing files are found from a previous run. If false is selected, EnTAP will stop execution once it finds files from a previous run and print a warning. Note, the 'overwrite' option will supercede this command if 'overwrite' is being used (set to TRUE).
#type:boolean (true/false)
resume=false
#Path to the input transcriptome file
#type:string
input=
#Provide the paths to the databases you would like to use for either 'run' or 'configuration'.
#For running/execution:
#    - Ensure the databases selected are in a DIAMOND configured format with an extension of .dmnd
#For configuration:
#    - Ensure the databases are in a typical FASTA format
#    - Alternatively, you can provide the type of protein database you'd like EnTAP to download and configure for you. Simply add any of the following to download/configure for DIAMOND the appropriate database: refseq_archaea, refseq_bacteria, refseq_complete, refseq_fungi, refseq_invertebrate, refseq_other, refseq_plant, refseq_protozoa, refseq_vertebrate_mammalian, refseq_vertebrate_other, ncbi_nr, uniprot_sprot, uniprot_trembl
#Note: if your databases do not have the typical NCBI or UniProt header format, taxonomic  information and filtering may not be utilized. Refer to the documentation to see how to properly format any data.
#type:list (string)
database=
#By default, EnTAP will trim the sequence ID to the nearest space to help with compatibility across software. This command will instead remove the spaces in a sequence ID rather than trimming.
#'>TRINITY_231.1 Protein Information' will become...
#'>TRINITY_231.1ProteinInformation' 
#type:boolean (true/false)
no-trim=false
#Specify the number of threads that will be used throughout EnTAP execution
#type:integer
threads=1
#Specify the output format for the processed alignments. EnTAP will generally try to output these unless the data is unavailable. Multiple flags can be specified:
#    1. TSV Format (default)
#    2. CSV Format
#    3. FASTA Amino Acid (default)
#    4. FASTA Nucleotide (default)
#    5. Gene Enrichment Sequence ID vs. Effective Length TSV
#    6. Gene Enrichment Sequence ID vs. GO Term TSV
#    7. Gene Ontology Terms TSV (default)
#type:list (integer)
output-format=1,3,4,7,
#-------------------------------
# [expression_analysis]
#-------------------------------
#Specify the FPKM threshold with expression analysis. EnTAP will filter out transcripts below this value. (default: 0.5)
#type:decimal
fpkm=0.5
#Specify the path to the BAM/SAM file for expression analysis
#type:string
align=
#Specify this flag if your BAM/SAM file was generated through single-end reads
#Note: this is only required in expression analysis
#Default: paired-end
#type:boolean (true/false)
single-end=false
#-------------------------------
# [frame_selection]
#-------------------------------
#Specify if you would like to perform frame selection/gene prediction on your input nucleotide transcriptome. This flag will be ignored if you input protein sequences. If this is set to false with nucleotide input, EnTAP will perform blastx functionality where appropriate.
#type:boolean (true/false)
frame-selection=true
#-------------------------------
# [frame_selection-transdecoder]
#-------------------------------
#Transdecoder only. Specify the minimum protein length
#type:integer
transdecoder-m=100
#Specify this flag if you would like to pipe the TransDecoder command '--no_refine_starts' when it is executed. Default: False
#This will 'start refinement identifies potential start codons for 5' partial ORFs using a PWM, process on by default.' 
#type:boolean (true/false)
transdecoder-no-refine-starts=false
#-------------------------------
# [similarity_search]
#-------------------------------
#Specify the type of species/taxon you are analyzing and would like alignments closer in taxonomic relevance to be favored (based on NCBI Taxonomic Database). This field is also used for determination of horizontally transferred genes.
#Note: replace all spaces with underscores '_'
#type:string
taxon=
#Select the minimum query coverage to be allowed during similarity searching
#type:decimal
qcoverage=50
#Select the minimum target coverage to be allowed during similarity searching
#type:decimal
tcoverage=50
#Specify the contaminants you would like to flag for similarity searching. Contaminants can be selected by species or through a specific taxon (insecta) from the NCBI Taxonomy Database. If your taxon is more than one word just replace the spaces with underscores (_).
#Note: since hits are based upon a multitude of factors, a contaminant might end up being the best hit for an alignment. In this scenario, EnTAP will flag the contaminant and it can be removed if you would like.
#type:list (string)
contam=
#Specify the E-Value that will be used as a cutoff during similarity searching.
#type:decimal
e-value=1e-05
#List of keywords that should be used to specify uninformativeness of hits during similarity searching. Generally something along the lines of 'hypothetical' or 'unknown' are used. Each term should be separated by a comma (,) This can be used if you would like to tag certain descriptions or would like to weigh certain alignments differently (see full documentation)
#Example (defaults):
#conserved, predicted, unknown, hypothetical, putative, unidentified, uncultured, uninformative, unnamed
#type:list (string)
uninformative=conserved,predicted,unknown,unnamed,hypothetical,putative,unidentified,uncharacterized,uncultured,uninformative,
#-------------------------------
# [similarity_search-diamond]
#-------------------------------
#Specify the DIAMOND sensitivity used against input DIAMOND databases (Similarity Searching and HGT Analysis). Sensitivities are based off of DIAMOND documentation with a higher sensitivity generally taking longer but giving a higher alignment rate. Sensitivity options are fast, mid-sensitive, sensitive, more-sensitive, very-sensitive, ultra-sensitive.
#type:string
diamond-sensitivity=very-sensitive
#-------------------------------
# [ontology]
#-------------------------------
# Specify the ontology source databases you would like to use
#Note: it is possible to specify more than one! Just use multiple --ontology_source flags
#Specify flags as follows:
#    0. EggNOG (default)
#    1. InterProScan
#type:list (integer)
ontology_source=0,
#-------------------------------
# [ontology-eggnog-mapper]
#-------------------------------
#Specify this to turn on/off EggNOG contaminant analysis. This leverages the taxon input from the contaminant Similarity Search command to determine if an EggNOG annotation should be flagged as a contaminant. EggNOG contaminant analysis can only be performed alongside Similarity Search contaminant analysis (not on its own) and will only be utilized if no alignments were found for a given transcript during Similarity Searching
#type:boolean (true/false)
eggnog-contaminant=true
#Specify this to use the '--dbmem' flag with EggNOG-mapper. This will load the entire eggnog.db sqlite3 database into memory which can require up to ~44GB of memory. However, this will significantly speed up EggNOG annotations.
#type:boolean (true/false)
eggnog-dbmem=true
#Specify the DIAMOND sensitivity used during EggNOG mapper execution against the EggNOG database. Sensitivities are based off of DIAMOND documentation with a higher sensitivity generally taking longer but giving a higher alignment rate. Sensitivity options are fast, mid-sensitive, sensitive, more-sensitive, very-sensitive, ultra-sensitive.
#type:string
eggnog-sensitivity=more-sensitive
#-------------------------------
# [ontology-interproscan]
#-------------------------------
#Select which databases you would like for InterProScan. Databases must be one of the following:
#    -tigrfam
#    -sfld
#    -prodom
#    -hamap
#    -pfam
#    -smart
#    -cdd
#    -prositeprofiles
#    -prositepatterns
#    -superfamily
#    -prints
#    -panther
#    -gene3d
#    -pirsf
#    -pirsr
#    -coils
#    -antifam
#    -mobidblite
#Make sure the database is downloaded, EnTAP will not check!
#--interproscan-db tigrfam --interproscan-db pfam
#type:list (string)
interproscan-db=
#-------------------------------
# [horizontal-gene-transfer]
#-------------------------------
#Specify the DIAMOND configured (.dmnd extension) donor databases for Horizontal Gene Transfer 
#analysis. Separate databases with a comma (',')
#type:list (string)
hgt-donor=
#Specify the DIAMOND configured (.dmnd extension) recipient databases for Horizontal Gene Transfer 
#analysis. Separate databases with a comma (',')
#type:list (string)
hgt-recipient=
#Specify the path to the GFF file associated with your dataset. Ensure that all headers match those in your 
#input transcript file.
#type:string
hgt-gff=
#-------------------------------
# [ncbi-api]
#-------------------------------
#Enter your personal NCBI API key, if available. This can be assigned to you through your NCBI account. Although not required, enabling EnTAP to use your API key will allow for much quicker accessions to the NCBI database. Your API key will only be used to access the NCBI database and only stored locally. If you are a contributor to EnTAP, DO NOT accidentally commit your API key to git!
#type:string
ncbi-api-key=
#Allow EnTAP to access the NCBI database API to pull additional information for your data during Similarity Searching if searching against a NCBI database.
#type:boolean (true/false)
ncbi-api-enable=true

Warning

Be sure to at least set the DIAMOND and database flags before moving on as this is used during Configuration if you would like to configure databases for DIAMOND