Configuration

After Installation is complete, EnTAP must be configured for use. This stage will simply download and configure the necessary databases for full functionality and only needs to be ran once (unless more DIAMOND databases want to be updated or configured for similarity searching). It is recommended to run this once after installation, then users of EnTAP can run it additional times if more databases are required. Before starting Configuration, make sure to review both EnTAP ini files and ensure at least the DIAMOND and database flags are setup properly.

This will be broken up into two sections. If you would like to configure databases for searching with DIAMOND, refer to Preparing Your Reference DIAMOND Databases. If you already have DIAMOND databases configured, simply move on to Running Configuration.

Preparing Your Reference DIAMOND Databases

All source databases must be provided in FASTA format (protein) so that they can be indexed for use by DIAMOND. This can be completed independent of EnTAP with DIAMOND (- - makedb flag) or as part of the Configuration phase of EnTAP. This section will focus on downloading and preparing some of the more common FASTA source databases. If you already have DIAMOND databases configured, you can skip to Running Configuration. Even if you have a DIAMOND database already configured, Configuration must still be ran to download other databases!

While any protein FASTA database can be used, it is highly recommended to use UniProt Swiss-Prot to allow EnTAP to pull additional information. In addition to this, an NCBI (Genbank) sourced databases such as RefSeq databases or NR would be recommended.

EnTAP can recognize the species information from these header formats ONLY (NCBI and UniProt):

[homo sapiens]
OS=homo sapiens

If the individual FASTAs in a custom database you create do not adhere to one of these two formats, it will not be possible to weight taxonomic or contaminant status from them. You will need to change the headers to ensure they align.

EnTAP can download/configure commonly used databases for you, or you are welcome to generate a FASTA database yourself. The commonly used databases are listed below with their flags that can be used with the --database command:

**Reference Databases**
flag	database	ftp address
refseq_archaea	NCBI Refseq Archaea	ftp://ftp.ncbi.nlm.nih.gov/refseq/release/archaea
refseq_bacteria	NCBI Refseq Bacteria	ftp://ftp.ncbi.nlm.nih.gov/refseq/release/bacteria
refseq_complete	NCBI Refseq Complete	ftp://ftp.ncbi.nlm.nih.gov/refseq/release/complete
refseq_fungi	NCBI Refseq Fungi	ftp://ftp.ncbi.nlm.nih.gov/refseq/release/fungi
refseq_invertebrarte	NCBI Refseq Invertebrate	ftp://ftp.ncbi.nlm.nih.gov/refseq/release/invertebrate
refseq_other	NCBI Refseq Other	ftp://ftp.ncbi.nlm.nih.gov/refseq/release/other
refseq_plant	NCBI Refseq Plant	ftp://ftp.ncbi.nlm.nih.gov/refseq/release/plant
refseq_protozoa	NCBI Refseq Protozoa	ftp://ftp.ncbi.nlm.nih.gov/refseq/release/protozoa
refseq_vertebrate_mammalian	NCBI Refseq Vertebrate Mammalian	ftp://ftp.ncbi.nlm.nih.gov/refseq/release/vertebrate_mammalian
refseq_vertebrate_other	NCBI Refseq Vertebrate Other	ftp://ftp.ncbi.nlm.nih.gov/refseq/release/vertebrate_other
ncbi_nr	NCBI NR	ftp://ftp.ncbi.nlm.nih.gov/blast/db/FASTA/nr.gz
uniprot_sprot	UniProt Swiss-Prot	ftp://ftp.uniprot.org/pub/databases/uniprot/current_release/knowledgebase/complete/uniprot_sprot.fasta.gz
uniprot_trembl	UniProt Trembl	ftp://ftp.uniprot.org/pub/databases/uniprot/current_release/knowledgebase/complete/uniprot_trembl.fasta.gz

If you would like EnTAP to configure any of the above databases for you, simply use the above flags with the --database command.

It is generally recommended that a user select at least three databases with varying levels of curation. Unless the species is very non-model (i.e. does not have close relatives in databases such as RefSeq, it is not necessary to use the full NR database which is less curated). Once your FASTA databases are ready, move on to Running Configuration.

Running Configuration

As mentioned before, Configuration will only need to be ran once to download necessary files for EnTAP unless you would like to configure/update more databases.

Before running Configuration, please refer to the relevant Configuration Flags before running to be sure it goes smoothly and duplicate databases are not downloaded.

In the entap_run.params file we’re mostly concerned about the --database flag. To add more FASTA databases to configure, simply add them as follows:

database=path/to/database.fasta, ncbi_nr

In the entap_config.ini file, you should check the following flags. If you already have them downloaded, make sure to include them here, otherwise they will be downloaded again!

entap-db-bin=path/to/entap_database.bin
eggnog-map-data=/path/to/eggnog/directory
eggnog-map-dmnd=path/to/eggnog_proteins.dmnd
diamond-exe=path/to/diamond (only needed if you are configuring DIAMOND databases)

With the above in mind, the following command is typical with Configuration.

EnTAP --config --run-ini path/to/entap_run.params --entap-ini path/to/entap_config.ini -t thread_number

Warning

Sometimes DIAMOND database versions are not always cross-compatible with different versions of DIAMOND. To avoid this, configure databases and eventually execute with the same version of DIAMOND

In both cases, the following databases will be downloaded and configured:

EnTAP Database:
- Comprised of Gene Ontology, UniProt, PFAM, and Taxonomic mappings for use during Execution. FTP downloaded file.
- Downloaded from https://treegenesdb.org/FTP/EnTAP/latest/databases/entap_database.bin.gz
- Filename: entap_database.bin
- The SQL version is the same database, but formatted as a SQL database. Only one version of the database is needed (binary is used by default and SQL is much slower but uses less memory)
- If you experience any trouble in downloading, you can simply specify the - - data-generate flag during Configuration to configure it locally (more on that later)
- The database for the newest version of EnTAP will always reside in the “latest” FTP directory. Keep in mind, if you are using an older version of EnTAP, you do not want to download from the “latest” directory. Instead, you will need to consider the version you are using. The FTP will always be updated only when a new database version is created. For example, if you see v0.8.2 and v0.8.5 on the FTP while you are using v0.8.3, you will download the database located in the v0.8.2 directory.
- If using the binary database, the following lines MUST appear in the first 3 lines of the file in the proper format. This is just an example, your versions and date of creation may vary. Unless you are using an older version of the database, or you edited the file, this should automatically be there.
VERSION_MAJOR=3 VERSION_MINOR=0 DATE_CREATED=
EggNOG DIAMOND Reference:
- Reference database containing EggNOG database entries
- DIAMOND formatted database is downloaded from http://eggnog6.embl.de/download/emapperdb-5.0.2/eggnog_proteins.dmnd.gz
- Filename: eggnog_proteins.dmnd
EggNOG SQL Database:
- SQL database containing EggNOG mappings
- Downloaded from http://eggnog6.embl.de/download/emapperdb-5.0.2/eggnog.db.gz
- Filename: eggnog.db
- Note, when referencing this file in the entap_config.ini, you must use the directory that contains this file with the –egg-map-data flag, rather than the path to the file itself

Note

Either the EnTAP binary database (default) or the EnTAP SQL database is required for execution. Both are not needed.