4. Gene Family / Ontology Analysis

This stage of EnTAP attempts to functionally annotate our filtered transcriptome (after Frame Selection and Expression Analysis has been performed) to align information such as gene families, orthologous groups, protein domains, and Gene Ontology terms. EnTAP allows for either EggNOG (default and recommended) or InterProScan to be used with the ontology flag.

Modify the ontology flag within the entap_run.params by using 0 (EggNOG) and/or 1 (InterProScan) in a comma-separated list.

ontology=0

To include InterProScan, simply do the following:

ontology=0,1

4.1. General Ontology Analysis Flags

Ontology Flags

param

description

location (cmd/R-ini,E-ini)

qualifier

example

ontology_source

Specify which ontology source packages you would like to use. Multiple flags may be used to specify execution of multiple software packages.
  • 0 - EggNOG (default)

  • 1 - InterProScan

R-ini

multi-integer

0

4.2. EggNOG Analysis

By default, EnTAP will utilize EggNOG-mapper (https://github.com/jhcepas/eggnog-mapper) to access the collection of EggNOG databases (http://eggnog5.embl.de/#/app/home) to utilize orthology relationships to assign a myriad of functional information. This is a very powerful tool, especially for non-model transcriptomes where functional data may be limited.

EggNOG analysis is executed by default with EnTAP so the only thing to make sure of is that the database and execution paths are correct within both ini files.

4.2.1. EggNOG Commands

Ontology - EggNOG Specific Flags

param

description

location (cmd/R-ini,E-ini)

qualifier

example

eggnog-map-data

Path to the directory containing the EggNOG SQL database eggnog.db that was downloaded during the Configuration stage. EnTAP will check for the eggnog.db database within this specified directory

E-ini

string

/path/to/eggnog_db_directory

eggnog-map-dmnd

Path to the EggNOG DIAMOND configured database eggnog_proteins.dmnd that was generated during the Configuration stage.

E-ini

string

/databases/eggnog_proteins.dmnd

eggnog-map-exe

Path to the EggNOG-mapper executable, or method of execution. If installed globally, this is simply emapper.py

E-ini

string

emapper.py

4.2.2. Interpreting EggNOG Results

The /gene_family/EggNOG directory will contain all of the relevant information for the EggNOG stage of the pipeline. This folder will contain the files generated from EggNOG-mapper alongside the files generated by EnTAP. EnTAP files can be found within the /gene_family/EggNOG/processed directory.

Below are example files with a transcriptome labelled ‘transcriptome’ utilizing runP.

EggNOG Results

filename

description

directory

blastp_transcriptome.emapper.annotations

Generated from EggNOG-mapper. Contains important functional annotation information pulled from orthologous group alignment within EggNOG databases. This will be prepended with blastp or blastx depending on if runP or runN were used.

/gene_family/EggNOG

blastp_transcriptome.emapper.seed_orthologs

Generated from EggNOG-mapper. Contains all assigned seed orthologs for the sequences that were ran using EggNOG-mapper. Information in this is similar to that seen with DIAMOND or BLAST runs such as e-value and coverages. This will be prepended with blastp or blastx depending on if runP or runN were used.

/gene_family/EggNOG

blastp_transcriptome.emapper.hits

Generated from EggNOG-mapper. Contains all of the hits against the EggNOG database (from DIAMOND). EggNOG-mapper will first align our input transcriptome to the EggNOG database which can result in multiple hits. The selected hits are seen in the .emapper.seed_orthologs file while the rest remain here. This will be prepended with blastp or blastx depending on if runP or runN were used.

/gene_family/EggNOG

eggnog_unannotated.fnn/faa

Generated from EnTAP. Sequences where NO alignnment was made with the EggNOG database (nucleotide/protein).

/gene_family/EggNOG/processed

eggnog_annotated.fnn/faa

Generated from EnTAP. Sequences where an alignnment was made with the EggNOG database (nucleotide/protein).

/gene_family/EggNOG/processed

4.2.3. EggNOG Headers

TSV files generated from EnTAP will have the following headers from EggNOG analysis.

  • EggNOG Seed Ortholog

  • EggNOG Seed E-Value

  • EggNOG Seed Score

  • EggNOG Tax Scope Max

  • EggNOG Member OGs

  • EggNOG Description

  • EggNOG COG Abbreviation

  • EggNOG COG Description

  • EggNOG BIGG Reaction

  • EggNOG KEGG KO

  • EggNOG KEGG Pathway

  • EggNOG KEGG Module

  • EggNOG KEGG Reaction

  • EggNOG KEGG RClass

  • EggNOG BRITE

  • EggNOG GO Biological

  • EggNOG GO Molecular

  • EggNOG Protein Domains

4.3. InterProScan Analysis

The user has the option to use InterProScan (https://www.ebi.ac.uk/interpro/) as an additional method of determining functional annotation of our transcripts. InterProScan is a powerful tool that will classify our transcripts into families to predict domains and other important functional information.

4.3.1. Running InterProScan

In order to run InterProScan, as mentioned above, the ontology flag must also include ‘1’ within the entap_run.params file. Additional parameters can be set, such as which additional database should be analyzed through the protein command. These can be seen below.

4.3.2. InterProScan Commands

Ontology - InterProScan Specific Flags

param

description

location (cmd/R-ini,E-ini)

qualifier

example

interproscan-db

User this option if you would like to run InterProScan against specific databases. Multiple databases can be selected.
  • tigrfam

  • sfld

  • prodom

  • hamap

  • pfam

  • smart

  • cdd

  • prositeprofiles

  • prositepatterns

  • superfamily

  • prints

  • panther

  • gene3d

  • pirsf

  • coils

  • mobidblite

R-ini

multi-string

pfam

interproscan-exe

Specify the execution method for InterProScan. Commonly this can be the path to the interproscan.sh file

E-ini

string

interproscan.sh

4.4. Interpreting InterProScan Results

The /gene_family/InterProScan directory will contain all of the relevant information for the optional InterProScan stage of the pipeline. This folder will contain files generated by InterProScan as well as those by EnTAP (/gene_family/InterProScan/processed).

Below are the example files you may find when including InterProScan:

InterProScan Results

filename

description

directory

interproscan.tsv/xml

Generated from InterProScan. Tab delimited or XML file containing information on the sequences with domain matches. Information such as signature accession/description information and GO/Pathway alignments.

/gene_family/InterProScan

unannotated_sequences.fnn/faa

Generated from EnTAP. Sequences where NO domain could be assigned (nucleotide/protein) through InterProScan

/gene_family/InterProScan/processed

annotated_sequences.fnn/faa

Generated from EnTAP. Sequences where a domain could be assigned (nucleotide/protein) through InterProScan

/gene_family/InterProScan/processed

4.4.1. InterProScan Headers

TSV files generated from EnTAP will have the following headers from InterProScan analysis.

  • IPScan GO Biological

  • IPScan GO Cellular

  • IPScan GO Molecular

  • IPScan Pathways

  • IPScan InterPro ID

  • IPScan Protein Database

  • IPScan Protein Description

  • IPScan E-Value