eggNOG: automated construction and annotation of orthologous groups of genes

 talk shows

of 5
All materials on our website are shared by users. If you have any questions about copyright issues, please report us to resolve them. We are always happy to assist you.
eggNOG: automated construction and annotation of orthologous groups of genes
  Nucleic Acids Research, 2007,  1–5 doi:10.1093/nar/gkm796 eggNOG: automated construction and annotationof orthologous groups of genes Lars Juhl Jensen 1 , Philippe Julien 1 , Michael Kuhn 1 , Christian von Mering 2 ,Jean Muller 1 , Tobias Doerks 1 and Peer Bork  1,3, * 1 European Molecular Biology Laboratory, Meyerhofstrasse 1, 69117 Heidelberg, Germany,  2 University of Zurichand Swiss Institute of Bioinformatics, Winterthurerstrasse 190, 8057 Zurich, Switzerland and  3 Max-Delbru ¨ ck-Centre for Molecular Medicine, Robert-Ro ¨ ssle-Strrasse 10, 13092 Berlin, Germany Received August 14, 2007; Revised September 14, 2007; Accepted September 17, 2007  ABSTRACTThe identification of orthologous genes forms thebasis for most comparative genomics studies.Existing approaches either lack functional annota-tion of the identified orthologous groups, hamperingthe interpretation of subsequent results, or aremanually annotated and thus lag behind the rapidsequencing of new genomes. Here we presentthe eggNOG database (‘evolutionary genealogy ofgenes: Non-supervised Orthologous Groups’),which contains orthologous groups constructedfrom Smith–Waterman alignments through identifi-cation of reciprocal best matches and triangularlinkage clustering. Applying this procedure to 312bacterial, 26 archaeal and 35 eukaryotic genomes yielded 43582 course-grained orthologous groupsof which 9724 are extended versions of thosefrom the original COG/KOG database. We alsoconstructed more fine-grained groups for selectedsubsets of organisms, such as the 19914 mamma-lian orthologous groups. We automatically annotated our non-supervised orthologous groupswith functional descriptions, which were derived by identifying common denominators for the genesbased on their individual textual descriptions,annotated functional categories, and predictedprotein domains. The orthologous groups ineggNOG contain 1241751 genes and provide atleast a broad functional description for 77% of them.Users can query the resource for individual genesvia a web interface or download the complete setof orthologous groups at The vast majority of the functionally annotated genes ingenomes or metagenomes are derived by comparativeanalysis and inference from existing functional knowledgevia homology. With the sequencing of entire genomes,it became possible to increase the resolution of thefunctional transfer by distinguishing between orthologsand paralogs, that is gene pairs that trace back tospeciation and gene duplication events, respectively (1).These concepts have since been extended and refined toinclude orthologous groups (2), in-paralogs and out-paralogs (3,4), but the identification and classificationof homologous genes remains very difficult. In contrast tothe definition of orthology, the classification of genes intoorthologous groups is always with respect to a taxonomicposition: two paralogous genes from human and mousemay be orthologs of the same gene in fruit fly and willbelong to either the same or different orthologous groupsdepending on whether these are defined with respect to thelast common ancestor of metazoans or mammals. This isfurther complicated by evolutionary processes such asgene fusion and domain shuffling, due to which eachdomain of a multi-domain protein is not guaranteed tohave evolved through the same series of speciation andduplication events. Finally, because we do not know howeach gene evolved, one in practice always relies onoperational definitions rather than the evolutionarydefinitions given above.Numerous methods have been developed to deriveorthologs and orthologous groups, ranging from thesimple reciprocal-best-hit approach, via InParanoid (5),MultiParanoid (6), identification best-hit triangles (2,7,8)and clustering-based approaches (9), to tree-based meth-ods (10–13). By contrast, there has been only one major *To whom correspondence should be addressed. Tel: +49 6221 387 526; Fax: +49 6221 387 517; Email: bork@embl.deThe authors wish it to be known that, in their opinion, the first two authors should be regarded as joint First Authors.   2007 The Author(s)This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License ( which permits unrestricted non-commercial use, distribution, and reproduction in any medium, provided the srcinal work is properly cited.   Nucleic Acids Research Advance Access published October 16, 2007   b  y g u e  s  t   onD e  c  e m b  e r 1 4  ,2  0 1  3 h  t   t   p :  /   /  n a r  . oxf   or  d  j   o ur n a l   s  . or  g /  D o wnl   o a  d  e  d f  r  om   effort to provide functionally annotated orthologousgroups, namely the COG/KOG database (2,8), but itlacks phylogenetic resolution and is not regularly updateddue to the manual labor required. There is thus a needfor a hierarchical system of orthology classification withfunction annotation.Here, we provide such a system, eggNOG, which (1) canbe updated without the requirement for manual curation,(2) covers more genes and genomes than existingdatabases, (3) contains a hierarchy of orthologousgroups to balance phylogenetic coverage and resolutionand (4) provides automatic function annotation of similarquality to that obtained through manual inspection. CONSTRUCTION OF HIERARCHICALORTHOLOGOUS GROUPS We assemble proteins into orthologous groups using anautomated procedure similar to the srcinal COG/KOGapproach (2,8). When constructing coarse-grained ortho-logous groups across all three domains of life or for alleukaryotes, we first assign the proteins encoded by thegenomes in eggNOG to the respective COGs or KOGsbased on best hits to the manually assigned sequencesin the COG/KOG database. In case of multiple hits to thesame part of the sequence, only the best hit wasconsidered. The many proteins that cannot be assignedto existing COGs or KOGs are subsequently assembledinto non-supervised orthologous groups using the proce-dure described below. When constructing more fine-grained orthologous groups, this initial step is skipped.Briefly, we first compute all-against-allSmith–Waterman similarities among all proteins ineggNOG. We then group recently duplicated sequencesinto in-paralogous groups, which are subsequently treatedas single units to ensure that they will be assigned to thesame orthologous groups. To form the in-paralogousgroups, we first assemble highly related genomes intoclades, usually encompassing all sequenced strains of aparticular species in a single clade, but also other closepairs such as human and chimpanzee. In these clades, we join into in-paralogous groups all proteins that are moresimilar to each other (within the clade), than to any otherprotein outside the clade. For this, there is no fixedcutoff in similarity, but instead we start with a stringentsimilarity cutoff and relax it a step-wise fashion untilall in-paralogous proteins are joined, requiring that allmembers of a group must align to each other with at least20 residues.After grouping in-paralogous proteins, we start assign-ing orthology between proteins, by joining trianglesof reciprocal best hits involving three different species(here, in-paralogous groups are represented by their best-matching member). Again, we start with a stringentsimilarity cutoff and relax it to identify groups of proteinsthat all align to each other by at least 20 residues.This procedure occasionally causes an orthologous groupto be split in two; such cases are identified by anabundance of reciprocal best hits between groups, whichare then joined. Next, we relax the triangle criterionand allow remaining unassigned proteins to join a groupby simple bidirectional best hits. Finally, we automaticallyidentify gene fusion events by searching for proteinsthat bridge otherwise unrelated orthologous groups. Inthese cases, the different parts of the fusion protein areassigned to their respective orthologous groups. This stepis a distinguishing feature of our approach and is crucialfor the analysis of eukaryotic multi-domain proteins,as these would otherwise cause unrelated orthologousgroups to be fused.To construct a hierarchy of orthologous groups,the procedure described above was applied to severalsubsets of organisms. To make a set of course-grainedorthologous groups across all three domains of life, weconstructed non-supervised orthologous groups (NOGs)from the genes that could not be mapped to a COG orKOG. Focusing on eukaryotic genes, we constructed morefine-grained eukaryotic NOGs (euNOGs) from the genesthat could not be mapped to a KOG. Finally, we build setsof NOGs of increasing resolution for five eukaryoticclades, namely fungi (fuNOGs), metazoans (meNOGs),insects (inNOGs), vertebrates (veNOGs) and mammals(maNOGs).  AUTOMATIC ANNOTATION OF PROTEINFUNCTION An important feature of eggNOG is that it providesfunctional annotations for the orthologous groups.These annotations are produced by a pipeline, whichsummarizes the available functional information on theproteins in each cluster: (1) the textual annotation forthese proteins, (2) their annotated Gene Ontology(GO) terms (14), (3) their membership to KEGG path-ways (15) and (4) the presence of protein domains fromSMART (16) and Pfam (17). As the textual descriptionsallow for the most fine-grained annotation of proteinfunction, we first use Ukkonen’s algorithm (18) to identifythe longest common subsequence (LCS) between thedescription lines of any two proteins within a cluster.We then score each LCS based on the number of proteindescriptions matched within the cluster, the number of occurrences of each word of the LCS in these descriptions,and the presence of words such as ‘hypothetical’,‘putative’ or ‘unknown’. These scores are finally normal-ized against a score distribution based on randomizedclusters of the same size, and the highest scoring LCS ischosen, provided that it scores above a threshold.For each orthologous group, our pipeline also searchesfor overrepresented GO terms, KEGG pathways orprotein domains. To find terms that are sufficientlyspecific and at the same time are likely to describe theentire orthologous group, we devised a scoring functionthat takes into account term frequency within thegroup, background frequency, and the ratio of the two(i.e. the fold overrepresentation). In case no satisfactoryLCS was found, a description line is constructed basedon the highest scoring GO term or KEGG pathway. As asingle domain may not properly reflect the function of a complete protein, description lines are constructed 2  Nucleic Acids Research, 2007    b  y g u e  s  t   onD e  c  e m b  e r 1 4  ,2  0 1  3 h  t   t   p :  /   /  n a r  . oxf   or  d  j   o ur n a l   s  . or  g /  D o wnl   o a  d  e  d f  r  om   based on overrepresented domains only if all other optionshave been exhausted. QUALITY ASSESSMENT AND SUMMARY STATISTICS To assess the quality of the function annotations providedby our automated pipeline, we manually checked arandom sample of 100 NOGs and 100 euNOGs andclassified their annotations into three categories: 87.5%were correct (i.e. they describe a function that the proteinshave in common), 12.5% were uninformative (i.e. they donot describe a function) and, due to our stringent rule set,no wrong functions were assigned. Uninformative annota-tions of orthologous groups are in many cases due toa lack of functional knowledge on the correspondingproteins.Our function annotation pipeline enables us to providedescription lines for 6583 of the 33858 (19%) coarse-grained NOGs. Combined with the 9724 COGs andKOGs, this yields 43582 global orthologous groups of which 14356 (33%) have an annotated function. Inaddition, eggNOG contains 94240 more fine-grainedorthologous groups of which 55753 (59%) could befunctionally annotated. This enables us to assign 1241751of 1513782 genes (82% of the genes in the analyzedgenomes) to an orthologous group and to provide at leasta broad functional description of 951918 of them (77% of the genes that could be assigned to an orthologous group).The corresponding numbers for each set of orthologousgroups as well as for each individual genome aresummarized in Figure 1. USING eggNOG The eggNOG resource is accessible via a web interface at The main page allows the user toinput the names of one or more genes or orthologousgroups and to optionally select the organism of interest.Alternatively, the user can choose to upload a set of protein sequences to be searched against the full-lengthsequences in eggNOG. In case of ambiguous names orquery sequences with multiple hits, the user is prompted todisambiguate the input.Figure 2 shows the result of a query for the three G 1 -type cyclins in budding yeast, which belong to twodistinct fungal orthologous groups. Function descriptionsare displayed for both the orthologous groups and for theindividual genes. The web interface enables the user toview the complete set of genes that belong to eachorthologous group and provides external links to addi-tional information on the protein products.By default, eggNOG shows the most fine-grained ortho-logous groups that are possible given the input: just like Figure 2.  Screenshot of the main results page. The eggNOG database was queried for the three  G 1 -type cyclins in budding yeast, namely Cln1–Cln3.These have been correctly assigned to two fungal orthologous groups. The navigation tree at the top of the page allows the user to change the viewto more coarse-grained orthologous groups, for example the eukaryotic orthologous groups in which these cyclins are all grouped together. 4  Nucleic Acids Research, 2007    b  y g u e  s  t   onD e  c  e m b  e r 1 4  ,2  0 1  3 h  t   t   p :  /   /  n a r  . oxf   or  d  j   o ur n a l   s  . or  g /  D o wnl   o a  d  e  d f  r  om   entering a set of genes from budding yeast results in fungalorthologous groups being shown, a set of human geneswill yield mammalian orthologous groups, whereas acombination of human and fruit fly genes will yieldmetazoan orthologous groups. A navigation tree at thetop of the page (Figure 2) allows the user to select morecoarse-grained orthologous groups if desired; for example,selecting ‘eukaryotes’ reveals that the three budding yeastcyclins all belong to the same eukaryotic orthologousgroup. This key feature enables the user to choose thebalance between phylogenetic coverage and resolutionwithin our hierarchy of orthologous groups.Whereas the web interface is convenient for small-scalestudies, users interested in genome-wide analyses will bebetter served by downloading the complete content of the underlying relational database. For this reason, theorthologous groups, functional annotations and proteinsequences are all available from the eggNOG downloadpage under the Creative Commons Attribution 3.0License.  ACKNOWLEDGEMENTS The authors thank Eugene Koonin for comments onthe manuscript. This work was supported byBundesministerium fu ¨r Bildung und Forschung(Nationales Genomforschungsnetz grant 01GR0454) aswell as through the GeneFun Specific Targeted ResearchProject, contract number LSHG-CT-2004-503567, andthrough the BioSapiens Network of Excellence, contractnumber LSHG-CT-2003-503265, both funded by theEuropean Commission FP6 Programme. Funding to paythe Open Acces publication charges this article wasprovided by the European Molecular Biology Laboratory. Conflict of interest statement . None declared. REFERENCES 1. Fitch,W.M. (1970) Distinguishing homologous from analogousproteins.  J. Biol. Chem .,  19 , 99–113.2. Tatusov,R.L., Koonin,E.V. and Lipman,D.J. (1997) A genomicperspective on protein families.  Science ,  278 , 631–637.3. Sonnhammer,E.L. and Koonin,E.V. (2002) Orthology, paralogyand proposed classification for paralog subtypes.  Trends Genet .,  18 ,619–620.4. Koonin,E.V. (2005) Orthologs, paralogs, and evolutionarygenomics.  Annu. Rev. Genet .,  39 , 309–338.5. O’Brien,K.P., Remm,M. and Sonnhammer,E.L.L. (2005)Inparanoid: a comprehensive database of eukaryotic orthologs. Nucleic Acids Res .,  33 , D476–D480.6. Alexeyenko,A., Tamas,I., Liu,G. and Sonnhammer,E.L.L. (2006)Automatic clustering of orthologs and inparalogs shared bymultiple proteomes.  Bioinformatics ,  14 , e9–e15.7. Lee,Y., Sultana,R., Pertea,G., Cho,J., Karamycheva,S., Tsai,J.,Parvizi,B., Cheung,F., Antonescu,V.  et al  . (2002) Cross-referencingeukaryotic genomes: TIGR orthologous gene assignments (TOGA). Genome Res .,  12 , 493–502.8. Tatusov,R.L., Fedorova,N.D., Jackson,J.D., Jacobs,A.R.,Kiryutin,B., Koonin,E.V., Krylov,D.M., Mazumder,R.,Mekhedov,S.L.  et al  . (2003) The COG database: an updated versionincludes eukaryotes.  BMC Bioinformatics ,  4 , 41.9. Li,L., Stoeckert,C.J., Jr. and Roos,D.S. (2003) OrthoMCL:identification of orthologous groups for eukaryotic genomes. Genome Res .,  13 , 2178–2189.10. Li,H., Coghlan,A., Ruan,J., Coin,L.J., He ´riche ´,J.-K.,Osmotherly,L., Li,R., Liu,T., Zhang,Z.  et al  . (2006) TreeFam: acurated database of phylogenetic trees of animal gene families. Nucleic Acids Res .,  34 , D572–D580.11. van der Heijden,R.T.J.M., Snel,B., van Noort,V. and Huynen,M.A.(2007) Orthology prediction at scalable resolution by phylogenetictree analysis.  BMC Bioinformatics ,  8 , 83.12. Hubbard,T.J.P., Aken,B.L., Beal,K., Ballester,B., Caccamo,M.,Chen,Y., Clarke,L., Coates,G., Cunningham,F.  et al  . (2007)Ensembl 2007.  Nucleic Acids Res .,  35 , D610–D617.13. Wapinski,I., Pfeffer,A., Friedman,N. and Regev,A. (2007)Automatic genome-wide reconstruction of phylogenetic trees. Bioinformatics ,  23 , i549–i558.14. Ashburner,M., Ball,C.A., Blake,J.A., Botstein,D., Butler,H.,Cherry,J.M., Davis,A.P., Dolinski,K., Dwight,S.S.  et al  . (2000)Gene Ontology: tool for the unification of biology.  Nature Genet ., 25 , 25–29.15. Kanehisa,M., Goto,S., Hattori,M., Aoki-Kinoshita,K.F., Itoh,M.,Kawashima,S., Katayama,T., Araki,M. and Hirakawa,M. (2006)From genomics to chemical genomics: new developments in KEGG. Nucleic Acids Res .,  34 , D354–D357.16. Letunic,I., Copley,R.R., Schmidt,S., Ciccarelli,F., Doerks,T.,Schultz,J., Ponting,C.P. and Bork,P. (2004) SMART 4.0: towardsgenomic data integration.  Nucleic Acids Res .,  32 , D142–D144.17. Finn,R.D., Mistry,J., Schuster-Bo ¨ckler,B., Griffiths-Jones,S.,Hollich,V., Lassmann,T., Moxon,S., Marshall,M., Khanna,A.  et al  .(2006) Pfam: clans, web tools and services.  Nucleic Acids Res .,  34 ,D247–D251.18. Ukkonen,E. (1995) On-line construction of suffix trees. Algorithmica ,  14 , 249–260.19. von Mering,C., Jensen,L.J., Kuhn,M., Chaffron,S., Doerks,T.,Kru ¨ger,B., Snel,B. and Bork,P. (2007) STRING 7—recent devel-opments in the integration and prediction of protein interactions. Nucleic Acids Res .,  35 , D358–D362.20. Letunic,I. and Bork,P. (2007) Interactive Tree Of Life (iTOL): anonline tool for phylogenetic tree display and annotation. Bioinformatics ,  23 , 127–128. Nucleic Acids Research, 2007   5   b  y g u e  s  t   onD e  c  e m b  e r 1 4  ,2  0 1  3 h  t   t   p :  /   /  n a r  . oxf   or  d  j   o ur n a l   s  . or  g /  D o wnl   o a  d  e  d f  r  om 
Related Search
Similar documents
View more...
We Need Your Support
Thank you for visiting our website and your interest in our free products and services. We are nonprofit website to share and download documents. To the running of this website, we need your help to support us.

Thanks to everyone for your continued support.

No, Thanks