Supplemental Material for Kim et al., 2018
Datasets usually provide raw data for analysis. This raw data often comes in spreadsheet form, but can be any collection of data, on which analysis can be performed.
Figure S1: Changes in gene content in Ensembl Compara v87, v88 and v89. The Ensembl Compara database was updated three times while we were compiling OL2. In this time-span we noticed that the landscape of worm genes with predicted human orthologs changed after each update, so that each version had ~2% of genes unique to it, while another ~2-4% of genes were found in only two of the three versions (see also Materials and Methods).
Figure S2: Updated OL1. We updated OL1 by addressing changes in worm gene structure, classification and nomenclature for the genes present in our original compendium. We then combined results from the corrected OL1 programs. The Venn diagram (A) shows overlap in corrected gene content between the four programs, while the table (B) gives an overall measure of how many genes were found by one or more programs (regardless of which one(s) found them).
Table S1: Data sources for orthology-prediction programs used to compile OL2. The source data for each program is found at each program’s website.
File S1: Changed OL1 worm genes. This file lists genes whose classification, or ID, changed since the release of OL1. Type I changes correspond to genes that were re-classified as pseudogenes, ncRNA, being transposon-derived, or killed due to lack of evidence. Type II changes results from combining, or “merging” two or more genes that had each, separately, been found to have a human ortholog in OL1. Type III changes represent genes that were assigned new IDs, either because experimental evidence suggested that they should be merged with genes previously not in OL1 (marked red), or due to addition of previously unpredicted gene segments (denoted as a red “?”)
File S2: Corrected C. elegans genes in OL1. All corrected worm genes found by each OL1-era version of orthology prediction methods are shown in tab (A). Tab (B) shows the distribution of results between OL1-era orthology-prediction methods, while tab (C) shows the corrected OL1 as well as the distribution of genes by support class (supported by one, two, three or all methods).
File S3: Changed OL1 human genes. Human ENSG gene IDs from OL1 are listed for each orthology-prediction method in tab (A). This tab also shows the 574 ENSG IDs that are no longer found in current versions of the Ensembl genome browser. Tab (B) shows the Ensembl-provided history for the 574 lost ENSG IDs, showing that most are now just classified as “retired”. Tab (C) shows a randomly selected subset of 20 IDs that were “retired”. Note that the gene name (HGNC-approved symbol) associated with the “retired” ENSG ID is always associated with current ENSG IDs, demonstrating that curation of ENSG IDs rarely links “retired” IDs with their current counterparts. Tab (D) lists the sixteen human ENSG IDs that we could confirm were deprecated.
File S4: C. elegans genes in OL1.1. Tab (A) shows the worm genes found to have human orthologs by updated versions of prediction methods used in OL1. Tab (B) shows the distribution of results between orthology-prediction methods. Tab (C) shows the final OL1.1, as well as the distribution of genes by support class (supported by one, two, three or all methods). Tab (D) lists those genes found only in OL1 (termed “lost”), and those added upon update to OL1.1.
File S5: C. elegans OMA and OrthoInspector results, their relationship to OL1.1 and genes not supported by current versions of orthology-prediction methods. Tab (A) shows the worm genes found to have human orthologs by OMA, OrthoInspector and those already in OL1.1. Tab (B) shows the distribution of results amongst these three sets. Tab (C) lists all worm genes with human orthologs supported by current orthology-prediction methods (OL2) as well as those no longer supported (the “legacy” set).
File S6: the “legacy” set. Tab (A) lists the 256 C. elegans genes previously-predicted to have human orthologs, but not supported by current versions of orthology-prediction methods, and their predicted protein domains determined by SMART and InterPro. Tab (B) lists the human “legacy” set: 165 human genes that were previously predicted to have worm orthologs, but for whom orthology is no longer supported.
File S7: OL2 and legacy master list. This file, which underlies the database hosted at ortholist.shaye-lab.org, contains all orthology predictions (current and legacy), with C. elegans and human gene identifiers, as well as associated protein domain (SMART and InterPro) and human disease (OMIM) information.
File S8: Freeze of code used to compile OrthoList 2. The code was downloaded from https://github.com/danshaye/OrthoList2 at the time of submission.