The Ensembl Browser: Difference between revisions
From genomewiki
Jump to navigationJump to search
No edit summary |
No edit summary |
||
Line 9: | Line 9: | ||
* Usually, each species has its own database, like in the UCSC browser. The current human one is 'homo_sapiens_core_56_37a' | * Usually, each species has its own database, like in the UCSC browser. The current human one is 'homo_sapiens_core_56_37a' | ||
* The Web interface is called "webcode", written in Perl and makes extensive use of inheritance (uh-oh), tool-support for reading the code might be helpful | * The Web interface is called "webcode", written in Perl and makes extensive use of inheritance (uh-oh), tool-support for reading the code might be helpful | ||
* The database structure is very normalized. Whereas this is nice from a software engineering perspective, you cannot do large-scale requests. E.g. downloading all homologs between two genomes involves queries on self-referencing tables which take ages to resolve and will time out if run on their server. Use biomart for these types of requests. | |||
The databases: | The databases: | ||
Line 14: | Line 16: | ||
Confusing, because all versions are on the same server. Some ideas to help you find your way: | Confusing, because all versions are on the same server. Some ideas to help you find your way: | ||
* species_name_version_obscureNumber is the format of the individual species database (see below) | * species_name_version_obscureNumber is the format of the individual species database (see below) | ||
* | * [[ensembl_compara]] includes homologies between proteins and genomes | ||
* ensembl_go_version: Not used anymore? Was used to store gene ontology links. | * ensembl_go_version: Not used anymore? Was used to store gene ontology links. | ||
* ensembl_website_version: Ensembl includes some sort of content management system. This databases includes help articles, bugs, news, the list of species on the frontpage etc. (This database looks somewhat similar to hgcentral) | * ensembl_website_version: Ensembl includes some sort of content management system. This databases includes help articles, bugs, news, the list of species on the frontpage etc. (This database looks somewhat similar to hgcentral) |
Revision as of 10:05, 30 October 2009
I am trying to learn how Ensembl is structured. As Ensembl itself does not have a wiki nor a forum nor a public mailing list for user discussions, I'll document it here.
Some ideas:
- Everything is in mysql databases. No flat text files. Database schema documentation
- Can be accessed via Perl API (slow) or via biomart.org (~table browser, fast and convenient) or via direct sql queries (very complex schema)
- An update of everything is done every 6 months. The old code, the old API and all databases are archived. Different mysql servers running on different ports are used to separated older archived from current versions.
- Genes are not re-predicted each time but only when new data is added to the gene build. The month of the last update of a gene build is stored in genome_db.genebuild
- The current version (oct 09) is 56
- Usually, each species has its own database, like in the UCSC browser. The current human one is 'homo_sapiens_core_56_37a'
- The Web interface is called "webcode", written in Perl and makes extensive use of inheritance (uh-oh), tool-support for reading the code might be helpful
- The database structure is very normalized. Whereas this is nice from a software engineering perspective, you cannot do large-scale requests. E.g. downloading all homologs between two genomes involves queries on self-referencing tables which take ages to resolve and will time out if run on their server. Use biomart for these types of requests.
The databases:
Confusing, because all versions are on the same server. Some ideas to help you find your way:
- species_name_version_obscureNumber is the format of the individual species database (see below)
- ensembl_compara includes homologies between proteins and genomes
- ensembl_go_version: Not used anymore? Was used to store gene ontology links.
- ensembl_website_version: Ensembl includes some sort of content management system. This databases includes help articles, bugs, news, the list of species on the frontpage etc. (This database looks somewhat similar to hgcentral)
- ensembl_ancestral_version ??
The species database:
- Sequences can be accessed using different "coordinate systems", e.g. you can type in a chromsome location or alternatively a contig location. Both will be mapped to chromsome sequences. They are set up in the table 'coord_system'
- The sequences themselved are stored in the table 'dna' and information about them in 'seq_region'. There is a table dnac for compressed sequences but its empty.
The pipeline:
- Their pipeline systems inserts jobs into a mysql database as well
- The genebuild step is predicting genes
- The xref step is connecting predicted genes to external identifiers
- The compara step is aligning all genomes and predicted genes and then building phylogenetic trees for all proteins
- The biomart step is de-normalizing all databases for faster access (It seems that biomart is not archived. If this is true, then one cannot rely on it for whole-genome work as one might end up with inconsistent data)