The Ensembl Browser

I am trying to learn how Ensembl is structured. As Ensembl itself does not have a wiki, nor a forum, nor a public mailing list for user discussions, I'll document it here.

Some ideas:

Everything is in mysql databases. No flat text files. Database schema documentation
Can be accessed via Perl API (slow) or via biomart.org (~table browser, fast and convenient) or via direct sql queries (very complex schema)
An update of everything is done every 6 months. The old code, the old API and all databases are archived. Different mysql servers running on different ports are used to separated older archived from current versions.
- Genes are not re-predicted each time but only when new data is added to the gene build. The starting month of the last update of a gene build is stored in genome_db.genebuild (not the month when the genebuild ended, so I don't see how you know if genes changed)
The current version (oct 09) is 56
Usually, each species has its own database, like in the UCSC browser. The current human one is 'homo_sapiens_core_56_37a'
The Web interface is called "webcode", written in Perl and makes extensive use of inheritance (uh-oh), tool-support for reading the code might be helpful
The database structure is very normalized. Whereas this is nice from a software engineering perspective, you cannot do large-scale requests. E.g. downloading all homologs between two genomes involves queries on self-referencing tables which take ages to resolve and will time out if run on their server. Use biomart for these types of requests.
There are still a lot of older functions lingering in the source code. If a function returns null although it shouldn't have a look into the source code. Often they have been replaced by others. The ensembl-dev mailing list is usually the only way to get more information.

The databases:

All versions of the genomes are on the same server. Some ideas to help you find your way:

species_name_version_obscureNumber is the format of the individual species database (see below)
ensembl_compara includes homologies between proteins and genomes
ensembl_go_version: Not used anymore? Was used to store gene ontology links.
ensembl_website_version: Ensembl includes some sort of content management system. This databases includes help articles, bugs, news, the list of species on the frontpage etc. (This database looks somewhat similar to hgcentral)
ensembl_ancestral_version ??

The species database:

Sequences can be accessed using different "coordinate systems", e.g. you can type in a chromsome location or alternatively a contig location. Both will be mapped to chromsome sequences. They are set up in the table 'coord_system'
The sequences themselved are stored in the table 'dna' and information about them in 'seq_region'. There is a table dnac for compressed sequences but its empty.

The pipeline:

Their pipeline systems inserts jobs into a mysql database as well
The genebuild step is predicting genes
The xref step is connecting predicted genes to external identifiers
The compara step is aligning all genomes and predicted genes and then building phylogenetic trees for all proteins
The biomart step is de-normalizing all databases for faster access (All older biomart versions are accessible via the archived old ensembl versions)

Documentation:

Most documentation is not accessible from the Ensembl homepage. You have to dig into the CVS repositories to find "pipeline_docs": [1] The file overview.txt gives a very good introduction.

The Ensembl Browser

Navigation menu

Page actions

Page actions

Personal tools

Navigation

Search

related sites

hosted projects

Tools