Ensembl minimum install: Difference between revisions

From genomewiki
Jump to navigationJump to search
No edit summary
No edit summary
Line 1: Line 1:
Necessary files for this example, see [[Image:EnsemblWorkshopFiles.tar.gz]]. The examples are just copy/pasted from the workshop documentation prepared by Bronwen Aken and Jan Vogel, with some notes added by myself. The original documentation can be found on [http://www.voeglein.de/gi_workshop_2008/Exercise%201.doc Jan Vogel's homepage] and in the tar archive referenced above, in the doc/ directory.
For historical reasons, Ensembl is not loading the big NCBI chromosome fasta files, but only the small contigs and their AGP files (which maps contigs -> chromosomes). The following sections are describing the loading of this whole "coordinate system" with its sequences. It will also load some default data into three tables to bring the database into a state that is accepted by next exercises.
== Load coordinate systems into tables seq_region and coord_system ==
== Load coordinate systems into tables seq_region and coord_system ==


Necessary files for this example, see [[Image:EnsemblWorkshopFiles.tar.gz]]. The examples are just copy/pasted from the workshop documentation prepared by Bronwen Aken and Jan Vogel, with some notes added by myself. The original documentation can be found on [http://www.voeglein.de/gi_workshop_2008/Exercise%201.doc Jan Vogel's homepage] and in the tar archive referenced above, in the doc/ directory.
You need the fasta and AGP files for an assembly. Ensembl supports multiple coordinate systems: Any piece of DNA can be referenced by its chromosomal location (1:1000), or other systems, like its super_contig location (NT_039500:1-1000).


You need the fasta and AGP files for an assembly. Ensembl supports multiple coordinate systems: Any piece of DNA can be referenced by it's chromosomal location (1:1000), its super_contig location (NT_039500:1-1000) or other coordinates
Coordinate systems have a "rank" of importance (the higher the better), and a "version" (so the database contains information for several possible assemblies of the same contigs and annotations can be loaded that are based on several different versions).


Coordinate systems have a "rank" of importance (the higher the better), and a "version" (so the database contains information for several possible assemblies of the same contigs and annotations can be loaded that are based on several different versions)
Only sequence fragments marked as "toplevel" are shown on the final website (see below). Therefore, you might have never seen an individual contig NT_123453 on the website.


* set a little shortcut:
* The make things easier, let's set a little shortcut:
   export DBSPEC="-dbhost 127.0.0.1 -dbuser ens-training -dbport 3306 -dbname mouse37_mini_ref -dbpass workshop"
   export DBSPEC="-dbhost 127.0.0.1 -dbuser ens-training -dbport 3306 -dbname mouse37_mini_ref -dbpass workshop"
* Create an empty database named mouse37_mini_ref and populate it with the CORE schema:
* Create an empty database named mouse37_mini_ref and populate it with the CORE schema:
   mysql -uens-training -pworkshop -h127.0.0.1 -P3306 -D mouse37_mini_ref < $HOME/cvs_checkout/ensembl/sql/table.sql
   mysql -uens-training -pworkshop -h127.0.0.1 -P3306 -D mouse37_mini_ref < $HOME/cvs_checkout/ensembl/sql/table.sql
* Load coordinates and actual sequences into the empty core database:
* Load coordinates and actual sequences into the empty core database:
* chromosome -> super_contig mappings:
* chromosome -> super_contig mappings into seq_region and coord_system:
<code>perl $PS/load_seq_region.pl $DBSPEC -coord_system_name chromosome -coord_system_version NCBIM37 -rank 1 -default_version -agp_file $HOME/workshop/genebuild/assembly/mini_chr_contig.agp</code>
<code>perl $PS/load_seq_region.pl $DBSPEC -coord_system_name chromosome -coord_system_version NCBIM37 -rank 1 -default_version -agp_file $HOME/workshop/genebuild/assembly/mini_chr_contig.agp</code>
* super_contig -> contig mappings:
* super_contig -> contig mappings into seq_region and coord_system:
<code>perl $PS/load_seq_region.pl -coord_system_name supercontig -default_version -rank 2 -coord_system_version NCBIM37 -agp_file $HOME/workshop/genebuild/assembly/mini_supercontig_contig.agp -verbose</code>
<code>perl $PS/load_seq_region.pl -coord_system_name supercontig -default_version -rank 2 -coord_system_version NCBIM37 -agp_file $HOME/workshop/genebuild/assembly/mini_supercontig_contig.agp -verbose</code>
* See what's going on with:
* See what's going on with:
Line 20: Line 24:
   select * from coord_system  
   select * from coord_system  
   select * from dna;
   select * from dna;
* contigs:
* contigs (this is the only step that actually loads sequences into the "dna" table):
   perl $PS/load_seq_region.pl -coord_system_name contig -default_version -rank 3 -sequence_level -coord_system_version NCBIM37 -fasta_file /home/ensembl/workshop/genebuild/assembly/clones_finished_mini.fa </code>
   perl $PS/load_seq_region.pl -coord_system_name contig -default_version -rank 3 -sequence_level -coord_system_version NCBIM37 -fasta_file /home/ensembl/workshop/genebuild/assembly/clones_finished_mini.fa </code>
* clones (only this command loads sequences into the "dna" table):
* clones (only this command loads sequences into the "dna" table):
Line 29: Line 33:
   select * from dna;
   select * from dna;


* Delete version numbers from coord_system table for contig and clone:
* Delete version numbers from coord_system table for contig and clone (why is this necessary?):


   select * from coord_system ;
   select * from coord_system ;
Line 54: Line 58:
* check what's going on:
* check what's going on:
   select * from assembly
   select * from assembly
<tt>
<code>
+-------------------+-------------------+-----------+-----------+-----------+---------+-----+
+-------------------+-------------------+-----------+-----------+-----------+---------+-----+
| asm_seq_region_id | cmp_seq_region_id | asm_start | asm_end   | cmp_start | cmp_end | ori |
| asm_seq_region_id | cmp_seq_region_id | asm_start | asm_end   | cmp_start | cmp_end | ori |
Line 73: Line 77:
|                12 |                18 |    208471 |    436586 |      2001 |  230116 |   1 |
|                12 |                18 |    208471 |    436586 |      2001 |  230116 |   1 |
+-------------------+-------------------+-----------+-----------+-----------+---------+-----+
+-------------------+-------------------+-----------+-----------+-----------+---------+-----+
</tt>
</code>
The column asm_seq_region_id links to the seq_region table and refers to the assembled sequence (longer sequence). The column cmp_seq_region_id links to the seq_region table and refers to the component sequence (shorter sequence).
The original documentation writes about this table: ''The column asm_seq_region_id links to the seq_region table and refers to the assembled sequence (longer sequence). The column cmp_seq_region_id links to the seq_region table and refers to the component sequence (shorter sequence).''


* clone to contig (ignore all warnings)
* clone to contig (ignore all warnings)
   perl $HOME/cvs_checkout/ensembl-pipeline/scripts/load_agp.pl $DBSPEC -assembled_name clone -component_name contig -agp_file  /home/ensembl/workshop/genebuild/assembly/mini_clone_contig.agp
   perl $HOME/cvs_checkout/ensembl-pipeline/scripts/load_agp.pl $DBSPEC -assembled_name clone -component_name contig -agp_file  /home/ensembl/workshop/genebuild/assembly/mini_clone_contig.agp


== Some nice additions ===
== Complete the assembly database tables ===
* As assemblies can contain references to external databases, we load a list of default references:
* As assemblies can contain references to external databases, we load a list of default references:
   perl $HOME/cvs_checkout/ensembl/misc-scripts/external_db/update_external_dbs.pl $DBSPEC -nonreleasemode -file /home/ensembl/cvs_checkout/ensembl/misc-scripts/external_db/external_dbs.txt
   perl $HOME/cvs_checkout/ensembl/misc-scripts/external_db/update_external_dbs.pl $DBSPEC -nonreleasemode -file /home/ensembl/cvs_checkout/ensembl/misc-scripts/external_db/external_dbs.txt
* contigs that are not located on chromosomes but exist by themselves need the toplevel attribute set. For this, first we need to define the attribute "toplevel", then link the toplevel attribute to the unmapped contigs.  
* contigs that are not located on chromosomes but exist by themselves need the toplevel attribute set. For this, first we need to define the attribute "toplevel", then link the toplevel attribute to the unmapped contigs.  
   perl $HOME/cvs_checkout/ensembl/misc-scripts/attribute_types/upload_attributes.pl  $DBSPEC -file /home/ensembl/cvs_checkout/ensembl/misc-scripts/attribute_types/attrib_type.txt
   perl $HOME/cvs_checkout/ensembl/misc-scripts/attribute_types/upload_attributes.pl  $DBSPEC -file /home/ensembl/cvs_checkout/ensembl/misc-scripts/attribute_types/attrib_type.txt
  perl $HOME/cvs_checkout/ensembl-pipeline/scripts/set_toplevel.pl $DBSPEC
* Ensembl has a system to track sequences that are not mapped at all (ESTs, cDNAs, contigs, etc), so we also populate the "unmapped_reasons" table, though you can skip this.  
* Ensembl has a system to track sequences that are not mapped at all (ESTs, cDNAs, contigs, etc), so we also populate the "unmapped_reasons" table, though you can skip this.  
   perl $HOME/cvs_checkout/ensembl/misc-scripts/unmapped_reason/update_unmapped_reasons.pl $DBSPEC -file /home/ensembl/cvs_checkout/ensembl/misc-scripts/unmapped_reason/unmapped_reason.txt
   perl $HOME/cvs_checkout/ensembl/misc-scripts/unmapped_reason/update_unmapped_reasons.pl $DBSPEC -file /home/ensembl/cvs_checkout/ensembl/misc-scripts/unmapped_reason/unmapped_reason.txt
== Check the sequences in the database ==
  perl $HOME/cvs_checkout/ensembl-analysis/scripts/sequence_dump.pl $DBSPEC -coord_system_name chromosome -output_dir /tmp

Revision as of 13:44, 13 September 2010

Necessary files for this example, see File:EnsemblWorkshopFiles.tar.gz. The examples are just copy/pasted from the workshop documentation prepared by Bronwen Aken and Jan Vogel, with some notes added by myself. The original documentation can be found on Jan Vogel's homepage and in the tar archive referenced above, in the doc/ directory.

For historical reasons, Ensembl is not loading the big NCBI chromosome fasta files, but only the small contigs and their AGP files (which maps contigs -> chromosomes). The following sections are describing the loading of this whole "coordinate system" with its sequences. It will also load some default data into three tables to bring the database into a state that is accepted by next exercises.

Load coordinate systems into tables seq_region and coord_system

You need the fasta and AGP files for an assembly. Ensembl supports multiple coordinate systems: Any piece of DNA can be referenced by its chromosomal location (1:1000), or other systems, like its super_contig location (NT_039500:1-1000).

Coordinate systems have a "rank" of importance (the higher the better), and a "version" (so the database contains information for several possible assemblies of the same contigs and annotations can be loaded that are based on several different versions).

Only sequence fragments marked as "toplevel" are shown on the final website (see below). Therefore, you might have never seen an individual contig NT_123453 on the website.

  • The make things easier, let's set a little shortcut:
 export DBSPEC="-dbhost 127.0.0.1 -dbuser ens-training -dbport 3306 -dbname mouse37_mini_ref -dbpass workshop"
  • Create an empty database named mouse37_mini_ref and populate it with the CORE schema:
 mysql -uens-training -pworkshop -h127.0.0.1 -P3306 -D mouse37_mini_ref < $HOME/cvs_checkout/ensembl/sql/table.sql
  • Load coordinates and actual sequences into the empty core database:
  • chromosome -> super_contig mappings into seq_region and coord_system:

perl $PS/load_seq_region.pl $DBSPEC -coord_system_name chromosome -coord_system_version NCBIM37 -rank 1 -default_version -agp_file $HOME/workshop/genebuild/assembly/mini_chr_contig.agp

  • super_contig -> contig mappings into seq_region and coord_system:

perl $PS/load_seq_region.pl -coord_system_name supercontig -default_version -rank 2 -coord_system_version NCBIM37 -agp_file $HOME/workshop/genebuild/assembly/mini_supercontig_contig.agp -verbose

  • See what's going on with:
 select * from seq_region
 select * from coord_system 
 select * from dna;
  • contigs (this is the only step that actually loads sequences into the "dna" table):
 perl $PS/load_seq_region.pl -coord_system_name contig -default_version -rank 3 -sequence_level -coord_system_version NCBIM37 -fasta_file /home/ensembl/workshop/genebuild/assembly/clones_finished_mini.fa 
  • clones (only this command loads sequences into the "dna" table):

perl $PS/load_seq_region.pl $DBSPEC -coord_system_name clone -default_version -coord_system_version NCBIM37 -rank 4 -agp_file /home/ensembl/workshop/genebuild/assembly/mini_clone_contig.agp

  • See what's going on with:
 select * from seq_region
 select * from coord_system 
 select * from dna;
  • Delete version numbers from coord_system table for contig and clone (why is this necessary?):
 select * from coord_system ;
 update coord_system set version=NULL where name ='clone' ;
 update coord_system set version=NULL where name ='contig' ;
 select * from coord_system ;

+-----------------+-------------+---------+------+--------------------------------+ | coord_system_id | name        | version | rank | attrib                         | +-----------------+-------------+---------+------+--------------------------------+ |               1 | chromosome  | NCBIM37 |    1 | default_version                | |               2 | supercontig | NCBIM37 |    2 | default_version                | |               3 | contig      | NULL    |    3 | default_version,sequence_level | |               4 | clone       | NULL    |    4 | default_version                | +-----------------+-------------+---------+------+--------------------------------+

Load assembly information into MySQL tables assembly and meta

  • chromosome to contig mapping:
 perl $HOME/cvs_checkout/ensembl-pipeline/scripts/load_agp.pl $DBSPEC -assembled_name chromosome -component_name contig -agp_file  /home/ensembl/workshop/genebuild/assembly/mini_chr_contig.agp 
  • supercontig to contig (ignore all warnings):
 perl $HOME/cvs_checkout/ensembl-pipeline/scripts/load_agp.pl $DBSPEC -assembled_name supercontig -component_name contig -agp_file  /home/ensembl/workshop/genebuild/assembly/mini_supercontig_contig.agp 
  • check what's going on:
 select * from assembly

+-------------------+-------------------+-----------+-----------+-----------+---------+-----+ | asm_seq_region_id | cmp_seq_region_id | asm_start | asm_end   | cmp_start | cmp_end | ori | +-------------------+-------------------+-----------+-----------+-----------+---------+-----+ |                 1 |                14 | 129260521 | 129429106 |     41261 |  209846 |   1 | |                 2 |                17 |  69703858 |  69950060 |      2001 |  248203 |   1 | |                 3 |                13 |  94665450 |  94866948 |     22953 |  224451 |   1 | |                 4 |                16 |  21549325 |  21662718 |      2001 |  115394 |   1 | |                 4 |                16 |  21662719 |  21672889 |    115471 |  125641 |   1 | |                 5 |                15 |  81038621 |  81119472 |         1 |   80852 |  -1 | |                 6 |                18 |   3208471 |   3436586 |      2001 |  230116 |   1 | |                 7 |                15 |  23370008 |  23450859 |         1 |   80852 |  -1 | |                 8 |                14 |  81573483 |  81742068 |     41261 |  209846 |   1 | |                 9 |                13 |  44047803 |  44249301 |     22953 |  224451 |   1 | |                10 |                16 |  10375984 |  10489377 |      2001 |  115394 |   1 | |                10 |                16 |  10489378 |  10499548 |    115471 |  125641 |   1 | |                11 |                17 |  35208736 |  35454938 |      2001 |  248203 |   1 | |                12 |                18 |    208471 |    436586 |      2001 |  230116 |   1 | +-------------------+-------------------+-----------+-----------+-----------+---------+-----+ The original documentation writes about this table: The column asm_seq_region_id links to the seq_region table and refers to the assembled sequence (longer sequence). The column cmp_seq_region_id links to the seq_region table and refers to the component sequence (shorter sequence).

  • clone to contig (ignore all warnings)
 perl $HOME/cvs_checkout/ensembl-pipeline/scripts/load_agp.pl $DBSPEC -assembled_name clone -component_name contig -agp_file  /home/ensembl/workshop/genebuild/assembly/mini_clone_contig.agp

Complete the assembly database tables =

  • As assemblies can contain references to external databases, we load a list of default references:
 perl $HOME/cvs_checkout/ensembl/misc-scripts/external_db/update_external_dbs.pl $DBSPEC -nonreleasemode -file /home/ensembl/cvs_checkout/ensembl/misc-scripts/external_db/external_dbs.txt
  • contigs that are not located on chromosomes but exist by themselves need the toplevel attribute set. For this, first we need to define the attribute "toplevel", then link the toplevel attribute to the unmapped contigs.
 perl $HOME/cvs_checkout/ensembl/misc-scripts/attribute_types/upload_attributes.pl  $DBSPEC -file /home/ensembl/cvs_checkout/ensembl/misc-scripts/attribute_types/attrib_type.txt
 perl $HOME/cvs_checkout/ensembl-pipeline/scripts/set_toplevel.pl $DBSPEC
  • Ensembl has a system to track sequences that are not mapped at all (ESTs, cDNAs, contigs, etc), so we also populate the "unmapped_reasons" table, though you can skip this.
 perl $HOME/cvs_checkout/ensembl/misc-scripts/unmapped_reason/update_unmapped_reasons.pl $DBSPEC -file /home/ensembl/cvs_checkout/ensembl/misc-scripts/unmapped_reason/unmapped_reason.txt

Check the sequences in the database

 perl $HOME/cvs_checkout/ensembl-analysis/scripts/sequence_dump.pl $DBSPEC -coord_system_name chromosome -output_dir /tmp