Ensembl minimum install: Difference between revisions
No edit summary |
No edit summary |
||
(13 intermediate revisions by the same user not shown) | |||
Line 1: | Line 1: | ||
This example uses some demo files prepared by Jan Vogel and Bronwen Aken. Download and extract them from [[Image:EnsemblWorkshopFiles.tar.gz]], then run your tcsh and do "source set_perl_path.sh". | |||
The example commands on this page are just copy/pasted from the workshop documentation prepared by Bronwen Aken and Jan Vogel, with some notes added by myself. The original documentation can be found on [http://www.voeglein.de/gi_workshop_2008/Exercise%201.doc Jan Vogel's homepage] and in the tar archive referenced above, in the doc/ directory. | |||
For historical reasons, Ensembl is not loading the big NCBI chromosome fasta files, but only the small contigs and their [http://www.ncbi.nlm.nih.gov/projects/genome/assembly/agp/AGP_Specification.shtml#FORMAT AGP] files (a golden path = maps contigs -> chromosomes). The following sections are describing the loading of this whole "coordinate system" with its sequences. It will also load some default data into three tables to bring the database into a state that is accepted by next exercises. | |||
Another reason for not loading a whole chromosome into a single MySQL row is that the maximum size of mysql packets is ~10MB (depending on your settings). In addition, MySQL's get-substring implementation will always retrieve the whole row which creates slows down retrieval of short sequences. These are reasons why usually 10-200kb sequences are loaded into the databases and these fragments are then linked by .AGP files. | |||
* set a little shortcut: | == Load coordinate systems into tables seq_region and coord_system == | ||
export | |||
You need the fasta and AGP files for an assembly. Ensembl supports multiple coordinate systems: Any piece of DNA can be referenced by its chromosomal location (1:1000), or other systems, like its super_contig location (NT_039500:1-1000). | |||
Coordinate systems have a "rank" of importance (the higher the better), and a "version" (so the database contains information for several possible assemblies of the same contigs and annotations can be loaded that are based on several different versions). | |||
Only sequence fragments marked as "toplevel" are shown on the final website (see below). Therefore, you might have never seen an individual contig NT_123453 on the website. | |||
* The make things easier, let's set a little shortcut: | |||
export DBSPEC="-dbhost 127.0.0.1 -dbuser ens-training -dbport 3306 -dbname mouse37_mini_ref -dbpass workshop" | |||
* Create an empty database named mouse37_mini_ref and populate it with the CORE schema: | * Create an empty database named mouse37_mini_ref and populate it with the CORE schema: | ||
mysql -uens-training -pworkshop -h127.0.0.1 -P3306 -D mouse37_mini_ref < $HOME/cvs_checkout/ensembl/sql/table.sql | mysql -uens-training -pworkshop -h127.0.0.1 -P3306 -D mouse37_mini_ref < $HOME/cvs_checkout/ensembl/sql/table.sql | ||
* Load coordinates and actual sequences into the empty core database: | * Load coordinates and actual sequences into the empty core database: | ||
* chromosome -> super_contig mappings into seq_region and coord_system: | |||
<code>perl $PS/load_seq_region.pl $DBSPEC -coord_system_name chromosome -coord_system_version NCBIM37 -rank 1 -default_version -agp_file $HOME/workshop/genebuild/assembly/mini_chr_contig.agp</code> | <code>perl $PS/load_seq_region.pl $DBSPEC -coord_system_name chromosome -coord_system_version NCBIM37 -rank 1 -default_version -agp_file $HOME/workshop/genebuild/assembly/mini_chr_contig.agp</code> | ||
* super_contig -> contig mappings into seq_region and coord_system: | |||
perl $PS/load_seq_region.pl $DBSPEC -coord_system_name supercontig -default_version \ | |||
-rank 2 -coord_system_version NCBIM37 -agp_file $HOME/workshop/genebuild/assembly/mini_supercontig_contig.agp -verbose | |||
* See what's going on with: | |||
select * from seq_region | |||
select * from coord_system | select * from coord_system | ||
select * from dna; | select * from dna; | ||
* contigs (this is the only step that actually loads sequences into the "dna" table): | |||
perl $PS/load_seq_region.pl $DBSPEC -coord_system_name contig -default_version -rank 3 -sequence_level -coord_system_version NCBIM37 -fasta_file /home/ensembl/workshop/genebuild/assembly/clones_finished_mini.fa | |||
* clones : | |||
perl $PS/load_seq_region.pl $DBSPEC -coord_system_name clone -default_version -coord_system_version NCBIM37 \ | |||
-rank 4 -agp_file /home/ensembl/workshop/genebuild/assembly/mini_clone_contig.agp | |||
* See what's going on with: | |||
select * from seq_region | |||
select * from coord_system | |||
select * from dna; | |||
* Delete version numbers from coord_system table for contig and clone (why is this necessary?): | |||
select * from coord_system ; | |||
update coord_system set version=NULL where name ='clone' ; | |||
update coord_system set version=NULL where name ='contig' ; | |||
select * from coord_system ; | |||
<code> | |||
+-----------------+-------------+---------+------+--------------------------------+ | |||
| coord_system_id | name | version | rank | attrib | | |||
+-----------------+-------------+---------+------+--------------------------------+ | |||
| 1 | chromosome | NCBIM37 | 1 | default_version | | |||
| 2 | supercontig | NCBIM37 | 2 | default_version | | |||
| 3 | contig | NULL | 3 | default_version,sequence_level | | |||
| 4 | clone | NULL | 4 | default_version | | |||
+-----------------+-------------+---------+------+--------------------------------+ | |||
</code> | |||
== Load assembly information into MySQL tables assembly and meta == | |||
* chromosome to contig mapping: | |||
perl $HOME/cvs_checkout/ensembl-pipeline/scripts/load_agp.pl $DBSPEC -assembled_name chromosome -component_name contig -agp_file /home/ensembl/workshop/genebuild/assembly/mini_chr_contig.agp | |||
* supercontig to contig (ignore all warnings): | |||
perl $HOME/cvs_checkout/ensembl-pipeline/scripts/load_agp.pl $DBSPEC -assembled_name supercontig -component_name contig -agp_file /home/ensembl/workshop/genebuild/assembly/mini_supercontig_contig.agp | |||
* check what's going on: | |||
select * from assembly | |||
<code> | |||
+-------------------+-------------------+-----------+-----------+-----------+---------+-----+ | |||
| asm_seq_region_id | cmp_seq_region_id | asm_start | asm_end | cmp_start | cmp_end | ori | | |||
+-------------------+-------------------+-----------+-----------+-----------+---------+-----+ | |||
| 1 | 14 | 129260521 | 129429106 | 41261 | 209846 | 1 | | |||
| 2 | 17 | 69703858 | 69950060 | 2001 | 248203 | 1 | | |||
| 3 | 13 | 94665450 | 94866948 | 22953 | 224451 | 1 | | |||
| 4 | 16 | 21549325 | 21662718 | 2001 | 115394 | 1 | | |||
+-------------------+-------------------+-----------+-----------+-----------+---------+-----+ | |||
</code> | |||
The original documentation writes about this table: ''The column asm_seq_region_id links to the seq_region table and refers to the assembled sequence (longer sequence). The column cmp_seq_region_id links to the seq_region table and refers to the component sequence (shorter sequence).'' | |||
* clone to contig (ignore all warnings) | |||
perl $HOME/cvs_checkout/ensembl-pipeline/scripts/load_agp.pl $DBSPEC -assembled_name clone -component_name contig -agp_file /home/ensembl/workshop/genebuild/assembly/mini_clone_contig.agp | |||
== Complete the assembly database tables === | |||
* As assemblies can contain references to external databases, we load a list of default references: | |||
perl $HOME/cvs_checkout/ensembl/misc-scripts/external_db/update_external_dbs.pl $DBSPEC -nonreleasemode -file /home/ensembl/cvs_checkout/ensembl/misc-scripts/external_db/external_dbs.txt | |||
* contigs that are not located on chromosomes but exist by themselves need the toplevel attribute set. For this, first we need to define the attribute "toplevel", then link the toplevel attribute to the unmapped contigs. | |||
perl $HOME/cvs_checkout/ensembl/misc-scripts/attribute_types/upload_attributes.pl $DBSPEC -file /home/ensembl/cvs_checkout/ensembl/misc-scripts/attribute_types/attrib_type.txt | |||
perl $HOME/cvs_checkout/ensembl-pipeline/scripts/set_toplevel.pl $DBSPEC | |||
* Ensembl has a system to track sequences that are not mapped at all (ESTs, cDNAs, contigs, etc), so we also populate the "unmapped_reasons" table, though you can skip this. | |||
perl $HOME/cvs_checkout/ensembl/misc-scripts/unmapped_reason/update_unmapped_reasons.pl $DBSPEC -file /home/ensembl/cvs_checkout/ensembl/misc-scripts/unmapped_reason/unmapped_reason.txt | |||
== Check the sequences in the database == | |||
Output all sequences that are part of a coordinate system to fasta files in a given directory: | |||
perl $HOME/cvs_checkout/ensembl-analysis/scripts/sequence_dump.pl $DBSPEC -coord_system_name chromosome -output_dir /tmp |
Latest revision as of 10:38, 15 September 2010
This example uses some demo files prepared by Jan Vogel and Bronwen Aken. Download and extract them from File:EnsemblWorkshopFiles.tar.gz, then run your tcsh and do "source set_perl_path.sh".
The example commands on this page are just copy/pasted from the workshop documentation prepared by Bronwen Aken and Jan Vogel, with some notes added by myself. The original documentation can be found on Jan Vogel's homepage and in the tar archive referenced above, in the doc/ directory.
For historical reasons, Ensembl is not loading the big NCBI chromosome fasta files, but only the small contigs and their AGP files (a golden path = maps contigs -> chromosomes). The following sections are describing the loading of this whole "coordinate system" with its sequences. It will also load some default data into three tables to bring the database into a state that is accepted by next exercises.
Another reason for not loading a whole chromosome into a single MySQL row is that the maximum size of mysql packets is ~10MB (depending on your settings). In addition, MySQL's get-substring implementation will always retrieve the whole row which creates slows down retrieval of short sequences. These are reasons why usually 10-200kb sequences are loaded into the databases and these fragments are then linked by .AGP files.
Load coordinate systems into tables seq_region and coord_system
You need the fasta and AGP files for an assembly. Ensembl supports multiple coordinate systems: Any piece of DNA can be referenced by its chromosomal location (1:1000), or other systems, like its super_contig location (NT_039500:1-1000).
Coordinate systems have a "rank" of importance (the higher the better), and a "version" (so the database contains information for several possible assemblies of the same contigs and annotations can be loaded that are based on several different versions).
Only sequence fragments marked as "toplevel" are shown on the final website (see below). Therefore, you might have never seen an individual contig NT_123453 on the website.
- The make things easier, let's set a little shortcut:
export DBSPEC="-dbhost 127.0.0.1 -dbuser ens-training -dbport 3306 -dbname mouse37_mini_ref -dbpass workshop"
- Create an empty database named mouse37_mini_ref and populate it with the CORE schema:
mysql -uens-training -pworkshop -h127.0.0.1 -P3306 -D mouse37_mini_ref < $HOME/cvs_checkout/ensembl/sql/table.sql
- Load coordinates and actual sequences into the empty core database:
- chromosome -> super_contig mappings into seq_region and coord_system:
perl $PS/load_seq_region.pl $DBSPEC -coord_system_name chromosome -coord_system_version NCBIM37 -rank 1 -default_version -agp_file $HOME/workshop/genebuild/assembly/mini_chr_contig.agp
- super_contig -> contig mappings into seq_region and coord_system:
perl $PS/load_seq_region.pl $DBSPEC -coord_system_name supercontig -default_version \ -rank 2 -coord_system_version NCBIM37 -agp_file $HOME/workshop/genebuild/assembly/mini_supercontig_contig.agp -verbose
- See what's going on with:
select * from seq_region select * from coord_system select * from dna;
- contigs (this is the only step that actually loads sequences into the "dna" table):
perl $PS/load_seq_region.pl $DBSPEC -coord_system_name contig -default_version -rank 3 -sequence_level -coord_system_version NCBIM37 -fasta_file /home/ensembl/workshop/genebuild/assembly/clones_finished_mini.fa
- clones :
perl $PS/load_seq_region.pl $DBSPEC -coord_system_name clone -default_version -coord_system_version NCBIM37 \ -rank 4 -agp_file /home/ensembl/workshop/genebuild/assembly/mini_clone_contig.agp
- See what's going on with:
select * from seq_region select * from coord_system select * from dna;
- Delete version numbers from coord_system table for contig and clone (why is this necessary?):
select * from coord_system ; update coord_system set version=NULL where name ='clone' ; update coord_system set version=NULL where name ='contig' ; select * from coord_system ;
+-----------------+-------------+---------+------+--------------------------------+
| coord_system_id | name | version | rank | attrib |
+-----------------+-------------+---------+------+--------------------------------+
| 1 | chromosome | NCBIM37 | 1 | default_version |
| 2 | supercontig | NCBIM37 | 2 | default_version |
| 3 | contig | NULL | 3 | default_version,sequence_level |
| 4 | clone | NULL | 4 | default_version |
+-----------------+-------------+---------+------+--------------------------------+
Load assembly information into MySQL tables assembly and meta
- chromosome to contig mapping:
perl $HOME/cvs_checkout/ensembl-pipeline/scripts/load_agp.pl $DBSPEC -assembled_name chromosome -component_name contig -agp_file /home/ensembl/workshop/genebuild/assembly/mini_chr_contig.agp
- supercontig to contig (ignore all warnings):
perl $HOME/cvs_checkout/ensembl-pipeline/scripts/load_agp.pl $DBSPEC -assembled_name supercontig -component_name contig -agp_file /home/ensembl/workshop/genebuild/assembly/mini_supercontig_contig.agp
- check what's going on:
select * from assembly
+-------------------+-------------------+-----------+-----------+-----------+---------+-----+
| asm_seq_region_id | cmp_seq_region_id | asm_start | asm_end | cmp_start | cmp_end | ori |
+-------------------+-------------------+-----------+-----------+-----------+---------+-----+
| 1 | 14 | 129260521 | 129429106 | 41261 | 209846 | 1 |
| 2 | 17 | 69703858 | 69950060 | 2001 | 248203 | 1 |
| 3 | 13 | 94665450 | 94866948 | 22953 | 224451 | 1 |
| 4 | 16 | 21549325 | 21662718 | 2001 | 115394 | 1 |
+-------------------+-------------------+-----------+-----------+-----------+---------+-----+
The original documentation writes about this table: The column asm_seq_region_id links to the seq_region table and refers to the assembled sequence (longer sequence). The column cmp_seq_region_id links to the seq_region table and refers to the component sequence (shorter sequence).
- clone to contig (ignore all warnings)
perl $HOME/cvs_checkout/ensembl-pipeline/scripts/load_agp.pl $DBSPEC -assembled_name clone -component_name contig -agp_file /home/ensembl/workshop/genebuild/assembly/mini_clone_contig.agp
Complete the assembly database tables =
- As assemblies can contain references to external databases, we load a list of default references:
perl $HOME/cvs_checkout/ensembl/misc-scripts/external_db/update_external_dbs.pl $DBSPEC -nonreleasemode -file /home/ensembl/cvs_checkout/ensembl/misc-scripts/external_db/external_dbs.txt
- contigs that are not located on chromosomes but exist by themselves need the toplevel attribute set. For this, first we need to define the attribute "toplevel", then link the toplevel attribute to the unmapped contigs.
perl $HOME/cvs_checkout/ensembl/misc-scripts/attribute_types/upload_attributes.pl $DBSPEC -file /home/ensembl/cvs_checkout/ensembl/misc-scripts/attribute_types/attrib_type.txt perl $HOME/cvs_checkout/ensembl-pipeline/scripts/set_toplevel.pl $DBSPEC
- Ensembl has a system to track sequences that are not mapped at all (ESTs, cDNAs, contigs, etc), so we also populate the "unmapped_reasons" table, though you can skip this.
perl $HOME/cvs_checkout/ensembl/misc-scripts/unmapped_reason/update_unmapped_reasons.pl $DBSPEC -file /home/ensembl/cvs_checkout/ensembl/misc-scripts/unmapped_reason/unmapped_reason.txt
Check the sequences in the database
Output all sequences that are part of a coordinate system to fasta files in a given directory:
perl $HOME/cvs_checkout/ensembl-analysis/scripts/sequence_dump.pl $DBSPEC -coord_system_name chromosome -output_dir /tmp