Ensembl minimum install: Difference between revisions
No edit summary  | 
				No edit summary  | 
				||
| Line 29: | Line 29: | ||
   select * from dna;  |    select * from dna;  | ||
* Note that DNA sequences are not stored as a raw ASCII string in Mysql  | * Note that DNA sequences are not stored as a raw ASCII string in Mysql  | ||
* Delete version numbers from coord_system table for contig and clone:  | |||
  mysql -uens-training -pworkshop -h127.0.0.1 -P3306 -D mouse37_mini_ref  | |||
  select * from coord_system ;  | |||
  update coord_system set version=NULL where name ='clone' ;  | |||
  update coord_system set version=NULL where name ='contig' ;  | |||
  select * from coord_system ;  | |||
<code>  | |||
+-----------------+-------------+---------+------+--------------------------------+  | |||
| coord_system_id | name        | version | rank | attrib                         |  | |||
+-----------------+-------------+---------+------+--------------------------------+  | |||
|               1 | chromosome  | NCBIM37 |    1 | default_version                |  | |||
|               2 | supercontig | NCBIM37 |    2 | default_version                |  | |||
|               3 | contig      | NULL    |    3 | default_version,sequence_level |  | |||
|               4 | clone       | NULL    |    4 | default_version                |  | |||
+-----------------+-------------+---------+------+--------------------------------+   | |||
</code>  | |||
Revision as of 13:00, 13 September 2010
Load sequences into MySQL tables
Necessary files for this example, see File:EnsemblWorkshopFiles.tar.gz
You need the fasta and AGP files for an assembly. Ensembl supports multiple coordinate systems: Any piece of DNA can be referenced by it's chromosomal location (1:1000), its super_contig location (NT_039500:1-1000) or other coordinates
Coordinate systems have a "rank" of importance (the higher the better), and a "version" (so the database contains information for several possible assemblies of the same contigs and annotations can be loaded that are based on several different versions)
- set a little shortcut:
 
export $DBSPEC="-dbhost 127.0.0.1 -dbuser ens-training -dbport 3306 -dbname mouse37_mini_ref -dbpass workshop"
- Create an empty database named mouse37_mini_ref and populate it with the CORE schema:
 
mysql -uens-training -pworkshop -h127.0.0.1 -P3306 -D mouse37_mini_ref < $HOME/cvs_checkout/ensembl/sql/table.sql
- Load coordinates and actual sequences into the empty core database:
 - chromosome -> super_contig mappings:
 
perl $PS/load_seq_region.pl $DBSPEC -coord_system_name chromosome -coord_system_version NCBIM37 -rank 1 -default_version -agp_file $HOME/workshop/genebuild/assembly/mini_chr_contig.agp
- super_contig -> contig mappings:
 
perl $PS/load_seq_region.pl -coord_system_name supercontig -default_version -rank 2 -coord_system_version NCBIM37 -agp_file $HOME/workshop/genebuild/assembly/mini_supercontig_contig.agp -verbose
- See what's going on with:
 
select * from seq_region select * from coord_system select * from dna;
- contigs:
 
perl $PS/load_seq_region.pl -coord_system_name contig -default_version -rank 3 -sequence_level -coord_system_version NCBIM37 -fasta_file /home/ensembl/workshop/genebuild/assembly/clones_finished_mini.fa
- clones (only this command loads sequences into the "dna" table):
 
perl $PS/load_seq_region.pl $DBSPEC -coord_system_name clone -default_version -coord_system_version NCBIM37 -rank 4 -agp_file /home/ensembl/workshop/genebuild/assembly/mini_clone_contig.agp
- See what's going on with:
 
select * from seq_region select * from coord_system select * from dna;
- Note that DNA sequences are not stored as a raw ASCII string in Mysql
 - Delete version numbers from coord_system table for contig and clone:
 
mysql -uens-training -pworkshop -h127.0.0.1 -P3306 -D mouse37_mini_ref
select * from coord_system ; update coord_system set version=NULL where name ='clone' ; update coord_system set version=NULL where name ='contig' ; select * from coord_system ;
+-----------------+-------------+---------+------+--------------------------------+
| coord_system_id | name        | version | rank | attrib                         |
+-----------------+-------------+---------+------+--------------------------------+
|               1 | chromosome  | NCBIM37 |    1 | default_version                |
|               2 | supercontig | NCBIM37 |    2 | default_version                |
|               3 | contig      | NULL    |    3 | default_version,sequence_level |
|               4 | clone       | NULL    |    4 | default_version                |
+-----------------+-------------+---------+------+--------------------------------+