Building a new genome database

From genomewiki
Revision as of 02:32, 24 July 2008 by Hiram (talk | contribs)
Jump to navigationJump to search

Prerequisites: you have the kent source tree checked out in your home directory ~/kent/src/

  and you are familiar with the contents of the README files in ~/kent/src/product/README.*


1. Organize your work. Decide on a database name. UCSC bases our names on the binomial nomenclature. The UCSC naming scheme is abcDef1 where abc is the first three letters of the genus, Def is the first three letters of the species, the 1 is the version of the genome. Versions start at 1. At UCSC we use a symlink from /cluster/data/abcDef1 -> to a filesystem that has enough data space to contain the build. Thus, all genome builds can be found from /cluster/data/ despite their actual NFS filesystem location. For this discussion, our new genome name will be simply: abcDef1. A few files are kept in /cluster/data/abcDef1/ but most work files for tracks are kept in /cluster/data/abcDef1/bed/trackName/

2. You should have fasta file(s) of your genome sequence and an AGP file describing their construction from contigs into scaffolds, or scaffolds into chromosomes, or combinations thereof. AGP files can be constructed purely from fasta files if no AGP file exists. Usually assemblers will be producing an AGP file.

3. Convert your fasta to 2bit format:

	$ twoBitToFa abcDef1.2bit newGenome.fa
	$ mkdir /gbdb/abcDef1
	$ ln -s `pwd`/newDb.2bit /gbdb/abcDef1/abcDef1.2bit

4. verify your agp file matches your fasta file:

	$ sort -k1,1 -k2n,2n original.agp > abcDef1.agp
	$ checkAgpAndFa abcDef1.agp abcDef1.2bit

5. Create a chromInfo file:

	$ twoBitInfo abcDef1.2bit stdout | sort -k2nr > chrom.sizes
        $ mkdir -p bed/chromInfo
	$ awk '{printf "%s\t%d\t/gbdb/abcDef1/abcDef1.2bit\n", $1, $2}' \
		chrom.sizes > bed/chromInfo/chromInfo.tab

6. Start your new database:

	$ hgsql -e "create database abcDef1;" mysql