Building a new genome database: Difference between revisions

From genomewiki
Jump to navigationJump to search
m (no longer need the warning, these are good instructions)
(add hgFakeAgp information)
Line 6: Line 6:
1.  Organize your work.  Decide on a database name.  UCSC bases our names on the binomial nomenclature.  The UCSC naming scheme is abcDef1 where abc is the first three letters of the genus, Def is the first three letters of the species, the 1 is the version of the genome.  Versions start at 1.  At UCSC we use a symlink from /cluster/data/abcDef1 -> to a filesystem that has enough data space to contain the build.  Thus, all genome builds can be found from /cluster/data/ despite their actual NFS filesystem location.  For this discussion, our new genome name will be simply: abcDef1.  A few files are kept in /cluster/data/abcDef1/ but most work files for tracks are kept in /cluster/data/abcDef1/bed/trackName/
1.  Organize your work.  Decide on a database name.  UCSC bases our names on the binomial nomenclature.  The UCSC naming scheme is abcDef1 where abc is the first three letters of the genus, Def is the first three letters of the species, the 1 is the version of the genome.  Versions start at 1.  At UCSC we use a symlink from /cluster/data/abcDef1 -> to a filesystem that has enough data space to contain the build.  Thus, all genome builds can be found from /cluster/data/ despite their actual NFS filesystem location.  For this discussion, our new genome name will be simply: abcDef1.  A few files are kept in /cluster/data/abcDef1/ but most work files for tracks are kept in /cluster/data/abcDef1/bed/trackName/


2.  You should have fasta file(s) of your genome sequence and an AGP file describing their construction from contigs into scaffolds, or scaffolds into chromosomes, or combinations thereof.  AGP files can be constructed purely from fasta files if no AGP file exists (hgFakeAgp).  Usually assemblers will be producing an AGP file.
2.  You should have fasta file(s) of your genome sequence and an AGP file describing their construction from contigs into scaffolds, or scaffolds into chromosomes, or combinations thereof.  AGP files can be constructed purely from fasta files if no AGP file exists (hgFakeAgp).  Usually assemblers will be producing an AGP file. To mark all N's as gaps in your fake AGP:
hgFakeAgp -minContigGap=1 newGenome.fa abcDef1.agp


3. Convert your fasta to 2bit format:
3. Convert your fasta to 2bit format:

Revision as of 18:08, 5 October 2010

Prerequisites: you have the kent source tree checked out in your home directory ~/kent/src/ and you are familiar with the contents of the README files in ~/kent/src/product/README.*
You have built all of the utilities (see those README files ...)


1. Organize your work. Decide on a database name. UCSC bases our names on the binomial nomenclature. The UCSC naming scheme is abcDef1 where abc is the first three letters of the genus, Def is the first three letters of the species, the 1 is the version of the genome. Versions start at 1. At UCSC we use a symlink from /cluster/data/abcDef1 -> to a filesystem that has enough data space to contain the build. Thus, all genome builds can be found from /cluster/data/ despite their actual NFS filesystem location. For this discussion, our new genome name will be simply: abcDef1. A few files are kept in /cluster/data/abcDef1/ but most work files for tracks are kept in /cluster/data/abcDef1/bed/trackName/

2. You should have fasta file(s) of your genome sequence and an AGP file describing their construction from contigs into scaffolds, or scaffolds into chromosomes, or combinations thereof. AGP files can be constructed purely from fasta files if no AGP file exists (hgFakeAgp). Usually assemblers will be producing an AGP file. To mark all N's as gaps in your fake AGP:

hgFakeAgp -minContigGap=1 newGenome.fa abcDef1.agp

3. Convert your fasta to 2bit format:

   $ faToTwoBit newGenome.fa abcDef1.2bit
   $ mkdir /gbdb/abcDef1
   $ mkdir /gbdb/abcDef1/html
   $ ln -s `pwd`/abcDef1.2bit /gbdb/abcDef1/abcDef1.2bit

4. verify your agp file matches your fasta file:

   $ sort -k1,1 -k2n,2n original.agp > abcDef1.agp
   $ checkAgpAndFa abcDef1.agp abcDef1.2bit

5. Create a chromInfo file:

   $ twoBitInfo abcDef1.2bit stdout | sort -k2nr > chrom.sizes
   $ mkdir -p bed/chromInfo
   $ awk '{printf "%s\t%d\t/gbdb/abcDef1/abcDef1.2bit\n", $1, $2}' \
          chrom.sizes > bed/chromInfo/chromInfo.tab

6. Start your new database:

   $ hgsql -e "create database abcDef1;" mysql

7. Load the grp table:

   $ hgsql abcDef1 < $HOME/kent/src/hg/lib/grp.sql

8. Load the chromInfo table:

   $ hgLoadSqlTab abcDef1 chromInfo $HOME/kent/src/hg/lib/chromInfo.sql \
             bed/chromInfo/chromInfo.tab

9. Load the gold and gap tables from your AGP file:

   $ hgGoldGapGl abcDef1 abcDef1.agp

10. Generate the gc5Base data and load table:

   $ mkdir bed/gc5Base
   $ hgGcPercent -wigOut -doGaps -file=stdout -win=5 -verbose=0 abcDef1 \
                 abcDef1.2bit | wigEncode stdin bed/gc5Base/gc5Base.{wig,wib}
   $ hgLoadWiggle -pathPrefix=/gbdb/abcDef1/wib \
                 abcDef1 gc5Base bed/gc5Base/gc5Base.wig
   $ mkdir /gbdb/abcDef1/wib
   $ ln -s `pwd`/bed/gc5Base/gc5Base.wib /gbdb/abcDef1/wib

11. Create the dbDb SQL insert statement. The orderKey is determined from existing dbDb entries. This creates the order of the pulldown menus in the gateway page. Place this into a file: dbDbInsert.sql and load it with the command: hgsql hgcentral < dbDbInsert.sql

INSERT INTO dbDb
    (name, description, nibPath, organism,
     defaultPos, active, orderKey, genome, scientificName,
     htmlPath, hgNearOk, hgPbOk, sourceName)
VALUES
    ("abcDef1", "July 2008", "/gbdb/abcDef1", "A. organism",
     "chr1:10459784-10469783", 1, 123, "A. organism", "Genus species",
     "/gbdb/abcDef1/html/description.html", 0, 0, "new genome version 1.0");

12. Create defaultDb and genomeClade table SQL entries. For example:

INSERT INTO defaultDb (genome, name) VALUES ("A. organism", "abcDef1")
INSERT INTO genomeClade (genome, clade, priority) VALUES ("A. organism", "vertebrate", 123)

the genomeClade.priority helps choose the default genome for that clade. See examples of these values in the existing defaultDb and genomeClade tables.

13. Make a trackDb hierarchy of your genome. Populate it with trackDb.ra files and a description.html file. See existing examples in the source tree ~/kent/src/hg/makeDb/trackDb/. It does not have to be in the source tree. You can load it into a separate database with the hgTrackDb and hgFindSpec commands. Place a reference to this trackDb extra database via your cgi-bin/hg.conf options.

See also

Minimal Browser Installation