Building a new genome database: Difference between revisions

From genomewiki
Jump to navigationJump to search
(add note about MySQL permission)
(Adding link to tdb docs #24956)
 
(7 intermediate revisions by 4 users not shown)
Line 2: Line 2:
and you are familiar with the contents of the README files in <EM>~/kent/src/product/README.*</EM><BR>
and you are familiar with the contents of the README files in <EM>~/kent/src/product/README.*</EM><BR>


You have built all of the utilities (see those README files: [http://genome-source.cse.ucsc.edu/gitweb/?p=kent.git;a=tree;f=src/product src/product]
You have built all of the utilities (see those README files: [http://genome-source.soe.ucsc.edu/gitlist/kent.git/tree/master/src/product src/product]


* Executable and Source Code Downloads -> http://hgdownload.soe.ucsc.edu/downloads.html#source_downloads
* Utilities, README -> http://genome-source.soe.ucsc.edu/gitlist/kent.git/blob/master/src/userApps/README


1.  Organize your work.  Decide on a database name.  UCSC bases our names on the binomial nomenclature.  The UCSC naming scheme is abcDef1 where abc is the first three letters of the genus, Def is the first three letters of the species, the 1 is the version of the genome.  Versions start at 1.  At UCSC we use a symlink from /cluster/data/abcDef1 -> to a filesystem that has enough data space to contain the build.  Thus, all genome builds can be found from /cluster/data/ despite their actual NFS filesystem location.  For this discussion, our new genome name will be simply: abcDef1.  A few files are kept in /cluster/data/abcDef1/ but most work files for tracks are kept in /cluster/data/abcDef1/bed/trackName/
1.  Organize your work.  Decide on a database name.  UCSC bases our names on the binomial nomenclature.  The UCSC naming scheme is abcDef1 where abc is the first three letters of the genus, Def is the first three letters of the species, the 1 is the version of the genome.  Versions start at 1.  At UCSC we use a symlink from /cluster/data/abcDef1 -> to a filesystem that has enough data space to contain the build.  Thus, all genome builds can be found from /cluster/data/ despite their actual NFS filesystem location.  For this discussion, our new genome name will be simply: abcDef1.  A few files are kept in /cluster/data/abcDef1/ but most work files for tracks are kept in /cluster/data/abcDef1/bed/trackName/
Line 42: Line 44:
10. Generate the gc5Base data and load table:
10. Generate the gc5Base data and load table:
     $ mkdir bed/gc5Base
     $ mkdir bed/gc5Base
    $ cd bed/gc5Base
     $ hgGcPercent -wigOut -doGaps -file=stdout -win=5 -verbose=0 abcDef1 \
     $ hgGcPercent -wigOut -doGaps -file=stdout -win=5 -verbose=0 abcDef1 \
                   abcDef1.2bit | wigEncode stdin bed/gc5Base/gc5Base.{wig,wib}
                   ../../abcDef1.2bit | wigEncode stdin gc5Base.{wig,wib}
     $ hgLoadWiggle -pathPrefix=/gbdb/abcDef1/wib \
     $ hgLoadWiggle -pathPrefix=/gbdb/abcDef1/wib \
                   abcDef1 gc5Base bed/gc5Base/gc5Base.wig
                   abcDef1 gc5Base gc5Base.wig
     $ mkdir /gbdb/abcDef1/wib
     $ mkdir /gbdb/abcDef1/wib
     $ ln -s `pwd`/bed/gc5Base/gc5Base.wib /gbdb/abcDef1/wib
     $ ln -s `pwd`/gc5Base.wib /gbdb/abcDef1/wib


11. Create the dbDb SQL insert statement.  The orderKey is determined from existing dbDb entries.  This creates the order of the pulldown menus in the gateway page.  Place this into a file: dbDbInsert.sql and load it with the command:  <EM>hgsql hgcentral < dbDbInsert.sql</EM>
11. Create the dbDb SQL insert statement.  The orderKey is determined from existing dbDb entries.  This creates the order of the pulldown menus in the gateway page.  Place this into a file: dbDbInsert.sql and load it with the command:  <EM>hgsql hgcentral < dbDbInsert.sql</EM>
Line 55: Line 58:
     (name, description, nibPath, organism,
     (name, description, nibPath, organism,
     defaultPos, active, orderKey, genome, scientificName,
     defaultPos, active, orderKey, genome, scientificName,
     htmlPath, hgNearOk, hgPbOk, sourceName)
     htmlPath, hgNearOk, hgPbOk, sourceName, taxId)
VALUES
VALUES
     ("abcDef1", "July 2008", "/gbdb/abcDef1", "A. organism",
     ("abcDef1", "July 2008", "/gbdb/abcDef1", "A. organism",
     "chr1:10459784-10469783", 1, 123, "A. organism", "Genus species",
     "chr1:10459784-10469783", 1, 123, "A. organism", "Genus species",
     "/gbdb/abcDef1/html/description.html", 0, 0, "new genome version 1.0");
     "/gbdb/abcDef1/html/description.html", 0, 0, "new genome version 1.0", 12345);
</PRE>
</PRE>


Line 65: Line 68:


<PRE>
<PRE>
INSERT INTO defaultDb (genome, name) VALUES ("A. organism", "abcDef1")
hgsql hgcentral -e 'INSERT INTO defaultDb (genome, name) VALUES ("A. organism", "abcDef1");'
INSERT INTO genomeClade (genome, clade, priority) VALUES ("A. organism", "vertebrate", 123)
hgsql hgcentral -e 'INSERT INTO genomeClade (genome, clade, priority) VALUES ("A. organism", "vertebrate", 123);'
</PRE>
</PRE>


the genomeClade.priority helps choose the default genome for that clade.  See examples of these values in the existing defaultDb and genomeClade tables.
the genomeClade.priority helps choose the default genome for that clade.  See examples of these values in the existing defaultDb and genomeClade tables. You can verify the hgcentral table relationships with a join command on these tables:
<pre>
hgsql -e "SELECT d.name,d.orderKey,g.genome,g.priority,g.clade,d.scientificName FROM
dbDb d, genomeClade g
WHERE d.organism = g.genome
ORDER by d.orderKey;" hgcentral
</pre>


13.  Make a trackDb hierarchy of your genome.  Populate it with trackDb.ra files and a description.html file.  See existing examples in the source tree <EM>~/kent/src/hg/makeDb/trackDb/</EM>.  It does not have to be in the source tree.  You can load it into a separate database with the hgTrackDb and hgFindSpec commands.  Place a reference to this trackDb extra database via your cgi-bin/hg.conf options.
 
13.  Make a trackDb hierarchy of your genome.  Populate it with trackDb.ra files and a description.html file.  See existing examples in the source tree <EM>~/kent/src/hg/makeDb/trackDb/</EM>.  It does not have to be in the source tree.  You can load it into a separate database with the hgTrackDb and hgFindSpec commands.  Place a reference to this trackDb extra database via your cgi-bin/hg.conf options. The following page has information on TrackDb files which are identical for hubs or native assembly tracks:
https://genome.ucsc.edu/goldenPath/help/hubQuickStart.html


==See also==
==See also==

Latest revision as of 23:56, 12 February 2020

Prerequisites: you have the kent source tree checked out in your home directory ~/kent/src/ and you are familiar with the contents of the README files in ~/kent/src/product/README.*

You have built all of the utilities (see those README files: src/product

1. Organize your work. Decide on a database name. UCSC bases our names on the binomial nomenclature. The UCSC naming scheme is abcDef1 where abc is the first three letters of the genus, Def is the first three letters of the species, the 1 is the version of the genome. Versions start at 1. At UCSC we use a symlink from /cluster/data/abcDef1 -> to a filesystem that has enough data space to contain the build. Thus, all genome builds can be found from /cluster/data/ despite their actual NFS filesystem location. For this discussion, our new genome name will be simply: abcDef1. A few files are kept in /cluster/data/abcDef1/ but most work files for tracks are kept in /cluster/data/abcDef1/bed/trackName/

2. You should have fasta file(s) of your genome sequence and an AGP file describing their construction from contigs into scaffolds, or scaffolds into chromosomes, or combinations thereof. AGP files can be constructed purely from fasta files if no AGP file exists (hgFakeAgp). Usually assemblers will be producing an AGP file. To mark all N's as gaps in your fake AGP:

hgFakeAgp -minContigGap=1 newGenome.fa abcDef1.agp

3. Convert your fasta to 2bit format:

   $ faToTwoBit newGenome.fa abcDef1.2bit
   $ mkdir /gbdb/abcDef1
   $ mkdir /gbdb/abcDef1/html
   $ ln -s `pwd`/abcDef1.2bit /gbdb/abcDef1/abcDef1.2bit

4. verify your agp file matches your fasta file:

   $ sort -k1,1 -k2n,2n original.agp > abcDef1.agp
   $ checkAgpAndFa abcDef1.agp abcDef1.2bit

5. Create a chromInfo file:

   $ twoBitInfo abcDef1.2bit stdout | sort -k2nr > chrom.sizes
   $ mkdir -p bed/chromInfo
   $ awk '{printf "%s\t%d\t/gbdb/abcDef1/abcDef1.2bit\n", $1, $2}' \
          chrom.sizes > bed/chromInfo/chromInfo.tab

6. Start your new database:

      (you may need to update your MySQL database permissions to allow access to your new database)
   $ hgsql -e "create database abcDef1;" mysql

7. Load the grp table:

   $ hgsql abcDef1 < $HOME/kent/src/hg/lib/grp.sql

8. Load the chromInfo table:

   $ hgLoadSqlTab abcDef1 chromInfo $HOME/kent/src/hg/lib/chromInfo.sql \
             bed/chromInfo/chromInfo.tab

9. Load the gold and gap tables from your AGP file:

   $ hgGoldGapGl abcDef1 abcDef1.agp

10. Generate the gc5Base data and load table:

   $ mkdir bed/gc5Base
   $ cd bed/gc5Base
   $ hgGcPercent -wigOut -doGaps -file=stdout -win=5 -verbose=0 abcDef1 \
                 ../../abcDef1.2bit | wigEncode stdin gc5Base.{wig,wib}
   $ hgLoadWiggle -pathPrefix=/gbdb/abcDef1/wib \
                 abcDef1 gc5Base gc5Base.wig
   $ mkdir /gbdb/abcDef1/wib
   $ ln -s `pwd`/gc5Base.wib /gbdb/abcDef1/wib

11. Create the dbDb SQL insert statement. The orderKey is determined from existing dbDb entries. This creates the order of the pulldown menus in the gateway page. Place this into a file: dbDbInsert.sql and load it with the command: hgsql hgcentral < dbDbInsert.sql

INSERT INTO dbDb
    (name, description, nibPath, organism,
     defaultPos, active, orderKey, genome, scientificName,
     htmlPath, hgNearOk, hgPbOk, sourceName, taxId)
VALUES
    ("abcDef1", "July 2008", "/gbdb/abcDef1", "A. organism",
     "chr1:10459784-10469783", 1, 123, "A. organism", "Genus species",
     "/gbdb/abcDef1/html/description.html", 0, 0, "new genome version 1.0", 12345);

12. Create defaultDb and genomeClade table SQL entries. For example:

hgsql hgcentral -e 'INSERT INTO defaultDb (genome, name) VALUES ("A. organism", "abcDef1");'
hgsql hgcentral -e 'INSERT INTO genomeClade (genome, clade, priority) VALUES ("A. organism", "vertebrate", 123);'

the genomeClade.priority helps choose the default genome for that clade. See examples of these values in the existing defaultDb and genomeClade tables. You can verify the hgcentral table relationships with a join command on these tables:

hgsql -e "SELECT d.name,d.orderKey,g.genome,g.priority,g.clade,d.scientificName FROM
dbDb d, genomeClade g
WHERE d.organism = g.genome
ORDER by d.orderKey;" hgcentral


13. Make a trackDb hierarchy of your genome. Populate it with trackDb.ra files and a description.html file. See existing examples in the source tree ~/kent/src/hg/makeDb/trackDb/. It does not have to be in the source tree. You can load it into a separate database with the hgTrackDb and hgFindSpec commands. Place a reference to this trackDb extra database via your cgi-bin/hg.conf options. The following page has information on TrackDb files which are identical for hubs or native assembly tracks: https://genome.ucsc.edu/goldenPath/help/hubQuickStart.html

See also

Minimal Browser Installation