Building GenArk genomes
GenArk
The Genome Archive (GenArk) are assembly hubs that come pre-loaded with several annotation tracks, gene models, and the ability to align genomic sequence to the reference assembly using the BLAT alignment tool. The genomes in the GenArk are sourced from NCBI RefSeq, the Vertebrate Genomes Project (VGP), and other projects.
Steps from Hiram's Talk (by Gerardo) 2/2/2022
See Recorded Demo here: https://genome-test.gi.ucsc.edu/~hiram/zoomRecordings/2022-02-02.GenArkBuild/video1137793487.mp4
Before starting
Check to see if you have the goto and gotos command. If not, add the following to your .bashrc file:
function goto() { export destDir="" export asmId=$1 case "${asmId}" in GC*) gcX=${asmId:0:3} d0=${asmId:4:3} d1=${asmId:7:3} d2=${asmId:10:3} if [ "${gcX}" = "GCF" ]; then destDir="/hive/data/genomes/asmHubs/refseqBuild/${gcX}/${d0}/${d1}/${d2}" else destDir="/hive/data/genomes/asmHubs/genbankBuild/${gcX}/${d0}/${d1}/${d2}" fi export cdDir=`ls -d ${destDir}/${asmId}* 2> /dev/null` if [ -d "${cdDir}" ]; then destDir="${cdDir}/trackData" else printf "# can not find ${destDir}/${asmId}*\n" 1>&2 fi ;; *) destDir="/hive/data/genomes/${asmId}/bed" if [ -d "/hive/data/genomes/${asmId}/trackData" ]; then destDir="/hive/data/genomes/${asmId}/trackData" fi ;; esac if [ -d "${destDir}" ]; then printf "cd $destDir\n" 1>&2 cd "$destDir" else printf "# can not find ${destDir}\n" 1>&2 fi } function gotos() { export asmId=$1 export gcX=${asmId:0:3} export d0=${asmId:4:3} export d1=${asmId:7:3} export d2=${asmId:10:3} export destDir="/hive/data/outside/ncbi/genomes/${gcX}/${d0}/${d1}/${d2}" export dirCount=`ls -d ${destDir}/${asmId}* 2> /dev/null | wc -l` if [ "${dirCount}" -ne 1 ]; then printf "# can not find ${destDir}/${asmId}*\n" 1>&2 else export cdDir=`ls -d ${destDir}/${asmId}* 2> /dev/null` if [ -d "${cdDir}" ]; then printf "cd $cdDir\n" 1>&2 cd "$cdDir" else printf "# can not find ${destDir}/${asmId}*\n" 1>&2 fi fi }
Finding the info for the runBuild command
Hiram recommends checking if a genome browser already exists in the source tree (grep number part of the GCA/GCF ID: 002776525).
Example:
$ grep 002776525 ~/kent/src/hg/makeDb/doc/*AsmHub/*.tsv
Hiram has list of genomes pre-ready to be run in the allBuild directory. There are GenBank versions or RefSeq versions. Decide which one is best or most up to date, RefSeq is always first choice.
Genbank: /hive/data/outside/ncbi/genomes/reports/newAsm/gb.todo.*
RefSeq: /hive/data/outside/ncbi/genomes/reports/newAsm/rs.todo.*
grep number part of the GCA/GCF ID: 002776525 to find genomes that are pre-ready to be run:
$ grep 002776525 /hive/data/outside/ncbi/genomes/reports/newAsm/rs.todo.*
$ grep 002776525 /hive/data/outside/ncbi/genomes/reports/newAsm/gb.todo.* /hive/data/outside/ncbi/genomes/reports/newAsm/gb.todo.primates.txt:./runBuild GCA_002776525.5_ASM277652v5 primates Piliocolobus_tephrosceles 2019_12_12
If it doesn't show up in the list of genomes pre-ready to be run, check the 'master' listings from NCBI:
$ grep 002776525 /hive/data/outside/ncbi/genomes/reports/assembly_summary*.txt
Running the runBuild command
Move to a screen:
$ screen -S screenName
The command can be run anywhere but preferably accumulated into one directory.
$ cd /hive/data/genomes/asmHubs/allBuild
The runBuild command template (scientific name should have underscores instead of blanks):
time (./runBuild assemblyName clade Scientific_Name)
Running the command:
$ time (./runBuild GCA_001632725.1_ASM163272v1 invertebrate Corbicula_fluminea) >> GCA_001632725.1.log 2>&1
Checking progress
Check progress by looking at what track is getting built in the trackData/ directory for the assembly. Use goto command to find the directory:
$ goto GCA_001632725.1
$ ls addMask allGaps assemblyGap augustus chromAlias cpgIslands cytoBand gapOverlap gc5Base idKeys ncbiGene repeatMasker simpleRepeat tandemDups windowMasker xenoRefGene
You can confirm the run finished when it makes a trackDb:
GCA_001632725.1_ASM163272v1.trackDb.txt
You will get an email when the run finishes: genArk build done: GCA_001632725.1_ASM163272v1
Post runBuild steps
Saved the command in a file:
$ cd /hive/data/genomes/asmHubs/allBuild/history $ echo './runBuild GCA_001632725.1_ASM163272v1 invertebrate Corbicula_fluminea' > GCA_001632725.1.list
After run finishes, you will need to add and commit the runBuild command to the master run list in the source tree and ordered by GC identifier:
Edit: ~/kent/src/hg/makeDb/doc/asmHubs/master.run.list
Add:
./runBuild GCA_001632725.1_ASM163272v1 invertebrate Corbicula_fluminea
Commit changes, ex.
$ git commit -m 'adding GCA_001632725.1_ASM163272v1 per user request, refs #29545'
Add and commit an entry with the assembly name and common name in the orderList.tsv file. Entry should be in alphabetical order (case insensitive) by the second column which is the common name. You can use Hiram’s commonNames.pl script to find a common name:
Place the name in a temporary file, in this case called '1'
$ echo GCA_001632725.1_ASM163272v1 > 1 $ ~/kent/src/hg/makeDb/doc/asmHubs/commonNames.pl 1 GCA_001632725.1_ASM163272v1 asian clam (DE_01 2016)
Edit: ~/kent/src/hg/makeDb/doc/invertebrateAsmHub/invertebrate.orderList.tsv
Add:
GCA_001632725.1_ASM163272v1 Asian clam (DE_01 2016)
Commit changes, ex.
$ git commit -m 'adding Corbicula_fluminea Asian clam (DE_01 2016) refs #29545'
Pushing to the RR
Run the following make in makeDb/doc/[animalGroup]AsmHub to get the symlinks constructed and make this assembly appear in the outside world:
$ time (make) > dbg 2>&1 &
Check for errors:
$ grep -i error dbg $ tail dbg
This make sets up a number of file links to get pushed out.
Same directory, verify that it worked:
$ time (make verifyTestDownload) >> test.down.log 2>&1 &
Its going through each assembly in that assembly hub and making sure it has enough tracks to be a legitimate assembly hub. The test.down.log file should end in the line something like:
$ tail test.down.log
# checked 598 hubs, 598 success, 0 fail, total tracks: 10288, 2023-06-17 12:40:15
And if you accumulate that test.down.log file, you can see the changes over time via:
$ grep check test.down.log
for each of those 'checked' lines.
You should be able to see for yourself the assembly in the hgDownload-test listing and file directory. Example below:
https://hgdownload-test.gi.ucsc.edu/hubs/GCA/011/100/615/GCA_011100615.1/ https://hgdownload-test.gi.ucsc.edu/hubs/primates/index.html
Verify you can login to hgdownload and dynablat. For the first time, you may need to update your keys. If it doesn't work, you may already have a key and you may need to delete it. The command with "date" afterward is a way to check your access without actually staying in that server.
ssh qateam@hgdownload.soe.ucsc.edu date
$ ssh qateam@dynablat-01.soe.ucsc.edu date
Run RR push from ~/kent/src/hg/makeDb/doc/[animalGroup]AsmHub:
$ time (make sendDownload ) >> send.down.log 2>&1 &
check for errors:
$ grep -i error send.down.log $ tail send.down.log
Then final check that it is on hgdownload:
$ time (make verifyDownload) > verify.down.log 2>&1 &
Should have a valid last 'checked' line
$ tail verify.down.log