Assembly QA Part 1 DEV Steps: Difference between revisions

From Genecats
Jump to navigationJump to search
No edit summary
 
(479 intermediate revisions by 9 users not shown)
Line 1: Line 1:
<div style="border: 2px solid #A3B1BF; padding: .5em 1em 1em 1em; border-top: none; background-color: #FFFBF1; color: #000;">
<div style="border: 2px solid #A3B1BF; padding: .5em 1em 1em 1em; border-top: none;">
<span style="color:olive">These steps were revised in 2017, but you can also see the old steps: [[Releasing_an_assembly | Releasing an assembly (old steps)]] </span>


== Welcome to the '''Assembly QA Part 1: DEV Steps''' page! 😎 ==


{| class=wikitable width=415 align=right
{| class=wikitable width=415 align=right
|
|
:Home: [[Assembly_Release_QA_Steps]]
<span style="color:darkorange">Navigation Menu</span>
#[[Assembly QA Part 1: DEV Steps]]
:[[Assembly_Release_QA_Steps | Home Page]]
#[[Assembly QA Part 2: BETA Steps]]
::[[Assembly QA Part 1 DEV Steps | Assembly QA Part 1: DEV Steps]]
#[[Assembly QA Part 3: RR Steps]]
::[[Assembly QA Part 2 Track Steps | Assembly QA Part 2: Track Steps]]
#[[Assembly QA Part 4: Post Release Steps]]
::[[Assembly QA Part 3 BETA Steps | Assembly QA Part 3: BETA Steps]]
::[[Assembly QA Part 4 RR Steps | Assembly QA Part 4: RR Steps]]
|}
|}


====<span style="color:dodgerblue">Setup: Create a Google spreadsheet checklist from a template====


<span style="color:darkred">
'''Steps:'''
Page created Fall. 2016 by Cath, Jairo, and ChrisV. <br>This page is currently a draft in progress.<br>For now, use [[Releasing_an_assembly]] instead.
# Open a new Google Spreadsheet.
# Go to the Google spreadsheet [https://docs.google.com/spreadsheets/d/17sAdd_Z8kLif74ig2UBmuHMpGP6hyKknXPk-hcU9sHo/edit?usp=sharing Template: Assembly Release Checklist]
# Copy the template: File > Make a copy
# Give your new spreadsheet a title, like "manPen1 Assembly Release Checklist".
# Move your spreadsheet to a good folder on your Google Drive so that you can easily find it later.
# All set! You can now use your checklist.
 
'''Tips:'''
# Note: This system works best when you create one spreadsheet per assembly.
# See the tab, "README" for more info.
# If a wiki section is h4 ("====Wiki Section===="), denoted by surrounding the section with exactly 4 equal signs, then the h4 section will appear as a step in your checklist.
# To add a new step to your checklist - do not add it directly to your spreadsheet. Instead add a new h4 section to the wiki. Just copy an existing h4 and edit it!
# Your h4 will become a url, so the only punctuation you can use is a colon " : " otherwise the wiki link in column A will break. 
# To see your change, toggle the "#" character in your formula. The "#" is not really needed in the formula, and removing it or adding it back in will re-load the page.
 
====<span style="color:dodgerblue">Setup: Make a directory in your hive====
</span>
During this assembly release process,  you will be generating a lot of output, and you'll need a place to put everything. The use of the "hive" directory is encouraged as the best location because of ample space.
<pre>
mkdir /hive/users/userName/assemblies/assemblyName 
 
e.g.:  mkdir /hive/users/cath/assemblies/manPen1
</pre>
 
====<span style="color:dodgerblue">Setup:  Create an alias to your new dir====
</span>
 
When you add an alias from your .bashrc file, you can simply type that alias in your command line as a shortcut to the associated command. A "shortcut" alias can be created to allow fast access to your hive directory for this assembly.
 
To do this, follow the steps below:
 
#In your terminal, connect to hgwdev and type "cd" (go to your home directory).
#Confirm the location of .bashrc. Type "ls -a" in your home directory to see all hidden files that have a " . " in the filename. This way you can confirm the location of your .bashrc file.
#Open your .bashrc file for editing. If you're using the vi editor, you can type "vi .bashrc" to edit the file. Add an alias by typing in the line below, then save your changes.
 
<pre>
alias hive='cd /hive/users/yourUserName/assemblies/yourAssembly'
 
e.g., alias hive='cd /hive/users/cath/assemblies/manPen1'
</pre>
 
 
====<span style="color:dodgerblue">Redmine: Review the Redmine as PushQ wiki====
</span>
 
* As of March 2017, the PushQ has been replaced with Redmine to track and release new assemblies.
* Review the [http://redmine.soe.ucsc.edu/projects/genomebrowser/wiki/PushQ Redmine as the pushQ replacement wiki page].
* Go to Redmine > GB > Issues > Filter: "Ready for QA"
* Find the assembly you will QA/Release
 
====<span style="color:dodgerblue">Redmine: Set assignee as yourself====
</span>
 
====<span style="color:dodgerblue">Redmine: Set the engineer as watcher if they are not the developer====
</span>
 
====<span style="color:dodgerblue">Redmine: Set Status to Reviewing====
</span>
 
====<span style="color:dodgerblue">Dev: Check minimal browser criteria====
</span>
 
Does this assembly have the required tracks?
 
Visit this [http://genomewiki.ucsc.edu/genecats/index.php/Minimal_browser page] to check that the assembly contains the required tracks to be considered a minimal browser on the RR.
 
<span style="color:red">To add explaination: genbank mrnas & ests (/cluster/data/genbank/data/organism.lst) How to view/interpret the file
</span>
 
====<span style="color:dodgerblue">Dev: Check that BLAT Server is running====
</span>
 
To check if your organism has a blat servers set up, run the following command (beware that copyHgcentral creates many temp files):
<pre>copyHgcentral test $db blatServers dev beta</pre>
a better command that does not create many temporary files is just querying hgcentraltest yourself:
<pre>hgsql -e "select * from blatServers where db='$db'" hgcentraltest</pre>
 
The developer has often already requested that the blat servers be set up for the new assembly. If not, and/or if entries for your assembly are missing from hgcentraltest.blatServers, please make a note in the Redmine ticket and ask the assembly builder to 1) request the setup of the blat servers and to 2) manually add the entries to hgcentraltest.blatServers.
Make sure that this assembly is not hosted on "blatx" BLAT server. That server is not as stable and therefore is for assemblies that are not destined for the RR. For more information about where the blat servers for different machines should be hosted, go to [[Updating blat servers]].
 
You should see results like this (below) since this should only be setup on dev so far:
 
<pre>
 
copyHgcentral test manPen1 blatServers dev beta
 
--------------------------------------------------
--------------------------------------------------
<<< blatServers >>>
 
hgcentraltest
-------------
manPen1 blat1b 17878 1 0
manPen1 blat1b 17879 0 1
 
hgcentralbeta
-------------
 
 
hgcentral
-------------
 
 
*** There are blatServers differences between dev and beta ***
 
*** The blatServers data on beta and rr is identical ***
 
</pre>
 
====<span style="color:dodgerblue">Dev: Do a BLAT search: DNA====
</span>
 
From BLAT tool on dev:
#Go to your browser and copy some DNA sequence
#Go to BLAT: Home > Tools > [http://hgwdev.gi.ucsc.edu/cgi-bin/hgBlat Blat]
#Paste in sequence
#Change query type to DNA and press submit
#Click on various blat results to make sure they look as expected
#Make a custom track of blat results and then look at them in the browser.
 
====<span style="color:dodgerblue">Dev: Do a BLAT search: protein====
</span>
 
From BLAT tool on dev:
#Go to your browser and copy some DNA sequence -> translate to amino acid sequence*
#Go to BLAT: Home > Tools > [http://hgwdev.gi.ucsc.edu/cgi-bin/hgBlat Blat]
#Paste in sequence
#Change query type to "protein" (amino acid) and press submit
#Click on various blat results to make sure they look as expected
#Make a custom track of blat results and then look at them in the browser.
 
====<span style="color:dodgerblue">Dev: isPCR test====
</span>
 
* Go to dev's [http://hgwdev.gi.ucsc.edu/cgi-bin/hgPcr PCR Tool] and test a PCR search for your assembly.
 
For example, on hg38:
*You want to get DNA, about 20-23 bases, that "book end" the region of DNA that will be amplified.
* For example, here's a 70bp region in hg38: chr1:11,131,574-11,131,643
 
*Go to this region on hg38
*hgTracks, View > DNA (v + d keyboard shortcut)
*Click "get DNA" with the default selections.
*Copy the DNA to your clipboard:
 
<pre>
>hg38_dna range=chr1:11131574-11131643 5'pad=0 3'pad=0 strand=+ repeatMasking=none
CCTGGTCCCAACACCTAGCCCACGGCCTGACAGAGAACCAGTGCTCAATGCTTGAAGGAAGAACCGCTGG
</pre>
 
 
:Go to isPCR for hg38 (Tools >InSilico PCR)
:Genome: Human
::Assembly: hg38
::Tareget::genome assembly
:Forward Primer: The first 20(ish) bp of the region, e.g., CCTGGTCCCAACACCTAGCC (in green below)
:::<span style="color:green">CCTGGTCCCAACACCTAGCC</span>CACGGCCTGACAGAGAACCAGTGCTCAATGCTTGAAGGAAGAACCGCTGG
::Reverse Primer: The last 20(ish) of your region, e.g, GCTTGAAGGAAGAACCGCTGG (in red below)
:::CCTGGTCCCAACACCTAGCCCACGGCCTGACAGAGAACCAGTGCTCAAT<span style="color:red">GCTTGAAGGAAGAACCGCTGG</span>
::Check "Flip reverse primer" This will change the region in red to the reverse compliment and also flip it 180 degrees.
:The idea is that you are finding the DNA between the green and the red chunks to amplify.
:Click submit.
 
For the reverse primer in red, you could have output the "-" strand DNA (the reverse compliment" in "Tools > View DNA" by selecting the radio button for the reverse compliment. If you do this for the "Reverse Primer" field in isPCR, then you do not have to select "Flip Reverse Primer."
 
====<span style="color:dodgerblue">Dev: Compare chrom sizes====
</span>
: Skip this if your assembly is the first for a species (hosted by UCSC), there will be no chrom sizes to compare to!
: For a new assembly version, compare the chrom sizes from the last assembly to this new assembly version. For some assemblies, chrom names were changed, be aware of this if comparing. You are not checking annotations on the reference sequence, you are just checking the number of base pairs per chrom/contig, and making sure that nothing has changed drastically (i.e., millions of base pairs different). Also take a look for general differences, such as chrom labels or number of chrom/contigs.
:Output chrom sizes into two files, sort each file by using the command below
:Compare the sorted files
 
:There are two ways to compare chromosomes:
 
:1.Navigate to http://hgwdev.gi.ucsc.edu/cgi-bin/hgGateway, find your assembly and click on the "View Sequences" button - bring up 2 windows side by side to view both old and new assemblies. Now, you can compare the chromosome sizes. 
 
or
 
2. open up a terminal window and input the following commands:
 
hgsql -Ne "select chrom, size from chromInfo" $prevDb | sort > oldChromSizes
hgsql -Ne "select chrom, size from chromInfo" $db | sort > newChromSizes
sdiff -s oldChromSizes newChromSizes
 
You may want to use "$cat oldChromSizes | head" to clean up the output in both old and new chromSize files, we are only concerned with the "chr# ####" labels.
 
====<span style="color:dodgerblue">Dev: Gateway: Check the tree====
</span>
On hgGateway, make sure your db appears in the tree.
# Type the first few letters of your assembly name in the search field above the "Represented Species" tree, "m-a-n-P-e..." and the rest should populate.
# Your assembly should now be highlighted in the tree, and the tree position should have moved so that you are now centered on the tree position for your org.
# Hover over the name of your org within the tree, you should see the scientific name.
# Hover over the horizontal branch leading to your org, you should see the genus - family - order.
# Hover over the vertical branch leading to your org, you should see the superorder.
# Go to a different organism on hgGateway. Then scroll down the tree and find your organism. Click on the name of your organism in the tree and you should go to the default assembly for your organism.
 
====<span style="color:dodgerblue">Dev: Gateway: Check default position====
</span>
 
#Go to gateway page
# Reset all user settings (Home > Genome Browser > Reset All User Settings
# Select the assembly you're testing
# Press "Go" on hgGateway
# You will be taken to the default position for your assembly.
# Make sure that the resulting area is scientifically interesting and aesthetically pleasing!
# You can edit the default location here: hgcentraltest.dbDb.defaultPos:
<pre>
hgsql -e "update dbDb set defaultPos='chr6:43426669-43433274' where name='danRer11'" hgcentraltest
</pre>
On an unrelated dbDb note, setting the field hgPbOk=1 sets the base pairs shown on hgTracks from T's to U's. This also affects zoomed in MAF files, but shouldn't matter unless we're displaying an RNA genome like SARS-CoV-2.  This field was left over from the protein browser and was repurposed, so it should be 0 for all DNA genomes.
 
====<span style="color:dodgerblue">Dev: Gateway: Check default tracks====
</span>
* Each assembly has certain tracks that are hidden or visible by default.
* You can edit the default tracks here: /kent/src/hg/makeDb/trackDb/$db/trackDb.ra.
 
Below is an example for turning on a default gene track that was off when the developer released the assembly to dev.
 
Resource: https://genome.ucsc.edu/goldenpath/help/trackDb/trackDbDoc.html
 
* manPen1 has no gene tracks on by default.
* I want to turn on the augustus track (on by default, pack visibility).
* Looking at ~/kent/src/hg/makeDb/trackDb/$db/trackDb.ra, I see that there is no stanza for the augustus track, because it is inheriting the parent *.ra files configuration, making it hidden.
* I need to override the parent config in the manPen1 .ra file.
 
Steps:
* go to dev, Genome Browser > Reset All User Settings
*  note which track you would like to turn on, see if you want it in 'pack' or 'full' etc.
* vim ~/kent/src/hg/makeDb/trackDb/pangolin/manPen1/trackDb.ra
* Add something like this:
<pre>
#Local declaration so that augustus genes is picked up.
track augustusGene override
visibility pack
</pre>
* cd ~/kent/src/hg/makeDb/trackDb
* make alpha DBS=manPen1
* refresh your dev hgTracks browser and see that your track is now on, inheriting the parent's visibility (pack, in this case).
* if all looks good, add, commit, push your .ra file.
 
If your assembly is already public on the RR, then continue the push:
* make beta DBS=manPen1
* make public DBS=manPen1
* Push request to admins: Make trackDb & friends for manPen1
* Check the rr/euro/asia for your newly visible track.
 
====<span style="color:dodgerblue">Dev: Gateway: Check trackDb priority====
</span>
* Each assembly has certain tracks that are hidden or visible by default.
** Our standard is to have the visible tracks at the beginning of each group.
* You can edit the default tracks here: /kent/src/hg/makeDb/trackDb/$db/trackDb.ra.
 
 
Tracks that are on by default should be the first tracks within a group, for example, GENCODE v29 is first in the "Genes and Gene Predictions" group for hg38. All other tracks that are hidden by default proceed the visible tracks in alphabetical order. The only exception to this rule is for the chain/net tracks inside of the "Comparative Genomics" group. These chain/net tracks are in phylogenetic order and should not be in alphabetical order.
 
To change the order of the tracks on hgTracks, you can use the '''priority''' trackDb setting:
<pre>
priority 1</pre>
 
hgTracks will display the tracks with the lowest priority value first, then followed by any tracks without a priority setting in alphabetical order. The priority value can be a floating point number, so a priority value of <code>1.1</code> will be displayed after a track with a priority value set to <code>1</code>.
 
====<span style="color:dodgerblue">Dev: Gateway:  Organism image check====
</span>
The image on hgGateway is referenced from trackDb directory's description.html file.
 
From your file.list from Redmine, make sure a scientificName.jpg image is listed, check to see that it does exist on dev.
 
The image file that appears on the gateway page should reside in the kent source tree in:
~/kent/src/hg/htdocs/images/
and a copy should exist at:
hgwdev > /usr/local/apache/htdocs/images/
If the image is not showing up on genome-test, cd to kent/src/hg/htdocs, ensure the image in the images directory is committed, then run make alpha.
 
====<span style="color:dodgerblue">Dev: Gateway: Accession ID check====
</span>
 
 
Assemblies/sequences, from various organizations, are submitted to the mother ship [https://www.ncbi.nlm.nih.gov/genbank/ GenBank].<br>
Those assemblies [https://www.ncbi.nlm.nih.gov/assembly/help/anomnotrefseq/ might be included in RefSeq if criteria are met].
 
The QA check should be to go to NCBI and double check that the accessionID is correct, possibly by searching the Accession ID in https://www.ncbi.nlm.nih.gov/assembly/.
 
:RefSeq assemblies:
::use accession ID: '''GCF'''_000002315.4 (e.g., galGal5)
::are delivered '''with''' chrMt (if they exisit)
::are delivered with NCBI gene predictions
 
:Genbank assemblies:
::use accession ID: '''GCA'''_000001305.2
::delivered '''without''' a chrMt.
::do '''not''' have gene predictions.
 
For the UCSC Genome Browser, it is preferable to use RefSeq assemblies (in part due to 'more data').
This is a "learn as we go" direction; historically GeneBank was preferred.
 
Helpful article: [http://www.nature.com/nrg/journal/v13/n5/full/nrg3174.html Nature, 2012 A beginner's guide to eukaryotic genome annotation]
 
====<span style="color:dodgerblue">Dev: Gateway: Check the NCBI assembly version link====
</span>
</span>


Check that there is an NCBI link to the exact assembly version, either by clicking the link on the Gateway or searching
http://www.ncbi.nlm.nih.gov/assembly/organism/
====<span style="color:dodgerblue">Dev: Verify make doc for all tracks====
</span>
* The makefile/s or initialbuild.txt file for your assembly describes the browser build.
* Location should be here:  ~/kent/src/hg/makeDb/doc/$db/*
Cath asked Hiram about tables that should be mentioned in the makedoc Nov 2017. Below is an example from xenLae2. The makedoc correctly lists all of the necessary tracks.
'''Mentioned in the makedoc'''
#augustusGene
#chromAlias
#cpgIslandExt
#cpgIslandExtUnmasked
#cytoBandIdeo
#gap
#genscan
#gold
#microsat
#rmsk
#simpleRepeat
#trackDb
#ucscToINSDC
#ucscToRefSeq
#windowmaskerSdust
#(Sometimes) ensGene
'''Not mentioned in the makedoc, and it is ok that they are not mentioned:'''
* Constructed by genbank processes:
#all_est
#all_mrna
#intronEst
#refFlat
#refGene
#refSeqAli
#xenoRefFlat
#xenoRefGene
#xenoRefSeqAli
#estOrientInfo
#mrnaOrientInfo
* Constructed by the doBlastzChainNet.pl script:
#chainHg38
#chainHg38Link
#chainMm10
#chainMm10Link
#chainXenTro9
#chainXenTro9Link
#netHg38
#netMm10
#netXenTro9
* Constructed by the makeGenome.pl script:
#chromInfo
#gc5BaseBw
#grp


===Dev.1. Claim it!===
* Constructed by 'make' in trackDb hierarchy:
#Find your assembly in the associated [http://redmine.soe.ucsc.edu/projects/genomebrowser/issues?set_filter=1&tracker_id=24 Assembly Redmine ticket].
##If there is no Redmine for your assembly, you should create one, assign to yourself, and add the engineer as a watcher.
##If one exists, read carefully, assign it to yourself. Make sure the engineer is a watcher.
#Find your assembly in the [http://hgwbeta.cse.ucsc.edu/cgi-bin/qaPushQ PushQ]
##Click on the link in the "Queue ID" column
##Click the "lock" button at the top of the page to "unlock" the fields for editing.
##Add your name to the "Reviewer" column.


#hgFindSpec


* Added to by many loader commands:


===Dev.2. Check the assembly for overall quality===
#history


* Constructed by doRepeatMasker.pl script:


#nestedRepeats


* Constructed by cron job every night:


#tableDescriptions


Grep tips:
Use your list of tables you'll push (see [http://genomewiki.ucsc.edu/genecats/index.php/Assembly_QA_Part_3_BETA_Steps#Beta:_Make_table_PUSH_lists_for_each_assembly BETA STEPS] )as the grep search string list, and look in the make file to see which tables are NOT mentioned


* 1. grep -of allTables.xenLae2 ~/kent/src/hg/makeDb/doc/xenLae2/initialBuild.txt | sort -u > tablesListed.makedoc
* 2. comm -23 tablesListed.makedoc allTables.xenLae2


This is also a helpful grep: cat ~/kent/src/hg/makeDb/doc/xenLae2/initialBuild.txt | grep "DONE"


====<span style="color:dodgerblue">Dev: Review downloads dir====
</span>


View the contents of the downloads directory.
<pre>
ls -R /usr/local/apache/htdocs-hgdownload/goldenPath/$db/
</pre>
LiftOver files and vs* directories are for the chain/net tracks; and the multiz*way, phastCons*way and phyloP*way directories are for conservation tracks.


Note that $db/database dir will be empty except for README.txt. This directory will contain a dump of the database on the RR, but will always remain empty on hgwdev.


Also note that these files:
est.fa.gz      mrna.fa.gz      refMrna.fa.gz      xenoMrna.fa.gz
est.fa.gz.md5  mrna.fa.gz.md5  refMrna.fa.gz.md5  xenoMrna.fa.gz.md5
will not be present on hgwdev.  They are generated automatically and rsync'ed to hgdownload after an assembly is added to hgwbeta.dbs and "make etc-update-server" is run in the kent/src/hg/makeDb/genbank/ directory on hgwbeta.


====<span style="color:dodgerblue">Dev: Run dbCheck====
</span>
Run the following command to check that all MySQL tables are in good repair:
<pre>
sudo dbCheck.sh $db
</pre>


====<span style="color:dodgerblue">Dev: Alignment files are to valid assemblies====
</span>
In Redmine for your assembly, the engineer should have provided a path to redmine.$db.file.list
E.g.,
/hive/data/genomes/manPen1/redmine5515/redmine.manPen1.file.list
From hive, copy the file list to your assembly dir:
<pre>
/hive/data/genomes/manPen1/redmine5515/redmine.manPen1.file.list
</pre>
Take a look at the alignment "To" and "From" files, and make sure they are to valid assemblies on the RR.
:'''LiftOver Files'''
::A liftOver file is a chain file, it is a subset of all chains used in creating the net file.
:::hg38ToAnoCar2.over.chain.gz: These files are required for the liftOver utility.
:::::The file names reflect the assembly conversion data contained within in the format <db1>To<Db2>.over.chain.gz. For example, a file named hg38ToAnoCar2.over.chain.gz file contains the liftOver data needed to convert hg38 coordinates to the anoCar2 assembly.
:'''Chain Files'''
::Chain files contain all possible chains generated by lastz, before a subset of "best" chains are filtered into the liftOver file.
:::hg38.anoCar2.all.chain.gz: chained blastz alignments.
:::::The chain format is described in [http://genome.ucsc.edu/goldenPath/help/chain.html on the chain help page].
:'''Net Files'''
::hg38.anoCar2.net.gz: "net" file.
::::This file describes rearrangements between the species and the best Lizard match to any part of the Human genome.  The net format is described in [http://genome.ucsc.edu/goldenPath/help/net.html on the net help page].
:'''Axt Files'''
::hg38.anoCar2.net.axt.gz: chained and netted alignments.
:::: i.e. the best chains in the Human genome, with gaps in the best chains filled in by next-best chains where possible.  The axt format is described in [http://genome.ucsc.edu/goldenPath/help/axt.html the axt help page].
====<span style="color:dodgerblue">Dev: liftOver exists: old to new, new to old====
</span>
Skip this if your assembly is the first version for the organism.
Otherwise, check that the previous assembly version has a liftOver file to
* the new version
* and a reciprocal file in the
/gbdb/[your Db]/liftOver/[your Db]To[the older version of your org].over.chain.gz
/gbdb/[the older version of your org]/liftOver/[the older version of you org]To[your Db].over.chain.gz
====<span style="color:dodgerblue">Dev: liftOver exists: other orgs====
</span>
Your assembly will probably also have liftOver files to/from other major orgs, such as the newer human and mouse assemblies.
Check that liftOver files exist in BOTH directories,
/gbdb/[your database]/liftOver/
/gbdb/[some other org database]/liftOver
For example, if your assembly is manPen1, see what liftOver files are there. These should also match what is in your filelist from Redmine.
<pre>
★  /gbdb/manPen1/liftOver
ls
manPen1ToHg38.over.chain.gz  manPen1ToMm10.over.chain.gz
</pre>
Note that there are liftOver files to TWO other orgs, human and mouse. If this assembly was not the first, it should also have liftOver files to the previous assembly version.
Let's go look at liftOver files for hg38:
<pre>
★  /gbdb/hg38/liftOver
ls | grep ManPen
hg38ToManPen1.over.chain.gz
</pre>
and then we'll check mm10:
<pre>
★  /gbdb/mm10/liftOver
ls | grep ManPen
mm10ToManPen1.over.chain.gz
</pre>
====<span style="color:dodgerblue">Dev: Check Tools: LiftOver====
</span>
* Go to dev's [http://hgwdev.gi.ucsc.edu/cgi-bin/hgLiftOver LiftOver Tool] and test lifts to & from other assembly versions and other organisms that you have liftOver files for.
====<span style="color:dodgerblue">Dev: Review notes and make temp dir for md5sum checks====
</span>
There is a way to check all md5sums at once using one command. This should save you lots of time and typing. You'll need two directories in your home folder, temp and temp2.
First go to your test directory:
cd /usr/local/apache/htdocs-hgdownload/goldenPath/$db/
Then run the following loop command to compare current md5sum.txt files and the ones generated when the files were uploaded. If your only output are about README.txt and md5sum.txt, that's great and nothing has changed. If something else comes up, ask your developer.
for dir in *; do cd $dir; md5sum * | sort > ~/temp/$dir; sort md5sum.txt > ~/temp2/$dir; echo $dir; diff ~/temp/$dir ~/temp2/$dir; cd ..; done
------------
OUTDATED BELOW, use above command
First, make a dir "temp" in your home directory. You'll use this in the steps below. The remainder of the text below explains how the md5sum checks will work.
Review the following section, which is a guide to verify that the download files exist and are not corrupt in the following directory and sub dirs:
hgwdev > /usr/local/apache/htdocs-hgdownload/goldenPath/$db/
We will be using a computer program called [https://en.wikipedia.org/wiki/Md5sum Md5sum] to generate MD5 hashes to verify the integrity of the files since any change to the file will cause its MD5 hash to change. The MD5 hashes for each file was generated and stored in the md5sum.txt file.
An easy way to compare the MD5 hashes of each file is to do a diff. This can be easily automated by running the following commands.
The first command is to run md5sum for all files in your current directory (these will be listed in the steps below), sort them, and then redirect the output to a file.
md5sum * | sort > ~/temp/filename_1
The second command sorts the md5sum.txt file and redirects the output to a different file.
sort md5sum.txt > ~/temp/filename_2
The final command compares the two files created and displays the lines that differ between the two files.
diff ~/temp/filename_1 ~/temp/filename_2
Note that the md5sum.txt file obviously does not contain md5sum.txt and it was created before there was a README.txt file, so your diff will show md5sum.txt and README.txt in the results. If everything is ok, those should be the only results. In the vs* directories, the XXXX.net.axt.gz file will show up as axtNet/XXXX.net.axt.gz.
Continue on to the next steps to begin running these checks in the following directories.
====<span style="color:dodgerblue">Dev: bigZips: check md5sum====
</span>
Change your directory to:
cd /usr/local/apache/htdocs-hgdownload/goldenPath/$db/bigZips
then run the following command:
md5sum * | sort > ~/temp/filename_1; sort md5sum.txt > ~/temp/filename_2; diff ~/temp/filename_1 ~/temp/filename_2
====<span style="color:dodgerblue">Dev: bigZips: check README ====
</span>
These check README commands can be automated so you don't have to do any of the below commands.  You still do have to read or skim the README.txt files that output. Here is the command to display all README files for the whole directory:
cd  /usr/local/apache/htdocs-hgdownload/goldenPath/$db/
for dir in *; do cd $dir; echo $dir; cat README.txt; cd ..; done
REDUNDANT BELOW, above step does all README.txt prints in this assembly directory
--------------------
/usr/local/apache/htdocs-hgdownload/goldenPath/$db/bigZips
* Verify that the README.txt exists
* cat the file and read it, check the contents (such as urls listed, etc.)
====<span style="color:dodgerblue">Dev: bigZips: check for corruption====
</span>
Change your working directory to:
cd /usr/local/apache/htdocs-hgdownload/goldenPath/$db/bigZips
Run the following in the directory:
for file in *; do zcat $file | head; zcat $file | tail; done
Scroll the output and make sure all the text is ASCII.
====<span style="color:dodgerblue">Dev: database: check README====
</span>
cat /usr/local/apache/htdocs-hgdownload/goldenPath/$db/database/README.txt
* Verify that the README.txt exists
* cat the file and read it, check the contents (such as urls listed, etc.)
====<span style="color:dodgerblue">Dev: liftOver: check md5sum====
</span>
Change your directory to:
cd /usr/local/apache/htdocs-hgdownload/goldenPath/$db/liftOver
then run this command:
md5sum * | sort > ~/temp/filename_1; sort md5sum.txt > ~/temp/filename_2; diff ~/temp/filename_1 ~/temp/filename_2
====<span style="color:dodgerblue">Dev: liftOver: check README====
</span>
/usr/local/apache/htdocs-hgdownload/goldenPath/$db/liftOver
* Verify that the README.txt exists
* cat the file and read it, check the contents (such as urls listed, etc.)
====<span style="color:dodgerblue">Dev: liftOver: corruption====
</span>
Run the following in each directory and check the output:
for file in *; do zcat $file | head; zcat $file | tail; done
Scroll the output and make sure all the text is ASCII.
====<span style="color:dodgerblue">Dev: vsXXX: check md5sum====
</span>
This section is only relevant if your assembly has chain/net files to another organism.
Note: there may be multiple organisms that your assembly has alignment files to, check them all.
Change your directory to:
cd /usr/local/apache/htdocs-hgdownload/goldenPath/$db/vsXXX
then run the following command:
md5sum * | sort > ~/temp/filename_1; sort md5sum.txt > ~/temp/filename_2; diff ~/temp/filename_1 ~/temp/filename_2
'''REPEAT''' this process for subdirectories:
* reciprocalBest
* reciprocalBest/axtRBestNet
====<span style="color:dodgerblue">Dev: vsXXX: check README====
</span>
cd /usr/local/apache/htdocs-hgdownload/goldenPath/$db/vsMm10
cd /usr/local/apache/htdocs-hgdownload/goldenPath/$db/vsHg38
* Verify that the README.txt exists
* cat the file and read it, check the contents (such as urls listed, etc.)
====<span style="color:dodgerblue">Dev: vsXXX: corruption====
</span>
Change your directory to other assemblies chains of your assemboly:
cd /usr/local/apache/htdocs-hgdownload/goldenPath/hg38/vs${db^}
cd /usr/local/apache/htdocs-hgdownload/goldenPath/mm10/vs${db^}
Run the following in each directory and check the output:
for file in *; do zcat $file | head; zcat $file | tail; done
Scroll the output and make sure all the text is ASCII.
'''REPEAT''' this process for subdirectories:
* reciprocalBest
* reciprocalBest/axtRBestNet
====<span style="color:dodgerblue">Dev: for :queryDb:vsYourDb: check README ====
</span>
/usr/local/apache/htdocs-hgdownload/goldenPath/queryDb/vsYourDb
Check the README.txt files for any other organisms that your assembly has alignments (chain/net/liftover/etc) to:
* Verify that the README.txt exists
* cat the file and read it, check the contents (such as urls listed, etc.)
'''REPEAT''' this process for subdirectory:
* reciprocalBest (this readme covers the subdir, axtRBestNet).
====<span style="color:dodgerblue">Dev: for :queryDb:vsYourDb: check md5sum====
</span>
Note: there may be multiple organisms that your assembly has alignment files to, check them all.
Change your directory to:
hgwdev > /usr/local/apache/htdocs-hgdownload/goldenPath/queryDb/vsYourDb
then run the following command:
md5sum * | sort > ~/temp/filename_1; sort md5sum.txt > ~/temp/filename_2; diff ~/temp/filename_1 ~/temp/filename_2
'''REPEAT''' this process for subdirectories:
* reciprocalBest
* reciprocalBest/axtRBestNet
====<span style="color:dodgerblue">Dev: for :queryDb:vsYourDb: check corruption ====
</span>
/usr/local/apache/htdocs-hgdownload/goldenPath/queryDb/vsYourDb
Just do a zcat for otherOrg -> yourDb:
zcat $file | head
'''REPEAT''' this process for subdirectories:
* reciprocalBest
* reciprocalBest/axtRBestNet
====<span style="color:dodgerblue">Dev: md5sum check with 2bitCompare====
</span>
2bitCompare $db
The .2bit files contain the new assembly sequence in a compact, binary format.  The .2bit files are located at:
* /scratch/$db (on the blat server)
* /usr/local/apache/htdocs-hgdownload/goldenPath/$db/bigZips/ (on hgwdev)
* /gbdb/$db/ (on hgwdev)
* /gbdb/$db/ (on hgwbeta)
Check the to make sure that the .2bit files are identical by running the 2bitCompare script.  Particularly if the assembly has been part of a multiz track without a Browser, the file may exist on beta and RR and may not have been masked.
Below is some sample output:
hgwdev> 2bitCompare allMis1
  Checking md5sums.  This could take a few minutes.  Please be patient...
        blat4a md5sum: 134e740c05eedadc24de3a96775a25d6 /scratch/allMis1/allMis1.2bit
      download md5sum: 134e740c05eedadc24de3a96775a25d6 /usr/local/apache/htdocs-hgdownload/goldenPath/allMis1/bigZips/allMis1.2bit
    hgwdev gbdb md5sum: 134e740c05eedadc24de3a96775a25d6 /gbdb/allMis1/allMis1.2bit
  hgwbeta gbdb md5sum: 134e740c05eedadc24de3a96775a25d6 /gbdb/allMis1/allMis1.2bit
        blat4a date,size: Jun 19 11:03 569794406
      download date,size: Jul 3 10:55 53
    hgwdev gbdb date,size: Jun 7 13:34 39
  hgwbeta gbdb date,size: Jun 7 13:33 569794406
The first part of the script output lists the md5sums of all four .2bit files.  These should be identical.
The second part of the script output lists the timestamps and filesizes.
* The download and hgwdev gbdb files should be symlinks, as evidenced by a small filesize.
* The blat and hgwbeta gbdb files should be the actual files, as evidenced by a large filesize.
* The two symlink filesizes will likely be different, but the filesize of the two actual files should be identical.
If the blat .2bit is not the same as the other .2bit files, ask the pushers to restart the assembly and to pull the newest .2bit file from /gbdb.
Note.
  hgwbeta/rr gbdb md5sum: The $db directory does not exist in /gbdb on hgwbeta
  hgwbeta/rr gbdb date,size: N/A
Could show since there's no gbdb data on beta yet, that's part of the whole data push process.
====<span style="color:dodgerblue">Dev: Permissions check: downloads dir====
</span>
The developer may need to update permissions to the download directory to be at least 664.
hgwdev > /usr/local/apache/htdocs-hgdownload/goldenPath/$db/
<pre>
ls -lLR *
-rw-rw-r
</pre>
This output can be thousands of lines if you have lots of alignments.  To shorten it, you can display only the lines that don't match that permission, the vs label, the total bytes line, the symlink permissions, and blank lines.  If it finds anything with less permissions, investigate thoroughly.
<pre>
ls -lLR * | grep -ve '-rw-rw-r--\|vs\|total\|drwxrwxr-x\|^$'
</pre>
====<span style="color:dodgerblue">Dev: Ensure your dbs is defined in trackDb makefile====
</span>


cat ~/kent/src/hg/makeDb/trackDb/makefile | grep $db




🔵  Done with DEV steps? Let's go to  [[Assembly QA Part 2: BETA Steps]]
🔵  Done with DEV steps? Go to  [[Assembly QA Part 2 Track Steps | Assembly QA Part 2: Track Steps]]




Line 82: Line 800:


[[Category:Browser QA]]
[[Category:Browser QA]]
[[Category:Browser QA New Assembly]]
[[Category:Browser QA Assembly Release Guide]]
[[Category:Browser QA Assembly Release Guide]]

Latest revision as of 17:18, 29 September 2020

These steps were revised in 2017, but you can also see the old steps: Releasing an assembly (old steps)


Navigation Menu

Home Page
Assembly QA Part 1: DEV Steps
Assembly QA Part 2: Track Steps
Assembly QA Part 3: BETA Steps
Assembly QA Part 4: RR Steps

Setup: Create a Google spreadsheet checklist from a template

Steps:

  1. Open a new Google Spreadsheet.
  2. Go to the Google spreadsheet Template: Assembly Release Checklist
  3. Copy the template: File > Make a copy
  4. Give your new spreadsheet a title, like "manPen1 Assembly Release Checklist".
  5. Move your spreadsheet to a good folder on your Google Drive so that you can easily find it later.
  6. All set! You can now use your checklist.

Tips:

  1. Note: This system works best when you create one spreadsheet per assembly.
  2. See the tab, "README" for more info.
  3. If a wiki section is h4 ("====Wiki Section===="), denoted by surrounding the section with exactly 4 equal signs, then the h4 section will appear as a step in your checklist.
  4. To add a new step to your checklist - do not add it directly to your spreadsheet. Instead add a new h4 section to the wiki. Just copy an existing h4 and edit it!
  5. Your h4 will become a url, so the only punctuation you can use is a colon " : " otherwise the wiki link in column A will break.
  6. To see your change, toggle the "#" character in your formula. The "#" is not really needed in the formula, and removing it or adding it back in will re-load the page.

Setup: Make a directory in your hive

During this assembly release process, you will be generating a lot of output, and you'll need a place to put everything. The use of the "hive" directory is encouraged as the best location because of ample space.

 
mkdir /hive/users/userName/assemblies/assemblyName  

e.g.:  mkdir /hive/users/cath/assemblies/manPen1

Setup: Create an alias to your new dir

When you add an alias from your .bashrc file, you can simply type that alias in your command line as a shortcut to the associated command. A "shortcut" alias can be created to allow fast access to your hive directory for this assembly.

To do this, follow the steps below:

  1. In your terminal, connect to hgwdev and type "cd" (go to your home directory).
  2. Confirm the location of .bashrc. Type "ls -a" in your home directory to see all hidden files that have a " . " in the filename. This way you can confirm the location of your .bashrc file.
  3. Open your .bashrc file for editing. If you're using the vi editor, you can type "vi .bashrc" to edit the file. Add an alias by typing in the line below, then save your changes.
alias hive='cd /hive/users/yourUserName/assemblies/yourAssembly'

e.g., alias hive='cd /hive/users/cath/assemblies/manPen1'


Redmine: Review the Redmine as PushQ wiki

  • As of March 2017, the PushQ has been replaced with Redmine to track and release new assemblies.
  • Review the Redmine as the pushQ replacement wiki page.
  • Go to Redmine > GB > Issues > Filter: "Ready for QA"
  • Find the assembly you will QA/Release

Redmine: Set assignee as yourself

Redmine: Set the engineer as watcher if they are not the developer

Redmine: Set Status to Reviewing

Dev: Check minimal browser criteria

Does this assembly have the required tracks?

Visit this page to check that the assembly contains the required tracks to be considered a minimal browser on the RR.

To add explaination: genbank mrnas & ests (/cluster/data/genbank/data/organism.lst) How to view/interpret the file

Dev: Check that BLAT Server is running

To check if your organism has a blat servers set up, run the following command (beware that copyHgcentral creates many temp files):

copyHgcentral test $db blatServers dev beta

a better command that does not create many temporary files is just querying hgcentraltest yourself:

hgsql -e "select * from blatServers where db='$db'" hgcentraltest

The developer has often already requested that the blat servers be set up for the new assembly. If not, and/or if entries for your assembly are missing from hgcentraltest.blatServers, please make a note in the Redmine ticket and ask the assembly builder to 1) request the setup of the blat servers and to 2) manually add the entries to hgcentraltest.blatServers. Make sure that this assembly is not hosted on "blatx" BLAT server. That server is not as stable and therefore is for assemblies that are not destined for the RR. For more information about where the blat servers for different machines should be hosted, go to Updating blat servers.

You should see results like this (below) since this should only be setup on dev so far:


copyHgcentral test manPen1 blatServers dev beta

--------------------------------------------------
--------------------------------------------------
<<< blatServers >>>

hgcentraltest
-------------
manPen1	blat1b	17878	1	0
manPen1	blat1b	17879	0	1

hgcentralbeta
-------------


hgcentral
-------------


*** There are blatServers differences between dev and beta ***

*** The blatServers data on beta and rr is identical ***

Dev: Do a BLAT search: DNA

From BLAT tool on dev:

  1. Go to your browser and copy some DNA sequence
  2. Go to BLAT: Home > Tools > Blat
  3. Paste in sequence
  4. Change query type to DNA and press submit
  5. Click on various blat results to make sure they look as expected
  6. Make a custom track of blat results and then look at them in the browser.

Dev: Do a BLAT search: protein

From BLAT tool on dev:

  1. Go to your browser and copy some DNA sequence -> translate to amino acid sequence*
  2. Go to BLAT: Home > Tools > Blat
  3. Paste in sequence
  4. Change query type to "protein" (amino acid) and press submit
  5. Click on various blat results to make sure they look as expected
  6. Make a custom track of blat results and then look at them in the browser.

Dev: isPCR test

  • Go to dev's PCR Tool and test a PCR search for your assembly.

For example, on hg38:

  • You want to get DNA, about 20-23 bases, that "book end" the region of DNA that will be amplified.
  • For example, here's a 70bp region in hg38: chr1:11,131,574-11,131,643
  • Go to this region on hg38
  • hgTracks, View > DNA (v + d keyboard shortcut)
  • Click "get DNA" with the default selections.
  • Copy the DNA to your clipboard:
>hg38_dna range=chr1:11131574-11131643 5'pad=0 3'pad=0 strand=+ repeatMasking=none
CCTGGTCCCAACACCTAGCCCACGGCCTGACAGAGAACCAGTGCTCAATGCTTGAAGGAAGAACCGCTGG


Go to isPCR for hg38 (Tools >InSilico PCR)
Genome: Human
Assembly: hg38
Tareget::genome assembly
Forward Primer: The first 20(ish) bp of the region, e.g., CCTGGTCCCAACACCTAGCC (in green below)
CCTGGTCCCAACACCTAGCCCACGGCCTGACAGAGAACCAGTGCTCAATGCTTGAAGGAAGAACCGCTGG
Reverse Primer: The last 20(ish) of your region, e.g, GCTTGAAGGAAGAACCGCTGG (in red below)
CCTGGTCCCAACACCTAGCCCACGGCCTGACAGAGAACCAGTGCTCAATGCTTGAAGGAAGAACCGCTGG
Check "Flip reverse primer" This will change the region in red to the reverse compliment and also flip it 180 degrees.
The idea is that you are finding the DNA between the green and the red chunks to amplify.
Click submit.

For the reverse primer in red, you could have output the "-" strand DNA (the reverse compliment" in "Tools > View DNA" by selecting the radio button for the reverse compliment. If you do this for the "Reverse Primer" field in isPCR, then you do not have to select "Flip Reverse Primer."

Dev: Compare chrom sizes

Skip this if your assembly is the first for a species (hosted by UCSC), there will be no chrom sizes to compare to!
For a new assembly version, compare the chrom sizes from the last assembly to this new assembly version. For some assemblies, chrom names were changed, be aware of this if comparing. You are not checking annotations on the reference sequence, you are just checking the number of base pairs per chrom/contig, and making sure that nothing has changed drastically (i.e., millions of base pairs different). Also take a look for general differences, such as chrom labels or number of chrom/contigs.
Output chrom sizes into two files, sort each file by using the command below
Compare the sorted files
There are two ways to compare chromosomes:
1.Navigate to http://hgwdev.gi.ucsc.edu/cgi-bin/hgGateway, find your assembly and click on the "View Sequences" button - bring up 2 windows side by side to view both old and new assemblies. Now, you can compare the chromosome sizes.

or

2. open up a terminal window and input the following commands:

hgsql -Ne "select chrom, size from chromInfo" $prevDb | sort > oldChromSizes
hgsql -Ne "select chrom, size from chromInfo" $db | sort > newChromSizes
sdiff -s oldChromSizes newChromSizes

You may want to use "$cat oldChromSizes | head" to clean up the output in both old and new chromSize files, we are only concerned with the "chr# ####" labels.

Dev: Gateway: Check the tree

On hgGateway, make sure your db appears in the tree.

  1. Type the first few letters of your assembly name in the search field above the "Represented Species" tree, "m-a-n-P-e..." and the rest should populate.
  2. Your assembly should now be highlighted in the tree, and the tree position should have moved so that you are now centered on the tree position for your org.
  3. Hover over the name of your org within the tree, you should see the scientific name.
  4. Hover over the horizontal branch leading to your org, you should see the genus - family - order.
  5. Hover over the vertical branch leading to your org, you should see the superorder.
  6. Go to a different organism on hgGateway. Then scroll down the tree and find your organism. Click on the name of your organism in the tree and you should go to the default assembly for your organism.

Dev: Gateway: Check default position

  1. Go to gateway page
  2. Reset all user settings (Home > Genome Browser > Reset All User Settings
  3. Select the assembly you're testing
  4. Press "Go" on hgGateway
  5. You will be taken to the default position for your assembly.
  6. Make sure that the resulting area is scientifically interesting and aesthetically pleasing!
  7. You can edit the default location here: hgcentraltest.dbDb.defaultPos:
hgsql -e "update dbDb set defaultPos='chr6:43426669-43433274' where name='danRer11'" hgcentraltest

On an unrelated dbDb note, setting the field hgPbOk=1 sets the base pairs shown on hgTracks from T's to U's. This also affects zoomed in MAF files, but shouldn't matter unless we're displaying an RNA genome like SARS-CoV-2. This field was left over from the protein browser and was repurposed, so it should be 0 for all DNA genomes.

Dev: Gateway: Check default tracks

  • Each assembly has certain tracks that are hidden or visible by default.
  • You can edit the default tracks here: /kent/src/hg/makeDb/trackDb/$db/trackDb.ra.

Below is an example for turning on a default gene track that was off when the developer released the assembly to dev.

Resource: https://genome.ucsc.edu/goldenpath/help/trackDb/trackDbDoc.html

  • manPen1 has no gene tracks on by default.
  • I want to turn on the augustus track (on by default, pack visibility).
  • Looking at ~/kent/src/hg/makeDb/trackDb/$db/trackDb.ra, I see that there is no stanza for the augustus track, because it is inheriting the parent *.ra files configuration, making it hidden.
  • I need to override the parent config in the manPen1 .ra file.

Steps:

  • go to dev, Genome Browser > Reset All User Settings
  • note which track you would like to turn on, see if you want it in 'pack' or 'full' etc.
  • vim ~/kent/src/hg/makeDb/trackDb/pangolin/manPen1/trackDb.ra
  • Add something like this:
#Local declaration so that augustus genes is picked up.
track augustusGene override
visibility pack
  • cd ~/kent/src/hg/makeDb/trackDb
  • make alpha DBS=manPen1
  • refresh your dev hgTracks browser and see that your track is now on, inheriting the parent's visibility (pack, in this case).
  • if all looks good, add, commit, push your .ra file.

If your assembly is already public on the RR, then continue the push:

  • make beta DBS=manPen1
  • make public DBS=manPen1
  • Push request to admins: Make trackDb & friends for manPen1
  • Check the rr/euro/asia for your newly visible track.

Dev: Gateway: Check trackDb priority

  • Each assembly has certain tracks that are hidden or visible by default.
    • Our standard is to have the visible tracks at the beginning of each group.
  • You can edit the default tracks here: /kent/src/hg/makeDb/trackDb/$db/trackDb.ra.


Tracks that are on by default should be the first tracks within a group, for example, GENCODE v29 is first in the "Genes and Gene Predictions" group for hg38. All other tracks that are hidden by default proceed the visible tracks in alphabetical order. The only exception to this rule is for the chain/net tracks inside of the "Comparative Genomics" group. These chain/net tracks are in phylogenetic order and should not be in alphabetical order.

To change the order of the tracks on hgTracks, you can use the priority trackDb setting:

priority 1

hgTracks will display the tracks with the lowest priority value first, then followed by any tracks without a priority setting in alphabetical order. The priority value can be a floating point number, so a priority value of 1.1 will be displayed after a track with a priority value set to 1.

Dev: Gateway: Organism image check

The image on hgGateway is referenced from trackDb directory's description.html file.

From your file.list from Redmine, make sure a scientificName.jpg image is listed, check to see that it does exist on dev.

The image file that appears on the gateway page should reside in the kent source tree in:

~/kent/src/hg/htdocs/images/

and a copy should exist at:

hgwdev > /usr/local/apache/htdocs/images/

If the image is not showing up on genome-test, cd to kent/src/hg/htdocs, ensure the image in the images directory is committed, then run make alpha.

Dev: Gateway: Accession ID check


Assemblies/sequences, from various organizations, are submitted to the mother ship GenBank.
Those assemblies might be included in RefSeq if criteria are met.

The QA check should be to go to NCBI and double check that the accessionID is correct, possibly by searching the Accession ID in https://www.ncbi.nlm.nih.gov/assembly/.

RefSeq assemblies:
use accession ID: GCF_000002315.4 (e.g., galGal5)
are delivered with chrMt (if they exisit)
are delivered with NCBI gene predictions
Genbank assemblies:
use accession ID: GCA_000001305.2
delivered without a chrMt.
do not have gene predictions.

For the UCSC Genome Browser, it is preferable to use RefSeq assemblies (in part due to 'more data'). This is a "learn as we go" direction; historically GeneBank was preferred.

Helpful article: Nature, 2012 A beginner's guide to eukaryotic genome annotation

Dev: Gateway: Check the NCBI assembly version link

Check that there is an NCBI link to the exact assembly version, either by clicking the link on the Gateway or searching http://www.ncbi.nlm.nih.gov/assembly/organism/

Dev: Verify make doc for all tracks

  • The makefile/s or initialbuild.txt file for your assembly describes the browser build.
  • Location should be here: ~/kent/src/hg/makeDb/doc/$db/*

Cath asked Hiram about tables that should be mentioned in the makedoc Nov 2017. Below is an example from xenLae2. The makedoc correctly lists all of the necessary tracks.

Mentioned in the makedoc

  1. augustusGene
  2. chromAlias
  3. cpgIslandExt
  4. cpgIslandExtUnmasked
  5. cytoBandIdeo
  6. gap
  7. genscan
  8. gold
  9. microsat
  10. rmsk
  11. simpleRepeat
  12. trackDb
  13. ucscToINSDC
  14. ucscToRefSeq
  15. windowmaskerSdust
  16. (Sometimes) ensGene

Not mentioned in the makedoc, and it is ok that they are not mentioned:

  • Constructed by genbank processes:
  1. all_est
  2. all_mrna
  3. intronEst
  4. refFlat
  5. refGene
  6. refSeqAli
  7. xenoRefFlat
  8. xenoRefGene
  9. xenoRefSeqAli
  10. estOrientInfo
  11. mrnaOrientInfo
  • Constructed by the doBlastzChainNet.pl script:
  1. chainHg38
  2. chainHg38Link
  3. chainMm10
  4. chainMm10Link
  5. chainXenTro9
  6. chainXenTro9Link
  7. netHg38
  8. netMm10
  9. netXenTro9
  • Constructed by the makeGenome.pl script:
  1. chromInfo
  2. gc5BaseBw
  3. grp
  • Constructed by 'make' in trackDb hierarchy:
  1. hgFindSpec
  • Added to by many loader commands:
  1. history
  • Constructed by doRepeatMasker.pl script:
  1. nestedRepeats
  • Constructed by cron job every night:
  1. tableDescriptions

Grep tips: Use your list of tables you'll push (see BETA STEPS )as the grep search string list, and look in the make file to see which tables are NOT mentioned

  • 1. grep -of allTables.xenLae2 ~/kent/src/hg/makeDb/doc/xenLae2/initialBuild.txt | sort -u > tablesListed.makedoc
  • 2. comm -23 tablesListed.makedoc allTables.xenLae2

This is also a helpful grep: cat ~/kent/src/hg/makeDb/doc/xenLae2/initialBuild.txt | grep "DONE"

Dev: Review downloads dir

View the contents of the downloads directory.

ls -R /usr/local/apache/htdocs-hgdownload/goldenPath/$db/

LiftOver files and vs* directories are for the chain/net tracks; and the multiz*way, phastCons*way and phyloP*way directories are for conservation tracks.

Note that $db/database dir will be empty except for README.txt. This directory will contain a dump of the database on the RR, but will always remain empty on hgwdev.

Also note that these files:

est.fa.gz      mrna.fa.gz      refMrna.fa.gz      xenoMrna.fa.gz
est.fa.gz.md5  mrna.fa.gz.md5  refMrna.fa.gz.md5  xenoMrna.fa.gz.md5

will not be present on hgwdev. They are generated automatically and rsync'ed to hgdownload after an assembly is added to hgwbeta.dbs and "make etc-update-server" is run in the kent/src/hg/makeDb/genbank/ directory on hgwbeta.

Dev: Run dbCheck

Run the following command to check that all MySQL tables are in good repair:

sudo dbCheck.sh $db

Dev: Alignment files are to valid assemblies

In Redmine for your assembly, the engineer should have provided a path to redmine.$db.file.list E.g., /hive/data/genomes/manPen1/redmine5515/redmine.manPen1.file.list

From hive, copy the file list to your assembly dir:

/hive/data/genomes/manPen1/redmine5515/redmine.manPen1.file.list

Take a look at the alignment "To" and "From" files, and make sure they are to valid assemblies on the RR.

LiftOver Files
A liftOver file is a chain file, it is a subset of all chains used in creating the net file.
hg38ToAnoCar2.over.chain.gz: These files are required for the liftOver utility.
The file names reflect the assembly conversion data contained within in the format <db1>To<Db2>.over.chain.gz. For example, a file named hg38ToAnoCar2.over.chain.gz file contains the liftOver data needed to convert hg38 coordinates to the anoCar2 assembly.
Chain Files
Chain files contain all possible chains generated by lastz, before a subset of "best" chains are filtered into the liftOver file.
hg38.anoCar2.all.chain.gz: chained blastz alignments.
The chain format is described in on the chain help page.
Net Files
hg38.anoCar2.net.gz: "net" file.
This file describes rearrangements between the species and the best Lizard match to any part of the Human genome. The net format is described in on the net help page.
Axt Files
hg38.anoCar2.net.axt.gz: chained and netted alignments.
i.e. the best chains in the Human genome, with gaps in the best chains filled in by next-best chains where possible. The axt format is described in the axt help page.

Dev: liftOver exists: old to new, new to old

Skip this if your assembly is the first version for the organism. Otherwise, check that the previous assembly version has a liftOver file to

  • the new version
  • and a reciprocal file in the
/gbdb/[your Db]/liftOver/[your Db]To[the older version of your org].over.chain.gz
/gbdb/[the older version of your org]/liftOver/[the older version of you org]To[your Db].over.chain.gz

Dev: liftOver exists: other orgs

Your assembly will probably also have liftOver files to/from other major orgs, such as the newer human and mouse assemblies. Check that liftOver files exist in BOTH directories,

/gbdb/[your database]/liftOver/
/gbdb/[some other org database]/liftOver

For example, if your assembly is manPen1, see what liftOver files are there. These should also match what is in your filelist from Redmine.

 ★  /gbdb/manPen1/liftOver
ls
manPen1ToHg38.over.chain.gz  manPen1ToMm10.over.chain.gz

Note that there are liftOver files to TWO other orgs, human and mouse. If this assembly was not the first, it should also have liftOver files to the previous assembly version.

Let's go look at liftOver files for hg38:

 ★  /gbdb/hg38/liftOver
ls | grep ManPen
hg38ToManPen1.over.chain.gz

and then we'll check mm10:

 ★  /gbdb/mm10/liftOver
ls | grep ManPen
mm10ToManPen1.over.chain.gz

Dev: Check Tools: LiftOver

  • Go to dev's LiftOver Tool and test lifts to & from other assembly versions and other organisms that you have liftOver files for.

Dev: Review notes and make temp dir for md5sum checks

There is a way to check all md5sums at once using one command. This should save you lots of time and typing. You'll need two directories in your home folder, temp and temp2.

First go to your test directory:

cd /usr/local/apache/htdocs-hgdownload/goldenPath/$db/

Then run the following loop command to compare current md5sum.txt files and the ones generated when the files were uploaded. If your only output are about README.txt and md5sum.txt, that's great and nothing has changed. If something else comes up, ask your developer.

for dir in *; do cd $dir; md5sum * | sort > ~/temp/$dir; sort md5sum.txt > ~/temp2/$dir; echo $dir; diff ~/temp/$dir ~/temp2/$dir; cd ..; done

OUTDATED BELOW, use above command

First, make a dir "temp" in your home directory. You'll use this in the steps below. The remainder of the text below explains how the md5sum checks will work.

Review the following section, which is a guide to verify that the download files exist and are not corrupt in the following directory and sub dirs:

hgwdev > /usr/local/apache/htdocs-hgdownload/goldenPath/$db/

We will be using a computer program called Md5sum to generate MD5 hashes to verify the integrity of the files since any change to the file will cause its MD5 hash to change. The MD5 hashes for each file was generated and stored in the md5sum.txt file.

An easy way to compare the MD5 hashes of each file is to do a diff. This can be easily automated by running the following commands.

The first command is to run md5sum for all files in your current directory (these will be listed in the steps below), sort them, and then redirect the output to a file.

md5sum * | sort > ~/temp/filename_1

The second command sorts the md5sum.txt file and redirects the output to a different file.

sort md5sum.txt > ~/temp/filename_2

The final command compares the two files created and displays the lines that differ between the two files.

diff ~/temp/filename_1 ~/temp/filename_2

Note that the md5sum.txt file obviously does not contain md5sum.txt and it was created before there was a README.txt file, so your diff will show md5sum.txt and README.txt in the results. If everything is ok, those should be the only results. In the vs* directories, the XXXX.net.axt.gz file will show up as axtNet/XXXX.net.axt.gz.

Continue on to the next steps to begin running these checks in the following directories.

Dev: bigZips: check md5sum

Change your directory to:

cd /usr/local/apache/htdocs-hgdownload/goldenPath/$db/bigZips

then run the following command:

md5sum * | sort > ~/temp/filename_1; sort md5sum.txt > ~/temp/filename_2; diff ~/temp/filename_1 ~/temp/filename_2

Dev: bigZips: check README

These check README commands can be automated so you don't have to do any of the below commands. You still do have to read or skim the README.txt files that output. Here is the command to display all README files for the whole directory:

cd  /usr/local/apache/htdocs-hgdownload/goldenPath/$db/
for dir in *; do cd $dir; echo $dir; cat README.txt; cd ..; done

REDUNDANT BELOW, above step does all README.txt prints in this assembly directory



/usr/local/apache/htdocs-hgdownload/goldenPath/$db/bigZips
  • Verify that the README.txt exists
  • cat the file and read it, check the contents (such as urls listed, etc.)

Dev: bigZips: check for corruption

Change your working directory to:

cd /usr/local/apache/htdocs-hgdownload/goldenPath/$db/bigZips

Run the following in the directory:

for file in *; do zcat $file | head; zcat $file | tail; done

Scroll the output and make sure all the text is ASCII.

Dev: database: check README

cat /usr/local/apache/htdocs-hgdownload/goldenPath/$db/database/README.txt
  • Verify that the README.txt exists
  • cat the file and read it, check the contents (such as urls listed, etc.)

Dev: liftOver: check md5sum

Change your directory to:

cd /usr/local/apache/htdocs-hgdownload/goldenPath/$db/liftOver

then run this command:

md5sum * | sort > ~/temp/filename_1; sort md5sum.txt > ~/temp/filename_2; diff ~/temp/filename_1 ~/temp/filename_2

Dev: liftOver: check README

/usr/local/apache/htdocs-hgdownload/goldenPath/$db/liftOver
  • Verify that the README.txt exists
  • cat the file and read it, check the contents (such as urls listed, etc.)

Dev: liftOver: corruption

Run the following in each directory and check the output:

for file in *; do zcat $file | head; zcat $file | tail; done

Scroll the output and make sure all the text is ASCII.

Dev: vsXXX: check md5sum

This section is only relevant if your assembly has chain/net files to another organism.

Note: there may be multiple organisms that your assembly has alignment files to, check them all.

Change your directory to:

cd /usr/local/apache/htdocs-hgdownload/goldenPath/$db/vsXXX

then run the following command:

md5sum * | sort > ~/temp/filename_1; sort md5sum.txt > ~/temp/filename_2; diff ~/temp/filename_1 ~/temp/filename_2

REPEAT this process for subdirectories:

  • reciprocalBest
  • reciprocalBest/axtRBestNet

Dev: vsXXX: check README

cd /usr/local/apache/htdocs-hgdownload/goldenPath/$db/vsMm10
cd /usr/local/apache/htdocs-hgdownload/goldenPath/$db/vsHg38


  • Verify that the README.txt exists
  • cat the file and read it, check the contents (such as urls listed, etc.)

Dev: vsXXX: corruption

Change your directory to other assemblies chains of your assemboly:

cd /usr/local/apache/htdocs-hgdownload/goldenPath/hg38/vs${db^}
cd /usr/local/apache/htdocs-hgdownload/goldenPath/mm10/vs${db^}

Run the following in each directory and check the output:

for file in *; do zcat $file | head; zcat $file | tail; done

Scroll the output and make sure all the text is ASCII.

REPEAT this process for subdirectories:

  • reciprocalBest
  • reciprocalBest/axtRBestNet

Dev: for :queryDb:vsYourDb: check README

/usr/local/apache/htdocs-hgdownload/goldenPath/queryDb/vsYourDb

Check the README.txt files for any other organisms that your assembly has alignments (chain/net/liftover/etc) to:

  • Verify that the README.txt exists
  • cat the file and read it, check the contents (such as urls listed, etc.)

REPEAT this process for subdirectory:

  • reciprocalBest (this readme covers the subdir, axtRBestNet).

Dev: for :queryDb:vsYourDb: check md5sum

Note: there may be multiple organisms that your assembly has alignment files to, check them all.

Change your directory to:

hgwdev > /usr/local/apache/htdocs-hgdownload/goldenPath/queryDb/vsYourDb

then run the following command:

md5sum * | sort > ~/temp/filename_1; sort md5sum.txt > ~/temp/filename_2; diff ~/temp/filename_1 ~/temp/filename_2

REPEAT this process for subdirectories:

  • reciprocalBest
  • reciprocalBest/axtRBestNet

Dev: for :queryDb:vsYourDb: check corruption

/usr/local/apache/htdocs-hgdownload/goldenPath/queryDb/vsYourDb

Just do a zcat for otherOrg -> yourDb:

zcat $file | head

REPEAT this process for subdirectories:

  • reciprocalBest
  • reciprocalBest/axtRBestNet

Dev: md5sum check with 2bitCompare

2bitCompare $db

The .2bit files contain the new assembly sequence in a compact, binary format. The .2bit files are located at:

  • /scratch/$db (on the blat server)
  • /usr/local/apache/htdocs-hgdownload/goldenPath/$db/bigZips/ (on hgwdev)
  • /gbdb/$db/ (on hgwdev)
  • /gbdb/$db/ (on hgwbeta)

Check the to make sure that the .2bit files are identical by running the 2bitCompare script. Particularly if the assembly has been part of a multiz track without a Browser, the file may exist on beta and RR and may not have been masked.

Below is some sample output:

hgwdev> 2bitCompare allMis1

  Checking md5sums.  This could take a few minutes.  Please be patient...

        blat4a md5sum: 134e740c05eedadc24de3a96775a25d6 /scratch/allMis1/allMis1.2bit
      download md5sum: 134e740c05eedadc24de3a96775a25d6 /usr/local/apache/htdocs-hgdownload/goldenPath/allMis1/bigZips/allMis1.2bit
   hgwdev gbdb md5sum: 134e740c05eedadc24de3a96775a25d6 /gbdb/allMis1/allMis1.2bit
  hgwbeta gbdb md5sum: 134e740c05eedadc24de3a96775a25d6 /gbdb/allMis1/allMis1.2bit

        blat4a date,size: Jun 19 11:03 569794406
      download date,size: Jul 3 10:55 53
   hgwdev gbdb date,size: Jun 7 13:34 39
  hgwbeta gbdb date,size: Jun 7 13:33 569794406

The first part of the script output lists the md5sums of all four .2bit files. These should be identical.

The second part of the script output lists the timestamps and filesizes.

  • The download and hgwdev gbdb files should be symlinks, as evidenced by a small filesize.
  • The blat and hgwbeta gbdb files should be the actual files, as evidenced by a large filesize.
  • The two symlink filesizes will likely be different, but the filesize of the two actual files should be identical.

If the blat .2bit is not the same as the other .2bit files, ask the pushers to restart the assembly and to pull the newest .2bit file from /gbdb.

Note.

 hgwbeta/rr gbdb md5sum: The $db directory does not exist in /gbdb on hgwbeta
 hgwbeta/rr gbdb date,size: N/A

Could show since there's no gbdb data on beta yet, that's part of the whole data push process.

Dev: Permissions check: downloads dir

The developer may need to update permissions to the download directory to be at least 664.

hgwdev > /usr/local/apache/htdocs-hgdownload/goldenPath/$db/

ls -lLR *
-rw-rw-r

This output can be thousands of lines if you have lots of alignments. To shorten it, you can display only the lines that don't match that permission, the vs label, the total bytes line, the symlink permissions, and blank lines. If it finds anything with less permissions, investigate thoroughly.

ls -lLR * | grep -ve '-rw-rw-r--\|vs\|total\|drwxrwxr-x\|^$'

Dev: Ensure your dbs is defined in trackDb makefile

cat ~/kent/src/hg/makeDb/trackDb/makefile | grep $db


🔵 Done with DEV steps? Go to Assembly QA Part 2: Track Steps