Assembly QA Part 1 DEV Steps
These steps were revised in 2017, but you can also see the old steps: Releasing an assembly (old steps)
Navigation Menu |
Setup: Create a Google spreadsheet checklist from a template
Steps:
- Open a new Google Spreadsheet.
- Go to the Google spreadsheet Template: Assembly Release Checklist
- Copy the template: File > Make a copy
- Give your new spreadsheet a title, like "manPen1 Assembly Release Checklist".
- Move your spreadsheet to a good folder on your Google Drive so that you can easily find it later.
- All set! You can now use your checklist.
Tips:
- Note: This system works best when you create one spreadsheet per assembly.
- See the tab, "README" for more info.
- If a wiki section is h4 ("====Wiki Section===="), denoted by surrounding the section with exactly 4 equal signs, then the h4 section will appear as a step in your checklist.
- To add a new step to your checklist - do not add it directly to your spreadsheet. Instead add a new h4 section to the wiki. Just copy an existing h4 and edit it!
- Your h4 will become a url, so the only punctuation you can use is a colon " : " otherwise the wiki link in column A will break.
- To see your change, toggle the "#" character in your formula. The "#" is not really needed in the formula, and removing it or adding it back in will re-load the page.
Setup: Make a directory in your hive
During this assembly release process, you will be generating a lot of output, and you'll need a place to put everything. The use of the "hive" directory is encouraged as the best location because of ample space.
mkdir /hive/users/userName/assemblies/assemblyName e.g.: mkdir /hive/users/cath/assemblies/manPen1
Setup: Create an alias to your new dir
When you add an alias from your .bashrc file, you can simply type that alias in your command line as a shortcut to the associated command. A "shortcut" alias can be created to allow fast access to your hive directory for this assembly.
To do this, follow the steps below:
- In your terminal, connect to hgwdev and type "cd" (go to your home directory).
- Confirm the location of .bashrc. Type "ls -a" in your home directory to see all hidden files that have a " . " in the filename. This way you can confirm the location of your .bashrc file.
- Open your .bashrc file for editing. If you're using the vi editor, you can type "vi .bashrc" to edit the file. Add an alias by typing in the line below, then save your changes.
alias hive='cd /hive/users/yourUserName/assemblies/yourAssembly' e.g., alias hive='cd /hive/users/cath/assemblies/manPen1'
Redmine: Review the Redmine as PushQ wiki
- As of March 2017, the PushQ has been replaced with Redmine to track and release new assemblies.
- Review the Redmine as the pushQ replacement wiki page.
- Go to Redmine > GB > Issues > Filter: "Ready for QA"
- Find the assembly you will QA/Release
Redmine: Set assignee as yourself
Redmine: Set the engineer as watcher if they are not the developer
Redmine: Set Status to Reviewing
Dev: Check minimal browser criteria
Does this assembly have the required tracks?
Visit this page to check that the assembly contains the required tracks to be considered a minimal browser on the RR.
To add explaination: genbank mrnas & ests (/cluster/data/genbank/data/organism.lst) How to view/interpret the file
Dev: Check that BLAT Server is running
To check if your organism has a blat servers set up, run the following command (beware that copyHgcentral creates many temp files):
copyHgcentral test $db blatServers dev beta
a better command that does not create many temporary files is just querying hgcentraltest yourself:
hgsql -e "select * from blatServers where db='$db'" hgcentraltest
The developer has often already requested that the blat servers be set up for the new assembly. If not, and/or if entries for your assembly are missing from hgcentraltest.blatServers, please make a note in the Redmine ticket and ask the assembly builder to 1) request the setup of the blat servers and to 2) manually add the entries to hgcentraltest.blatServers. Make sure that this assembly is not hosted on "blatx" BLAT server. That server is not as stable and therefore is for assemblies that are not destined for the RR. For more information about where the blat servers for different machines should be hosted, go to Updating blat servers.
You should see results like this (below) since this should only be setup on dev so far:
copyHgcentral test manPen1 blatServers dev beta -------------------------------------------------- -------------------------------------------------- <<< blatServers >>> hgcentraltest ------------- manPen1 blat1b 17878 1 0 manPen1 blat1b 17879 0 1 hgcentralbeta ------------- hgcentral ------------- *** There are blatServers differences between dev and beta *** *** The blatServers data on beta and rr is identical ***
Dev: Do a BLAT search: DNA
From BLAT tool on dev:
- Go to your browser and copy some DNA sequence
- Go to BLAT: Home > Tools > Blat
- Paste in sequence
- Change query type to DNA and press submit
- Click on various blat results to make sure they look as expected
- Make a custom track of blat results and then look at them in the browser.
Dev: Do a BLAT search: protein
From BLAT tool on dev:
- Go to your browser and copy some DNA sequence -> translate to amino acid sequence*
- Go to BLAT: Home > Tools > Blat
- Paste in sequence
- Change query type to "protein" (amino acid) and press submit
- Click on various blat results to make sure they look as expected
- Make a custom track of blat results and then look at them in the browser.
Dev: isPCR test
- Go to dev's PCR Tool and test a PCR search for your assembly.
For example, on hg38:
- You want to get DNA, about 20-23 bases, that "book end" the region of DNA that will be amplified.
- For example, here's a 70bp region in hg38: chr1:11,131,574-11,131,643
- Go to this region on hg38
- hgTracks, View > DNA (v + d keyboard shortcut)
- Click "get DNA" with the default selections.
- Copy the DNA to your clipboard:
>hg38_dna range=chr1:11131574-11131643 5'pad=0 3'pad=0 strand=+ repeatMasking=none CCTGGTCCCAACACCTAGCCCACGGCCTGACAGAGAACCAGTGCTCAATGCTTGAAGGAAGAACCGCTGG
- Go to isPCR for hg38 (Tools >InSilico PCR)
- Genome: Human
- Assembly: hg38
- Tareget::genome assembly
- Forward Primer: The first 20(ish) bp of the region, e.g., CCTGGTCCCAACACCTAGCC (in green below)
- CCTGGTCCCAACACCTAGCCCACGGCCTGACAGAGAACCAGTGCTCAATGCTTGAAGGAAGAACCGCTGG
- Reverse Primer: The last 20(ish) of your region, e.g, GCTTGAAGGAAGAACCGCTGG (in red below)
- CCTGGTCCCAACACCTAGCCCACGGCCTGACAGAGAACCAGTGCTCAATGCTTGAAGGAAGAACCGCTGG
- Check "Flip reverse primer" This will change the region in red to the reverse compliment and also flip it 180 degrees.
- The idea is that you are finding the DNA between the green and the red chunks to amplify.
- Click submit.
For the reverse primer in red, you could have output the "-" strand DNA (the reverse compliment" in "Tools > View DNA" by selecting the radio button for the reverse compliment. If you do this for the "Reverse Primer" field in isPCR, then you do not have to select "Flip Reverse Primer."
Dev: Compare chrom sizes
- Skip this if your assembly is the first for a species (hosted by UCSC), there will be no chrom sizes to compare to!
- For a new assembly version, compare the chrom sizes from the last assembly to this new assembly version. For some assemblies, chrom names were changed, be aware of this if comparing. You are not checking annotations on the reference sequence, you are just checking the number of base pairs per chrom/contig, and making sure that nothing has changed drastically (i.e., millions of base pairs different). Also take a look for general differences, such as chrom labels or number of chrom/contigs.
- Output chrom sizes into two files, sort each file by using the command below
- Compare the sorted files
- There are two ways to compare chromosomes:
- 1.Navigate to http://hgwdev.gi.ucsc.edu/cgi-bin/hgGateway, find your assembly and click on the "View Sequences" button - bring up 2 windows side by side to view both old and new assemblies. Now, you can compare the chromosome sizes.
or
2. open up a terminal window and input the following commands:
hgsql -Ne "select chrom, size from chromInfo" $prevDb | sort > oldChromSizes hgsql -Ne "select chrom, size from chromInfo" $db | sort > newChromSizes sdiff -s oldChromSizes newChromSizes
You may want to use "$cat oldChromSizes | head" to clean up the output in both old and new chromSize files, we are only concerned with the "chr# ####" labels.
Dev: Gateway: Check the tree
On hgGateway, make sure your db appears in the tree.
- Type the first few letters of your assembly name in the search field above the "Represented Species" tree, "m-a-n-P-e..." and the rest should populate.
- Your assembly should now be highlighted in the tree, and the tree position should have moved so that you are now centered on the tree position for your org.
- Hover over the name of your org within the tree, you should see the scientific name.
- Hover over the horizontal branch leading to your org, you should see the genus - family - order.
- Hover over the vertical branch leading to your org, you should see the superorder.
- Go to a different organism on hgGateway. Then scroll down the tree and find your organism. Click on the name of your organism in the tree and you should go to the default assembly for your organism.
Dev: Gateway: Check default position
- Go to gateway page
- Reset all user settings (Home > Genome Browser > Reset All User Settings
- Select the assembly you're testing
- Press "Go" on hgGateway
- You will be taken to the default position for your assembly.
- Make sure that the resulting area is scientifically interesting and aesthetically pleasing!
- You can edit the default location here: hgcentraltest.dbDb.defaultPos:
hgsql -e "update dbDb set defaultPos='chr6:43426669-43433274' where name='danRer11'" hgcentraltest
On an unrelated dbDb note, setting the field hgPbOk=1 sets the base pairs shown on hgTracks from T's to U's. This also affects zoomed in MAF files, but shouldn't matter unless we're displaying an RNA genome like SARS-CoV-2. This field was left over from the protein browser and was repurposed, so it should be 0 for all DNA genomes.
Dev: Gateway: Check default tracks
- Each assembly has certain tracks that are hidden or visible by default.
- You can edit the default tracks here: /kent/src/hg/makeDb/trackDb/$db/trackDb.ra.
Below is an example for turning on a default gene track that was off when the developer released the assembly to dev.
Resource: https://genome.ucsc.edu/goldenpath/help/trackDb/trackDbDoc.html
- manPen1 has no gene tracks on by default.
- I want to turn on the augustus track (on by default, pack visibility).
- Looking at ~/kent/src/hg/makeDb/trackDb/$db/trackDb.ra, I see that there is no stanza for the augustus track, because it is inheriting the parent *.ra files configuration, making it hidden.
- I need to override the parent config in the manPen1 .ra file.
Steps:
- go to dev, Genome Browser > Reset All User Settings
- note which track you would like to turn on, see if you want it in 'pack' or 'full' etc.
- vim ~/kent/src/hg/makeDb/trackDb/pangolin/manPen1/trackDb.ra
- Add something like this:
#Local declaration so that augustus genes is picked up. track augustusGene override visibility pack
- cd ~/kent/src/hg/makeDb/trackDb
- make alpha DBS=manPen1
- refresh your dev hgTracks browser and see that your track is now on, inheriting the parent's visibility (pack, in this case).
- if all looks good, add, commit, push your .ra file.
If your assembly is already public on the RR, then continue the push:
- make beta DBS=manPen1
- make public DBS=manPen1
- Push request to admins: Make trackDb & friends for manPen1
- Check the rr/euro/asia for your newly visible track.
Dev: Gateway: Check trackDb priority
- Each assembly has certain tracks that are hidden or visible by default.
- Our standard is to have the visible tracks at the beginning of each group.
- You can edit the default tracks here: /kent/src/hg/makeDb/trackDb/$db/trackDb.ra.
Tracks that are on by default should be the first tracks within a group, for example, GENCODE v29 is first in the "Genes and Gene Predictions" group for hg38. All other tracks that are hidden by default proceed the visible tracks in alphabetical order. The only exception to this rule is for the chain/net tracks inside of the "Comparative Genomics" group. These chain/net tracks are in phylogenetic order and should not be in alphabetical order.
To change the order of the tracks on hgTracks, you can use the priority trackDb setting:
priority 1
hgTracks will display the tracks with the lowest priority value first, then followed by any tracks without a priority setting in alphabetical order. The priority value can be a floating point number, so a priority value of 1.1
will be displayed after a track with a priority value set to 1
.
Dev: Gateway: Organism image check
The image on hgGateway is referenced from trackDb directory's description.html file.
From your file.list from Redmine, make sure a scientificName.jpg image is listed, check to see that it does exist on dev.
The image file that appears on the gateway page should reside in the kent source tree in:
~/kent/src/hg/htdocs/images/
and a copy should exist at:
hgwdev > /usr/local/apache/htdocs/images/
If the image is not showing up on genome-test, cd to kent/src/hg/htdocs, ensure the image in the images directory is committed, then run make alpha.
Dev: Gateway: Accession ID check
Assemblies/sequences, from various organizations, are submitted to the mother ship GenBank.
Those assemblies might be included in RefSeq if criteria are met.
The QA check should be to go to NCBI and double check that the accessionID is correct, possibly by searching the Accession ID in https://www.ncbi.nlm.nih.gov/assembly/.
- RefSeq assemblies:
- use accession ID: GCF_000002315.4 (e.g., galGal5)
- are delivered with chrMt (if they exisit)
- are delivered with NCBI gene predictions
- Genbank assemblies:
- use accession ID: GCA_000001305.2
- delivered without a chrMt.
- do not have gene predictions.
For the UCSC Genome Browser, it is preferable to use RefSeq assemblies (in part due to 'more data'). This is a "learn as we go" direction; historically GeneBank was preferred.
Helpful article: Nature, 2012 A beginner's guide to eukaryotic genome annotation
Dev: Gateway: Check the NCBI assembly version link
Check that there is an NCBI link to the exact assembly version, either by clicking the link on the Gateway or searching http://www.ncbi.nlm.nih.gov/assembly/organism/
Dev: Verify make doc for all tracks
- The makefile/s or initialbuild.txt file for your assembly describes the browser build.
- Location should be here: ~/kent/src/hg/makeDb/doc/$db/*
Cath asked Hiram about tables that should be mentioned in the makedoc Nov 2017. Below is an example from xenLae2. The makedoc correctly lists all of the necessary tracks.
Mentioned in the makedoc
- augustusGene
- chromAlias
- cpgIslandExt
- cpgIslandExtUnmasked
- cytoBandIdeo
- gap
- genscan
- gold
- microsat
- rmsk
- simpleRepeat
- trackDb
- ucscToINSDC
- ucscToRefSeq
- windowmaskerSdust
- (Sometimes) ensGene
Not mentioned in the makedoc, and it is ok that they are not mentioned:
- Constructed by genbank processes:
- all_est
- all_mrna
- intronEst
- refFlat
- refGene
- refSeqAli
- xenoRefFlat
- xenoRefGene
- xenoRefSeqAli
- estOrientInfo
- mrnaOrientInfo
- Constructed by the doBlastzChainNet.pl script:
- chainHg38
- chainHg38Link
- chainMm10
- chainMm10Link
- chainXenTro9
- chainXenTro9Link
- netHg38
- netMm10
- netXenTro9
- Constructed by the makeGenome.pl script:
- chromInfo
- gc5BaseBw
- grp
- Constructed by 'make' in trackDb hierarchy:
- hgFindSpec
- Added to by many loader commands:
- history
- Constructed by doRepeatMasker.pl script:
- nestedRepeats
- Constructed by cron job every night:
- tableDescriptions
Grep tips: Use your list of tables you'll push (see BETA STEPS )as the grep search string list, and look in the make file to see which tables are NOT mentioned
- 1. grep -of allTables.xenLae2 ~/kent/src/hg/makeDb/doc/xenLae2/initialBuild.txt | sort -u > tablesListed.makedoc
- 2. comm -23 tablesListed.makedoc allTables.xenLae2
This is also a helpful grep: cat ~/kent/src/hg/makeDb/doc/xenLae2/initialBuild.txt | grep "DONE"
Dev: Review downloads dir
View the contents of the downloads directory.
ls -R /usr/local/apache/htdocs-hgdownload/goldenPath/$db/
LiftOver files and vs* directories are for the chain/net tracks; and the multiz*way, phastCons*way and phyloP*way directories are for conservation tracks.
Note that $db/database dir will be empty except for README.txt. This directory will contain a dump of the database on the RR, but will always remain empty on hgwdev.
Also note that these files:
est.fa.gz mrna.fa.gz refMrna.fa.gz xenoMrna.fa.gz est.fa.gz.md5 mrna.fa.gz.md5 refMrna.fa.gz.md5 xenoMrna.fa.gz.md5
will not be present on hgwdev. They are generated automatically and rsync'ed to hgdownload after an assembly is added to hgwbeta.dbs and "make etc-update-server" is run in the kent/src/hg/makeDb/genbank/ directory on hgwbeta.
Dev: Run dbCheck
Run the following command to check that all MySQL tables are in good repair:
sudo dbCheck.sh $db
Dev: Alignment files are to valid assemblies
In Redmine for your assembly, the engineer should have provided a path to redmine.$db.file.list E.g., /hive/data/genomes/manPen1/redmine5515/redmine.manPen1.file.list
From hive, copy the file list to your assembly dir:
/hive/data/genomes/manPen1/redmine5515/redmine.manPen1.file.list
Take a look at the alignment "To" and "From" files, and make sure they are to valid assemblies on the RR.
- LiftOver Files
- A liftOver file is a chain file, it is a subset of all chains used in creating the net file.
- hg38ToAnoCar2.over.chain.gz: These files are required for the liftOver utility.
- The file names reflect the assembly conversion data contained within in the format <db1>To<Db2>.over.chain.gz. For example, a file named hg38ToAnoCar2.over.chain.gz file contains the liftOver data needed to convert hg38 coordinates to the anoCar2 assembly.
- hg38ToAnoCar2.over.chain.gz: These files are required for the liftOver utility.
- A liftOver file is a chain file, it is a subset of all chains used in creating the net file.
- Chain Files
- Chain files contain all possible chains generated by lastz, before a subset of "best" chains are filtered into the liftOver file.
- hg38.anoCar2.all.chain.gz: chained blastz alignments.
- The chain format is described in on the chain help page.
- hg38.anoCar2.all.chain.gz: chained blastz alignments.
- Chain files contain all possible chains generated by lastz, before a subset of "best" chains are filtered into the liftOver file.
- Net Files
- hg38.anoCar2.net.gz: "net" file.
- This file describes rearrangements between the species and the best Lizard match to any part of the Human genome. The net format is described in on the net help page.
- hg38.anoCar2.net.gz: "net" file.
- Axt Files
- hg38.anoCar2.net.axt.gz: chained and netted alignments.
- i.e. the best chains in the Human genome, with gaps in the best chains filled in by next-best chains where possible. The axt format is described in the axt help page.
- hg38.anoCar2.net.axt.gz: chained and netted alignments.
Dev: liftOver exists: old to new, new to old
Skip this if your assembly is the first version for the organism. Otherwise, check that the previous assembly version has a liftOver file to
- the new version
- and a reciprocal file in the
/gbdb/[your Db]/liftOver/[your Db]To[the older version of your org].over.chain.gz /gbdb/[the older version of your org]/liftOver/[the older version of you org]To[your Db].over.chain.gz
Dev: liftOver exists: other orgs
Your assembly will probably also have liftOver files to/from other major orgs, such as the newer human and mouse assemblies. Check that liftOver files exist in BOTH directories,
/gbdb/[your database]/liftOver/ /gbdb/[some other org database]/liftOver
For example, if your assembly is manPen1, see what liftOver files are there. These should also match what is in your filelist from Redmine.
★ /gbdb/manPen1/liftOver ls manPen1ToHg38.over.chain.gz manPen1ToMm10.over.chain.gz
Note that there are liftOver files to TWO other orgs, human and mouse. If this assembly was not the first, it should also have liftOver files to the previous assembly version.
Let's go look at liftOver files for hg38:
★ /gbdb/hg38/liftOver ls | grep ManPen hg38ToManPen1.over.chain.gz
and then we'll check mm10:
★ /gbdb/mm10/liftOver ls | grep ManPen mm10ToManPen1.over.chain.gz
Dev: Check Tools: LiftOver
- Go to dev's LiftOver Tool and test lifts to & from other assembly versions and other organisms that you have liftOver files for.
Dev: Review notes and make temp dir for md5sum checks
There is a way to check all md5sums at once using one command. This should save you lots of time and typing. You'll need two directories in your home folder, temp and temp2.
First go to your test directory:
cd /usr/local/apache/htdocs-hgdownload/goldenPath/$db/
Then run the following loop command to compare current md5sum.txt files and the ones generated when the files were uploaded. If your only output are about README.txt and md5sum.txt, that's great and nothing has changed. If something else comes up, ask your developer.
for dir in *; do cd $dir; md5sum * | sort > ~/temp/$dir; sort md5sum.txt > ~/temp2/$dir; echo $dir; diff ~/temp/$dir ~/temp2/$dir; cd ..; done
OUTDATED BELOW, use above command
First, make a dir "temp" in your home directory. You'll use this in the steps below. The remainder of the text below explains how the md5sum checks will work.
Review the following section, which is a guide to verify that the download files exist and are not corrupt in the following directory and sub dirs:
hgwdev > /usr/local/apache/htdocs-hgdownload/goldenPath/$db/
We will be using a computer program called Md5sum to generate MD5 hashes to verify the integrity of the files since any change to the file will cause its MD5 hash to change. The MD5 hashes for each file was generated and stored in the md5sum.txt file.
An easy way to compare the MD5 hashes of each file is to do a diff. This can be easily automated by running the following commands.
The first command is to run md5sum for all files in your current directory (these will be listed in the steps below), sort them, and then redirect the output to a file.
md5sum * | sort > ~/temp/filename_1
The second command sorts the md5sum.txt file and redirects the output to a different file.
sort md5sum.txt > ~/temp/filename_2
The final command compares the two files created and displays the lines that differ between the two files.
diff ~/temp/filename_1 ~/temp/filename_2
Note that the md5sum.txt file obviously does not contain md5sum.txt and it was created before there was a README.txt file, so your diff will show md5sum.txt and README.txt in the results. If everything is ok, those should be the only results. In the vs* directories, the XXXX.net.axt.gz file will show up as axtNet/XXXX.net.axt.gz.
Continue on to the next steps to begin running these checks in the following directories.
Dev: bigZips: check md5sum
Change your directory to:
cd /usr/local/apache/htdocs-hgdownload/goldenPath/$db/bigZips
then run the following command:
md5sum * | sort > ~/temp/filename_1; sort md5sum.txt > ~/temp/filename_2; diff ~/temp/filename_1 ~/temp/filename_2
Dev: bigZips: check README
These check README commands can be automated so you don't have to do any of the below commands. You still do have to read or skim the README.txt files that output. Here is the command to display all README files for the whole directory:
cd /usr/local/apache/htdocs-hgdownload/goldenPath/$db/ for dir in *; do cd $dir; echo $dir; cat README.txt; cd ..; done
REDUNDANT BELOW, above step does all README.txt prints in this assembly directory
/usr/local/apache/htdocs-hgdownload/goldenPath/$db/bigZips
- Verify that the README.txt exists
- cat the file and read it, check the contents (such as urls listed, etc.)
Dev: bigZips: check for corruption
Change your working directory to:
cd /usr/local/apache/htdocs-hgdownload/goldenPath/$db/bigZips
Run the following in the directory:
for file in *; do zcat $file | head; zcat $file | tail; done
Scroll the output and make sure all the text is ASCII.
Dev: database: check README
cat /usr/local/apache/htdocs-hgdownload/goldenPath/$db/database/README.txt
- Verify that the README.txt exists
- cat the file and read it, check the contents (such as urls listed, etc.)
Dev: liftOver: check md5sum
Change your directory to:
cd /usr/local/apache/htdocs-hgdownload/goldenPath/$db/liftOver
then run this command:
md5sum * | sort > ~/temp/filename_1; sort md5sum.txt > ~/temp/filename_2; diff ~/temp/filename_1 ~/temp/filename_2
Dev: liftOver: check README
/usr/local/apache/htdocs-hgdownload/goldenPath/$db/liftOver
- Verify that the README.txt exists
- cat the file and read it, check the contents (such as urls listed, etc.)
Dev: liftOver: corruption
Run the following in each directory and check the output:
for file in *; do zcat $file | head; zcat $file | tail; done
Scroll the output and make sure all the text is ASCII.
Dev: vsXXX: check md5sum
This section is only relevant if your assembly has chain/net files to another organism.
Note: there may be multiple organisms that your assembly has alignment files to, check them all.
Change your directory to:
cd /usr/local/apache/htdocs-hgdownload/goldenPath/$db/vsXXX
then run the following command:
md5sum * | sort > ~/temp/filename_1; sort md5sum.txt > ~/temp/filename_2; diff ~/temp/filename_1 ~/temp/filename_2
REPEAT this process for subdirectories:
- reciprocalBest
- reciprocalBest/axtRBestNet
Dev: vsXXX: check README
cd /usr/local/apache/htdocs-hgdownload/goldenPath/$db/vsMm10 cd /usr/local/apache/htdocs-hgdownload/goldenPath/$db/vsHg38
- Verify that the README.txt exists
- cat the file and read it, check the contents (such as urls listed, etc.)
Dev: vsXXX: corruption
Change your directory to other assemblies chains of your assemboly:
cd /usr/local/apache/htdocs-hgdownload/goldenPath/hg38/vs${db^} cd /usr/local/apache/htdocs-hgdownload/goldenPath/mm10/vs${db^}
Run the following in each directory and check the output:
for file in *; do zcat $file | head; zcat $file | tail; done
Scroll the output and make sure all the text is ASCII.
REPEAT this process for subdirectories:
- reciprocalBest
- reciprocalBest/axtRBestNet
Dev: for :queryDb:vsYourDb: check README
/usr/local/apache/htdocs-hgdownload/goldenPath/queryDb/vsYourDb
Check the README.txt files for any other organisms that your assembly has alignments (chain/net/liftover/etc) to:
- Verify that the README.txt exists
- cat the file and read it, check the contents (such as urls listed, etc.)
REPEAT this process for subdirectory:
- reciprocalBest (this readme covers the subdir, axtRBestNet).
Dev: for :queryDb:vsYourDb: check md5sum
Note: there may be multiple organisms that your assembly has alignment files to, check them all.
Change your directory to:
hgwdev > /usr/local/apache/htdocs-hgdownload/goldenPath/queryDb/vsYourDb
then run the following command:
md5sum * | sort > ~/temp/filename_1; sort md5sum.txt > ~/temp/filename_2; diff ~/temp/filename_1 ~/temp/filename_2
REPEAT this process for subdirectories:
- reciprocalBest
- reciprocalBest/axtRBestNet
Dev: for :queryDb:vsYourDb: check corruption
/usr/local/apache/htdocs-hgdownload/goldenPath/queryDb/vsYourDb
Just do a zcat for otherOrg -> yourDb:
zcat $file | head
REPEAT this process for subdirectories:
- reciprocalBest
- reciprocalBest/axtRBestNet
Dev: md5sum check with 2bitCompare
2bitCompare $db
The .2bit files contain the new assembly sequence in a compact, binary format. The .2bit files are located at:
- /scratch/$db (on the blat server)
- /usr/local/apache/htdocs-hgdownload/goldenPath/$db/bigZips/ (on hgwdev)
- /gbdb/$db/ (on hgwdev)
- /gbdb/$db/ (on hgwbeta)
Check the to make sure that the .2bit files are identical by running the 2bitCompare script. Particularly if the assembly has been part of a multiz track without a Browser, the file may exist on beta and RR and may not have been masked.
Below is some sample output:
hgwdev> 2bitCompare allMis1 Checking md5sums. This could take a few minutes. Please be patient... blat4a md5sum: 134e740c05eedadc24de3a96775a25d6 /scratch/allMis1/allMis1.2bit download md5sum: 134e740c05eedadc24de3a96775a25d6 /usr/local/apache/htdocs-hgdownload/goldenPath/allMis1/bigZips/allMis1.2bit hgwdev gbdb md5sum: 134e740c05eedadc24de3a96775a25d6 /gbdb/allMis1/allMis1.2bit hgwbeta gbdb md5sum: 134e740c05eedadc24de3a96775a25d6 /gbdb/allMis1/allMis1.2bit blat4a date,size: Jun 19 11:03 569794406 download date,size: Jul 3 10:55 53 hgwdev gbdb date,size: Jun 7 13:34 39 hgwbeta gbdb date,size: Jun 7 13:33 569794406
The first part of the script output lists the md5sums of all four .2bit files. These should be identical.
The second part of the script output lists the timestamps and filesizes.
- The download and hgwdev gbdb files should be symlinks, as evidenced by a small filesize.
- The blat and hgwbeta gbdb files should be the actual files, as evidenced by a large filesize.
- The two symlink filesizes will likely be different, but the filesize of the two actual files should be identical.
If the blat .2bit is not the same as the other .2bit files, ask the pushers to restart the assembly and to pull the newest .2bit file from /gbdb.
Note.
hgwbeta/rr gbdb md5sum: The $db directory does not exist in /gbdb on hgwbeta hgwbeta/rr gbdb date,size: N/A
Could show since there's no gbdb data on beta yet, that's part of the whole data push process.
Dev: Permissions check: downloads dir
The developer may need to update permissions to the download directory to be at least 664.
hgwdev > /usr/local/apache/htdocs-hgdownload/goldenPath/$db/
ls -lLR * -rw-rw-r
This output can be thousands of lines if you have lots of alignments. To shorten it, you can display only the lines that don't match that permission, the vs label, the total bytes line, the symlink permissions, and blank lines. If it finds anything with less permissions, investigate thoroughly.
ls -lLR * | grep -ve '-rw-rw-r--\|vs\|total\|drwxrwxr-x\|^$'
Dev: Ensure your dbs is defined in trackDb makefile
cat ~/kent/src/hg/makeDb/trackDb/makefile | grep $db
🔵 Done with DEV steps? Go to Assembly QA Part 2: Track Steps