Conservation Track QA
Preparing to QA
Please follow the instructions in this wiki page, in addition to the New Track Checklist. The Conservation track will consist of one or more of the following subtracks: multiz#way, phyloP, phastCons and Most Conserved/Conserved Elements. Sometimes if there is only a multiz#way and a Most Conserved/Conserved Elements tracks, they will be in separate tracks. Otherwise, they should all be together.
Make sure that pushQ entry has all gbdb files and downloads needed for the track
For gbdb, there should be either be one big .maf file or one .maf file for each chromosome for the multiz#way track. Both phyloP and phastCons will have one .wib file associated with each. There are no gbdb files for the Most Conserved/Conserved Elements track.
For the downloads there will typically be a directory for multiz#way, phastCons and phyloP. Typically all the contents of these directories are pushed, unless there are special files that the developer does not want to go out. There is no directory for the Most Conserved/Conserved Elements track.
To see the list of assemblies actually used to build the tracks
cat /hive/data/genomes/$db/bed/multiz#way/species.lst
(Note that sometimes the file extension is .list; depends on the developer. Hiram tends to use .list)
Make a list of all organisms in the Conservation track
This will help you make sure that the correct name and date are associated with each assembly
human | Homo sapiens | Mar 2006 | hg18 |
chimpanzee | Pan troglodytes | Mar 2006 | panTro2 |
... | ... | ... | ... |
QAing the track
Don't rearrange the species list for the hgTracks order
If you look at some past RM tix for conservation tracks, you can see that in some cases, folks have re-arranged the species order (sGroup in trackDb.30way.ra, for example) list so that it matches the tree order. Cath talked to Hiram about this in Dec 2017, Hiram said it's better not to change the ordering, because he puts the species in phylogenic order, which is different than the order of the tree. So, there would be no need to change the ordering, such as in this example: http://redmine.soe.ucsc.edu/issues/20216#note-35
Examine and possibly push records for the extFile and seq tables
- if they are standard maf files, there will be no entries in the seq table, but you should have entries in the extFile table.
- you might have "upsteam" files in your file push list (e.g., upstream5000.knownGene.maf.gz), but these won't be/should not be in the extFile table.
- There may be more than one set of entries in the extFile table. Make sure you only push the set that pertains to the actual files you are pushing to hgnfs1 (e.g. /gbdb/ponAbe2/multiz8way/anno/maf/*)
- These are the ones that will need pushing to beta:
mysql> select path from extFile where path like "%anno/maf%";
Do not push extFile and seq tables, you must use this script to copy the rows from dev to beta:
copyExtSeqRows.csh
- If this is a new assembly and this is the first Conservation track, the extFile table probably doesn't exist on beta and likely only has a single entry on dev. If this is the case, it is ok to just push the entire table from dev to beta.
- Include extFile table in push request when you are ready to push to the RR.
For example, for RM#20216, Cath did:
1. Count your files on dev.
db=hg38 #Makes copy-pasting easier. You will still have to change '30way' to your number. hgsql -Ne 'select count(*) from extFile where path like "%30way%";' $db +-----+ | 358 | +-----+
2. Obviously, there won't be such files on beta yet, we haven't pushed them over:
hgsql -h mysqlbeta -Ne 'select count(*) from extFile where path like "%30way%";' $db +----------+ | 0 | +----------+
3. Get a list of your files from dev.
hgsql -Ne 'select path from extFile where path like "%30way%";' $db > extFileListPush
4. Do a test run
copyExtSeqRows.csh $db extFileListPush new setup
5. Copy the actual records from dev to beta (e.g., this is for hg38.extFile table)
copyExtSeqRows.csh $db extFileListPush new real
6. Check beta, should be the same count as dev.
hgsql -h mysqlbeta -Ne 'select count(*) from extFile where path like "%30way%";' $db +-----+ | 358 | +-----+
7. Do some spot checking to compare the dev table to the beta table
hgsql -h mysqlbeta -Ne 'select * from extFile where path like "%30way%" limit 3;' $db +------+--------------------------+-----------------------------------------------------+-------------+ | 3807 | chr1.maf | /gbdb/hg38/multiz30way/maf/chr1.maf | 13292820287 | | 3808 | chr10.maf | /gbdb/hg38/multiz30way/maf/chr10.maf | 7690007924 | | 3809 | chr10_GL383545v1_alt.maf | /gbdb/hg38/multiz30way/maf/chr10_GL383545v1_alt.maf | 3237645 | +------+--------------------------+-----------------------------------------------------+-------------+ hgsql -Ne 'select * from extFile where path like "%30way%" limit 3;' $db +------+--------------------------+-----------------------------------------------------+-------------+ | 3807 | chr1.maf | /gbdb/hg38/multiz30way/maf/chr1.maf | 13292820287 | | 3808 | chr10.maf | /gbdb/hg38/multiz30way/maf/chr10.maf | 7690007924 | | 3809 | chr10_GL383545v1_alt.maf | /gbdb/hg38/multiz30way/maf/chr10_GL383545v1_alt.maf | 3237645 | +------+--------------------------+-----------------------------------------------------+-------------+
Check trackDb.ra file for default codon species
For the multizway track make sure there is a speciesCodonDefault entry (usually is this species). This is the species that is used as the default for codon translation. This entry determines which species are turned off by default and, therefore, which ones are on by default.
e.g.,
cat ~/kent/src/hg/makeDb/trackDb/human/hg38/trackDb.30way.ra | grep speciesCodonDefault
Test the tables
Check multizway#wayFrames table
- check that each species in the description page has entries in the Frames table (select distinct(src) from multiz8wayFrames), including entries for the species that this track is for. If there are species missing, look in the makedoc (/kent/src/hg/makeDb/doc/$db) to see if there was a reason they were excluded (e.g. no good gene set). If there is nothing in the makedoc, ask the developer.
For example:
hgsql -Ne 'select distinct src from multiz30wayFrames;' hg38
Find out how phastCons was run
The --not-informative flag is used when subsets of a multiple alignment are used to calculate phastCons (i.e. when this track has subgroups/subtrees). If this is the case there will be a *.non-inf file that contains the list of orgs not used in the calculations for that subgroup. An example of a *.non-inf file can be found here:
/cluster/data/ponAbe2/bed/multiz8way/cons/primates/primates.non-inf
it includes: monDom4,ornAna1
These species are not primates and so are not informative when doing the primate-only phastCons. Check in the make doc to see if phastCons was run with the --not-informative flag. If it was, check for a *.non-inf file and see if the species given in that file belong there (i.e. for a 'primates' subgroup, there shouldn't be any primates specified in the *.non-inf file)
e.g.,
cat ~/kent/src/hg/makeDb/doc/hg38/multiz30way.txt | grep "non-inf"
Make sure hgPal works with this set of tables
- the hgPal CGI should work with the multiz table for this Conservation track. To do this go to a RefSeq track, click on a gene and then click on the "CDS FASTA" link on the trackUi page. This should take you to a page that links to the MAF file. This can also be reached by clicking on the "Protein FASTA" link for a UCSC Gene, and by selecting "cds FASTA" output for either track (refseq/ucsc genes) in the Table Browser. For both, make sure that all the buttons work.
Read through the description page
- Conservation track:
- Check image that displays on conservation details page.
- Check "Gene tracks used for codon translation" table against make doc.
- Make sure organisms are listed (in all places, including the gif & track controls (Multiz Alignments Configuration box)) in the correct phylogenetic order. If the check boxes in the track control are not in order, you can reorder them in trackDb.ra next to the sGroup_### line under "track multiz##way".
- Make sure that this page includes all the extra sections (if the multizs have been annotated). Sections should include: Gap Annotation, Genomic Breaks, Base Level, and a chart for protein/codon translation
- Make sure there is a tree model available.
Test in the Genome Browser
- Zoom out past 1M bps (this tests the multiz*waySummary table)
- Check out codon translation for a few species.
- make sure that Conserved Elements (a.k.a. Most Conserved) track items have names like "lod=22", not like "chr3.172".
Check downloads
Read all README files in hgdownload directories
Make sure that there are no spelling or grammatical errors. Also check that the information, including the help links, are correct and relevant.
Check one maf file
Verify that it has all the species involved in the track, and that at least one entry in the file is formatted according correctly (more information here: http://genome.ucsc.edu/FAQ/FAQformat.html#format5).
hgwdev: zcat chr#.maf.gz | head
or if you have many species (like a 20way):
$ cat /hive/users/chmalee/tracks/rn6/20way/species.list.txt | xargs -I{} zgrep -m 1 -w {} rn6.20way.maf.gz
Modify downloads.html
Modify the Multiple Alignments section for this assembly in downloads.html. If this is the first Conservation track for this assembly, you'll need to create a new Multiple Alignments section underneath the Pairwise Alignments section.
Things to not do
joinerCheck, searching
Precaution when pushing to the RR
With the release of the 100way the size of the files and tables has become so large that the push takes a very long time. This is problematic as the *RR will be broken* unless the extFile table entry matches whichever files are pushed out so this needs to be timed so that the extFile table push and the gbdb files that match it arrive at the same time.
A trick to facilitate this - taken from the UCSC genes staging process - is as follows: Copy the tables from hgwbeta into a temporary database on the RR servers. When it's time for the switch, just do a unix mv into the real database and push the extFile table at the same time.
Note: the hg38.cons30way #RM20216 was pushed out all at once with no issues. As of 2017, much of our hardware has been updated, and it's possible the note above no longer applies. That said, be on the safe side and check with admins before doing a push larger than the 30way.
Final Steps
- Add the "new" label pennantIcon for any new track
- Do the usual post-RR checks, check Euro and Asia, etc.
Example push requests
- Push gbdb files
- Push conservation tables, extFile table, and trackDb and friends
- Push the tree.png image file
- Push download files - note, push directory/* instead of listing files individually.
- Push downloads.html
Things we used to do to QA the conservation track but no longer do
Check annotated maf files for overlapping blocks
(maf files are generated automatically now, so we no longer need to do this) [hgwdev:/gbdb/ponAbe2/multiz8way/anno/maf>
foreach f (*.maf) echo -n "${f}: " mafFilter -overlap -minRow=1 $f > /dev/null end
If there are 'rejected blocks', contact the developer.
Check upstream files to make sure that the species name doesn't appear in an "s" line
(the upstream maf files are generated automatically now, so we no longer need to do this)
[hgwdev:~/goldenPath/ponAbe2/multiz8way/maf> zcat upstream*.maf.gz | grep "s ponAbe2" | wc -l 0
If this is not zero, contact the developer.
Check upstream files to make sure gene names haven't been truncated (to 9 chars)
(the upstream maf files are generated automatically now, so we no longer need to do this)
[hgwdev:~/goldenPath/ponAbe2/multiz8way/maf> zcat upstream*.maf.gz | head
##maf version=1 scoring=zero a score=0.000000 s CG13384-RD_up_1000_chr2L_8383467_f 0 1000 + 1000
If the gene names are short (9 characters) contact the developer.
To generate a list of gene names and pipe to a file called "test": zcat upstream1000.maf.gz | grep "NM_.*" | cut -f2 -d" " > test
To see how many file names are 10 or more characters long: grep "[0-9]\{10,\}" test
Check upstream files to make sure sequence hasn't been reverse-complemented incorrectly
(the upstream maf files are generated automatically now, so we no longer need to do this)
Since reverse-complement is a relative thing, the MAF sequence is supposed to be in the direction of transcription. That is, for a negative strand gene, its reversed-complement of the genome sequence. So it is supposed to be r-c of the genome, and not r-c of the direction of transcription.
From the MAF file documentation:
strand -- Either '+' or '-'. If '-', then the alignment is to the reverse-complemented source.
Search in the MySQL database for a gene on the minus strand. Then find that gene in the upstream1000.maf.gz file then check for correct r-c.
e.g.
mysql> select name, chrom, strand from ensGene where strand = "-" limit 1\G
name: ENSORLT00000000020
chrom: chr1
strand: -
[hgwdev:~/goldenPath/oryLat2/multiz5way> zcat ensGene.upstream1000.maf.gz | grep "ENSORLT00000000020"
s ENSORLT00000000020 0 1000 + 1000 gacactgaaggacgtGGACGTTATTTACCAACATCAAAGCACACAAATATAtggcacagaaac [ -clip - ]
Check this sequence with the sequence just upstream from this gene in the browser.
Test in the Genome Browser
- Zoom back in to <50,000 bp and find example areas of all annotation types (check against the maf file for that location):
- pale yellow bar
- green square brackets
- vertical blue bar
- gaps
- Note that it is much easier to see the yellow, green and blue areas when Multiz Alignments is set to "pack" in the track controls.
Check the Most Conserved description page (this no longer exists)
- Most Conserved track:
- Make sure the text refers to the correct species.
Make a list of all organisms for which there are nets & chains and put them in phylogenetic order
This will provide you with a reference for the phylogenetic order of species. Wikipedia is a good place to look for species relationships to confirm that our tree is correct (e.g., euarchontoglires, Laurasiatheria, primates, carnivores). NCBI's Taxonomy Browser and Wellcome's Interactive Tree of Life can also be helpful.
- hg18
- panTro2
- ...
We no longer build a phylogenetic tree to "confirm that our tree is correct" because, in short, our tree is better. To quote Hiram:
We only start with NCBI taxonomy guidelines. Since that start we have received numerous inputs on how this tree is constructed from experts in the various clades. There are no trees that are accurate, since the truth isn't actually known, there are only approximations. Our tree is one approximation, it will be improved in the future. We most likely have enough data to make our own calculation of what the tree might better be, but we don't do taxonomy estimations even though they would be fun. This 100-way tree is only a subset of the more extensive 191-way tree that is in the source tree in kent/src/hg/utils/phyloTrees/*.nh This is an evolving project.