Conservation Track QA: Difference between revisions

From genomewiki
Jump to navigationJump to search
Line 149: Line 149:
=== Contact Galaxy: ===
=== Contact Galaxy: ===
* send an email to Anton at Galaxy to let him know that a new conservation track is about to be released.  They can decide if they want to pre-load the data at their site.
* send an email to Anton at Galaxy to let him know that a new conservation track is about to be released.  They can decide if they want to pre-load the data at their site.
==Things we used to do to QA the conservation track but no longer do==
==== Check annotated maf files for overlapping blocks ====
(maf files are generated automatically now, so we no longer need to do this)
[hgwdev:/gbdb/ponAbe2/multiz8way/anno/maf>
foreach f (*.maf)
  echo -n "${f}: "
  mafFilter -overlap -minRow=1 $f > /dev/null
end
''If there are 'rejected blocks', contact the developer.''
==== Check upstream files to make sure that the species name doesn't appear in an "s" line: ====
(the upstream maf files are generated automatically now, so we no longer need to do this)
[hgwdev:~/goldenPath/ponAbe2/multiz8way/maf> zcat upstream*.maf.gz | grep "s ponAbe2" | wc -l
0
''If this is not zero, contact the developer.''
==== Check upstream files to make sure gene names haven't been truncated (to 9 chars): ====
(the upstream maf files are generated automatically now, so we no longer need to do this)
[hgwdev:~/goldenPath/ponAbe2/multiz8way/maf> zcat upstream*.maf.gz | head
##maf version=1 scoring=zero
a score=0.000000
s CG13384-RD_up_1000_chr2L_8383467_f 0 1000 + 1000
''If the gene names are short (9 characters) contact the developer.''
To generate a list of gene names and pipe to a file called "test":
zcat upstream1000.maf.gz | grep "NM_.*" | cut -f2 -d" " > test
To see how many file names are 10 or more characters long:
grep "[0-9]\{10,\}" test
==== Check upstream files to make sure sequence hasn't been reverse-complemented incorrectly: ====
(the upstream maf files are generated automatically now, so we no longer need to do this)
Since reverse-complement is a relative thing, the MAF sequence is supposed to be in the direction of transcription.  That is, for a negative strand gene, its reversed-complement of the
genome sequence.  So it is supposed to be r-c of the genome, and not r-c of the direction of transcription.
From the MAF file documentation:
strand -- Either '+' or '-'. If '-', then the alignment is to the reverse-complemented source.
Search in the MySQL database for a gene on the minus strand.  Then find that gene in the upstream1000.maf.gz file then check for correct r-c.
e.g.
mysql> select name, chrom, strand from ensGene where strand = "-" limit 1\G<BR>
name: ENSORLT00000000020<BR>
chrom: chr1<BR>
strand: -<BR>
[hgwdev:~/goldenPath/oryLat2/multiz5way> zcat ensGene.upstream1000.maf.gz | grep "ENSORLT00000000020"<BR>
s ENSORLT00000000020 0 1000 + 1000 gacactgaaggacgtGGACGTTATTTACCAACATCAAAGCACACAAATATAtggcacagaaac [ -clip - ]
Check this sequence with the sequence just upstream from this gene in the browser.


[[Category:Browser QA tracks]]
[[Category:Browser QA tracks]]

Revision as of 00:08, 25 February 2011

Preparing to QA

The Conservation track will consist of one or more of the following subtracks: multiz#way, phyloP, phastCons and Most Conserved/Conserved Elements. Sometimes if there is only a multiz#way and a Most Conserved/Conserved Elements tracks, they will be in separate tracks. Otherwise, they should all be together

Make sure that pushQ entry has all gbdb files and downloads needed for the track

For gbdb, there should be one .maf file for the multiz#way track and a .wib file for both phyloP and phastCons. There are no gbdb files for the Most Conserved/Conserved Elements track.

For the downloads there will typically be a directory for multiz#way, phastCons and phyloP. Typically all the contents of these directories are pushed, unless there are special files that the developer does not want to go out. There is no directory for the Most Conserved/Conserved Elements track.

To see the list of assemblies actually used to build the tracks

cat /hive/data/genomes/$db/bed/multiz#way/species.lst

Make a list of all organisms in the Conservation track

This will help you make sure that the correct name and date are associated with each assembly

human Homo sapiens Mar 2006 hg18
chimpanzee Pan troglodytes Mar 2006 panTro2
... ... ... ...

Make a list of all organisms for which there are nets & chains and put them in phylogenetic order

This will provide you with a reference for the phylogenetic order of species

  • hg18
  • panTro2
  • ...

Check the following in the files:

Check annotated maf files for overlapping blocks:

  • Note that upstream*.maf files do not need to be checked in this way.

[hgwdev:/gbdb/ponAbe2/multiz8way/anno/maf>

foreach f (*.maf)
  echo -n "${f}: "
  mafFilter -overlap -minRow=1 $f > /dev/null
end

If there are 'rejected blocks', contact the developer.


Read both README files:

/goldenPath/ponAbe2/phastCons8way/README.txt
/goldenPath/ponAbe2/multiz8way/README.txt

Check upstream files to make sure that the species name doesn't appear in an "s" line:

(because the upstream maf files are generated automatically now, we no longer need to do this)
[hgwdev:~/goldenPath/ponAbe2/multiz8way/maf> zcat upstream*.maf.gz | grep "s ponAbe2" | wc -l
0
If this is not zero, contact the developer.

Check upstream files to make sure gene names haven't been truncated (to 9 chars):

(because the upstream maf files are generated automatically now, we no longer need to do this)

[hgwdev:~/goldenPath/ponAbe2/multiz8way/maf> zcat upstream*.maf.gz | head

##maf version=1 scoring=zero
a score=0.000000
s CG13384-RD_up_1000_chr2L_8383467_f 0 1000 + 1000

If the gene names are short (9 characters) contact the developer.

To generate a list of gene names and pipe to a file called "test": zcat upstream1000.maf.gz | grep "NM_.*" | cut -f2 -d" " > test

To see how many file names are 10 or more characters long: grep "[0-9]\{10,\}" test

Check upstream files to make sure sequence hasn't been reverse-complemented incorrectly:

(because the upstream maf files are generated automatically now, we no longer need to do this)

Since reverse-complement is a relative thing, the MAF sequence is supposed to be in the direction of transcription. That is, for a negative strand gene, its reversed-complement of the genome sequence. So it is supposed to be r-c of the genome, and not r-c of the direction of transcription.

From the MAF file documentation:

strand -- Either '+' or '-'. If '-', then the alignment is to the reverse-complemented source.

Search in the MySQL database for a gene on the minus strand. Then find that gene in the upstream1000.maf.gz file then check for correct r-c.

e.g. mysql> select name, chrom, strand from ensGene where strand = "-" limit 1\G
name: ENSORLT00000000020
chrom: chr1
strand: -

[hgwdev:~/goldenPath/oryLat2/multiz5way> zcat ensGene.upstream1000.maf.gz | grep "ENSORLT00000000020"
s ENSORLT00000000020 0 1000 + 1000 gacactgaaggacgtGGACGTTATTTACCAACATCAAAGCACACAAATATAtggcacagaaac [ -clip - ]

Check this sequence with the sequence just upstream from this gene in the browser.

Check one maf file:

[hgwdev:~/goldenPath/ponAbe2/multiz8way/maf> zcat chrX.maf.gz | head

##maf version=1 scoring=autoMZ.v1
a score=21236.000000
s ponAbe2.chrX     2 249 + 156195299 cagtggcatgatcacagatgactgcagcctcggcctccatagc

Read through both description pages:

  • Conservation track:
    • Check image that displays on conservation details page.
    • Check "Gene tracks used for codon translation" table against make doc.
    • Make sure organisms are listed (in all places) in the correct phylogenetic order.
    • Make sure that this page includes all the extra sections (if the multizs have been annotated). Sections should include: Gap Annotation, Genomic Breaks, Base Level, and a chart for protein/codon translation
    • Make sure there is a tree model available.
  • Most Conserved track:
    • Make sure the text refers to the correct species.

Check trackDb.ra file:

  • Conservation track:
    • Make sure there is a speciesCodonDefault entry (usually is this species).
    • Make sure Jim has signed off on the species listed in the speciesDefaultOff entry.
  • Most Conserved track:

Check multizway#wayFrames table:

  • check that each species in the description page has entries in the Frames table (select distinct(src) from multiz8wayFrames), including entries for the species that this track is for. If there are species missing, look in the makedoc to see if there was a reason they were excluded (e.g. no good gene set). If there is nothing in the makedoc, ask the developer.

Figure out extFile and seq tables:

  • if they are standard maf files, there will be no entries in the seq table.
  • There may be more than one set of entries in the extFile table. Make sure you only push the set that pertains to the actual files you are pushing to hgnfs1 (e.g. /gbdb/ponAbe2/multiz8way/anno/maf/*)
  • These are the ones that will need pushing to beta:

mysql> select path from extFile where path like "%anno/maf%";

You can use this script to copy the rows from dev to beta:

copyExtSeqRows.csh

Test in the Genome Browser:

  • Zoom out past 1M bps (this tests the multiz*waySummary table)
  • Zoom back in to <50,000 bp and find example areas of all annotation types (check against the maf file for that location):
    • pale yellow bar
    • green square brackets
    • vertical blue bar
    • gaps
  • Check out codon translation for a few species.
  • make sure that Conserved Elements (a.k.a. Most Conserved) track items have names like "lod=22", not like "chr3.172".


Test in the tables:

  • joinerCheck
  • featureBits

[hgwdev:~/qa/tracks/conservation/ponAbe2> nice featureBits ponAbe2 multiz8way gap -bed=output.bed 162920397 bases of 3093572278 (5.266%) in intersection

  • countPerChrom.csh ponAbe2 multiz8way
  • find out how phastCons was run (from make doc). See if the species listed in the non-inf list make sense. In this case, they do not add to the phastCons wiggle. --not-informative


Make sure hgPal works with this set of tables:

  • the hgPal CGI should work with the multiz table for this Conservation track. To do this go to a RefSeq track, click on a gene and then click on the "CDS FASTA" link on the trackUi page. This should take you to a page that links to the MAF file. This can also be reached by clicking on the "Protein FASTA" link for a UCSC Gene. For both, make sure that all the buttons work.

Contact Galaxy:

  • send an email to Anton at Galaxy to let him know that a new conservation track is about to be released. They can decide if they want to pre-load the data at their site.

Things we used to do to QA the conservation track but no longer do

Check annotated maf files for overlapping blocks

(maf files are generated automatically now, so we no longer need to do this) [hgwdev:/gbdb/ponAbe2/multiz8way/anno/maf>

foreach f (*.maf)
  echo -n "${f}: "
  mafFilter -overlap -minRow=1 $f > /dev/null
end

If there are 'rejected blocks', contact the developer.

Check upstream files to make sure that the species name doesn't appear in an "s" line:

(the upstream maf files are generated automatically now, so we no longer need to do this)

[hgwdev:~/goldenPath/ponAbe2/multiz8way/maf> zcat upstream*.maf.gz | grep "s ponAbe2" | wc -l
0

If this is not zero, contact the developer.

Check upstream files to make sure gene names haven't been truncated (to 9 chars):

(the upstream maf files are generated automatically now, so we no longer need to do this)

[hgwdev:~/goldenPath/ponAbe2/multiz8way/maf> zcat upstream*.maf.gz | head

##maf version=1 scoring=zero
a score=0.000000
s CG13384-RD_up_1000_chr2L_8383467_f 0 1000 + 1000

If the gene names are short (9 characters) contact the developer.

To generate a list of gene names and pipe to a file called "test": zcat upstream1000.maf.gz | grep "NM_.*" | cut -f2 -d" " > test

To see how many file names are 10 or more characters long: grep "[0-9]\{10,\}" test

Check upstream files to make sure sequence hasn't been reverse-complemented incorrectly:

(the upstream maf files are generated automatically now, so we no longer need to do this)

Since reverse-complement is a relative thing, the MAF sequence is supposed to be in the direction of transcription. That is, for a negative strand gene, its reversed-complement of the genome sequence. So it is supposed to be r-c of the genome, and not r-c of the direction of transcription.

From the MAF file documentation:

strand -- Either '+' or '-'. If '-', then the alignment is to the reverse-complemented source.

Search in the MySQL database for a gene on the minus strand. Then find that gene in the upstream1000.maf.gz file then check for correct r-c.

e.g. mysql> select name, chrom, strand from ensGene where strand = "-" limit 1\G
name: ENSORLT00000000020
chrom: chr1
strand: -

[hgwdev:~/goldenPath/oryLat2/multiz5way> zcat ensGene.upstream1000.maf.gz | grep "ENSORLT00000000020"
s ENSORLT00000000020 0 1000 + 1000 gacactgaaggacgtGGACGTTATTTACCAACATCAAAGCACACAAATATAtggcacagaaac [ -clip - ]

Check this sequence with the sequence just upstream from this gene in the browser.