Known genes III
From genomewiki
Goals and History
The UCSC Known Genes track tries to gather together information from many sources into a nonredundant unified view of all genes for which there is solid evidence. The Known Genes track has been through a number of iterations:
- Known Genes 0 - Mapping of RefSeq mRNAs to the genome
- Known Genes 1 - UniProt driven. Find RNA corresponding to protien. Map that. Add in DNA based RefSeqs
- Known Genes 2 - Similar to Known Genes 1, but weeding out mappings that produce, after mapping, bad proteins due to insertions, deletions, and truncations, etc. See also online known genes methods description.
With Known Genes 3 we want to restore many of the mappings thrown out in Known Genes 2, fixing them when possible, and marking them as uncertain where a fix is not possible. We also want to design a process that will include noncoding genes, such as Xist and Har1, in the known genes set.
Goals
- Broader coverage than KG II
- include some of genes thrown out by KG II
- non-"NM_..." RefSeqs?
- should NP_... proteins from NCBI be included as initial candidates?
- include non-coding genes
- not sure about this, down stream software changes may be extensive (Fan)
- other sources besides Ensembl non-coding genes?
- More accurate gene model than KG II
- would gene-check or something similar still be used?
- and err-flags raised be kept and available to users?
- 3 or more quality classes
- stable UCSC KG gene ID (?)
- nightly updates (?)
- probably KG IV?
Plans
Here is a possible process for building Known Genes 3, taken from our grant app.
- Align all the RNAs in GenBank against the genome with BLAT and high-stringency filters. ESTs will not be included in this starting set. Certain mRNA libraries may be excluded as well.
- Most of this will fall pretty easily out of the GenBank automated builds. I think all that's required here is a little program that takes as input a list of bad libraries (which we'll hand curate), and a list of bad accessions (also hand curated) and creates as output a list of bad accessions. -jk
- Cluster the alignments that overlap.
- For this we could use clusterRna. Alternatively we might use something based on ExonWalk for this and the next few steps. -jk
- For each exon in the alignment, come to a consensus on the exon boundaries based on all of the RNA alignments. This consensus will allow for alternative 5’ and 3’ ends of the exons if there are clean alignments with good splice sites.
- Definitely it's worth spending a day or two reading ExonWalk here, since it does pretty much exactly this -jk
- Pick a representative RNA for each splicing variant. When there is a choice of representatives, pick the one that is longest and most similar to the genome.
- I'd be inclined to fold this step into the exon-walker-like program too, since the pick criteria are likely to be pretty straightforward, and presumably the exon-walker already has most of the information needed to make the pick loaded already. -jk
- For any base in an RNA that differs from the aligned DNA base in the reference genome, determine if that RNA base is more likely to be (a) a common allelic variant, (b) a post-transcriptional modification or (c) a rare variant or artifact in the RNA sequence or its alignment to the genome. This determination is made by examining all cDNA alignments (including ESTs) to this gene and to very similar paralogs, consulting dbSNP and other special information sources, and determining common haplotypes for the region in question. In case (c), either fix the alignment or replace the questionable RNA base with the corresponding base from the most similar common haplotype, preferring the reference genome if its haplotype is not an extremely rare one for this region and there is not extremely strong evidence of an error in the reference genome (that would trigger a different action). A record of the original base value and reason for the correction is kept.
- Hmm, it's worth considering doing this as a separate program from the walker, just because it involves database access and inputs that the walker doesn't already have. -jk
- Add genes from RefSeq and perhaps other trusted sources that are known purely at the DNA level.
- Perhaps these actually should be folded into the process sooner, generating fake .psl files for them? This way if there ended up being RNA records as well as DNA records, they could be properly merged. -jk
- Map UniProt proteins to the corrected RNAs to determine the coding regions, if any. Use bestOrf, pseudogene and evolutionary analysis on the RNAs to determine additional protein coding genes and find suspected errors in UniProt. Report any suspected errors to SwissProt staff.
- We could use blat for the protein/RNA mapping I think. Probably better to do it here rather than blatting proteins against the genome and looking for overlaps between the mrna and protein alignments, since blatting protein vs. mRNA is an easier job. I'm not sure what are the best tools to apply here for the psuedogene and evolutionary analysis. -jk
- Separate the resulting gene models into gold/silver/bronze sets as discussed above.
- Here the gold set is basically CCDS, the silver ones that look good according to some criteria I'd like to leave in Mark's hands, basically a nice full length protein with no weird splicing. The bronze will be everything else including the noncoding. -jk