PCR on cDNA
This is a sort of design document after-the-fact/in-progess. I would love to get input on any part of it... see any holes?
Overview
We will enhance hgPcr to offer not only genomic assembly sequence, but also cDNA sequences such as UCSC Genes and quarterly snapshots of GenBank native mRNAs, as targets for search.
hgPcr
When the necessary tables, files and gfServers are in place, hgPcr's front page gets a new select box labelled Target. The first and default choice is "genome assembly" -- no change to the current behavior. The subsequent choices have a brief shortLabel-like description, e.g. "UCSC Genes" or "Human mRNAs".
If the user selects one of the new targets, then the primer pair is passed to a gfServer running on cDNA sequences. The gfPcr result page for cDNA looks much like the genome assembly result page, except cDNA coordinates and sequence are displayed and the hgTracks links have position=accession instead of a chrom:start-end range. Unlike the genomic result page, if a match is to the opposite strand of a cDNA (assumed to be mRNA at this point), a message is printed out.
Infrastructure
A new central database table describes the targets. Each new target requires a new blat server, two new /gbdb files per genome db and three new tables per genome db.
Synchronization across {central db, blat servers, /gbdb, genome db tables} x {hgwdev, hgwbeta, RR} will be interesting -- more on that below.
Implementation
Central database
A new table targetDb describes each pair of target and genome db to which it has been aligned. Fields (as in kent/src/hg/lib/targetDb.{as,sql}):
field | description | example |
---|---|---|
name | Unique identifier of target | hg18KgApr08 |
description | Brief description for select box, like shortLabel | UCSC Genes |
db | Genome assembly database to which this target has been aligned | hg18 |
pslTable | PSL table in db that maps target coords to db coords | kgTargetAli |
seqTable | Table in db that has extFileTable indices of target sequences | kgTargetSeq |
extFileTable | Table in db that has .id, .path, and .size of target sequence files | kgTargetExtFile |
seqFile | Target sequence file path (typically /gbdb/db/targetDb/name.2bit) | /gbdb/hg18/targetDb/kgTargetSeq.2bit |
priority | Relative priority compared to other targets for same db (smaller numbers are higher priority) | 1.0 |
time | Time at which this record was updated -- should be newer than db tables and seqFile (so should blat server) | 2008-04-10 14:11:35 |
When the gfServer info for the target is added to blatServers, blatServers.db must be the same as targetDb.name (not targetDb.db!).
pslTable, seqTable, and extFileTable are not necessary to describe the hgPcr enhancements described above -- they are for addition of a PCR results track. They could be used to enhance the hgPcr results page, too.
Come to think of it, would it be as efficient for a CGI to get a small sequence from .2bit as from seqTable + extFileTable + /gbdb/.../*.fa ? If so I'll happily ditch those fields and db tables.
Blat servers
The number of new blat servers will depend on how many genome dbs and targets we want to support. It could be a lot. Fortunately a gfServer running on transcript sequences doesn't require much memory -- hg18 UCSC Genes and native mRNAs require about 50M and 100M respectively. However, if we add a lot of new servers, there will be a lot to keep track of, and I wonder if cluster-admin might be annoyed at all of the new start and stop requests.
/gbdb/ files
The location of these is flexible. Currently I'm using /gbdb/db/targetDb/.
In that directory are two files per target: name.2bit (for gfPcr searches) and name.fa (for seqTable).
It is possible that one target server and sequence file could be shared across several genome databases; for example, a human mRNA gfServer could be shared by hg* dbs. In that case, putting the sequence somewhere other than the /gbdb/db directories would make sense.
Genome database tables
pslTable maps the target sequence coordinates to genomic; seqTable and extFileTable enable easy CGI lookup of a target sequence accession. Again, these are not necessary for hgPcr now; they are for the addition of a PCR results track (and maybe some new stuff in the hgPcr target results page).
Potential for automation
All of this could be automated except for a few critical pieces: running the blatServers, QA, and release. :)
However, a script could certainly print out instructions (including template email to cluster-admin, template push request) and make sub-scripts for the automatable parts: creating the necessary tables/files and updating targetDb and blatServers in the central database.
For QA as well, some automation is possible. For example, a script could pick a sequence from the target .2bit, extract some primers and their coords, and show what the hgPcr result should be -- same thing would work for genomic PCR (with $db.2bit).
Synchronization/Release
Synchronizing across the central db, blat servers, /gbdb and genome db tables will be a challenge, especially when rolling out changes from hgwdev/hgcentraltest to hgwbeta and the RR.
One safeguard in place is a timestamp check: targetDb.time must be newer than any of the genome db tables or /gbdb file, or that target will be ignored. That will prevent a table/file update from causing incorrect results from hgPcr, but it doesn't cover the blat server. (Could gfServer status be enhanced to give an uptime / start date? maybe even name and timestamp of input file? :)
Rolling out an update of anything with both db tables and files, from hgwdev/hgcentraltest -> hgwbeta/hgcentralbeta + shared-with-RR /gbdb -> RR/hgcentral, is always more complicated than pushing out a brand new set. Here, even more moving parts are involved. So I think the only way that we can support updates is to use target names that include some kind of date (MmmYY like Apr08 should be enough). That allows us to add all components of a new target, while leaving the old target in place, then switchover in targetDb when everything else is in place.
Prototype on genome-test
hg18 hgPcr now shows the Target menu with two non-genome choices, UCSC Genes and Human mRNAs. Currently the blat servers are running on kolossus instead of proper blat server machines.