CpG Islands

From genomewiki
Jump to navigationJump to search

The CpG Islands track, cpgIslandExt, shows islands found by a program originally written by Gos Micklem. We got the program from WUSTL, where LaDeana Hillier made some edits to it, and then Angie edited it again to ensure that the ratio of observed to expected CpGs was calculated as stated in

Gardiner-Garden and Frommer, J. Mol. Biol. (1987) 196 (2), 261-282:
  observed * length / (number of C * number of G)   

The "(AL)" track shows islands found by a program written by Andy Law of the Roslin Institute (again with small corrections by Angie). It is relatively new: developed in 2004, initially pushed only for chicken, but starting to appear as an alternative in other genomes, especially on hgwdev.

Andy's program performs a sliding window search on the locations of CG's in the genome (as opposed to a sliding window search over all bases). It simply finds all stretches of sequence that meet the parameters that both programs claim to use:

  length >= 200bp, %GC >= 50%, observed/expected CpG >= 0.6.

I don't know of a publication describing Andy's script but here is what it does:

  1. The location of each 'CG' in the genome sequence is identified, along with the number of nucleotides since the previous 'CG' and the numbers of C's and G's. The total number of nucleotides lets us compute the length of a putative island, and the C and G counts let us compute the GC% and O/E scores.
  2. An array of successive CpGs that might belong to an island is built up according to these rules:
  • We start with the first CpG in the genome and add subsequent CpGs to the array one by one (sliding the right edge forward).
  • Each time we add a CpG on the right:
    • If length is < 200, we keep going (add another to the right).
    • Otherwise the length criteria is met and we evaluate %GC and O/E.
    • If all three criteria are met, we make a note that our putative island looks good and keep going (try to build a bigger island).
    • If all three criteria were met before we added this CpG, but now %GC and O/E are no longer met, we use this CpG as the start of a new putative island and print out the island from the previous iteration.
    • If %GC and O/E weren't met before and still aren't met, we remove the first CpGs from the array (sliding the left edge forward) until the length is < 200.
  • If we reach the end of the sequence and our putative island meets the criteria, print it out.

So it's a window search where we slide the right and left edges to the right, evaluating criteria at each step -- but instead of sliding one genomic base at a time, we slide from one CpG to the next. (It would be suboptimal to evaluate a CpG island with its boundaries between CpGs.)


Gos's program is pickier about which islands it reports, for two reasons:

  1. it imposes an additional constraint, that a certain running score *must* remain above 0 for the entire length of an island, and
  2. it also chops up islands at their max-running-score point and evaluates the two halves separately.

The running score is computed as follows: it starts at 0; every time a CG is encountered, it's incremented by 17; at every other base, it's decremented by 1, but never allowed to fall below 0. The running score is used to identify stretches of sequence to evaluate according to length, %GC and O/E, but the running score constraint itself precludes a lot of stretches that would qualify by those 3 stated params.

The following is mostly Angie's opinion...

We have used Gos's program to generate the CpG Islands track since ~2002 and before that, WUSTL folks used the program to generate islands which we loaded. Then Andy wrote his program as part of the chicken analysis project (2004), and suddenly it found about 3 times as many islands as Gos's, which I found kind of alarming so I dug into the source code to find out how they differed. I told Jim about the discrepancy in number of islands found, and expected that he would want to show the track that identified all stretches that meet the stated params. However, Jim still finds Gos's track more pleasing because its number of islands is closer to the number of genes and it gets a better enrichment score for upstream regions of known genes -- i.e. Gos's picky-islands are more likely to intersect with promoters than Andy's comprehensive-islands.

Terry Furey got some interesting results for an even simpler method of identifying CpG-rich regions during the 2005 ENCODE analysis fair, but I don't think that should go on a public wiki page before publication of the ENCODE analysis papers so ask Terry or me if you're curious. We may eventually want to offer 3 versions of CpG islands at different points on the sensitivity/specificity curve.

Navigation: back to Implementation_Notes