Fetch proteins from gene track

From genomewiki
Revision as of 18:39, 10 October 2012 by Hiram (talk | contribs)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to navigationJump to search

Example procedure to construct protein sequences from a UCSC gene track. And a couple of sanity checks on the predicted proteins.

These commands can be placed into a single bash shell script to perform the entire sequence. To parameterize the script, establish two variables:

#!/bin/sh
# set bash options to exit at any error
set -beEu -o pipefail
export geneTbl="knownGene"
export db="mm10"

1. Obtain the CDS exons from the gene table with featureBits:

featureBits -fa=${db}.${geneTbl}.exons.fa ${db} ${geneTbl}:cds

2. For convenience, change the fasta file into single lines for each exon sequence, tab separated the name from the sequence:

faToTab -keepAccSuffix ${db}.${geneTbl}.exons.fa ${db}.${geneTbl}.exons.tab

3. Processing for forward and reverse strands needs to be separated:

grep -P "_f\t" ${db}.${geneTbl}.exons.tab > ${db}.${geneTbl}.forward.exons.tab

4. The reverse strand exons come out of featureBits in reverse order, but are already properly reverse complemented, the tac command reverses that order to get them in correct sequence order:

grep -P "_r\t" ${db}.${geneTbl}.exons.tab | tac > ${db}.${geneTbl}.reverse.exons.tab

5. Can now return those tab separated single line files to fasta files and remove the extra strings from the identifier line:

awk '{printf ">%s\n%s\n", $1, $2}' ${db}.${geneTbl}.forward.exons.tab \
    | sed -e "s/\t.*//" > ${db}.${geneTbl}.forward.exons.fa
awk '{printf ">%s\n%s\n", $1, $2}' ${db}.${geneTbl}.reverse.exons.tab \
    | sed -e "s/\t.*//" > ${db}.${geneTbl}.reverse.exons.fa

6. Using this perl script to process the individual exons into single fasta records for the complete CDS. This works because the exons for a single gene are congruent in the file, if the next name encountered is the same, the CDS continues:

#!/usr/bin/env perl

use strict;
use warnings;

my $file = shift;
die "usage: ./exonsToCds.pl <file>\n" if (!defined $file);

my $previousId = "";

open (FH, "$file") or die "can not read $file";
while (my $line = <FH>) {
    chomp $line;
    if ($line =~ m/^>/) {
        my $id = $line;
        $id =~ s/^>//;
        $id =~ s/_cds.*//;
        if ($id ne $previousId) {
            $previousId = $id;
            printf ">%s\n", $previousId;
        }
    } else {
        printf "%s\n", $line;
    }
}
close (FH);

7. That script constructs CDS records for each gene:

./exonsToCds.pl ${db}.${geneTbl}.forward.exons.fa \
       > ${db}.${geneTbl}.forward.cds.fa
./exonsToCds.pl ${db}.${geneTbl}.reverse.exons.fa \
       > ${db}.${geneTbl}.reverse.cds.fa

8. Can now translate the CDS into protein with the kent command 'faTrans', changed into single line tab separated with the final stop codon 'X' removed from the sequence:

faTrans ${db}.${geneTbl}.forward.cds.fa stdout \
        | faToTab -type=protein -keepAccSuffix stdin stdout \
        | sed -e "s/X$//" > ${db}.${geneTbl}.forward.protein.tab
faTrans ${db}.${geneTbl}.reverse.cds.fa stdout \
        | faToTab -type=protein -keepAccSuffix stdin stdout \
        | sed -e "s/X$//" > ${db}.${geneTbl}.reverse.protein.tab

9. Can now run some sanity checks on these protein sequences, for example, how many have stop codons in the middle of the protein sequence:

echo -n "stop codons in the middle of proteins, count on forward strand: "
grep X ${db}.${geneTbl}.forward.protein.tab | wc -l
echo -n "count on reverse strand: "
grep X ${db}.${geneTbl}.reverse.protein.tab | wc -l

10. Check the length of the CDS to see if it is a multiple of 3:

echo "CDS remainder frequency for length of three multiple, forward strand: "
faCount ${db}.${geneTbl}.forward.cds.fa | grep -v "#seq|total" \
       | awk '{d=int($2/3); rem=$2-(d*3); printf "%d\n", rem}' \
       | sort | uniq -c
echo "CDS remainder frequency for length of three multiple, reverse strand: "
faCount ${db}.${geneTbl}.reverse.cds.fa | grep -v "#seq|total" \
       | awk '{d=int($2/3); rem=$2-(d*3); printf "%d\n", rem}' \
       | sort | uniq -c

11. Example results, mm10 knownGene has a few proteins with stop codons in the middle of the protein sequence, but all the CDS is a multiple of 3 length:

stop codons in the middle of proteins, count on positive strand: 23
count on negative strand: 10
CDS remainder frequency for length of three multiple, positive strand: 
 22204 0
CDS remainder frequency for length of three multiple, negative strand: 
 21786 0

hg19 knownGene also has a similar profile:

stop codons in the middle of proteins, count on forward strand: 21
count on reverse strand: 20
CDS remainder frequency for length of three multiple, forward strand: 
 31454 0
CDS remainder frequency for length of three multiple, reverse strand: 
 30606 0

A gene track such as refGene has a number of anomalies since the mRNA for the gene definition that is mapped to the reference sequence may have indels in the CDS sequence. For example rn5 refGene:

stop codons in the middle of proteins, count on positive strand: 664
count on negative strand: 685
CDS remainder frequency for length of three multiple, positive strand: 
  8300 0
   256 1
   330 2
CDS remainder frequency for length of three multiple, negative strand: 
  8254 0
   277 1
   319 2