RepeatMasker
Example procedure to operate RepeatMasker on a typical genome assembly sequence.
Partition
The source tree script simplePartition.pl can be used to partition the genome. The output of the script is merely a listing of twoBit file references representing the partitions. A specific bit of sequence is obtained from the twoBit file at the time it is needed for the RepeatMasker operation. UCSC uses partitions of 500,000 bases when using the cross_match alignment engine with RepeatMasker. That size of partition often results in a single job run-time of about 1.5 hours. When using the rmblast alignment engine with RepeatMasker, the typical single job run-time is on the order of 10 minutes. The resulting masking is very similar. This example constructs partition lists in the specified RMPart directory hierarchy:
mkdir -p /data/genomes/ricCom1/bed/repeatMasker/ cd /data/genomes/ricCom1/bed/repeatMasker simplePartition.pl ../../ricCom1.unmasked.2bit 500000 /data/genomes/ricCom1/RMPart
Cluster run
The template file used here:
#LOOP ./RMRun.csh /data/genomes/ricCom1/RMPart/$(path1).out #ENDLOOP
with the gensub2 command constructs the command list jobList to run on the cluster.
mkdir -p /data/genomes/ricCom1/bed/repeatMasker/run.cluster cd /data/genomes/ricCom1/bed/repeatMasker/run.cluster ln -s /data/genomes/ricCom1/RMPart ./RMPart gensub2/data/genomes/ricCom1/RMPart/partitions.lst single template jobList
This example constructs 880 commands to run in the jobList file, typically:
./RMRun.csh /data/genomes/ricCom1/RMPart/000/000.lst.out ./RMRun.csh /data/genomes/ricCom1/RMPart/001/001.lst.out ./RMRun.csh /data/genomes/ricCom1/RMPart/002/002.lst.out ... etc.
The RMRun.csh script performs the RepeatMasker operation on each partition:
#!/bin/csh -ef set finalOut = $1 set inLst = $finalOut:r set inLft = $inLst:r.lft set alignOut = $finalOut:r.align set catOut = $finalOut:r.cat # Use local disk for output, and move the final result to $outPsl # when done, to minimize I/O. set tmpDir = `mktemp -d -p /scratch/tmp doRepeatMasker.cluster.XXXXXX` pushd $tmpDir # Initialize local library RepeatMasker -engine rmblast -pa 1 -species 'Ricinus communis' /dev/null foreach spec (`cat $inLst`) # Remove path and .2bit filename to get just the seq:start-end spec: set base = `echo $spec | sed -r -e 's/^[^:]+://'` # If $spec is the whole sequence, twoBitToFa removes the :start-end part, # which causes liftUp to barf later. So tweak the header back to # seq:start-end for liftUp's sake: twoBitToFa $spec stdout \ | sed -e "s/^>.*/>$base/" > $base.fa RepeatMasker -engine rmblast -pa 1 -align -species 'Ricinus communis' $base.fa if (-e $base.fa.cat) then mv $base.fa.cat $catOut endif end # Lift up (leave the RepeatMasker header in place because we'll liftUp # again later): liftUp -type=.out stdout $inLft error *.fa.out > tmpOut__out set nonomatch set alignFiles = ( *.align ) if ( ${#alignFiles} && -e $alignFiles[1] ) then liftUp -type=.align stdout $inLft error *.align \ > tmpOut__align else touch tmpOut__align endif # Move final result into place: mv tmpOut__out $finalOut mv tmpOut__align $alignOut popd rm -rf $tmpDir
If you have a parasol cluster, this operation is performed.
para -ram=2g make jobList para check para time > run.time cat run.time Completed: 880 of 880 jobs CPU time in finished jobs: 202665s 3377.75m 56.30h 2.35d 0.006 y IO & Wait Time: 119279s 1987.98m 33.13h 1.38d 0.004 y Average job time: 366s 6.10m 0.10h 0.00d Longest finished job: 6789s 113.15m 1.89h 0.08d Submission to last job: 6938s 115.63m 1.93h 0.08d
Collect cluster run results
Gather the partition results into a single file:
cd /data/genomes/ricCom1/bed/repeatMasker liftUp ricCom1.fa.out /dev/null carry /data/genomes/ricCom1/bed/repeatMasker/RMPart/???/*.out liftUp ricCom1.fa.align /dev/null carry /data/genomes/ricCom1/bed/repeatMasker/RMPart/???/*.align head -3 ricCom1.fa.out > ricCom1.sorted.fa.out tail -n +4 ricCom1.fa.out | sort -k5,5 -k6,6n >> ricCom1.sorted.fa.out
In addition, using the UCSC script: extractNestedRepeats.pl for the nestedRepeats track:
# Use the ID column to join up fragments of interrupted repeats for the # Nested Repeats track. extractNestedRepeats.pl ricCom1.fa.out | sort -k1,1 -k2,2n > ricCom1.nestedRepeats.bed
Masking sequence
The unmasked.2bit file is masked with the RepeatMasker results:
cd /data/genomes/ricCom1/bed/repeatMasker twoBitMask /data/genomes/ricCom1/ricCom1.unmasked.2bit ricCom1.sorted.fa.out ricCom1.rmsk.2bit
Measure the masking:
twoBitToFa ricCom1.rmsk.2bit stdout | faSize stdin > faSize.rmsk.txt
Results in the faSize.rmsk.txt file:
350621860 bases (13662715 N's 336959145 real 326566002 upper 10393143 lower) in 25763 sequences in 1 files Total size: mean 13609.5 sd 106411.1 min 202 (EQ999533) max 4693355 (EQ973772) median 1094 %2.96 masked total, %3.08 masked real
This example indicates a very low masking result from RepeatMasker. The WindowMasker result masks the sequence much better and will be used for the eventual final masking.