RepeatMasker
Example procedure to operate RepeatMasker on a typical genome assembly sequence.
Partition
The source tree script simplePartition.pl can be used to partition the genome. The output of the script is merely a listing of twoBit file references representing the partitions. A specific bit of sequence is obtained from the twoBit file at the time it is needed for the RepeatMasker operation. UCSC uses partitions of 500,000 bases when using the cross_match alignment engine with RepeatMasker. That size of partition often results in a single job run-time of about 1.5 hours. When using the rmblast alignment engine with RepeatMasker, the typical single job run-time is on the order of 10 minutes. The resulting masking is very similar. This example constructs partition lists in the specified RMPart directory hierarchy:
mkdir -p /data/genomes/ricCom1/bed/repeatMasker/ cd /data/genomes/ricCom1/bed/repeatMasker simplePartition.pl ../../ricCom1.unmasked.2bit 500000 /data/genomes/ricCom1/RMPart
Cluster run
The template file used here:
#LOOP ./RMRun.csh /data/genomes/ricCom1/RMPart/$(path1).out #ENDLOOP
with the gensub2 command constructs the command list jobList to run on the cluster.
mkdir -p /data/genomes/ricCom1/bed/repeatMasker/run.cluster cd /data/genomes/ricCom1/bed/repeatMasker/run.cluster ln -s /data/genomes/ricCom1/RMPart ./RMPart gensub2/data/genomes/ricCom1/RMPart/partitions.lst single template jobList
This example constructs 880 commands to run in the jobList file, typically:
./RMRun.csh /data/genomes/ricCom1/RMPart/000/000.lst.out ./RMRun.csh /data/genomes/ricCom1/RMPart/001/001.lst.out ./RMRun.csh /data/genomes/ricCom1/RMPart/002/002.lst.out ... etc.
The RMRun.csh script performs the RepeatMasker operation on each partition:
#!/bin/csh -ef
set finalOut = $1
set inLst = $finalOut:r
set inLft = $inLst:r.lft
set alignOut = $finalOut:r.align
set catOut = $finalOut:r.cat
# Use local disk for output, and move the final result to $outPsl
# when done, to minimize I/O.
set tmpDir = `mktemp -d -p /scratch/tmp doRepeatMasker.cluster.XXXXXX`
pushd $tmpDir
# Initialize local library
RepeatMasker -engine rmblast -pa 1 -species 'Ricinus communis' /dev/null
foreach spec (`cat $inLst`)
# Remove path and .2bit filename to get just the seq:start-end spec:
set base = `echo $spec | sed -r -e 's/^[^:]+://'`
# If $spec is the whole sequence, twoBitToFa removes the :start-end part,
# which causes liftUp to barf later. So tweak the header back to
# seq:start-end for liftUp's sake:
twoBitToFa $spec stdout \
| sed -e "s/^>.*/>$base/" > $base.fa
RepeatMasker -engine rmblast -pa 1 -align -species 'Ricinus communis' $base.fa
if (-e $base.fa.cat) then
mv $base.fa.cat $catOut
endif
end
# Lift up (leave the RepeatMasker header in place because we'll liftUp
# again later):
liftUp -type=.out stdout $inLft error *.fa.out > tmpOut__out
set nonomatch
set alignFiles = ( *.align )
if ( ${#alignFiles} && -e $alignFiles[1] ) then
liftUp -type=.align stdout $inLft error *.align \
> tmpOut__align
else
touch tmpOut__align
endif
# Move final result into place:
mv tmpOut__out $finalOut
mv tmpOut__align $alignOut
popd
rm -rf $tmpDir
If you have a parasol cluster, this operation is performed.
para -ram=2g make jobList para check para time > run.time cat run.time Completed: 880 of 880 jobs CPU time in finished jobs: 202665s 3377.75m 56.30h 2.35d 0.006 y IO & Wait Time: 119279s 1987.98m 33.13h 1.38d 0.004 y Average job time: 366s 6.10m 0.10h 0.00d Longest finished job: 6789s 113.15m 1.89h 0.08d Submission to last job: 6938s 115.63m 1.93h 0.08d
Collect cluster run results
Gather the partition results into a single file:
cd /data/genomes/ricCom1/bed/repeatMasker liftUp ricCom1.fa.out /dev/null carry /data/genomes/ricCom1/bed/repeatMasker/RMPart/???/*.out liftUp ricCom1.fa.align /dev/null carry /data/genomes/ricCom1/bed/repeatMasker/RMPart/???/*.align head -3 ricCom1.fa.out > ricCom1.sorted.fa.out tail -n +4 ricCom1.fa.out | sort -k5,5 -k6,6n >> ricCom1.sorted.fa.out
In addition, using the UCSC script: extractNestedRepeats.pl for the nestedRepeats track:
# Use the ID column to join up fragments of interrupted repeats for the # Nested Repeats track. extractNestedRepeats.pl ricCom1.fa.out | sort -k1,1 -k2,2n > ricCom1.nestedRepeats.bed
Masking sequence
The unmasked.2bit file is masked with the RepeatMasker results:
cd /data/genomes/ricCom1/bed/repeatMasker twoBitMask /data/genomes/ricCom1/ricCom1.unmasked.2bit ricCom1.sorted.fa.out ricCom1.rmsk.2bit
Measure the masking:
twoBitToFa ricCom1.rmsk.2bit stdout | faSize stdin > faSize.rmsk.txt
Results in the faSize.rmsk.txt file:
350621860 bases (13662715 N's 336959145 real 326566002 upper 10393143 lower) in 25763 sequences in 1 files Total size: mean 13609.5 sd 106411.1 min 202 (EQ999533) max 4693355 (EQ973772) median 1094 %2.96 masked total, %3.08 masked real
This example indicates a very low masking result from RepeatMasker. The WindowMasker result masks the sequence much better and will be used for the eventual final masking.