DoSameSpeciesLiftOver.pl: Difference between revisions

From genomewiki
Jump to navigationJump to search
(ooc file construction)
(→‎How does this process work: Adding note about -debug mode)
 
(10 intermediate revisions by 3 users not shown)
Line 3: Line 3:
For commercial use of these toolsets, please note the license considerations for the
For commercial use of these toolsets, please note the license considerations for the
kent source tools at the: [https://genome-store.ucsc.edu/ Genome Store]
kent source tools at the: [https://genome-store.ucsc.edu/ Genome Store]
This process also uses the '''blat''' command.  For commercial license please see:
[http://www.kentinformatics.com/ Kent Informatics]


==Prerequisites==
==Prerequisites==
Line 33: Line 36:
<pre>
<pre>
<nowiki>
<nowiki>
  mkdir -p /data/scripts /data/bin
  sudo mkdir -p /data/scripts /data/bin
  chmod 755 /data/scripts /data/bin
  sudo chmod 755 /data/scripts /data/bin


  rsync -a rsync://hgdownload.soe.ucsc.edu/genome/admin/exe/linux.x86_64/ /data/bin/
  rsync -a rsync://hgdownload.soe.ucsc.edu/genome/admin/exe/linux.x86_64/ /data/bin/
Line 46: Line 49:
       --exclude='kent/src/hg/utils/automation/lastz_D' \
       --exclude='kent/src/hg/utils/automation/lastz_D' \
       --exclude='kent/src/hg/utils/automation/openStack'
       --exclude='kent/src/hg/utils/automation/openStack'
   wget -O /data/bin/bedSingleCover.pl 'http://genome-source.soe.ucsc.edu/gitweb/?p=kent.git;a=blob_plain;f=src/utils/bedSingleCover.pl'
   wget -O /data/bin/bedSingleCover.pl 'http://genome-source.soe.ucsc.edu/gitlist/kent.git/raw/master/src/utils/bedSingleCover.pl'
</nowiki>
</nowiki>
</pre>
</pre>
'''NOTE:''' A copy of the '''lastz''' binary is included in the rsync download
of binaries from hgdownload.  It is named '''lastz-1.04.00''' to identify the version.
Source for lastz can be obtained from [https://github.com/lastz/lastz lastz github.]


==PATH setup==
==PATH setup==
Line 62: Line 61:
  echo 'export PATH=/data/bin:/data/bin/blat:/data/scripts:$PATH' >> $HOME/.bashrc
  echo 'export PATH=/data/bin:/data/bin/blat:/data/scripts:$PATH' >> $HOME/.bashrc


Then, '''source''' that file to add that to this current shell:
'''Note:''' '''/data/bin/blat''' has been added to the '''PATH''' for access to
the '''blat''' command.  Then, '''source''' that file to add that to this current shell:


  . $HOME/.bashrc
  . $HOME/.bashrc
Line 69: Line 69:


  echo $PATH
  echo $PATH
  /data/bin:/data/scripts:/usr/local/bin:/usr/bin:/usr/local/sbin:/usr/sbin:/home/centos/.local/bin:/home/centos/bin
  /data/bin:/data/bin/blat:/data/scripts:/usr/local/bin:/usr/bin:/usr/local/sbin:/usr/sbin:/home/centos/.local/bin:/home/centos/bin


==Obtain genome sequences==
==Obtain genome sequences==


This example is going to use two Soy bean assemblies, one from '''genbank''' and one from '''refseq'''
This example is going to use two Soy bean assemblies, one from '''genbank''' and one from '''refseq'''
assemblies at NCBI.
assemblies at [https://www.ncbi.nlm.nih.gov/assembly NCBI].


  mkdir -p /data/genomes/genbank /data/genomes/refseq
  mkdir -p /data/genomes/genbank /data/genomes/refseq
Line 107: Line 107:
  -rw-rw-r--. 1 287569208 Apr 13 17:30 GCA_000004515.3_Glycine_max_v2.0.2bit
  -rw-rw-r--. 1 287569208 Apr 13 17:30 GCA_000004515.3_Glycine_max_v2.0.2bit
  -rw-rw-r--. 1    19439 Apr 13 17:31 GCA_000004515.3_Glycine_max_v2.0.chrom.sizes
  -rw-rw-r--. 1    19439 Apr 13 17:31 GCA_000004515.3_Glycine_max_v2.0.chrom.sizes
  cd mkdir /data/genomes/GCF_000004515.3_V1.1
  cd /data/genomes/GCF_000004515.3_V1.1
  faToTwoBit /data/genomes/GCF_000004515.3_V1.1/GCF_000004515.3_V1.1_genomic.fna.gz GCF_000004515.3_V1.1.2bit
  faToTwoBit /data/genomes/GCF_000004515.3_V1.1/GCF_000004515.3_V1.1_genomic.fna.gz GCF_000004515.3_V1.1.2bit
  twoBitInfo GCF_000004515.3_V1.1.2bit stdout | sort -k2,2nr > GCF_000004515.3_V1.1.chrom.sizes
  twoBitInfo GCF_000004515.3_V1.1.2bit stdout | sort -k2,2nr > GCF_000004515.3_V1.1.chrom.sizes
Line 140: Line 140:
  Writing GCA_000004515.3_Glycine_max_v2.0.ooc
  Writing GCA_000004515.3_Glycine_max_v2.0.ooc
  Wrote 64902 overused 11-mers to GCA_000004515.3_Glycine_max_v2.0.ooc
  Wrote 64902 overused 11-mers to GCA_000004515.3_Glycine_max_v2.0.ooc
==doSameSpeciesLiftOver.pl==
This script is going to run the entire process.  With the '''.2bit''', '''chrom.sizes''' and '''ooc'''
files in place, the process is ready to run with this single command:
export target="GCA_000004515.3_Glycine_max_v2.0"
export query="GCF_000004515.3_V1.1"
cd /data/genomes/${target}
time (doSameSpeciesLiftOver.pl -verbose=2 -buildDir=`pwd` \
  -ooc=`pwd`/${target}.ooc -fileServer=localhost -localTmp="/dev/shm" \
    -bigClusterHub=localhost -dbHost=localhost -workhorse=localhost \
      -target2Bit=`pwd`/${target}.2bit -targetSizes=`pwd`/${target}.chrom.sizes \
        -query2Bit=/data/genomes/${query}/${query}.2bit \
          -querySizes=/data/genomes/${query}/${query}.chrom.sizes ${target} ${query}) > do.log 2>&1
==Result files==
The '''liftOver''' file result is in this working directory:
cd /data/genomes/GCA_000004515.3_Glycine_max_v2.0
ls -og *.over.chain.gz
-rw-rw-r--. 1 1325174 Apr 14 17:27 GCA_000004515.3_Glycine_max_v2.0ToGCF_000004515.3_V1.1.over.chain.gz
This has also been converted to [http://genome.ucsc.edu/goldenPath/help/bigChain.html bigChain] files for display in an assembly hub:
ls -og *.bb
-rw-rw-r--. 1 1622857 Apr 14 17:27 chainGCF_000004515.3_V1.1.bb
-rw-rw-r--. 1 1900014 Apr 14 17:27 chainGCF_000004515.3_V1.1Link.bb
==How does this process work==
The '''doSameSpeciesLiftOver.pl''' script performs the processing in distinct '''steps'''.  Each '''step''' is
almost always performed with a C-shell or bash shell script.  Therefore, if there is a problem in
any '''step''', the commands performing the '''step''' can be dissected from the script in operation,
the problem identified and fixed, and the '''step''' completed manually by running the rest of the
commands in that script.  Once a step has been completed, the process can continue with the next
step using the argument '''-continue=nextStepName'''.  Check the usage message from the '''doSameSpeciesLiftOver.pl'''
script to see a listing of the '''steps''' and their sequence.  Specifically:
align, chain, net, load, cleanup
In this example, the various scripts are:
-rwxrwxr-x. 1 1485 Apr 14 00:06 run.blat/doAlign.csh
-rwxrwxr-x. 1 2337 Apr 14 00:08 run.blat/job.csh
-rwxrwxr-x. 1  499 Apr 14 04:43 run.chain/job.csh
-rwxrwxr-x. 1  641 Apr 14 04:43 run.chain/doChain.csh
-rwxrwxr-x. 1 1870 Apr 14 17:26 run.chain/doNet.csh
-rwxrwxr-x. 1 2317 Apr 14 17:27 doLoad.csh
-rwxrwxr-x. 1  540 Apr 14 17:27 doCleanup.csh
The cluster runs performed were in '''run.blat''' and '''run.chain''' with the following
typical processing times for this genome sequence:
cat run.blat/run.time
Completed: 2289 of 2289 jobs
CPU time in finished jobs:    658550s  10975.83m  182.93h    7.62d  0.021 y
IO & Wait Time:                  9734s    162.24m    2.70h    0.11d  0.000 y
Average job time:                292s      4.87m    0.08h    0.00d
Longest finished job:            3935s      65.58m    1.09h    0.05d
Submission to last job:        16493s    274.88m    4.58h    0.19d
cat run.chain/run.time
Completed: 63 of 63 jobs
CPU time in finished jobs:        49s      0.82m    0.01h    0.00d  0.000 y
IO & Wait Time:                  167s      2.78m    0.05h    0.00d  0.000 y
Average job time:                  3s      0.06m    0.00h    0.00d
Longest finished job:              17s      0.28m    0.00h    0.00d
Submission to last job:            21s      0.35m    0.01h    0.00d




The script can be used in a -debug mode where it does nothing but construct the required shell scripts.
[[Category:Cluster FAQ]]
[[Category:Cluster FAQ]]
[[Category:Technical FAQ]]
[[Category:Technical FAQ]]

Latest revision as of 17:24, 14 March 2019

Licensing

For commercial use of these toolsets, please note the license considerations for the kent source tools at the: Genome Store

This process also uses the blat command. For commercial license please see: Kent Informatics

Prerequisites

This discussion assumes you are familiar with Unix shell command line programming and scripting. You will be encountering and interacting with csh/tcsh, bash, perl, and python scripting languages. You will need at least one computer with several CPU cores, preferably a multiple compute cluster system or equivalent in a cloud computing environment.

This entire discussion assumes the bash shell is the user's unix shell.

Parasol Job Control System

The scripts and programs used here expect to find the Parasol_job_control_system in place and operational.

Install scripts and kent command line utilities

This is a bit of a kludge at this time (April 2018), we are working on a cleaner distribution of these scripts. As was mentioned in the Parasol_job_control_system setup, the kent command line binaries and these scripts are going to reside in /data/bin/ and /data/scripts/. This is merely a style custom to keep scripts separate from binaries, this is not strictly necessary to keep them separate.


 sudo mkdir -p /data/scripts /data/bin
 sudo chmod 755 /data/scripts /data/bin

 rsync -a rsync://hgdownload.soe.ucsc.edu/genome/admin/exe/linux.x86_64/ /data/bin/
 git archive --remote=git://genome-source.soe.ucsc.edu/kent.git \
  --prefix=kent/ HEAD src/hg/utils/automation \
     | tar vxf - -C /data/scripts --strip-components=5 \
        --exclude='kent/src/hg/utils/automation/incidentDb' \
      --exclude='kent/src/hg/utils/automation/configFiles' \
      --exclude='kent/src/hg/utils/automation/ensGene' \
      --exclude='kent/src/hg/utils/automation/genbank' \
      --exclude='kent/src/hg/utils/automation/lastz_D' \
      --exclude='kent/src/hg/utils/automation/openStack'
  wget -O /data/bin/bedSingleCover.pl 'http://genome-source.soe.ucsc.edu/gitlist/kent.git/raw/master/src/utils/bedSingleCover.pl'

PATH setup

Add or verify the two directories /data/bin and /data/scripts are added to the shell PATH environment. This can be added simply to the .bashrc file in your home directory:

echo 'export PATH=/data/bin:/data/bin/blat:/data/scripts:$PATH' >> $HOME/.bashrc

Note: /data/bin/blat has been added to the PATH for access to the blat command. Then, source that file to add that to this current shell:

. $HOME/.bashrc

Verify you see those pathnames on the PATH variable:

echo $PATH
/data/bin:/data/bin/blat:/data/scripts:/usr/local/bin:/usr/bin:/usr/local/sbin:/usr/sbin:/home/centos/.local/bin:/home/centos/bin

Obtain genome sequences

This example is going to use two Soy bean assemblies, one from genbank and one from refseq assemblies at NCBI.

mkdir -p /data/genomes/genbank /data/genomes/refseq
cd /data/genomes/genbank
rsync -L -a -P \
  rsync://ftp.ncbi.nlm.nih.gov/genomes/genbank/plant/Glycine_max/all_assembly_versions/GCA_000004515.3_Glycine_max_v2.0/GCA_000004515.3_Glycine_max_v2.0_genomic.fna.gz ./
cd /data/genomes/refseq
rsync -L -a -P \
rsync://ftp.ncbi.nlm.nih.gov/genomes/refseq/plant/Glycine_max/all_assembly_versions/GCF_000004515.3_V1.1/GCF_000004515.3_V1.1_genomic.fna.gz ./
cd /data/genomes
ls -og genbank/*.gz refseq/*.gz
-r--r--r--. 1 296231629 Jun 13  2016 genbank/GCA_000004515.3_Glycine_max_v2.0_genomic.fna.gz
-r--r--r--. 1 296228780 Jan  6  2015 refseq/GCF_000004515.3_V1.1_genomic.fna.gz

Working directories

Organize your work in a directory hierarchy for convenience of bookeeping and shell script automation for numerous sequences.

The target sequence name is GCA_000004515.3_Glycine_max_v2.0 and the query sequence name is GCF_000004515.3_V1.1.

Convert the fasta sequence to .2bit files and calculate chrom.sizes files.

mkdir /data/genomes/GCA_000004515.3_Glycine_max_v2.0
mkdir /data/genomes/GCF_000004515.3_V1.1
cd /data/genomes/GCA_000004515.3_Glycine_max_v2.0 
faToTwoBit /data/genomes/genbank/GCA_000004515.3_Glycine_max_v2.0_genomic.fna.gz GCA_000004515.3_Glycine_max_v2.0.2bit
twoBitInfo GCA_000004515.3_Glycine_max_v2.0.2bit stdout | sort -k2,2nr > GCA_000004515.3_Glycine_max_v2.0.chrom.sizes
ls -og
-rw-rw-r--. 1 287569208 Apr 13 17:30 GCA_000004515.3_Glycine_max_v2.0.2bit
-rw-rw-r--. 1     19439 Apr 13 17:31 GCA_000004515.3_Glycine_max_v2.0.chrom.sizes
cd /data/genomes/GCF_000004515.3_V1.1
faToTwoBit /data/genomes/GCF_000004515.3_V1.1/GCF_000004515.3_V1.1_genomic.fna.gz GCF_000004515.3_V1.1.2bit
twoBitInfo GCF_000004515.3_V1.1.2bit stdout | sort -k2,2nr > GCF_000004515.3_V1.1.chrom.sizes
ls -og
-rw-rw-r--. 1 286264183 Apr 13 17:30 GCF_000004515.3_V1.1.2bit
-rw-rw-r--. 1     23246 Apr 13 17:31 GCF_000004515.3_V1.1.chrom.sizes

Construct ooc file

The blat operation in this procedure works much more efficiently when a pre-computed ooc file is constructed to use for all blat comparisons. This is a file that counts up over used 11-mer tiles for blat to eliminate them from the initial consideration for alignment, thereby limiting the amount of alignment that has to take place. We base the repMatch parameter on the size of the genome compared to UCSC hg19 sequence. A genome of that size used -repMatch=1024. We want to adjust that parameter in proportion to that size. Measure the size of the target genome:

cd /data/genomes/GCA_000004515.3_Glycine_max_v2.0
twoBitToFa GCA_000004515.3_Glycine_max_v2.0.2bit stdout | faSize stdin
978416860 bases (23046524 N's 955370336 real 521703227 upper 433667109 lower) in 1189 sequences in 1 files

Note the number of real bases 955370336 to use in this proportion calculation:

calc \( 955370336 / 2861349177  \) \* 1024
( 955370336 / 2861349177 ) * 1024 = 341.901377

Round down the answer to the nearest 50 for the -repMatch=300 use in blat:

blat GCA_000004515.3_Glycine_max_v2.0.2bit /dev/null /dev/null -tileSize=11 \
  -makeOoc=GCA_000004515.3_Glycine_max_v2.0.ooc -repMatch=300
Loading GCA_000004515.3_Glycine_max_v2.0.2bit
Counting GCA_000004515.3_Glycine_max_v2.0.2bit
Writing GCA_000004515.3_Glycine_max_v2.0.ooc
Wrote 64902 overused 11-mers to GCA_000004515.3_Glycine_max_v2.0.ooc

doSameSpeciesLiftOver.pl

This script is going to run the entire process. With the .2bit, chrom.sizes and ooc files in place, the process is ready to run with this single command:

export target="GCA_000004515.3_Glycine_max_v2.0"
export query="GCF_000004515.3_V1.1"
cd /data/genomes/${target}
time (doSameSpeciesLiftOver.pl -verbose=2 -buildDir=`pwd` \
  -ooc=`pwd`/${target}.ooc -fileServer=localhost -localTmp="/dev/shm" \
    -bigClusterHub=localhost -dbHost=localhost -workhorse=localhost \
      -target2Bit=`pwd`/${target}.2bit -targetSizes=`pwd`/${target}.chrom.sizes \
        -query2Bit=/data/genomes/${query}/${query}.2bit \
          -querySizes=/data/genomes/${query}/${query}.chrom.sizes ${target} ${query}) > do.log 2>&1

Result files

The liftOver file result is in this working directory:

cd /data/genomes/GCA_000004515.3_Glycine_max_v2.0
ls -og *.over.chain.gz
-rw-rw-r--. 1 1325174 Apr 14 17:27 GCA_000004515.3_Glycine_max_v2.0ToGCF_000004515.3_V1.1.over.chain.gz

This has also been converted to bigChain files for display in an assembly hub:

ls -og *.bb
-rw-rw-r--. 1 1622857 Apr 14 17:27 chainGCF_000004515.3_V1.1.bb
-rw-rw-r--. 1 1900014 Apr 14 17:27 chainGCF_000004515.3_V1.1Link.bb

How does this process work

The doSameSpeciesLiftOver.pl script performs the processing in distinct steps. Each step is almost always performed with a C-shell or bash shell script. Therefore, if there is a problem in any step, the commands performing the step can be dissected from the script in operation, the problem identified and fixed, and the step completed manually by running the rest of the commands in that script. Once a step has been completed, the process can continue with the next step using the argument -continue=nextStepName. Check the usage message from the doSameSpeciesLiftOver.pl script to see a listing of the steps and their sequence. Specifically:

align, chain, net, load, cleanup

In this example, the various scripts are:

-rwxrwxr-x. 1 1485 Apr 14 00:06 run.blat/doAlign.csh
-rwxrwxr-x. 1 2337 Apr 14 00:08 run.blat/job.csh
-rwxrwxr-x. 1  499 Apr 14 04:43 run.chain/job.csh
-rwxrwxr-x. 1  641 Apr 14 04:43 run.chain/doChain.csh
-rwxrwxr-x. 1 1870 Apr 14 17:26 run.chain/doNet.csh
-rwxrwxr-x. 1 2317 Apr 14 17:27 doLoad.csh
-rwxrwxr-x. 1  540 Apr 14 17:27 doCleanup.csh

The cluster runs performed were in run.blat and run.chain with the following typical processing times for this genome sequence:

cat run.blat/run.time
Completed: 2289 of 2289 jobs
CPU time in finished jobs:     658550s   10975.83m   182.93h    7.62d  0.021 y
IO & Wait Time:                  9734s     162.24m     2.70h    0.11d  0.000 y
Average job time:                 292s       4.87m     0.08h    0.00d
Longest finished job:            3935s      65.58m     1.09h    0.05d
Submission to last job:         16493s     274.88m     4.58h    0.19d
cat run.chain/run.time
Completed: 63 of 63 jobs
CPU time in finished jobs:         49s       0.82m     0.01h    0.00d  0.000 y
IO & Wait Time:                   167s       2.78m     0.05h    0.00d  0.000 y
Average job time:                   3s       0.06m     0.00h    0.00d
Longest finished job:              17s       0.28m     0.00h    0.00d
Submission to last job:            21s       0.35m     0.01h    0.00d


The script can be used in a -debug mode where it does nothing but construct the required shell scripts.