Parasol job control system

From genomewiki
Jump to navigationJump to search

Introduction

The parasol job control system is used to manage a multiple CPU/core compute cluster. It can also be used on a single computer that has multiple CPUs/cores.

It is the cluster control system expected by the U.C. Santa Cruz Genomics Institute genomics toolsets and processing pipelines such as building new genome browsers, computing lastz/chain/net pair-wise alignments, multiple alignments of genome assemblies, and many other bioinformatics processing toolsets.

Computer organization

One computer (or one CPU of one computer) is allocated to the task of managing the compute jobs. This is referred to as the parasol hub machine.

The other computers in the cluster are allocated to the task of running the compute jobs. These computers are referred to as the parasol node machines.

Compute jobs to the system are managed with parasol commands on the hub machine. The compute jobs are assigned to the node machines by the parasol processes running on the hub machine.

A single machine can be used as both the hub controller and as a node task machine as long as two CPU cores are reserved, one for the operating system and one for the hub processes, with the extra CPU cores allocated to the node task resource pool.

SSH keys

The hub machine needs to communicate with the node machines via UDP network protocol and via ssh commands for setup tasks.

Assuming this is a new machine (such as a cloud machine instance) with no previous ssh operations, the ssh keys can be generated:

echo | ssh-keygen -N "" -t rsa -q

The echo provides an empty answer to the passphrase question that would normally be asked by the ssh-keygen command. This creates two files in $HOME/.ssh/:

-rw-r--r--. 1  397 Mar 22 17:43 id_rsa.pub
-rw-------. 1 1675 Mar 22 17:43 id_rsa

The existing $HOME/.ssh/authorized_keys probably already has keys added from the cloud machine management system to allow login to the machine instance. To add this newly generated key to the authorized_keys file without disturbing existing contents:

 cat $HOME/.ssh/id_rsa.pub >> $HOME/.ssh/authorized_keys

These id_rsa key files will be copied to other machines in this cluster to allow this hub machine access via ssh. This copy just performed will allow this hub machine to access itself via ssh in the case it is also used as a node machine.

Parasol installation

This example is going to install everything in a constructed directory hierarchy: /data/...

This directory will be exported as an NFS filesystem for access by the node machines in this cluster.

These commands need to be run as the root user as verified by the small bash script File:ParasolInstall.sh.txt:


 #!/bin/bash
 export self=`id -n -u`
 if [ "${self}" = "root" ]; then
   printf "# creating /data/ directory hierarchy and installing binaries\n" 1>&2
   mkdir -p /data/bin /data/genomes /data/parasol/nodeInfo
   chmod 777 /data /data/genomes /data/parasol /data/parasol/nodeInfo
   chmod 755 /data/bin
   rsync -a rsync://hgdownload.soe.ucsc.edu/genome/admin/exe/linux.x86_64/ /data/bin/
   yum -y install wget
   wget -qO /data/parasol/nodeInfo/nodeReport.sh 'http://genomewiki.ucsc.edu/images/e/e3/NodeReport.sh.txt'
   chmod 755 /data/parasol/nodeInfo/nodeReport.sh
   wget -qO /data/parasol/initParasol 'http://genomewiki.ucsc.edu/images/4/4f/InitParasol.sh.txt'
   chmod 755 /data/parasol/initParasol
 else
   printf "ERROR: need to run this script as sudo ./parasolInstall.sh\n" 1>&2
 fi

PATH setup

The user that will be using this parasol system will need to have the /data/bin directory in their shell PATH environment. This can be added simply to the .bashrc file in the user's home directory:

echo 'export PATH=/data/bin:$PATH' >> $HOME/.bashrc

Then, source that file to add that to this current shell:

. $HOME/.bashrc

This entire discussion assumes the bash shell is the user's unix shell.

To root or not to root

If this parasol system is going to be used by the single user that is installing it, the system does not need to be run as the root user. If multiple users are going to use this parasol system, the daemons need to be run as root user. When the system is run as the root user, it will maintain correct ownership of each user's files created during parasol cluster operations. For a single user operation, all files used by the system will be owned by that single user identity. To use the commands as the root user, add the sudo command in front of each of the operations mentioned below.

Initialize node instances

Each node to be used in the system needs to report itself to the parasol hub. This is done by running the /data/parasol/nodeInfo/nodeReport.sh script when logged into each node in the system. The single argument to the script is the number of CPUs on that node to reserve outside the parasol system. For example, when using a single machine as both hub and node you want to reserve two CPUs, one for the operating system and one for the parasol hub processes. For example:

/data/parasol/nodeInfo/nodeReport.sh 2

On a strictly only node machine, no CPUs need to be reserved:

/data/parasol/nodeInfo/nodeReport.sh 0

For cloud setup procedures, this "nodeReport.sh 0" command can be included in the cloud startup scripts for each node brought on-line.

Initialize and verify ssh keys

On the parasol hub machine, use the initialize function once to prepare and verify the ssh keys will function properly:

/data/parasol/initParasol initialize

Starting/Stopping the parasol system

After all the node machines have reported in via the nodeReport.sh script, the parasol system can be started:

cd /data/parasol
./initParsol start

That script will echo the command it is using to start the system:

Starting parasol:/usr/bin/ssh 10.109.0.54 /data/bin/paraNode start -cpu=6 log=/data/parasol/10.109.0.54.2018-03-28T17:33.log hub=10.109.0.54 umask=002 sysPath=. userPath=bin
Done.

And the status of the parasol system can be checked:

$ parasol status
CPUs total: 6
CPUs free: 6
CPUs busy: 0
Nodes total: 1
Nodes dead: 0
Nodes sick?: 0
Jobs running:  0
Jobs waiting:  0
Jobs finished: 0
Jobs crashed:  0
Spokes free: 30
Spokes busy: 0
Spokes dead: 0
Active batches: 0
Total batches: 0
Active users: 0
Total users: 0
Days up: 0.000046
Version: 12.18

To stop the system:

cd /data/parasol
./initParasol stop

That command responds:

Stopping parasol:Telling 10.109.0.54 to quit 
rudpSend timed out

testing UDP network connections

This is almost never a problem, but this is a fun test. Using the command nc from the nmap-ncat package:

sudo yum install nmap-ncat

Open a shell terminal to each machine to be tested. Find out what the IP address is for this machine:

$ ip addr show | egrep "inet.*eth0" | awk '{print $2}' | sed -e 's#/.*##;'
10.109.0.40

On this machine, start an nc listener process on some port number, this example port 6111

 nc -ul 6111

On the second shell terminal, start a corresponding connection to the first machine's address and port:

 nc -u 10.109.0.40 6111

With those two processes connected, typing anything on one terminal will show up at the other terminal, after return is pressed at the end of a line.

To exit each listener process, type Ctrl-D