Cluster Jobs: Difference between revisions

Revision as of 20:38, 26 April 2006

Cluster Job Organization

Input/Output

The most critical factor in designing your cluster jobs is to completely understand where your input data is coming from, where temporary files will be made during processing, and where your output data results are going. With several hundred CPUs reading and writing data, it is trivially simple to make life very difficult for the underlying NFS fileservers. The ideal case is, your input data comes from one file server, your temporary files are written to /scratch/tmp/ local disk space, and your output data goes back to a different NFS server than where your input data came from.

Job Script

A properly constructed job is typically a small .csh shell script that begins:

#!/bin/csh -fe

The -fe ensures the script will run to completion successfully or exit with an error if any of the commands fail. Parasol is aware of the errors if a command exits with errors so it will also know a job has failed because of that. You can see many script examples in the kent source tree make*.doc files where we document all of our browser construction work.

Job Recovery

For job recovery, the parasol system has features that make it very convenient to recover failed jobs. There will almost always be failed jobs for a variety of reasons. The most important thing to do is design your jobs such that they have an atomic file presence indicator of successful completion. The case is typically to make a job do all of its work on the /scratch/tmp/ filesystem, creating its result file there. When it has successfully completed its work there, it does a single copy of the result file back to a /cluster/storeN/ filesystem, which is outside of the cluster and thus more permanent. The existence of that file result can be verified by parasol commands to determine if the job was successfully completed. Parasol keeps track of the jobs that are successful or not. To re-run the failed jobs, you merely do a 'para push' of the batch again, and the failed jobs will be retried. A job can be retried like this until it fails four times. The gensub2 template syntax: {check out line <result.file>} is used to tell parasol to check that file to verify job completion.

Long Running Jobs

If you really must run jobs that will occupy a lot of CPU time, and I would highly recommend redesigning the job if it is going to do that, then you must use the cluster politely and use the para option -maxNode=N to limit the number of nodes your long-running jobs are going to occupy. You have to leave the cluster in a state where it can do work for other users.

Cluster Jobs: Difference between revisions

Revision as of 20:38, 26 April 2006

Contents

Cluster Job Organization

Input/Output

Job Script

Job Recovery

Long Running Jobs

Navigation menu

Page actions

Page actions

Personal tools

Navigation

Search

related sites

hosted projects

Tools

@@ Line 11: / Line 11: @@
 goes back to a different NFS server than where your input data
 came from.
+<h4>Job Script</h4>
+A properly constructed job is typically a small .csh shell script that begins:
+<pre>
+#!/bin/csh -fe
+</pre>
+The -fe ensures the script will run to completion successfully or exit
+with an error if any of the commands fail.  Parasol is aware of the errors if a command exits with errors so it will also know a job has failed because of that. You can see many script examples in the kent source tree make*.doc files where we document all of our browser construction work.
 <h4>Job Recovery</h4>