Using custom track database

From genomewiki
Jump to navigationJump to search

Using the Custom Track Database

A new feature of the genome browser as of April 2007 (v156) is the ability to use a data base for custom tracks. Up to this date, custom track data has been kept in files in the /trash/ct/ directory. This article discusses the steps required to enable this function.

Summary configuration

  • database loader binaries hgLoadBed, hgLoadWiggle, hgLoadMaf and wigEncode are installed in /cgi-bin/loader/ - these are installed via the normal 'make cgi' in the source tree kent/src/hg/ directory.
  • an empty customTrash database has been created on the MySQL host - create this manually once, the MySQL host name is a configuration item, the database name customTrash is not a configuration item
  • temporary read-write data directory /data/tmp has been created with read/write/delete enabled for the Apache server effective user, this directory name is a configuration item
  • configuration items are specified in /cgi-bin/hg.conf/ - this will turn on the function
  • for command line access to the database, create a special ~/.hg.ct.conf to be used with the environment variable HGDB_CONF
  • create a cron job to run a cleaner script to expire and remove older tables from the database - dbTrash command is used for this purpose

Host and database name

For performance and security considerations, the MySQL host for the custom track database can be a separate machine from the ordinary MySQL host that usually serves up the assembly databases or the hgcentral database. It is not required that the custom track database be on a separate MySQL server. The specification of the host machine is placed in the /cgi-bin/hg.conf file, for example a host machine called "ctdbhost":

customTracks.host=ctdbHost

The database name used on this host is fixed at customTrash which is a define in the source tree file hg/inc/customTrack.h

/cgi-bin/hg.conf configuration items

The following items must be specified in /cgi-bin/hg.conf to enable this function:

customTracks.host=ctdbhost
customTracks.user=ctdbuser
customTracks.password=ctdbpasswd
customTracks.useAll=yes

Establish this user account and password in MySQL with db and user privileges:

Select, Insert, Update, Delete, Create, Drop, Alter
for example with your MySQL root user account:
hgsql -hctdbhost -uroot -p -e "GRANT SELECT,INSERT,UPDATE,DELETE,CREATE,DROP,ALTER"
on customTrash TO ctdbuser@yourWebHost IDENTIFIED by 'ctdbpasswd';" mysql

Optionally, a temporary read-write directory used during database loading can be specified:

customTracks.tmpdir=/data/tmp

The default for this is /data/tmp and should be created with read/write/delete access for the Apache server effective user. It should be on a local filesystem for best access speed, not via NFS.

Database loaders

The database loaders used to load custom tracks are the standard loader commands found in the source tree, hgLoadBed, hgLoadWiggle, hgLoadMaf and wigEncode. They are installed into /cgi-bin/loader/ with a 'make cgi' from the source tree directory kent/src/hg/ These loaders are used by the cgi binaries hgCustom, hgTracks, and hgTables to load custom tracks into the database. They are operated in an exec'd pipeline fashion, the code details can be see in src/hg/lib/customFactory.c

NOTE: You will need to enable the "LOCAL" keyword usage in your MySQL system. Use the argument: --local-infile=1 on your mysqld_safe startup command. Earlier versions of MySQL had this on by default. Later versions of MySQL disable this due to security concerns. If you are building your MySQL system from source you will also have to enable the keyword during your configure step. See also: LOAD DATA LOCAL

Command line access

Since the MySQL host may be different than your ordinary MySQL host, you will need to create a unique $HOME/.ct.hg.conf file to be used in the case where you want to manipulate this separate database with the kent source tree command line tools. This unique .ct.hg.conf is merely a copy of your normal .hg.conf file but with a different host/username/password specified:

db.host=ctdbhost
db.user=ctdbuser
db.password=ctdbpasswd

Remember to set the priviledges on this hg.conf file at 600:

chmod 600 $HOME/.ct.hg.conf

To enable the use of this file for subsequent command line operations, set the environment variable HGDB_CONF to point to this file, for example in the bash shell:

export HGDB_CONF=$HOME/.ct.hg.conf

With that in place, you can examine the contents of the customTrash database:

hgsql -e "show tables;" customTrash

This unique hg.conf file will also be used by the cleaner command dbTrash

Cleaner script

The database and the temporary data directory /data/tmp need to be kept clean. This is similar to the current cleaner script you have running on your /trash filesystem. In this case there is a specific source tree utility used to access and clean the database. The temporary data directory /data/tmp would stay clean if each and every loaded custom track was successfully loaded. In the case of badly formatted or illegal data submitted for the custom track, the database loaders do not remove their temporary files from /data/tmp This /data/tmp directory can be kept clean with, for example, an hourly cron job that performs:

find /data/tmp -type f -amin +10 -exec rm -f {} \;

This would remove any file not accessed in the past 10 minutes.

The database cleaner command dbTrash should be run as a cron job encapsulated in a shell script something like this, which maintains a record of items cleaned to enable later analysis of custom track database usage statistics:

#!/bin/sh

DS=`date "+%Y-%m-%d"`
YYYY=`date "+%Y"`
MM=`date "+%m"`
export DS YYYY MM

mkdir -p /data/trashLog/ctdbhost/${YYYY}/${MM}
RESULT="/data/trashLog/ctdbhost/${YYYY}/${MM}/${DS}.txt"
export RESULT

/cluster/bin/x86_64/dbTrash -age=72 -drop -verbose=2 \
     > ${RESULT} 2>&1

Running this once a day will remove any tables not accessed within the past 72 hours. The dbTrash command is found in the source tree in kent/src/hg/dbTrash


The /trash directory can be kept clean with the following two commands, one to implement an 8 hour expiration time on most files, the second to implement a 72 hour expiration time on custom track files:

find /trash \! \( -regex "/trash/ct/.*" -or -regex "/trash/hgSs/.*" \) \
    -type f -amin +480 -exec rm -f {} \;
find /trash \( -regex "/trash/ct/.*" -or -regex "/trash/hgSs/.*" \) \
    -type f -amin +4320 -exec rm -f {} \;

metaInfo and history

You will note several special and persistent tables in the customTrash database: metaInfo and history. The metaInfo table records a time of last use for each custom track table and a useCount for statistics. The time of last use is used by the cleaner utility dbTrash to expire older tables. The history table is the same as the history table in the normal assembly databases. The loader commands, hgLoadBed, hgLoadWiggle and hgLoadMaf record into the history table each time they load a track. The cleaner command dbTrash also records in the history table statistics about what it is removing. Which begs the question, so who cleans the history table ? We will leave that as an exercise to the reader, your choice.

Turning On Considerations

Please note, if there are currently existing custom tracks in /trash/ct/ files, at the time of adding the configuration items to /cgi-bin/hg.conf/ those existing tracks will be converted to database versions upon their next use by the user. At the time the hg.conf file is changed, these conversions will take place. Don't do any fancy tricks with the date/time stamp on the hg.conf file, it needs to be the current date/time at the moment of edit. We switched our Round Robin WEB servers with an 'at' job so they would all have the same date/time.

Use of trash files with the database on

When the custom tracks database is in use, there are still small files kept in /trash/ct/ which become the reference pointers to the actual database tables belonging to that custom track. The standard trash cleaner script should still be kept running to clean these files.

Known difficulties

For the case of a custom track submission that contains more than one track set of data, in the case where one of the sets of data is illegal and causes a loading problem, even though some sets of data may have loaded successfully, the submitting user will see an error about the corrupted data, and they would need to correct their data submission to get all tracks successfully loaded.

Also, with existing tracks in place, attempting to load another track that has a load error will cause the loss of all existing tracks. It would be good to see if this could be fixed.

MySQL >5.5 / Ubuntu 12.04

If using newer versions of MySQL (>5.5), the LOAD DATA LOCAL command is disabled for security reasons. This is required for using custom tracks. You will have to compile MySQL from source (using cmake) with "-DENABLE_LOCAL_INFILE=ON" (or ../configure --enable-local-infile if using make).

Here are specific instructions for how to compile MySQL (5.5) on newer versions of Ubuntu (such as 12.04):

Download source, install dependencies

$ sudo -s
# apt-get source mysql-5.5
# apt-get build-dep mysql-5.5
# cd mysql-5.5

You must then edit the file debian/rules. Find the line that says

	    cmake -DCMAKE_INSTALL_PREFIX=/usr \

and add another line so that it reads

	    cmake -DCMAKE_INSTALL_PREFIX=/usr \
	          -DENABLED_LOCAL_INFILE=ON \

You can then save that file and (inside the mysql-5.5 directory) run

# env DEB_BUILD_OPTIONS="nocheck" fakeroot debian/rules binary

which will create a bunch of .deb files in the parent directory. If you leave off "nocheck", it will run a lot of tests and take a long time to compile (and may fail).

To install the packages and mark them as non-upgradeable, do

# cd ..
# dpkg -i *mysql*.deb
# for i in *mysql*.deb; do echo "$i hold" | dpkg --set-selections ; done

You will also need to re-compile/install the genome browser afterwards.