The APPLES Manual 1

advertisement
The APPLES Manual
1
APPLES: Analysis of Plant Promoter-Linked Elements
(c) copyright University of Warwick 2012
Contents:
1.
Introduction ................................................................................................................... 2
2.
Getting Started with APPLES....................................................................................... 2
3.
Installation..................................................................................................................... 3
3.1.
Installation on the wsbc or IBM cluster ................................................................. 3
3.2.
Installation on a Windows PC................................................................................ 7
4.
APPLES Design Principles ........................................................................................... 7
5.
APPLES Habits........................................................................................................... 10
6.
APPLES coding notes ................................................................................................. 12
6.1.
Location of ‘use’ statements ................................................................................ 12
7.
The standard APPLES prelude explained ................................................................... 12
8.
APPLES Scheduler, Cache and Job Handling ............................................................ 13
8.1.
APPLES Running Modes .................................................................................... 14
8.2.
Parallelising in APPLES ...................................................................................... 15
8.3.
Adding a Method for Another Binary to the Job_Handler .................................. 17
8.4.
Configuration of Caching and Scheduler (=Listeners) ........................................ 17
8.5.
Starting and Stopping Listeners: .......................................................................... 19
8.6.
Diagnosing problems with the listeners ............................................................... 21
9.
Perseverance ............................................................................................................... 21
10.
Sample Script Files ................................................................................................... 22
11.
APPLES Glossary of Terms ..................................................................................... 23
12.
Using a Development Environment to Develop APPLES ........................................ 24
12.1.
Introduction ........................................................................................................ 24
12.2.
Setting up ........................................................................................................... 24
12.3.
Running code ..................................................................................................... 25
13.
Miscellaneous ........................................................................................................... 25
1
1. Introduction
This is the documentation for the APPLES software, intended for programmers
(developers) and users of the software. It is in the first instance written for users at
Warwick and therefore some of the installation instructions are written specifically for
two systems at Warwick where APPLES has been installed.
2. Getting Started with APPLES
This is the documentation for the initial installation of the APPLES distribution on the
wsbc or IBM cluster, or standalone on a PC.
The basic requirements for running APPLES are:
- Perl 5.10.0 (or later)
- Moose (http://www.iinteractive.com/moose/) (currently version 1.07)
- MooseX::Declare (http://www.iinteractive.com/moose/) (currently version 0.33)
- DBI
- DBD::mysql (plus MySQL client)
2
- SOAP::Lite
- BioPerl 1.6.0
- Ensembl modules and Ensembl Compara modules
- Statistics::R
- MEME
- R package heR.Misc.R
- NCBI BLAST
- Chart::Gnuplot version 0.14 and Gnuplot 4.4.0
- library “graph” for R
- DBIx::Simple
- Config::General
- GD
- Log::Log4perl
- Text::CSV
APPLES can be run on the wsbc cluster or IBM cluster provided by CSC, or can be run
standalone on a Windows PC. In principle It should also be possible to run standalone on
an Apple PC although this configuration has not been tested.
Some APPLES subroutines are written in C, the binaries are placed in common
directories on the WSBC-cluster and the IBM-cluster. See sample configuration files in
repository for standard locations. Updated versions of the binaries must have a postfix of
the APPLES revision in which these are used first, so every APPLES revision can call its
compatible binaries.
3. Installation
If running on the wsbc or IBM cluster, most dependencies are already installed, so the
simpler installation is covered in the first section.
If running on a Windows PC (or Apple PC) slightly more may have to be installed, and
there is some additional configuration to allow the PC to communicate with the central
cache database. This is covered in the second section.
3.1.
Installation on the wsbc or IBM cluster
1. Perl 5.10:
APPLES requires Perl 5.10.0 or greater in order to run. (The IBM and wsbc servers use
5.8.6 by default, but we have added separate installations on both systems).
In order to use Perl 5.10.0:
Using the wsbc server, start your scripts with:
/common/perl-5.10.0/bin/perl
(instead of just ‘perl’)
or put /common/perl-5.10.0/bin/perl in your path.
Using the IBM server (Francesca):
To use the new version of Perl you need the Perl 5.10.0 environment module loaded
(using the command line):
3
module load perl/5.10.0
which brings the perl 5.10.0 binary into your path.
(The actual path is: /software/perl/5.10.0/bin/perl).
On Francesca you also need to load other modules such as meme or gnuplot. After you
have checked out the code for APPLES (see below) you can change directory into
trunk/shell_scripts and type the following to load all modules for APPLES at once
(including Perl 5.10.0):
source load_modules_francesca
2. Installation directory
On the WSBC cluster you can do the installation within your home-folder. On the IBMcluster, however, you need to do the installation not in /home/sysbio/…, but in
/gpfs/sysbio/… as only the latter directory is accessible from the compute notes. Note
that this directory is not backed up. So if you develop longer time without committing
code to the repository you must make back-up copies.
3. Ensembl modules
Some functions will require ensembl API modules and ensembl-compara API modules.
These can be downloaded and installed using the following instructions (adapted from
http://www.ensembl.org/info/docs/api/api_cvs.html - note that installation of bioperl 1.2.3
is not necessary on the wsbc or IBM clusters, as this is already installed. For non-CVS
installation methods, follow instructions on this site:
http://www.ensembl.org/info/docs/api/api_installation.html)
CVS access is recommended, as this way it is easier to change between versions.
Create an installation directory in /gpfs/sysbio/username and go there, for example:
$ mkdir /src
$ cd /src
This will create a directory “src”.
Log into the Ensembl CVS server at Sanger (using a password of CVSUSER):
$ cvs -d :pserver:cvsuser@cvs.sanger.ac.uk:/cvsroot/ensembl login
Logging in to :pserver:cvsuser@cvs.sanger.ac.uk:2401/cvsroot/ensembl CVS
password: CVSUSER
PLEASE NOTE: At this point, the software version you download must match the
Ensembl database version you wish to use. For example, the following was taken from
“sample user scripts/core_promoters.pl:
# Establish database parameters for core species
my $genome_sequence_database_parameters = Ensembl_Database_Parameters>new( alias => 'arabidopsis', location => 'ensemblgenomes', dbname =>
'arabidopsis_thaliana_core_3_55_9');
4
For the Ensembl to be consistent with the above, install the Ensembl Core Perl API using
the following command:
$ cvs -d :pserver:cvsuser@cvs.sanger.ac.uk:/cvsroot/ensembl checkout -r
branch-ensembl-55 ensembl
Install the matching Ensembl Compara Perl API using the following command:
$ cvs -d :pserver:cvsuser@cvs.sanger.ac.uk:/cvsroot/ensembl checkout -r
branch-ensembl-55 ensembl-compara
In the APPLES.dat file, put the correct path to your installation of the ensembl APIs in
“ensembl =” and “ensembl_compara =”. These will be read from the config file when
your user script starts up, pushing the paths into @INC and setting corresponding
environmental variables. (NOTE: it is NOT necessary to follow the environment set-up
instructions given in the website version of the Ensembl installation instructions because
the environmental variables necessary are set up at the beginning of an APPLES user
script.)
4. APPLES modules
Download a working copy of APPLES/trunk from the software repository. This uses the
svn+ssh protocol for authentication, using your server login account. For example:
> svn co svn+ssh://username@home.wsbc.warwick.ac.uk/Groups/TRegS/software/APPLES/trunk
These accounts use public/private keys that you have set up previously (see Paul Brown if
you haven’t done this). You will also be prompted to enter your server password.
Also, in your .ssh directory on the server there needs to be a file called ‘config’, which
contains the following text. If there is one already then make sure it contains the
~/ssh/home line
IdentityFile ~/.ssh/id_dsa
IdentityFile ~/.ssh/home
This lists all of the private keys that are used to communicate with other
servers. The ~/.ssh/home entry points to the file needed to communicate with
the home server
Create a file called home in the .ssh directory, and paste in your private key
(id_rsa) which you use to access the home server (e.g. Located at
.ssh/systems/id_rsa on my mac), and which should look something like
-----BEGIN RSA PRIVATE KEY----MIIEoAIBAAKCAQEAu75Xa7rWawaXW0p6o9+R921KVMsprECh6DEMYMYFlGtX7Wz
5
...
5
t3y25CuJ6p5JDnYHjZNR9f+UjaWe0cYc3B3msAIbDhsN+Mud
-----END RSA PRIVATE KEY-----
This file should only have user read/write access (chmod 600 .ssh/home)
Change the permissions on your working copy directory to allow read and executable
access to the group, in order for jobs to be run by the Scheduler processes, that may be
owned by another user (i.e. chmod 755 my_working_copy). Also check the same
permissions are given for the Ensembl modules if you are using a set of modules you have
downloaded.
On both the WSBC-cluster and the IBM-cluster the recommended setting is “chmod 755
…”. Make sure your home-directory itself has the same privileges, too. At least on the
IBM-cluster the default setting seems to be that the user’s home-directory is inaccessible
to all other users.
Furthermore, ask Paul Brown to make you a member of the group “APPLES” on the
WSBC cluster – this is the group of all APPLES developers. This membership is
necessary as some developers may have restricted read-access to their scripts to this group
only, so if you want to be able to run their scripts you need to be a member of the group.
More importantly, if you were the last person to start the listeners (see below, Chapter
“APPLES Scheduler, Cache and Job Handling”) and you are not a member of this group,
then some developers will not be able to run jobs through this cluster.
If you want to restrict read-access to your scripts to the APPLES developers group (on the
WSBC machine), then you can ask Paul to make the APPLES-group your primary group
(in which case you can just use chmod to do so) or ask Paul for other ways.
5. Configuration files
Three configuration files are needed, make a copy of these files in a new directory and
amend these as appropriate (look inside the files). If you call the directory where you keep
your configuration files “configuration_files” and place this directory in “trunk”, then you
have your files at the same relative path as others. In this case you can run the sample user
scripts without any amendments.
The three configuration files are:
APPLES.dat
cache_main.dat
Job_Handler_Config.dat
On the IBM-cluster start from a copy of IBM_APPLES.dat and rename it into
APPLES.dat. Also use IBM_Job_Handler_Config.dat instead of Job_Handler_Config.dat,
but do not change the file name.
Some details on the meaning of configuration parameters can be found in Section 8.4.
6.1 Extra Perl Modules on CSC Cluster
6
If you are using CSC cluster there are extra modules installed in the
/gpfs/smsdad/apples/lib. In order to access them do the following:
1. Login to CSC cluster
2. Type echo 'eval $(perl -I/gpfs/sysbio/smsdad/apples/lib/perl5 Mlocal::lib=/gpfs/sysbio/smsdad/apples)' >> /gpfs/sysbio/<user
name>/.bashrc
N.B. Replace the <user name> with your actual user name
3. If you carry on working also type: . ~/.bashrc
4. Next time you login your environment will be correctly setup.
Now you should be ready to use APPLES!
3.2.
Installation on a Windows PC
Installation on a windows PC is much the same as on the server, except that a significant
amount of software that is preinstalled on the server is likely to have to be installed on the
PC. In addition, there are some windows specific configuration settings that will be
required on APPLES. More details can be found at:
http://www2.warwick.ac.uk/fac/sci/moac/currentstudents/2007/nigel_dyer/phd/software/r
unning_apples_on_windows
4. APPLES Design Principles
These Design Principles should be adhered to at all times when adding or modifying
anything within the APPLES software distribution. They are intended to keep the
development of the software clean and consistent. Please read these carefully and ask if
anything is unclear.
1). Clean termination: if an APPLES method results in regular termination, this behaviour
must only be determined by its parameters. These can only be the direct parameters
passed to the method, or properties of the object that the method belongs to.
2). Strict data typing: the input data types of all parameters passed to an APPLES method
must be strictly defined. This is enforced by the statement "use MooseX::Declare;" at the
beginning of all classes. A (single, invariant) output type for each function should be
provided. It's not currently possible to syntactically enforce this in Moose, but it should be
specified in a comment to the function for clarity.
3). Die statements: any APPLES method that is not yet fully implemented must exit
cleanly (if called) through the use of a statement such as "die 'method not implemented
yet!';" put at the beginning of the function.
4). Global variables: the use of global variables is limited to these cases only:
(a). Running mode parameter
(b). Verbosity
(c). Perseverance
7
(d). Memory cache
These variables do not affect the results of any computation.
5). Parameterisation: We distinguish parameters and configuration settings. The former
can affect the result of a method so that it terminates regularly but returns a different
result. The latter will either cause the method to throw an exception if the configuration
setting (such as a path to a binary) is incorrect, or, if set correctly, will allow the method
to function normally. Any parameter must be treated as such, by being explicitly defined
by the user and passed through the code. This is important for the caching solution to
work. Redundant parameters should be avoided.
6). Running mode: this can be "normal" (serial), "preparation" (setting up parallel jobs),
“statistics” or "statistics_and_retrieval" (to evaluate if a function call may result in heavy
computations). Some functions are dependent on the chosen running mode parameter, but
this does not affect what the function returns in case of regular termination. See section
8.1 for more details.
7). Randomisation: Any APPLES method involving randomisation (e.g.
random_genomic_interval_picker which uses the 'rand' function), must take a userdefined seed as a parameter, so that the consistency of the random result can be
controlled. This ensures the user is aware of any randomised procedures, can control their
seeds, and (if using seeds repeatedly) can apply our caching mechanism appropriately.
8). Public vs Private methods: public methods (ones which can be called legitimately by
the user) are distinguished from private methods (that are only to be used internally by the
object, $self) syntactically using the prefix "private_" for a private method.
9). Public vs Private attributes: to protect attributes, where possible set them to be readonly (is => 'ro'). If an attribute is required to be private, you can overide the implicit
read/write methods by stating an alternative named method for the writer and/or reader,
e.g:
has 'id' => (isa => 'Str', writer => 'private_write_id', reader => 'private_read_id');
10). Instantiation: the method 'new' is only to be used if there is no associated "maker"
class to use.
11). Object production vs object rendering: output of a computation should not go directly
into a file, but first into an object which is passed back by a function, so that it is
cacheable. Subsequently generating a file to record that object (or aspects of it) should be
done as a (computationally light) method of that object.
12). Setting object attributes: always set object attributes using the writer method as in
$gi->five_prime_pos(13);
and do NOT use assignments as in
8
$gi->five_prime_pos = 13;
The latter way of setting an attribute does not lead to processing of triggers, predicates,
and the like.
13). Exception handling: never use the “exit”-statement, but only “die”-instructions as the
former disables exception handling. If you catch an exception then make sure to re-throw
the exception for all those types of exceptions for which you do not provide treatment.
Also do not use $@ after calls into Moose-objects. (see explanation under “bad example”
below).
A good example:
method private_check_existence_of_neighbouring_gene(Reg_Loc $reg_loc, Boolean
$ignore_pseudogenes, GenomicDirection $direction) {
my $result;
eval {
$self->get_distance_to_neighbouring_gene($reg_loc, $ignore_pseudogenes,
$direction);
};
if ($@) {
my $exception_content = $@;
if (!$exception_content->can('isa')) {
die $exception_content; # if exception can't 'isa', re-throw the exception
}
else {
if ($exception_content->isa('No_Neighbouring_Gene_Exception')) {
$GU->user_info(3,"encountered gene without a neighbour.\n");
$result = FALSE;
}
else {
die $exception_content;
}
}
}
else {
$result = TRUE;
}
return $result;
} # private_check_existence_of_neighbouring_gene #
We normally deal with the negative case of !$exception_content‐>can(‘isa’)
first as there was a case where this was found to avoid a problem.
A bad example:
9
There is a big problem with this example, do not copy this. The problem: when the call
into $GU is made $@ turns undefined. So the call into merge_object, even though
syntactically correct, will fail (note: method merge_object was abolished since this
paragraph was written). We have not fully investigated why the call into $GU has this bad
side effect on $@. This type of problem can be confusing as the user will only see the
error message resulting from the problem within the exception handling part. The original
error cannot be seen.
5. APPLES Habits
These APPLES habits are intended to indicate the required ‘look and feel’ of the APPLES
code, and should be implemented when adding or modifying any of the software. They
also govern what state the code should be in before a new “commit” to the repository
occurs. Please read these carefully and ask if anything is unclear.
1. Class names format: Upper_Case_With_Underscores
2. Method names format: lower_case_with_underscores
3. Datatype format: UpperCaseNoUnderscores
4. Copyright note: see document "copyright note.pdf"
5. Method/class end labels: add the name of the method (or class) in a comment after the
closing bracket of the method (or class) as in:
} # get_distance_to_neighbouring_gene #
6. Method order: list public methods first, then private methods. This is to make it easy
for users of a class to get an overview over the available functionality. Some (big) classes
may have further ordering, for example Genome_DB_Utilities private methods are sorted
by: generic private, Ensembl-specific private, Genbank-specific private, ...
7. Print statements: do not use "print". Instead, create a General_Utilities object, and use
the method user_info. This takes 2 parameters: (i) integer (1,2,3) to designate the print
level, (ii) string to print. Global Print Levels: 0 = silent; 1 = warnings; 2 =
warnings+progress; 3 = warnings, progress + debugging
For example:
my $GU = General_Utilities->new();
10
$GU->user_info(2,"COMPARING GI ".$counter." of ".$todo."\n" );
8. Volatile code: (code that is written for testing only, and may create unreliable results):
use a # VOLATILE (start/end) temporary mark for line(s) of code in your working copy
that are to be removed/reverted to a 'safe' form before commiting the code. Code
containing this mark should NOT be committed to the repository. Any results obtained
with code containing these comments should NOT be cached. The mark is necessary
whenever you add volatile code to have a safeguard, the mark must match the search term
"VOLATILE" (in capitals).
9. Every Genome_Sequence_Database_Parameters instance should have a unique
'dbname' attribute. (Important for proper caching).
10. Changes to method names or datatypes: whenever altering a datatype or method name,
it is important to find and replace all occurrences in the code before committing to the
repository.
11. SVN commit: before committing code to the repository do an 'svn update' first to sort
out any conflicts locally and test that all changes put together make a working APPLES
version.
There are a set of regression tests in the sample_user_scripts_regression_test directory
which can be run using the ‘run_all_perl_tests.pl’ script that is also in this directory which
runs each of the other perl scripts in that directory in turn, gathering the exit status of each
script, and giving a summary of which scripts ran successfully and which failed. If all
tests are successful, the user is advised it is safe to commit their code; if one or more tests
failed the user is advised to check their code."
12. system-calls: Check if return code is zero. If it is not zero one should normally throw
an exception. Perl does not do this automatically, so if you do not check the return code
then your script may run on even after the system-call failed.
Example:
system($cmd) == 0 or die "System error!”;
13. code formatting: We would like to give the APPLES code a consistent appearance
which should help when working with code from different modules.
a) for if-statements and the like we open the bracket in the same line. So, for example, we
make it
if () {
...
}
else {
...
}
and NOT
11
if ()
{
...
}
else
{
...
}
b) we avoid redundant spaces in method calls, so we make it
$self->do_something();
and NOT
$self -> do_something () ;
6. APPLES coding notes
The following provide more details on some other aspects of the coding of APPLES
modules.
6.1.
Location of ‘use’ statements
Use statements are positioned within the classes, except for use MooseX::Declare which
is at the top of the file. The convention is not always followed elsewhere, but is described
in the MooseX documentation itself.
http://search.cpan.org/~flora/MooseX-Declare-0.33/lib/MooseX/Declare.pm#NOTE_ON_IMPORTS
7. The standard APPLES prelude explained
When using APPLES to do data analysis one normally sets up a “user script” which sets
relevant parameters, makes calls into APPLES classes, and triggers output functions to
render results as objects, text, files, pictures, etc. The example below shows the standard
“prelude” for any APPLES user script.
12
The BEGIN-statements are used in order to call the method load_includes which sets
environmental variables, the Perl variable @INC, and does some other settings. Using this
method we can run APPLES-scripts without having to configure the environment, we
only need to set up the APPLES-configuration files. Another advantage of this approach
is that all APPLES developers can run scripts from others using their settings. This is very
useful when tracing bugs revealed by others. We use the relative paths (see screenshot) to
define the location of Main and APPLES.dat so that everybody can run this script in their
installation and with their own settings without having to change the code.
As explained in the design principles, only three global variables are allowed in APPLES.
Only specific classes are allowed to read these variables (and, therefore, are allowed to
depend on these) and users should never alter these after the prelude. These globals do not
affect the result of computations, but merely how these are organised.
The first of these ($global_print_level) has to be set in the BEGIN-statement as
load_includes depends on it. This variable is explained in the APPLES habits.
The includes for APPLES_library and APPLES_dependencies are there to include
anything that may be needed in APPLES user scripts. This is so that the script can be as
short as possible. APPLES_library is for APPLES components, APPLES_dependencies is
for non-APPLES-components.
The globals for perseverance and running mode are explained in following sections.
8. APPLES Scheduler, Cache and Job Handling
When analysing genomic sequences and related data we frequently encounter
computationally heavy tasks. Therefore, we need to be able to organise parallel
13
computations. We have decided not to separate “production scripts” which organise
computations and “analysis scripts” which make use of large scale results to extract
relevant features. This is to a) avoid duplicating code, b) be able to flexibly change the
level at which we parallelise, and c) maintain full overview over all parameters.
Furthermore, we want to be able to store results of heavy computations without having to
organise this manually, have the software automatically retrieve results for computations
that have been done before (without the user needing to know these exist), and to share
computational results with others (within Warwick as well as across institutions). To
make it likely that different users use the exact same parameters for a given task we
define standard parameters for every parameter in APPLES so users only modify these
when they have a particular reason.
The main APPLES components to implement these features are the Job_Handler and the
Cache-module. The Scheduler is a component that runs underneath the Job_Handler to
serve its requests for jobs to be run and to provide an interface to a cluster machine’s
queueing system. The caching also runs underneath the Job_Handler so normally users
only need to call the Job_Handler as needed. It will automatically avoid re-running
computations done previously and allows switching between different modes of running
(see below).
The Job_Handler can handle two kinds of jobs:
1). Binaries:
such as alignments using the Ott algorithm or Seaweed algorithm
(“handle_alignment_job” method)
2). APPLES functions, as Perl scripts
“handle_APPLES_function” method
A main feature provided by the Job_Handler is the ability to switch between serial
processing (running mode “normal”) and parallel processing (running mode
“preparation”). When parallel processing is requested, then the Job_Handler is forced to
return control back to the calling script without being able to actually return the result of
the computation (the computation job will be send to the queue and the Job_Handler
cannot wait for it to complete). This gives rise to “job information exceptions” which are
a type of APPLES exceptions that the Job_Handler uses in order to tell the calling script
that the results could not be obtained (yet). These exceptions can then be caught and dealt
with appropriately. For example, if heavy computations for a number of genes are
requested, and these are independent among the genes, then one would want to catch the
exception every time and call the Job_Handler for the next gene. Details on this type of
exceptions are discussed in section “Job Information Exception Handling”. Essentially
this means that the programmer needs to hard-wire their decisions as to what functions
calls can be done independently in parallel.
8.1.
APPLES Running Modes
This is a global parameter, set by the user in their script. The options are:
• “statistics”
14
•
•
•
“statistics_and_retrieval”
“preparation”
“normal”
Statistics Mode: Statistics mode will not return any results for a request. However, (if
Job_Information_Exception handling has been correctly implemented for the job(s) you
are requesting, it will return information about the number of jobs required to fulfil your
request, (how many have been computed, and how many will require computation). This
mode can therefore be used to gauge how long your request will take. This is important to
diagnose the amount of computation when the CPU-time requirement is potentially
substantial, and you are not clear about how much of it may already be in the cache.
Statistics_and_Retrieval Mode: This is like statistics mode but the Job_Handler will
return results from the cache when available. So if all results needed exist already then in
this mode your script will run through just as it would do in normal mode.
Preparation Mode: This mode allows for parallel computations. If a result is available,
it is returned to you. If a job result is unavailable, it is automatically submitted for
computation. Your script will terminate once all the requests have been dealt with. You
may therefore not have received a full set of results, and will need to wait until all the jobs
have been computed before submitting the request again, at which time all the results will
be returned to you.
Normal Mode: This mode allows only for serial computations. As with Preparation
Mode, if a result is available it is returned to you. If a result is unavailable, it is computed
and your script will wait while this is done before moving on to the next part of your
script. Therefore it is only recommended for a small number of short jobs or where the
results already exist and you are just retrieving them from the cache.
8.2.
Parallelising in APPLES
The following screenshot shows the method rere_set_maker which shows all aspects of
handling job information exceptions. Typically this method is called to create ReRes for a
set of genes. Computations for each gene can be performed independently, but the final
ReRe_Set can only be completed in case results for all genes are available. In case of a
Job_Information_Exception a single job has been dealt with by the Job_Handler, in case
of an Aggregate_Exception some other layer in the code has already received one or more
Job_Information_Exceptions from the Job_Handler and merged these. In either case
receiving these exceptions in this method will mean that ReRes for all Reg_Locs will not
be available. So jobs for all ReRes can be requested and added to the queue, but the
ReRe_Set cannot be completed and, therefore, the method will call
print_statistics_and_die instead of returning. The print_statistics_and_die function will
print an overview of job types, job numbers, and overall CPU-time estimates before
calling the die-method.
15
A method called “standard_exception_handling_for_parallelisation” has been added to
General_Utilities. This can be used to shorten code such as the code shown above. For an
example see the code of Conservation_Profiles_Maker. Use this function where
applicable to avoid code duplication.
The example user script “how_to_engage_job_handler.pl” demonstrates the correct
implementation of calling an APPLES function via a call to the Job_Handler.
16
8.3.
Adding a Method for Another Binary to the Job_Handler
Binaries can be run in parallel and with caching in two ways: either directly by a method
of the Job_Handler (as is done for the seaweed algorithm, for example), or they can be
run from a method in another class which is then called through
handle_APPLES_function. We do the latter when dealing with MEME jobs. The
advantage of the former approach is that Perl does not need to be invoked when the job is
run on the machine. So if a binary is to be called for a large number of small instances
than this approach is preferable.
When adding a method for treating a binary to the Job_Handler the following points must
be taken into account:
1) A wrapper has to be written which should be done outside the Job_Handler. The
wrapper will involve a) generating the command line, b) generating any input files
for the binary, and c) providing a parser for output files that returns results as an
APPLES object.
2) The method in the Job_Handler must produce a “keyfile”. The keyfile is a file that
contains all information that defines one instance of running the binary. This
includes all parameters and relevant data fed into the binary via input files. It only
includes information that can make a difference to the output of the binary, so, for
example, it will not include the name of the temporary directory that results are
written to. The keyfile must be turned into an MD5 sum which is then used to
identify the instance. (see handle_alignment_job for an example)
3) All Job_Handler modes described above must be supported.
8.4.
Configuration of Caching and Scheduler (=Listeners)
Caching
By default, all computations that are deemed to be successful are cached (currently in
apples_cache_dev MySQL database). For APPLES functions (handled as Perl scripts),
caching can be turned off by passing $cache = FALSE to the handle_APPLES_function
method.
Configuration
The central configuration file for APPLES is APPLES.dat. The following information is
required:
username = # your username on the server
password = # your password for BiFa server, if that functionality is required
job_handler_config = # full path to Job_Handler_Config.dat file
cache_config = # full path to cache_main.dat
application = APPLES
queue = /cluster/data/webservices/apples/queue_1 # full path to
queue_id = 1
cluster_dir = /common/sge/bin/Darwin #Directory path to qsub command – deprecated?
grid = 1 # 0 means FALSE, 1 means TRUE. On a cluster, use grid=1
17
job_parameters_file = job_parameters.txt # filename for file containing job parameters to
be written
ensembl = /cluster/laurabaxter/ensembl/modules/ # ensembl installation
ensembl_compara = /cluster/laurabaxter/ensembl-compara/modules/
APPLES_main = /cluster/laurabaxter/APPLES_WORKING_COPY/Main/ # location of
APPLES code
BiFa_Server_Name = wsbc.warwick.ac.uk
BiFa_Server_Port = 4338
input_dir = /cluster/data/webservices/apples/input/ # location of input directory for job
handling
blast_db = // # location of local BLAST databases
data_dumps = /cluster/data/webservices/apples/data_dumps/
path_to_perl = /common/perl-5.10.0/bin/perl
job_check_delay = 1 # seconds to wait between checking for queued jobs, leave as 1
grid_type = SGE # SGE on wsbc cluster, PBS on IBM cluster
path_to_APPLES_C_binaries = /common/bin/apples_binaries/ # this is where we put Cbinaries belonging to the APPLES package
scheduler_cache_name = cache_save # to be documented
APPLES_management_home = /cluster/data/webservices/apples
transfac_to_uniprot_map =
/cluster/howard/Public/transfacIDs_to_UniprotIDs_ONDEX.txt
path_to_APPLES_R_programs =
/cluster/laurabaxter/APPLES_WORKING_COPY/R_Programs/
Job_Handler_Config.dat:
to be documented
cache_main.dat:
to be documented
Caching:
The user must the location of one correctly set up cache database (MySQL). This is used
to store results. It is also searched each time a computation is requested, to see if the result
already exists in the database, in which case the result is returned without the computation
being run.
Setting up a scheduler
Configuration:
Two mode choices:
Grid: if you have qsub available on your system, and wish to use it
configure APPLES.dat:
grid=1
Non-grid: should be used if there is no qsub available on your system
configure APPLES.dat:
grid=0
18
Set up the correct directory architecture:
The following directories must exist, and their paths specified in APPLES.dat:
queue
input
tempdir
blast_db
data_dumps
These directories should have read/write/execute permissions for the user and group.
8.5.
Starting and Stopping Listeners:
When listeners do not seem to work it often helps to stop and re-start them. After start-up
it can take a minute or two before the listeners start to submit jobs to the system queue, if
the APPLES queue is quite full (say >10,000 jobs). The listeners will read in all jobs in
the queue directory and then process those in alphabetical order. For example, the request
listener will check for each job in request state if it can be added to the queue. After going
to through all jobs, the listeners re-read the list of current jobs from the queue.
There is a “quick hack” for adding priorities to one’s job. So you could make your jobs
higher priority than others’ jobs or you could make them lower priority – as appropriate.
Look for “JOB PRIORITISATION NOTE” in Scheduler.pm to find out about it.
Re-starting Listeners
To (re-)start the listener processes:
1) Log into the head node (not any of the compute nodes). Note, on wsbc, Listeners
should be started and stopped by a global user called applesuser.
Log in to wsbc from your account (login applesuser; the password is ‘apples’).
2) cd into your APPLES_job_management directory (within trunk)
3) see what listeners are running:
ps
–A
|
grep
start_listen (on WSBC)
or
ps
‐AF
|
grep
start_listen
(on IBM)
4) (only on IBM) module
load
perl/5.10.0
5) ./stop_listeners.sh
(on WSBC)
./IBM_stop_listeners.sh (on IBM)
(kills the 5 listener processes)
19
6) Confirm listeners are no longer running (as in Step 3). If a listener process refuses to be
stopped, kill the process using the process id listed by ps (kill -9 processid).
7) ./start_listeners.sh
(on WSBC)
./IBM_start_listeners.sh (on IBM)
8) Confirm all 5 listeners are running (as in Step 3). On the wsbc system, logout as
applesuser (logout).
This triggers 5 perl scripts to be initiated which run in the background, and should update
*.log and *.err files for each of the 5 listener types (request, running, complete, error,
trash). The log-files are in /cluster/data/webservices/apples on the WSBC cluster and in
/gpfs/sysbio/smsdad/apples on the IBM cluster.
Note: at present (8/3/10) the listener processes will die if the shell from which you
started the listeners times out. So you must log out of the shell (or exit) yourself
before it times out. This is due to a known problem in the operating system. This
should only apply to the WSBC cluster.
Adjusting number of CPUs left for others:
The listeners have in-built “courtesy”: they will leave a fixed number of CPUs free for
others. At times where nobody else runs jobs these CPUs will be idle. This behaviour is
controlled in the start_listeners.sh script by altering the lines like this one:
.../common/perl-5.10.0/bin/perl start_listener.pl trash 20 64 10 1...
(leave 10 CPUs free)
to
.../common/perl-5.10.0/bin/perl start_listener.pl trash 20 64 0 1...
(take all CPUs, no courtesy)
Other arguments: (after listener type):
20 = max allowed jobs (i.e. maximum number of CPUs occupied by APPLES jobs)
64 = CPUs available on system
10 = minimum CPUs to leave free
1 = second delay for looping the listeners
Notes:
a) if you choose 0 as the setting for minimum CPUs free, then the listeners will send as
many jobs to the queue as they are allowed by the max allowed jobs parameter. So the
listeners will not wait for idle CPUs in this case.
b) on the IBM cluster we found that we have to set the minimum number of free CPUs to
0 in order to be able to get anything into the queue.
20
Removing jobs from the APPLES queue directory
In case you created jobs accidentally you can delete these from the APPLES queue
directory (location specified in configuration files) by using a command like this one:
find
.
‐user
howard
‐exec
rm
‐r
'{}'
\;
(make sure to CD into the right
directory before doing this!)
8.6.
Diagnosing problems with the listeners
After the introduction of the scheduling system there were some problems with the
listener processes. These sometimes died and sometimes became unresponsive. This leads
to jobs not being put out in the queue, or results not being picked up and inserted into the
cache-database.
Jay provided this list of points to check before reporting a listener bug:
1. Are the relevant listeners actually running? Check using (something like) ps
‐
A|grep
start_listen (on WSBC) or ps
‐AF
|
grep
start_listen
(on IBMcluster)
2. What is in the listener stdout and stderr logs for the listener in question? Check in the
files ......
i.e. if the listener stopped is there an error message in .err indicating why it stopped
(note: the .err files are not overwritten after a restart of the listeners, but the listeners will
append future error messages)
3. What else is running on the cluster? Is there time and space for the job? Can you see it
in qstat?
4. If the listener is still running is it doing something else? Are there other jobs in the
same state?
5. If the job seems 'stuck' what are the permissions on all the relevant files, and who owns
the listener processes?
9. Perseverance
As we depend on servers to provide sequences (Ensembl) or weight matrices (BiFaserver) etc. we frequently have to deal with instabilities, network outages, and the like. In
some cases we wish for our scripts to deal with the problem by trying several times, in
other cases it is more convenient to just see the crash. Also, we want to be able to adjust
the extent of perseverance, i.e. how many retries, at what time intervals, etc. To have a
general solution to this we have designed the Running_Perseverance class which provides
a way to make a bit of APPLES code perseverant. Every APPLES user script will provide
one Running_Perseverance object as a global parameter. The user can adjust perseverant
behaviour by setting up this object’s attributes. It allows for setting different perseverance
behaviour for different instable components. For example, one could set it up to be
perseverant on Ensembl-problems, but to crash in case of BiFa-problems.
The following code shows an example on how to make a part of APPLES code
perseverant. We are assuming that the error handling code itself also might fail
(reconnecting to ensembl), so it needs to be included in the eval-statement. The code you
21
have to copy for your applications is shown in red (it is important you pick up all those
lines), comments in green:
my
$gene;
my
$reconnect
=
FALSE;
while
(!$main::global_perseverance‐>stop_trying)
{
eval
{
if
($reconnect
eq
TRUE){
#
Attempt
to
refresh
the
connection
with
ensembl
‐‐‐
and
also
the
gene_adaptor?
$GU‐>user_info(1,
"attempting
to
reconnect
to
ensembl\n");
$self‐>private_reconnect_to_registry($parameters‐>location,
$parameters‐
>dbname);
#
can
crash
if
ensembl
is
down
$gene_adaptor
=
$self‐>private_get_ensembl_gene_adaptor
($parameters);
$GU‐>user_info(1,
"successful
reconnect?\n");
#
debugging
$reconnect
=
FALSE;
}
$gene
=
$gene_adaptor‐>fetch_by_stable_id($geneid);
#
this
could
crash
if
Ensembl
is
down
…
};
#
end
of
eval
statement
my
$error
=
$@;
$main::global_perseverance‐>decide_on_rerun('Ensembl',
TRUE,
$error);
if
($error)
{
#
something
has
gone
wrong...
#
specific‐case
error
handling
code
here:
$reconnect
=
TRUE;
}
}
#
end
of
while
loop
‐
if
there
was
an
error
we
(may)
try
again
#
and
reset
the
try
counter
and
stop_trying
attribute
$main::global_perseverance‐>stop_trying(FALSE);
10.
Sample Script Files
When new functionality is introduced an example script file should be placed in the
‘sample_user_scripts_regression_test’ directory. This serves two functions:
•
It provides a guide as to how the functionality can be used
•
It is run as a regression test to ensure that the code still runs when APPLEs
software is updated. This requirement means that the script should not take too
long too run, and should attempt to exercise most of the options associated with
the new functionality.
In addition scripts may also be placed in the ‘sample_user_scripts_performance’ which
demonstrate the use of the functions with large data sets, and which can be used to
perform Apples performance tests.
22
There may be other scripts that provide other good examples of using APPLEs to perform
functions and which do not come into either category. These should be placed in
‘sample_user_scripts_misc’
11.
APPLES Glossary of Terms
The ideas behind a number of key APPLES classed explained:
Genomic_Interval: a genomic interval is a single contiguous piece of biological
sequence.
Genomic_Interval can be any one of 3 defined sequence types (SequenceTypes):
dna, rna or protein.
Genomic_Interval must have a defined genome_db: this is any valid
Genome_Sequence_Database_Parameters object.
This in turn can be either a FASTA_Sequence_Database_Parameters or an
Ensembl_Database_Parameters object. Therefore FASTA files and Ensembl
databases can be used to define genomic intervals.
Genomic_Interval must have a defined region, which is either e.g. a chromosome
number for an Ensembl sequence, or the header of a FASTA sequence in the file.
Genomic_Interval must have a defined start (five_prime_pos) and stop
(three_prime_pos) position and a strand of StrandType (positive or
negative).
A Genomic_Interval can be aggregated with other interval(s) from the same or
different species to form a Genomic_Interval_Set.
Genomic_Interval_Set: a set of genomic intervals. ‘Set’ pertains to the standard
mathematical definition, i.e. zero-to-many, therefore an empty set is valid. The set
members can be from the same or different species.
Reg_Loc: a regulated locus. This is a single point within a genome, that is under the
influence of transcriptional regulation.
Reg_Loc must be of a defined RegLocLocusType (ensembl_gene or mirna),
for example, a TSS (transcriptional start site) pertaining to an ensembl gene, or a micro
RNA. As with a genomic interval, it must have a specified genome_db, position
and strand.
ReCo: a regulatory complex. Describes the binding of a set of proteins to a set of
genomic intervals to exert some transcriptional regulation effect on a specified regulated
locus.
23
ReMo: a regulatory module. Derived from a Genomic_Interval, its purpose is to
distinguish regions to be considered as important for binding of a regulatory complex
from any other genomic sequence. A ReMo can be derived in several ways: from a core
promoter; from an orthologous core promoter; from a ChIP seq fragment set; from a
known binding site, etc. ReMos can be combined into a ReMo_Set;
ReMo_Set: a set of ReMos.
BiSi: a binding site. Based on a Genomic_Interval_Set, a BiSi represents the place
where a transcription factors binds to the DNA sequence.
ReRe: collection of regulatory regions for one Reg_Loc. A ReRe has a group of
ReMo_Sets. This can include the core promoter, conserved regions (ReMos), regions
identified by ChIP-Seq data, and other regions. The purpose of this class is to be able to
collect all sequence regions for which there is some evidence of a regulatory function that
may influence the given Reg_Loc.
ReRe_Set: a set of ReRes.
12.
Using a Development Environment to Develop APPLES
12.1. Introduction
Those people who are familiar with development environments such as Eclipse will
appreciate all the advantages such systems provide for developing and testing software.
Because the wsbc server is not based around the use of a GUI environment, being geared
up for a console and text editor environment, it is not possible to use such an environment
there.
It is possible to develop and run the software within a GUI based development
environment such as a PC or mac, and prior to release for use on the Server.
The instructions here are for working on a Windows PC. Eclipse and Perl are both multiplatform, so it should be even easier to setup an equivalent environment on a mac.
12.2. Setting up
APPLES should first be installed as described in the ‘Getting started with APPLES’
document. It is suggested that the input directory be created as a subdirectory of the
APPLES directory.
The starting point is to set up the Eclipse development environment, together with the
addins for working with Perl. Information on installing this can be found at:
http://www2.warwick.ac.uk/fac/sci/moac/currentstudents/2007/nigel_dyer/phd/software/i
nstallingperl
24
To add the APPLES files to the workspace, right click in the Package Explorer and select
‘New/Other/Perl/Perl Project’. Set the project name to APPLES and browse for the
directory where APPLES has been placed using svn.
12.3. Running code
The Eclipse documentation and websites should be consulted for running and debugging
code.
One of the features of APPLES is that large chunks of a job are packaged up and run as a
separate APPLES job which makes it difficult to debug the jobs that have been created.
One solution to this is to
a) run the APPLES code through to the point where the sub job as been created,
which involves writing a set of files in the ‘Input’ directory,
b) Right click on the input directory within the APPLES package in the package
explorer and select ‘refresh’. Eclipse is now aware of the newly created
perlscript.pl file associated with this job.
c) Use the ‘debug configurations’ to add a configuration or modify an existing
configuration to run the newly created file ‘perlscript.pl’, file which will be in a
newly created subdirectory of input_dir which can then be debugged.
Note that over time a large number of redundant directories can accumulate in the
input_dir directory. These should be deleted in order to simplify stage c) the job of
finding the newly created file.
13.

Miscellaneous
Our MySQL server only accepts a limited number of connections. APPLES
processes often open connections to our local Ensembl databases which are kept
open until the process terminates. The limit is currently set to 500 which means
that no more than 500 APPLES processes using local Ensembl databases can run
in parallel.
25
Download