The APPLES Manual 1 APPLES: Analysis of Plant Promoter-Linked Elements (c) copyright University of Warwick 2012 Contents: 1. Introduction ................................................................................................................... 2 2. Getting Started with APPLES....................................................................................... 2 3. Installation..................................................................................................................... 3 3.1. Installation on the wsbc or IBM cluster ................................................................. 3 3.2. Installation on a Windows PC................................................................................ 7 4. APPLES Design Principles ........................................................................................... 7 5. APPLES Habits........................................................................................................... 10 6. APPLES coding notes ................................................................................................. 12 6.1. Location of ‘use’ statements ................................................................................ 12 7. The standard APPLES prelude explained ................................................................... 12 8. APPLES Scheduler, Cache and Job Handling ............................................................ 13 8.1. APPLES Running Modes .................................................................................... 14 8.2. Parallelising in APPLES ...................................................................................... 15 8.3. Adding a Method for Another Binary to the Job_Handler .................................. 17 8.4. Configuration of Caching and Scheduler (=Listeners) ........................................ 17 8.5. Starting and Stopping Listeners: .......................................................................... 19 8.6. Diagnosing problems with the listeners ............................................................... 21 9. Perseverance ............................................................................................................... 21 10. Sample Script Files ................................................................................................... 22 11. APPLES Glossary of Terms ..................................................................................... 23 12. Using a Development Environment to Develop APPLES ........................................ 24 12.1. Introduction ........................................................................................................ 24 12.2. Setting up ........................................................................................................... 24 12.3. Running code ..................................................................................................... 25 13. Miscellaneous ........................................................................................................... 25 1 1. Introduction This is the documentation for the APPLES software, intended for programmers (developers) and users of the software. It is in the first instance written for users at Warwick and therefore some of the installation instructions are written specifically for two systems at Warwick where APPLES has been installed. 2. Getting Started with APPLES This is the documentation for the initial installation of the APPLES distribution on the wsbc or IBM cluster, or standalone on a PC. The basic requirements for running APPLES are: - Perl 5.10.0 (or later) - Moose (http://www.iinteractive.com/moose/) (currently version 1.07) - MooseX::Declare (http://www.iinteractive.com/moose/) (currently version 0.33) - DBI - DBD::mysql (plus MySQL client) 2 - SOAP::Lite - BioPerl 1.6.0 - Ensembl modules and Ensembl Compara modules - Statistics::R - MEME - R package heR.Misc.R - NCBI BLAST - Chart::Gnuplot version 0.14 and Gnuplot 4.4.0 - library “graph” for R - DBIx::Simple - Config::General - GD - Log::Log4perl - Text::CSV APPLES can be run on the wsbc cluster or IBM cluster provided by CSC, or can be run standalone on a Windows PC. In principle It should also be possible to run standalone on an Apple PC although this configuration has not been tested. Some APPLES subroutines are written in C, the binaries are placed in common directories on the WSBC-cluster and the IBM-cluster. See sample configuration files in repository for standard locations. Updated versions of the binaries must have a postfix of the APPLES revision in which these are used first, so every APPLES revision can call its compatible binaries. 3. Installation If running on the wsbc or IBM cluster, most dependencies are already installed, so the simpler installation is covered in the first section. If running on a Windows PC (or Apple PC) slightly more may have to be installed, and there is some additional configuration to allow the PC to communicate with the central cache database. This is covered in the second section. 3.1. Installation on the wsbc or IBM cluster 1. Perl 5.10: APPLES requires Perl 5.10.0 or greater in order to run. (The IBM and wsbc servers use 5.8.6 by default, but we have added separate installations on both systems). In order to use Perl 5.10.0: Using the wsbc server, start your scripts with: /common/perl-5.10.0/bin/perl (instead of just ‘perl’) or put /common/perl-5.10.0/bin/perl in your path. Using the IBM server (Francesca): To use the new version of Perl you need the Perl 5.10.0 environment module loaded (using the command line): 3 module load perl/5.10.0 which brings the perl 5.10.0 binary into your path. (The actual path is: /software/perl/5.10.0/bin/perl). On Francesca you also need to load other modules such as meme or gnuplot. After you have checked out the code for APPLES (see below) you can change directory into trunk/shell_scripts and type the following to load all modules for APPLES at once (including Perl 5.10.0): source load_modules_francesca 2. Installation directory On the WSBC cluster you can do the installation within your home-folder. On the IBMcluster, however, you need to do the installation not in /home/sysbio/…, but in /gpfs/sysbio/… as only the latter directory is accessible from the compute notes. Note that this directory is not backed up. So if you develop longer time without committing code to the repository you must make back-up copies. 3. Ensembl modules Some functions will require ensembl API modules and ensembl-compara API modules. These can be downloaded and installed using the following instructions (adapted from http://www.ensembl.org/info/docs/api/api_cvs.html - note that installation of bioperl 1.2.3 is not necessary on the wsbc or IBM clusters, as this is already installed. For non-CVS installation methods, follow instructions on this site: http://www.ensembl.org/info/docs/api/api_installation.html) CVS access is recommended, as this way it is easier to change between versions. Create an installation directory in /gpfs/sysbio/username and go there, for example: $ mkdir /src $ cd /src This will create a directory “src”. Log into the Ensembl CVS server at Sanger (using a password of CVSUSER): $ cvs -d :pserver:cvsuser@cvs.sanger.ac.uk:/cvsroot/ensembl login Logging in to :pserver:cvsuser@cvs.sanger.ac.uk:2401/cvsroot/ensembl CVS password: CVSUSER PLEASE NOTE: At this point, the software version you download must match the Ensembl database version you wish to use. For example, the following was taken from “sample user scripts/core_promoters.pl: # Establish database parameters for core species my $genome_sequence_database_parameters = Ensembl_Database_Parameters>new( alias => 'arabidopsis', location => 'ensemblgenomes', dbname => 'arabidopsis_thaliana_core_3_55_9'); 4 For the Ensembl to be consistent with the above, install the Ensembl Core Perl API using the following command: $ cvs -d :pserver:cvsuser@cvs.sanger.ac.uk:/cvsroot/ensembl checkout -r branch-ensembl-55 ensembl Install the matching Ensembl Compara Perl API using the following command: $ cvs -d :pserver:cvsuser@cvs.sanger.ac.uk:/cvsroot/ensembl checkout -r branch-ensembl-55 ensembl-compara In the APPLES.dat file, put the correct path to your installation of the ensembl APIs in “ensembl =” and “ensembl_compara =”. These will be read from the config file when your user script starts up, pushing the paths into @INC and setting corresponding environmental variables. (NOTE: it is NOT necessary to follow the environment set-up instructions given in the website version of the Ensembl installation instructions because the environmental variables necessary are set up at the beginning of an APPLES user script.) 4. APPLES modules Download a working copy of APPLES/trunk from the software repository. This uses the svn+ssh protocol for authentication, using your server login account. For example: > svn co svn+ssh://username@home.wsbc.warwick.ac.uk/Groups/TRegS/software/APPLES/trunk These accounts use public/private keys that you have set up previously (see Paul Brown if you haven’t done this). You will also be prompted to enter your server password. Also, in your .ssh directory on the server there needs to be a file called ‘config’, which contains the following text. If there is one already then make sure it contains the ~/ssh/home line IdentityFile ~/.ssh/id_dsa IdentityFile ~/.ssh/home This lists all of the private keys that are used to communicate with other servers. The ~/.ssh/home entry points to the file needed to communicate with the home server Create a file called home in the .ssh directory, and paste in your private key (id_rsa) which you use to access the home server (e.g. Located at .ssh/systems/id_rsa on my mac), and which should look something like -----BEGIN RSA PRIVATE KEY----MIIEoAIBAAKCAQEAu75Xa7rWawaXW0p6o9+R921KVMsprECh6DEMYMYFlGtX7Wz 5 ... 5 t3y25CuJ6p5JDnYHjZNR9f+UjaWe0cYc3B3msAIbDhsN+Mud -----END RSA PRIVATE KEY----- This file should only have user read/write access (chmod 600 .ssh/home) Change the permissions on your working copy directory to allow read and executable access to the group, in order for jobs to be run by the Scheduler processes, that may be owned by another user (i.e. chmod 755 my_working_copy). Also check the same permissions are given for the Ensembl modules if you are using a set of modules you have downloaded. On both the WSBC-cluster and the IBM-cluster the recommended setting is “chmod 755 …”. Make sure your home-directory itself has the same privileges, too. At least on the IBM-cluster the default setting seems to be that the user’s home-directory is inaccessible to all other users. Furthermore, ask Paul Brown to make you a member of the group “APPLES” on the WSBC cluster – this is the group of all APPLES developers. This membership is necessary as some developers may have restricted read-access to their scripts to this group only, so if you want to be able to run their scripts you need to be a member of the group. More importantly, if you were the last person to start the listeners (see below, Chapter “APPLES Scheduler, Cache and Job Handling”) and you are not a member of this group, then some developers will not be able to run jobs through this cluster. If you want to restrict read-access to your scripts to the APPLES developers group (on the WSBC machine), then you can ask Paul to make the APPLES-group your primary group (in which case you can just use chmod to do so) or ask Paul for other ways. 5. Configuration files Three configuration files are needed, make a copy of these files in a new directory and amend these as appropriate (look inside the files). If you call the directory where you keep your configuration files “configuration_files” and place this directory in “trunk”, then you have your files at the same relative path as others. In this case you can run the sample user scripts without any amendments. The three configuration files are: APPLES.dat cache_main.dat Job_Handler_Config.dat On the IBM-cluster start from a copy of IBM_APPLES.dat and rename it into APPLES.dat. Also use IBM_Job_Handler_Config.dat instead of Job_Handler_Config.dat, but do not change the file name. Some details on the meaning of configuration parameters can be found in Section 8.4. 6.1 Extra Perl Modules on CSC Cluster 6 If you are using CSC cluster there are extra modules installed in the /gpfs/smsdad/apples/lib. In order to access them do the following: 1. Login to CSC cluster 2. Type echo 'eval $(perl -I/gpfs/sysbio/smsdad/apples/lib/perl5 Mlocal::lib=/gpfs/sysbio/smsdad/apples)' >> /gpfs/sysbio/<user name>/.bashrc N.B. Replace the <user name> with your actual user name 3. If you carry on working also type: . ~/.bashrc 4. Next time you login your environment will be correctly setup. Now you should be ready to use APPLES! 3.2. Installation on a Windows PC Installation on a windows PC is much the same as on the server, except that a significant amount of software that is preinstalled on the server is likely to have to be installed on the PC. In addition, there are some windows specific configuration settings that will be required on APPLES. More details can be found at: http://www2.warwick.ac.uk/fac/sci/moac/currentstudents/2007/nigel_dyer/phd/software/r unning_apples_on_windows 4. APPLES Design Principles These Design Principles should be adhered to at all times when adding or modifying anything within the APPLES software distribution. They are intended to keep the development of the software clean and consistent. Please read these carefully and ask if anything is unclear. 1). Clean termination: if an APPLES method results in regular termination, this behaviour must only be determined by its parameters. These can only be the direct parameters passed to the method, or properties of the object that the method belongs to. 2). Strict data typing: the input data types of all parameters passed to an APPLES method must be strictly defined. This is enforced by the statement "use MooseX::Declare;" at the beginning of all classes. A (single, invariant) output type for each function should be provided. It's not currently possible to syntactically enforce this in Moose, but it should be specified in a comment to the function for clarity. 3). Die statements: any APPLES method that is not yet fully implemented must exit cleanly (if called) through the use of a statement such as "die 'method not implemented yet!';" put at the beginning of the function. 4). Global variables: the use of global variables is limited to these cases only: (a). Running mode parameter (b). Verbosity (c). Perseverance 7 (d). Memory cache These variables do not affect the results of any computation. 5). Parameterisation: We distinguish parameters and configuration settings. The former can affect the result of a method so that it terminates regularly but returns a different result. The latter will either cause the method to throw an exception if the configuration setting (such as a path to a binary) is incorrect, or, if set correctly, will allow the method to function normally. Any parameter must be treated as such, by being explicitly defined by the user and passed through the code. This is important for the caching solution to work. Redundant parameters should be avoided. 6). Running mode: this can be "normal" (serial), "preparation" (setting up parallel jobs), “statistics” or "statistics_and_retrieval" (to evaluate if a function call may result in heavy computations). Some functions are dependent on the chosen running mode parameter, but this does not affect what the function returns in case of regular termination. See section 8.1 for more details. 7). Randomisation: Any APPLES method involving randomisation (e.g. random_genomic_interval_picker which uses the 'rand' function), must take a userdefined seed as a parameter, so that the consistency of the random result can be controlled. This ensures the user is aware of any randomised procedures, can control their seeds, and (if using seeds repeatedly) can apply our caching mechanism appropriately. 8). Public vs Private methods: public methods (ones which can be called legitimately by the user) are distinguished from private methods (that are only to be used internally by the object, $self) syntactically using the prefix "private_" for a private method. 9). Public vs Private attributes: to protect attributes, where possible set them to be readonly (is => 'ro'). If an attribute is required to be private, you can overide the implicit read/write methods by stating an alternative named method for the writer and/or reader, e.g: has 'id' => (isa => 'Str', writer => 'private_write_id', reader => 'private_read_id'); 10). Instantiation: the method 'new' is only to be used if there is no associated "maker" class to use. 11). Object production vs object rendering: output of a computation should not go directly into a file, but first into an object which is passed back by a function, so that it is cacheable. Subsequently generating a file to record that object (or aspects of it) should be done as a (computationally light) method of that object. 12). Setting object attributes: always set object attributes using the writer method as in $gi->five_prime_pos(13); and do NOT use assignments as in 8 $gi->five_prime_pos = 13; The latter way of setting an attribute does not lead to processing of triggers, predicates, and the like. 13). Exception handling: never use the “exit”-statement, but only “die”-instructions as the former disables exception handling. If you catch an exception then make sure to re-throw the exception for all those types of exceptions for which you do not provide treatment. Also do not use $@ after calls into Moose-objects. (see explanation under “bad example” below). A good example: method private_check_existence_of_neighbouring_gene(Reg_Loc $reg_loc, Boolean $ignore_pseudogenes, GenomicDirection $direction) { my $result; eval { $self->get_distance_to_neighbouring_gene($reg_loc, $ignore_pseudogenes, $direction); }; if ($@) { my $exception_content = $@; if (!$exception_content->can('isa')) { die $exception_content; # if exception can't 'isa', re-throw the exception } else { if ($exception_content->isa('No_Neighbouring_Gene_Exception')) { $GU->user_info(3,"encountered gene without a neighbour.\n"); $result = FALSE; } else { die $exception_content; } } } else { $result = TRUE; } return $result; } # private_check_existence_of_neighbouring_gene # We normally deal with the negative case of !$exception_content‐>can(‘isa’) first as there was a case where this was found to avoid a problem. A bad example: 9 There is a big problem with this example, do not copy this. The problem: when the call into $GU is made $@ turns undefined. So the call into merge_object, even though syntactically correct, will fail (note: method merge_object was abolished since this paragraph was written). We have not fully investigated why the call into $GU has this bad side effect on $@. This type of problem can be confusing as the user will only see the error message resulting from the problem within the exception handling part. The original error cannot be seen. 5. APPLES Habits These APPLES habits are intended to indicate the required ‘look and feel’ of the APPLES code, and should be implemented when adding or modifying any of the software. They also govern what state the code should be in before a new “commit” to the repository occurs. Please read these carefully and ask if anything is unclear. 1. Class names format: Upper_Case_With_Underscores 2. Method names format: lower_case_with_underscores 3. Datatype format: UpperCaseNoUnderscores 4. Copyright note: see document "copyright note.pdf" 5. Method/class end labels: add the name of the method (or class) in a comment after the closing bracket of the method (or class) as in: } # get_distance_to_neighbouring_gene # 6. Method order: list public methods first, then private methods. This is to make it easy for users of a class to get an overview over the available functionality. Some (big) classes may have further ordering, for example Genome_DB_Utilities private methods are sorted by: generic private, Ensembl-specific private, Genbank-specific private, ... 7. Print statements: do not use "print". Instead, create a General_Utilities object, and use the method user_info. This takes 2 parameters: (i) integer (1,2,3) to designate the print level, (ii) string to print. Global Print Levels: 0 = silent; 1 = warnings; 2 = warnings+progress; 3 = warnings, progress + debugging For example: my $GU = General_Utilities->new(); 10 $GU->user_info(2,"COMPARING GI ".$counter." of ".$todo."\n" ); 8. Volatile code: (code that is written for testing only, and may create unreliable results): use a # VOLATILE (start/end) temporary mark for line(s) of code in your working copy that are to be removed/reverted to a 'safe' form before commiting the code. Code containing this mark should NOT be committed to the repository. Any results obtained with code containing these comments should NOT be cached. The mark is necessary whenever you add volatile code to have a safeguard, the mark must match the search term "VOLATILE" (in capitals). 9. Every Genome_Sequence_Database_Parameters instance should have a unique 'dbname' attribute. (Important for proper caching). 10. Changes to method names or datatypes: whenever altering a datatype or method name, it is important to find and replace all occurrences in the code before committing to the repository. 11. SVN commit: before committing code to the repository do an 'svn update' first to sort out any conflicts locally and test that all changes put together make a working APPLES version. There are a set of regression tests in the sample_user_scripts_regression_test directory which can be run using the ‘run_all_perl_tests.pl’ script that is also in this directory which runs each of the other perl scripts in that directory in turn, gathering the exit status of each script, and giving a summary of which scripts ran successfully and which failed. If all tests are successful, the user is advised it is safe to commit their code; if one or more tests failed the user is advised to check their code." 12. system-calls: Check if return code is zero. If it is not zero one should normally throw an exception. Perl does not do this automatically, so if you do not check the return code then your script may run on even after the system-call failed. Example: system($cmd) == 0 or die "System error!”; 13. code formatting: We would like to give the APPLES code a consistent appearance which should help when working with code from different modules. a) for if-statements and the like we open the bracket in the same line. So, for example, we make it if () { ... } else { ... } and NOT 11 if () { ... } else { ... } b) we avoid redundant spaces in method calls, so we make it $self->do_something(); and NOT $self -> do_something () ; 6. APPLES coding notes The following provide more details on some other aspects of the coding of APPLES modules. 6.1. Location of ‘use’ statements Use statements are positioned within the classes, except for use MooseX::Declare which is at the top of the file. The convention is not always followed elsewhere, but is described in the MooseX documentation itself. http://search.cpan.org/~flora/MooseX-Declare-0.33/lib/MooseX/Declare.pm#NOTE_ON_IMPORTS 7. The standard APPLES prelude explained When using APPLES to do data analysis one normally sets up a “user script” which sets relevant parameters, makes calls into APPLES classes, and triggers output functions to render results as objects, text, files, pictures, etc. The example below shows the standard “prelude” for any APPLES user script. 12 The BEGIN-statements are used in order to call the method load_includes which sets environmental variables, the Perl variable @INC, and does some other settings. Using this method we can run APPLES-scripts without having to configure the environment, we only need to set up the APPLES-configuration files. Another advantage of this approach is that all APPLES developers can run scripts from others using their settings. This is very useful when tracing bugs revealed by others. We use the relative paths (see screenshot) to define the location of Main and APPLES.dat so that everybody can run this script in their installation and with their own settings without having to change the code. As explained in the design principles, only three global variables are allowed in APPLES. Only specific classes are allowed to read these variables (and, therefore, are allowed to depend on these) and users should never alter these after the prelude. These globals do not affect the result of computations, but merely how these are organised. The first of these ($global_print_level) has to be set in the BEGIN-statement as load_includes depends on it. This variable is explained in the APPLES habits. The includes for APPLES_library and APPLES_dependencies are there to include anything that may be needed in APPLES user scripts. This is so that the script can be as short as possible. APPLES_library is for APPLES components, APPLES_dependencies is for non-APPLES-components. The globals for perseverance and running mode are explained in following sections. 8. APPLES Scheduler, Cache and Job Handling When analysing genomic sequences and related data we frequently encounter computationally heavy tasks. Therefore, we need to be able to organise parallel 13 computations. We have decided not to separate “production scripts” which organise computations and “analysis scripts” which make use of large scale results to extract relevant features. This is to a) avoid duplicating code, b) be able to flexibly change the level at which we parallelise, and c) maintain full overview over all parameters. Furthermore, we want to be able to store results of heavy computations without having to organise this manually, have the software automatically retrieve results for computations that have been done before (without the user needing to know these exist), and to share computational results with others (within Warwick as well as across institutions). To make it likely that different users use the exact same parameters for a given task we define standard parameters for every parameter in APPLES so users only modify these when they have a particular reason. The main APPLES components to implement these features are the Job_Handler and the Cache-module. The Scheduler is a component that runs underneath the Job_Handler to serve its requests for jobs to be run and to provide an interface to a cluster machine’s queueing system. The caching also runs underneath the Job_Handler so normally users only need to call the Job_Handler as needed. It will automatically avoid re-running computations done previously and allows switching between different modes of running (see below). The Job_Handler can handle two kinds of jobs: 1). Binaries: such as alignments using the Ott algorithm or Seaweed algorithm (“handle_alignment_job” method) 2). APPLES functions, as Perl scripts “handle_APPLES_function” method A main feature provided by the Job_Handler is the ability to switch between serial processing (running mode “normal”) and parallel processing (running mode “preparation”). When parallel processing is requested, then the Job_Handler is forced to return control back to the calling script without being able to actually return the result of the computation (the computation job will be send to the queue and the Job_Handler cannot wait for it to complete). This gives rise to “job information exceptions” which are a type of APPLES exceptions that the Job_Handler uses in order to tell the calling script that the results could not be obtained (yet). These exceptions can then be caught and dealt with appropriately. For example, if heavy computations for a number of genes are requested, and these are independent among the genes, then one would want to catch the exception every time and call the Job_Handler for the next gene. Details on this type of exceptions are discussed in section “Job Information Exception Handling”. Essentially this means that the programmer needs to hard-wire their decisions as to what functions calls can be done independently in parallel. 8.1. APPLES Running Modes This is a global parameter, set by the user in their script. The options are: • “statistics” 14 • • • “statistics_and_retrieval” “preparation” “normal” Statistics Mode: Statistics mode will not return any results for a request. However, (if Job_Information_Exception handling has been correctly implemented for the job(s) you are requesting, it will return information about the number of jobs required to fulfil your request, (how many have been computed, and how many will require computation). This mode can therefore be used to gauge how long your request will take. This is important to diagnose the amount of computation when the CPU-time requirement is potentially substantial, and you are not clear about how much of it may already be in the cache. Statistics_and_Retrieval Mode: This is like statistics mode but the Job_Handler will return results from the cache when available. So if all results needed exist already then in this mode your script will run through just as it would do in normal mode. Preparation Mode: This mode allows for parallel computations. If a result is available, it is returned to you. If a job result is unavailable, it is automatically submitted for computation. Your script will terminate once all the requests have been dealt with. You may therefore not have received a full set of results, and will need to wait until all the jobs have been computed before submitting the request again, at which time all the results will be returned to you. Normal Mode: This mode allows only for serial computations. As with Preparation Mode, if a result is available it is returned to you. If a result is unavailable, it is computed and your script will wait while this is done before moving on to the next part of your script. Therefore it is only recommended for a small number of short jobs or where the results already exist and you are just retrieving them from the cache. 8.2. Parallelising in APPLES The following screenshot shows the method rere_set_maker which shows all aspects of handling job information exceptions. Typically this method is called to create ReRes for a set of genes. Computations for each gene can be performed independently, but the final ReRe_Set can only be completed in case results for all genes are available. In case of a Job_Information_Exception a single job has been dealt with by the Job_Handler, in case of an Aggregate_Exception some other layer in the code has already received one or more Job_Information_Exceptions from the Job_Handler and merged these. In either case receiving these exceptions in this method will mean that ReRes for all Reg_Locs will not be available. So jobs for all ReRes can be requested and added to the queue, but the ReRe_Set cannot be completed and, therefore, the method will call print_statistics_and_die instead of returning. The print_statistics_and_die function will print an overview of job types, job numbers, and overall CPU-time estimates before calling the die-method. 15 A method called “standard_exception_handling_for_parallelisation” has been added to General_Utilities. This can be used to shorten code such as the code shown above. For an example see the code of Conservation_Profiles_Maker. Use this function where applicable to avoid code duplication. The example user script “how_to_engage_job_handler.pl” demonstrates the correct implementation of calling an APPLES function via a call to the Job_Handler. 16 8.3. Adding a Method for Another Binary to the Job_Handler Binaries can be run in parallel and with caching in two ways: either directly by a method of the Job_Handler (as is done for the seaweed algorithm, for example), or they can be run from a method in another class which is then called through handle_APPLES_function. We do the latter when dealing with MEME jobs. The advantage of the former approach is that Perl does not need to be invoked when the job is run on the machine. So if a binary is to be called for a large number of small instances than this approach is preferable. When adding a method for treating a binary to the Job_Handler the following points must be taken into account: 1) A wrapper has to be written which should be done outside the Job_Handler. The wrapper will involve a) generating the command line, b) generating any input files for the binary, and c) providing a parser for output files that returns results as an APPLES object. 2) The method in the Job_Handler must produce a “keyfile”. The keyfile is a file that contains all information that defines one instance of running the binary. This includes all parameters and relevant data fed into the binary via input files. It only includes information that can make a difference to the output of the binary, so, for example, it will not include the name of the temporary directory that results are written to. The keyfile must be turned into an MD5 sum which is then used to identify the instance. (see handle_alignment_job for an example) 3) All Job_Handler modes described above must be supported. 8.4. Configuration of Caching and Scheduler (=Listeners) Caching By default, all computations that are deemed to be successful are cached (currently in apples_cache_dev MySQL database). For APPLES functions (handled as Perl scripts), caching can be turned off by passing $cache = FALSE to the handle_APPLES_function method. Configuration The central configuration file for APPLES is APPLES.dat. The following information is required: username = # your username on the server password = # your password for BiFa server, if that functionality is required job_handler_config = # full path to Job_Handler_Config.dat file cache_config = # full path to cache_main.dat application = APPLES queue = /cluster/data/webservices/apples/queue_1 # full path to queue_id = 1 cluster_dir = /common/sge/bin/Darwin #Directory path to qsub command – deprecated? grid = 1 # 0 means FALSE, 1 means TRUE. On a cluster, use grid=1 17 job_parameters_file = job_parameters.txt # filename for file containing job parameters to be written ensembl = /cluster/laurabaxter/ensembl/modules/ # ensembl installation ensembl_compara = /cluster/laurabaxter/ensembl-compara/modules/ APPLES_main = /cluster/laurabaxter/APPLES_WORKING_COPY/Main/ # location of APPLES code BiFa_Server_Name = wsbc.warwick.ac.uk BiFa_Server_Port = 4338 input_dir = /cluster/data/webservices/apples/input/ # location of input directory for job handling blast_db = // # location of local BLAST databases data_dumps = /cluster/data/webservices/apples/data_dumps/ path_to_perl = /common/perl-5.10.0/bin/perl job_check_delay = 1 # seconds to wait between checking for queued jobs, leave as 1 grid_type = SGE # SGE on wsbc cluster, PBS on IBM cluster path_to_APPLES_C_binaries = /common/bin/apples_binaries/ # this is where we put Cbinaries belonging to the APPLES package scheduler_cache_name = cache_save # to be documented APPLES_management_home = /cluster/data/webservices/apples transfac_to_uniprot_map = /cluster/howard/Public/transfacIDs_to_UniprotIDs_ONDEX.txt path_to_APPLES_R_programs = /cluster/laurabaxter/APPLES_WORKING_COPY/R_Programs/ Job_Handler_Config.dat: to be documented cache_main.dat: to be documented Caching: The user must the location of one correctly set up cache database (MySQL). This is used to store results. It is also searched each time a computation is requested, to see if the result already exists in the database, in which case the result is returned without the computation being run. Setting up a scheduler Configuration: Two mode choices: Grid: if you have qsub available on your system, and wish to use it configure APPLES.dat: grid=1 Non-grid: should be used if there is no qsub available on your system configure APPLES.dat: grid=0 18 Set up the correct directory architecture: The following directories must exist, and their paths specified in APPLES.dat: queue input tempdir blast_db data_dumps These directories should have read/write/execute permissions for the user and group. 8.5. Starting and Stopping Listeners: When listeners do not seem to work it often helps to stop and re-start them. After start-up it can take a minute or two before the listeners start to submit jobs to the system queue, if the APPLES queue is quite full (say >10,000 jobs). The listeners will read in all jobs in the queue directory and then process those in alphabetical order. For example, the request listener will check for each job in request state if it can be added to the queue. After going to through all jobs, the listeners re-read the list of current jobs from the queue. There is a “quick hack” for adding priorities to one’s job. So you could make your jobs higher priority than others’ jobs or you could make them lower priority – as appropriate. Look for “JOB PRIORITISATION NOTE” in Scheduler.pm to find out about it. Re-starting Listeners To (re-)start the listener processes: 1) Log into the head node (not any of the compute nodes). Note, on wsbc, Listeners should be started and stopped by a global user called applesuser. Log in to wsbc from your account (login applesuser; the password is ‘apples’). 2) cd into your APPLES_job_management directory (within trunk) 3) see what listeners are running: ps –A | grep start_listen (on WSBC) or ps ‐AF | grep start_listen (on IBM) 4) (only on IBM) module load perl/5.10.0 5) ./stop_listeners.sh (on WSBC) ./IBM_stop_listeners.sh (on IBM) (kills the 5 listener processes) 19 6) Confirm listeners are no longer running (as in Step 3). If a listener process refuses to be stopped, kill the process using the process id listed by ps (kill -9 processid). 7) ./start_listeners.sh (on WSBC) ./IBM_start_listeners.sh (on IBM) 8) Confirm all 5 listeners are running (as in Step 3). On the wsbc system, logout as applesuser (logout). This triggers 5 perl scripts to be initiated which run in the background, and should update *.log and *.err files for each of the 5 listener types (request, running, complete, error, trash). The log-files are in /cluster/data/webservices/apples on the WSBC cluster and in /gpfs/sysbio/smsdad/apples on the IBM cluster. Note: at present (8/3/10) the listener processes will die if the shell from which you started the listeners times out. So you must log out of the shell (or exit) yourself before it times out. This is due to a known problem in the operating system. This should only apply to the WSBC cluster. Adjusting number of CPUs left for others: The listeners have in-built “courtesy”: they will leave a fixed number of CPUs free for others. At times where nobody else runs jobs these CPUs will be idle. This behaviour is controlled in the start_listeners.sh script by altering the lines like this one: .../common/perl-5.10.0/bin/perl start_listener.pl trash 20 64 10 1... (leave 10 CPUs free) to .../common/perl-5.10.0/bin/perl start_listener.pl trash 20 64 0 1... (take all CPUs, no courtesy) Other arguments: (after listener type): 20 = max allowed jobs (i.e. maximum number of CPUs occupied by APPLES jobs) 64 = CPUs available on system 10 = minimum CPUs to leave free 1 = second delay for looping the listeners Notes: a) if you choose 0 as the setting for minimum CPUs free, then the listeners will send as many jobs to the queue as they are allowed by the max allowed jobs parameter. So the listeners will not wait for idle CPUs in this case. b) on the IBM cluster we found that we have to set the minimum number of free CPUs to 0 in order to be able to get anything into the queue. 20 Removing jobs from the APPLES queue directory In case you created jobs accidentally you can delete these from the APPLES queue directory (location specified in configuration files) by using a command like this one: find . ‐user howard ‐exec rm ‐r '{}' \; (make sure to CD into the right directory before doing this!) 8.6. Diagnosing problems with the listeners After the introduction of the scheduling system there were some problems with the listener processes. These sometimes died and sometimes became unresponsive. This leads to jobs not being put out in the queue, or results not being picked up and inserted into the cache-database. Jay provided this list of points to check before reporting a listener bug: 1. Are the relevant listeners actually running? Check using (something like) ps ‐ A|grep start_listen (on WSBC) or ps ‐AF | grep start_listen (on IBMcluster) 2. What is in the listener stdout and stderr logs for the listener in question? Check in the files ...... i.e. if the listener stopped is there an error message in .err indicating why it stopped (note: the .err files are not overwritten after a restart of the listeners, but the listeners will append future error messages) 3. What else is running on the cluster? Is there time and space for the job? Can you see it in qstat? 4. If the listener is still running is it doing something else? Are there other jobs in the same state? 5. If the job seems 'stuck' what are the permissions on all the relevant files, and who owns the listener processes? 9. Perseverance As we depend on servers to provide sequences (Ensembl) or weight matrices (BiFaserver) etc. we frequently have to deal with instabilities, network outages, and the like. In some cases we wish for our scripts to deal with the problem by trying several times, in other cases it is more convenient to just see the crash. Also, we want to be able to adjust the extent of perseverance, i.e. how many retries, at what time intervals, etc. To have a general solution to this we have designed the Running_Perseverance class which provides a way to make a bit of APPLES code perseverant. Every APPLES user script will provide one Running_Perseverance object as a global parameter. The user can adjust perseverant behaviour by setting up this object’s attributes. It allows for setting different perseverance behaviour for different instable components. For example, one could set it up to be perseverant on Ensembl-problems, but to crash in case of BiFa-problems. The following code shows an example on how to make a part of APPLES code perseverant. We are assuming that the error handling code itself also might fail (reconnecting to ensembl), so it needs to be included in the eval-statement. The code you 21 have to copy for your applications is shown in red (it is important you pick up all those lines), comments in green: my $gene; my $reconnect = FALSE; while (!$main::global_perseverance‐>stop_trying) { eval { if ($reconnect eq TRUE){ # Attempt to refresh the connection with ensembl ‐‐‐ and also the gene_adaptor? $GU‐>user_info(1, "attempting to reconnect to ensembl\n"); $self‐>private_reconnect_to_registry($parameters‐>location, $parameters‐ >dbname); # can crash if ensembl is down $gene_adaptor = $self‐>private_get_ensembl_gene_adaptor ($parameters); $GU‐>user_info(1, "successful reconnect?\n"); # debugging $reconnect = FALSE; } $gene = $gene_adaptor‐>fetch_by_stable_id($geneid); # this could crash if Ensembl is down … }; # end of eval statement my $error = $@; $main::global_perseverance‐>decide_on_rerun('Ensembl', TRUE, $error); if ($error) { # something has gone wrong... # specific‐case error handling code here: $reconnect = TRUE; } } # end of while loop ‐ if there was an error we (may) try again # and reset the try counter and stop_trying attribute $main::global_perseverance‐>stop_trying(FALSE); 10. Sample Script Files When new functionality is introduced an example script file should be placed in the ‘sample_user_scripts_regression_test’ directory. This serves two functions: • It provides a guide as to how the functionality can be used • It is run as a regression test to ensure that the code still runs when APPLEs software is updated. This requirement means that the script should not take too long too run, and should attempt to exercise most of the options associated with the new functionality. In addition scripts may also be placed in the ‘sample_user_scripts_performance’ which demonstrate the use of the functions with large data sets, and which can be used to perform Apples performance tests. 22 There may be other scripts that provide other good examples of using APPLEs to perform functions and which do not come into either category. These should be placed in ‘sample_user_scripts_misc’ 11. APPLES Glossary of Terms The ideas behind a number of key APPLES classed explained: Genomic_Interval: a genomic interval is a single contiguous piece of biological sequence. Genomic_Interval can be any one of 3 defined sequence types (SequenceTypes): dna, rna or protein. Genomic_Interval must have a defined genome_db: this is any valid Genome_Sequence_Database_Parameters object. This in turn can be either a FASTA_Sequence_Database_Parameters or an Ensembl_Database_Parameters object. Therefore FASTA files and Ensembl databases can be used to define genomic intervals. Genomic_Interval must have a defined region, which is either e.g. a chromosome number for an Ensembl sequence, or the header of a FASTA sequence in the file. Genomic_Interval must have a defined start (five_prime_pos) and stop (three_prime_pos) position and a strand of StrandType (positive or negative). A Genomic_Interval can be aggregated with other interval(s) from the same or different species to form a Genomic_Interval_Set. Genomic_Interval_Set: a set of genomic intervals. ‘Set’ pertains to the standard mathematical definition, i.e. zero-to-many, therefore an empty set is valid. The set members can be from the same or different species. Reg_Loc: a regulated locus. This is a single point within a genome, that is under the influence of transcriptional regulation. Reg_Loc must be of a defined RegLocLocusType (ensembl_gene or mirna), for example, a TSS (transcriptional start site) pertaining to an ensembl gene, or a micro RNA. As with a genomic interval, it must have a specified genome_db, position and strand. ReCo: a regulatory complex. Describes the binding of a set of proteins to a set of genomic intervals to exert some transcriptional regulation effect on a specified regulated locus. 23 ReMo: a regulatory module. Derived from a Genomic_Interval, its purpose is to distinguish regions to be considered as important for binding of a regulatory complex from any other genomic sequence. A ReMo can be derived in several ways: from a core promoter; from an orthologous core promoter; from a ChIP seq fragment set; from a known binding site, etc. ReMos can be combined into a ReMo_Set; ReMo_Set: a set of ReMos. BiSi: a binding site. Based on a Genomic_Interval_Set, a BiSi represents the place where a transcription factors binds to the DNA sequence. ReRe: collection of regulatory regions for one Reg_Loc. A ReRe has a group of ReMo_Sets. This can include the core promoter, conserved regions (ReMos), regions identified by ChIP-Seq data, and other regions. The purpose of this class is to be able to collect all sequence regions for which there is some evidence of a regulatory function that may influence the given Reg_Loc. ReRe_Set: a set of ReRes. 12. Using a Development Environment to Develop APPLES 12.1. Introduction Those people who are familiar with development environments such as Eclipse will appreciate all the advantages such systems provide for developing and testing software. Because the wsbc server is not based around the use of a GUI environment, being geared up for a console and text editor environment, it is not possible to use such an environment there. It is possible to develop and run the software within a GUI based development environment such as a PC or mac, and prior to release for use on the Server. The instructions here are for working on a Windows PC. Eclipse and Perl are both multiplatform, so it should be even easier to setup an equivalent environment on a mac. 12.2. Setting up APPLES should first be installed as described in the ‘Getting started with APPLES’ document. It is suggested that the input directory be created as a subdirectory of the APPLES directory. The starting point is to set up the Eclipse development environment, together with the addins for working with Perl. Information on installing this can be found at: http://www2.warwick.ac.uk/fac/sci/moac/currentstudents/2007/nigel_dyer/phd/software/i nstallingperl 24 To add the APPLES files to the workspace, right click in the Package Explorer and select ‘New/Other/Perl/Perl Project’. Set the project name to APPLES and browse for the directory where APPLES has been placed using svn. 12.3. Running code The Eclipse documentation and websites should be consulted for running and debugging code. One of the features of APPLES is that large chunks of a job are packaged up and run as a separate APPLES job which makes it difficult to debug the jobs that have been created. One solution to this is to a) run the APPLES code through to the point where the sub job as been created, which involves writing a set of files in the ‘Input’ directory, b) Right click on the input directory within the APPLES package in the package explorer and select ‘refresh’. Eclipse is now aware of the newly created perlscript.pl file associated with this job. c) Use the ‘debug configurations’ to add a configuration or modify an existing configuration to run the newly created file ‘perlscript.pl’, file which will be in a newly created subdirectory of input_dir which can then be debugged. Note that over time a large number of redundant directories can accumulate in the input_dir directory. These should be deleted in order to simplify stage c) the job of finding the newly created file. 13. Miscellaneous Our MySQL server only accepts a limited number of connections. APPLES processes often open connections to our local Ensembl databases which are kept open until the process terminates. The limit is currently set to 500 which means that no more than 500 APPLES processes using local Ensembl databases can run in parallel. 25