Version 1.2 (Dec, 2011) U Usseerr gguuiiddee ooff ssooffttw waarree A AL LO OH HA A Hsin-Chou Yang, Hsin-Chi Lin and Mei-Chu Huang Institute of Statistical Science, Academia Sinica Correspondence: Hsin-Chou Yang (hsinchou@stat.sinica.edu.tw), Institute of Statistical Science, Academia Sinica, 128, Academia Road, Section 2 Nankang, Taipei 115, Taiwan. Table of Contents: 1. ALOHA LICENSE 2. INTRODUCTION 3. SOFTWARE DOWNLOAD AND INSTALLATION 4. ALOHA INITIALIZATION 5. ALOHA INTERFACE AND FUNCTIONS 6. DATA INPUT FORMAT 7. TWO TEST EXAMPLES 8. ALOHA VERSION UPGRADE 1 1. ALOHA LICENSE All copyright are reserved by authors of ALOHA. We welcome any noncommercial use of ALOHA for your own research. Please do NOT modify or distribute the program of ALOHA in any form without the permission of authors of ALOHA. Commercial use of ALOHA should be directed to hsinchou@stat.sinica.edu.tw. For free software ALOHA, we assume no warranty and no responsibility for the results of analyses. If publications are based on the results from the use of ALOHA, please cite the following reference: Hsin-Chou Yang, Hsin-Chi Lin, Mei-Chu Huang, Ling-Hui Li, Wen-Harn Pan, Jer-Yuarn Wu, and Yuan-Tsong Chen (2010). A new analysis tool for individual-level allele frequency for genomic studies. BMC Genomics 11: 415. 2 2. INTRODUCTION ALOHA (Allele-frequency/Loss-of-heterozygosity/Allele-imbalance; AF/LOH/AI), written in R and R GUI, provides for genome-wide analysis of allele frequency and detection of loss of heterozygosity (LOH) and allelic imbalance (AI). An allele frequency biplot is also provided for sample classification, outlier detection, and SNP clustering. 3 3. SOFTWARE DOWNLOAD AND INSTALLATION Execution of ALOHA requires installation of program ALOHA and program R. 1. Download software ALOHA: Software ALOHA is available at the ALOHA website. The zipped file “ALOHA.zip” can be downloaded and then unzipped to obtain a directory “ALOHA” containing the programs of ALOHA and two illustrated data examples. The directory “ALOHA” can be saved as a working directory, such as “C:\ALOHA”. 2. Download program R: Users can download language R “R-2.14.1-win.exe” from the ALOHA website. Or users can download R from the website of “The R Project for Statistical Computing” at http://www.r-project.org/. Users click “CRAN” (Comprehensive R Archive Network) in the left of the page and then select a suitable mirror site to download software R. Select a platform (Linux, MacOS X, Windows (95 and later)) for R execution in your end. Click the hyperlink “base” and select “R-2.14.1-win.exe”. Then execute the file to install program R to “C:\Program Files\R\R-2.14.1”. After finishing the installation of program R, doubly click the icon “R-2.14.1” to initialize program R, a window “RGui” with a sub-window “R Console” jumps up await for the subsequent analysis action. Users are suggested to update packages in R. They can select “Packages” in the tool bar, click “Update packages” and then select a suitable mirror site to update packages. A window “CRAN mirror” jumps up and the icon “OK” is clicked to update packages. Note that the analyses provided by ALOHA require three additional R packages: tcltk, gtools and pspline. These packages will be automatically downloaded if users use a latest version of program R, e.g., R-2.14.1. Note that users are suggested to use program R-2.14.1 or a version of program R newer than program R-2.14.1 for execution of ALOHA. 4 4. ALOHA INITIALIZATION Once the packages mentioned in the previous section has been installed, ALOHA can be initialized by the following procedures. In this user guide, we suppose that programs of ALOHA are saved in the destination directory “C:\ALOHA”. 1. Initialize software R by doubly clicking the icon “R-2.14.1”. 2. Key in the command, ALOHA.gui=paste("C://ALOHA//PROGRAM//ALOHA_interface.r",sep=""), in the command line in the window “R Console” and press the Enter key. 3. Type the command, source(ALOHA.gui), in the command line and press the Enter key to initialize ALOHA. The ALOHA interface (see Figure 1) jumps up and waits for the data entry after pressing the Enter key. Figure 1. Interface of ALOHA 5 5. ALOHA INTERFACE AND FUNCTIONS ALOHA has a user friendly interface developed by R GUI (Figure 1). The interface contains a preface for a short introduction to ALOHA. Directory structure of ALOHA is shown (Figure 2). Four main item questions are designed for providing required/optional information for ALOHA data analysis. Item 1: Input/output path: Study group: User should select “one group” for a one-population or multiple-population analysis or “two groups” for a case-control analysis. Directory of data input: Users should provide the working directory where their data are saved. Directory of results output: Users should provide the working directory where their output should be saved. Note that the output directory must exist before executing ALOHA. Item 2: Allele frequency (AF) reference: ALOHA database – (a) Users should select a population matched to their study, and (b) users should determine which type of SNP chip. User provided – Users can provide their own AF reference. Item 3: AI/LOH calculation: Confidence level – Users should key in a value X between 0.9 and 1 for construction of a 100X% confidence interval for AI and LOH indices of study patients. Window size – Users should key in a value at least 10. Upper bound (quantile) of reference – Users should key in a value X between 0.9 and 1 to specify the 100X%-quantile indices of normal controls for AI and LOH indices. Item 4: AF biplot: Alpha scaling – 0 or 1 should be inputted. Once all item questions are answered, icon “Run” can be clicked to submit computational job. Then, ALOHA will check the inputted data information and data files. If the inputted information is invalid or the data files are ill-format, ALOHA shows warning message(s) or error message(s), which provides users to make corrections. If the inputted data pass the examination, ALOHA starts to perform analysis and a message “Please wait a while, ALOHA is running…” will be shown in 6 the command line. A prompt sign will appear immediately but the computation is proceeding. Please wait until a new window with the message “Computation of ALOHA is finished.” jumps up to acknowledge users the completion of ALOHA computation. Note that users can interrupt the execution of ALOHA anytime by clicking ESC in the window “R Console”. Once the execution of ALOHA is finished, the numerical results and graphical outputs will be automatically saved in the output directory that users provide. We suggest that users should remove figure files from a previous analysis before a new analysis in case of the confusion of multiple figure files from old and new analyses. Figure 2. Directory structure of ALOHA 7 6. DATA INPUT FORMAT This section introduces data structure/format. Two test examples, which are used to illustrate a one-group and two-group analysis, are also provided (see Section 7). 6.1 Nomenclature rule for the directory/file name and data format Two-group analysis: As mentioned in Item 1 in Section 5, users can specify a working directory where allele frequency data are saved, e.g., “C:\ALOHA\Work”. Under the working directory, users MUST make subdirectories “C:\ALOHA\Work\Case” and “C:\ALOHA\Work\Control” to save allele frequency data of patient samples and control samples, respectively. Note that the directory names are case sensitive. In the directory, allele frequency data for various samples should be saved in the respective directories with a directory name following the nomenclature rule “X_Y”, where Y is sample ID, e.g., “01”, “02”, … , and X is an arbitrary string used to describe attribute of case or control, e.g., “ALL” for acute lymphoblastic leukaemia and “NC” for normal control. Under the directory of each sample, allele frequency data for various chromosomes should be save in the respective files with a file name following the nomenclature rule “Chr_W.txt”, where W is the number of chromosome, i.e., “01”, “02”, …, “23”. Data file of allele frequency for each chromosome each sample contains six columns with the following header “Probe_set”, “Chr”, “Phy_position”, “Genotype”, “Chiptype”, and “AF”. Please refer to Example 1 introduced in Section 7.1 for details. One-group analysis: As mentioned in Item 1 in Section 5, users can specify a working directory where allele frequency data are saved, e.g., “C:\ALOHA\Work”. In the directory, data for various populations should be saved in the respective directories with a directory name following the nomenclature rule “X_Y”, where Y is an arbitrary string for illustration of study population, e.g., “Asian”, “CEU”, “YRI”, and X is an arbitrary string used to describe attribute of this study, e.g., “Normal”. Under the directory of each population, allele frequency data for various samples should be saved in the respective directories with a directory name following the nomenclature rule “X_Y”, where Y is sample ID, e.g., “01”, “02”, … , and X is an arbitrary string used to describe attribute of this study. Under the directory of each sample, allele frequency 8 data for various chromosomes should be saved in the respective files with a file name following the nomenclature rule “Chr_W.txt”, where W is the number of chromosome, i.e., “01”, “02”, …, “23”. Data file of allele frequency for each chromosome each sample contains six columns with the following header “Probe_set”, “Chr”, “Phy_position”, “Genotype”, and “Chiptype”, and “AF”. Please refer to Example 2 introduced in Section 7.2 for details. 9 7. TWO TEST EXAMPLES ALOHA provides two test examples generated by Monte Carlo procedures. The first example demonstrates an analysis of two groups, i.e., case group and control group. The second example demonstrates an example of one group with three populations. Data of these two examples are provided in directory “C:\ALOHA\EXAMPLE”. 7.1 Example 1: A two-group (case-control) analysis This is an example of two groups (case vs. control). This example consists of two cancer patients and ten normal controls and data are provided in the directory “C:\ALOHA\EXAMPLE\Test1”. Allele frequency data of the two cancer patients are saved in the directory “C:\ALOHA\EXAMPLE\Test1\Case” and data of ten normal controls are saved in the directory “C:\ALOHA\EXAMPLE\Test1\Control”. Directory names for the two patients are “Abnorm_01_F” and “Abnorm_02_M”; directory names for the ten controls are “Norm_001_F”, …, “Norm_010_M”, in which allele frequency data for 23 chromosomes are provided. This example is the defaulted example of ALOHA and can be run easily by pressing the “Run” button (keying in Test1 in the directory of data input in Item 1). Note that the commands filenames directory names are case sensitive. Then ALOHA starts to perform analysis and a message “Please wait a while, ALOHA is running…” will be shown in the command line. When the computation is finished, a message “Computation of ALOHA is finished.” shown to acknowledge users the completion of ALOHA computation. The computational procedure will take about 15 minutes using a machine with a CPU of Intel Core2 Duo E8400 3.00GHz and RAM of DDR2 3.25G. Results of the analysis will be automatically saved in the output directory “C:\ALOHA\OUTPUT\Test_Example_Output\Test1”, including three subdirectories, “Graphical result”, “Numerical result”, and “Sample list” and a file “Log.txt”. In addition, a file “Data description.txt” describing the study data and parameter setting in the analysis will also be provided in the directory “Numerical result”. In this illustrative example, the graphical results are shown in Figure 3, Figure 4, Figure 5, and Figure 6. Explanations to the results of these figures can refer to the ALOHA paper (Yang et al., BMC Genomics, 2010). 10 Figure 3. Chromosomal aberration plots of the first cancer patient Abnorm_01_F. Figure 4. Chromosomal aberration plots of the second cancer patient Abnorm_02_M. 11 Figure 5. Allele frequency biplots of the two cancer patients and 10 normal controls. Figure 6. Combined AI plot and combined LOH plot of two cancer patients. (A) Combined AI plot, and (B) Combined LOH plot. The upper-left subplot shows the status of AI/LOH in genomic regions for each study sample. Blue color denotes no occurrence of AI/LOH and red color denotes occurrence of AI/LOH. The lower-left subplot provides the proportion (%) of samples carrying AI/LOH aberrations in specific genomic regions. The upper-right subplot provides the proportion (%) of genomic regions carrying AI/LOH aberrations in a study sample. In this subplot, a male is indicated by sky blue color and a female is indicated by pink color. The area displayed by purple bars with left-slanting lines indicates chromosomal aberration. For example, the first sample is a female and ~3% of her genome carries chromosomal aberrations (Note that ~97% of her genome are normal but the subplot is truncated at an aberration proportion < 4%). 12 (A) (B) 7.2 Example 2: A one-group analysis with three populations This example provides an analysis of one group with three populations. This example consists of 15 samples from African (YRI), Caucasian (CEU) and Asian populations, and each population contains five samples and data are provided in the directory “C:\ALOHA\EXAMPLE\Test2”. Allele frequency data for the three populations are provided in the sub-directories “Normal_Asian”, “Normal_CEU”, and “Normal_YRI” 13 under the directory “C:\ALOHA\EXAMPLE\Test2\”. The input data format is the same as mentioned in the previous example (Example 1). This example can be run easily by checking “One group”, keying in Test2 in the directory of data input, and pressing the “Run” button. Note that the commands filenames directory names are case sensitive. Then ALOHA starts to perform analysis and a message “Please wait a while, ALOHA is running…” will be shown in the command line. When the computation is finished, a message “Computation of ALOHA is finished.” shown to acknowledge users the completion of ALOHA computation. The computational procedure will take about 2 minutes using a machine with a CPU of Intel Core2 Duo E8400 3.00GHz and RAM of DDR2 3.25G. Results of the analysis will be automatically saved in the output directory “C:\ALOHA\OUTPUT\Test_Example_Output\Test2”, including three subdirectories, “Graphical result”, “Numerical result”, and “Sample list” and a file “Log.txt”. In addition, a file “Data description.txt” describing the data and parameter setting in the analysis will also be provided in the directory “Numerical result”. The graphical result in this example is shown in Figure 7. Explanations to these figures can refer to the ALOHA paper (Yang et al., BMC Genomics, 2010). Figure 7. Allele frequency biplots of the 15 samples from three populations, Asian, CEU, and YRI populations. 14 8. ALOHA VERSION UPGRADE Versions: ALOHA Version 1.0: Jun. 2010 ALOHA Version 1.1: July. 2010 ALOHA Version 1.2: Dec. 2011 What are the new features in ALOHA? In version 1.2, a new function to provide a combined AI plot and a combined LOH plot is added. 15