User Manual2 - The University of Hong Kong

advertisement
IGG: A Tool to Integrate Gene chips for Genetic
Studies
User Manual
Version 2.0
Miao-Xin Li, Lin Jiang
Song You-Qiang and Sham Pak
Biochemistry Department
Genome Research Center
The University of Hong Kong
Pokfulam, Hong Kong SAR, China
AND
Hunan Business College
Changsha 410205, Hunan, China
CONTENT
1. INTRODUCTION .............................................................................................................................. 1
2. INSTALLATION ................................................................................................................................ 1
2.1 INSTALLATION OF JAVA RUNTIME ENVIRONMENT (JRE) ................................................ 1
2.2 INSTALLATION OF IGG ............................................................................................................. 1
3. PREPARATION ................................................................................................................................. 1
3.1 INPUT FILES ................................................................................................................................ 1
3.2 HAPMAP GENOTYPE DATA ...................................................................................................... 3
3.3 ANNOTATION FILES OF GENECHIPS ...................................................................................... 3
4. FUNCTIONS ...................................................................................................................................... 4
4.1 INTEGRATE GENOTYPES ACROSS VARIOUS GENECHIPS................................................. 5
4.1.1 Load Pedigrees or Subjects ...................................................................................................... 5
4.1.2 Load Genotype Files ................................................................................................................ 5
4.1.3 Integration Loaded Genotypes ................................................................................................ 6
4.2 INTEGRATE HAPMAP GENOTYPES ........................................................................................ 9
4.2 EXPORT INTEGRATED DATASETS ........................................................................................ 10
5. ISSUES INVOLVING LARGE DATASETS .................................................................................. 12
5.1 CONSERVATIVE ESTIMATE THE REQUIRED MAXIMUM JAVA HEAP SIZE ................... 12
5.2 MERGE DATASETS ................................................................................................................... 13
6. PROBLEMS AND SOLUTIONS .................................................................................................... 14
1. Introduction
IGG (Integration of Genotypes from Genechips) is a Java-based tool with graphic
interface to integrate genotypes across high throughput genotyping platforms of
Affymetrix and Illumina and the HapMap Project. It is equipped with a series of
functions to control qualities of genotype integration and to flexibly export genotypes
for genetic studies as well.
2. Installation
2.1 Installation of Java Runtime Environment (JRE)
The JRE is required to run IGG on any operating systems (OS). It can be
downloaded from http://java.sun.com/javase/downloads/index.jsp for free. The
version number for IGG is 1.6 or up. Installing the JRE is very easy in Windows OS
and Mac. In Linux, user may have to configure the Environmental Variables
manually.
2.2 Installation of IGG
IGG has not had an installation wizard by far. After downloaded from our website
and decompressed, it can be initiated through command, java -jar –Xms256m –Xmx512m
"./IGG.jar”, in command prompt window provided by OS. In the command, -Xms<size>
and -Xmx<size> set the initial and maximum Java heap sizes for IGG respectively. A
larger initial heap size can speed up the process of integration. A higher setting like
–Xmx512m is suggested dealing with huge amount of data. The number, however, should
be less than the size of physical memory. For Microsoft Windows Workstations, a bat file,
run.bat, is prepared for you (users). A simple double-clicking the file can launch IGG.
3. Preparation
3.1 Input Files
Two kinds of input files are required in IGG. One is the pedigree/subjects file,
where pedigree structure and attributes of subjects are included. The other is
genotype files exported directly by the GTYPE of Affymetrix or the BeadStudio of
Illumina. The following are the specific format of these input files. All input files are
text-based.
Format of Pedigree/subjects files:
Pedigree
ID
1
1
1
1
1
1
Individual
ID
100
101
307
502
501
306
Father ID
0
0
100
100
100
0
Mother ID
0
0
101
101
101
0
1
Gender
1
2
2
1
1
1
Disease
0
2
2
1
1
1
Individual
Label
0
0
Tom
John
Kite
Kevin
The first five columns are required, which is a traditional definition for almost all
popular genetic analysis tools. The column names in the first line have already clearly
indicated the meaning of each column. For subjects without pedigrees in case-control
studies, their Father and Mother IDs are 0. The sixth column is optional and more
columns can be added. These columns describe phenotypes of subjects. The last
column is the unique identification label in genotype files. A blank or 0 denotes that
an individual has no corresponding genechips. All columns are delimited by the
“tab” characters.
Format of Affymetrix genotype files:
Sample Name: Tom
##Call Rate Filter Threshold=90.000000
SNP ID
Call
SNP_A-2188145
SNP_A-1813205
SNP_A-1880143
SNP_A-4215517
SNP_A-1828242
SNP_A-2029913
SNP_A-1929900
SNP_A-1818663
SNP_A-2192352
SNP_A-4218271
SNP_A-2253696
SNP_A-2033171
SNP_A-2300162
AA
BB
AB
AA
AA
AA
BB
AB
AA
AA
AB
AA
AB
This file is exported by GTYPE. The first line is the identification label/name of a
subject. This label corresponds to one of Individual Label in the Pedigree/subjects files.
IGG will automatically match the two labels during the process of integration.
Inconsistent labels will result in the loss of genotypes in the output of integration. The
second line is the Call rate filter threshold. Genotype calls start from the fourth line.
In the following lines, the first column is the Probe_Set_ID (SNP ID) of Affymetrix
genechips and the second is the genotypes.
Format of Illumina genotype files:
Locus_Name
rs1867749
rs1397354
rs649593
rs1517342
rs1517343
rs1868071
rs761162
rs911903
rs753646
rs558912
rs357116
rs715494
rs223201
Individual Label
Tom
Tom
Tom
Tom
Tom
John
John
John
John
John
John
John
Kevin
Allele1
A
A
B
A
A
B
B
B
A
A
A
A
B
2
Allele2
B
B
B
B
B
B
B
B
A
B
B
B
B
GC_Score
0.8935
0.9440
0.8923
0.8211
0.8572
0.9222
0.8210
0.8966
0.8689
0.8825
0.9335
0.9085
0.8804
rs213006
rs520354
rs874515
Kevin
Kevin
Kevin
A
B
B
A
B
B
0.9480
0.9258
0.8661
This file is exported by BeadStudio. The first column is column is the RS ID in the
dbSNP database http://www.ncbi.nlm.nih.gov/SNP/. The second column
Individual Label corresponds to that in the Pedigree/subjects files. The third and fourth
columns are genotypes. The last column is ignored by IGG.
3.2 HapMap Genotype Data
To consolidate HapMap genotype, you have to download hapmap genotype files
from the web site of HapMap Project (http://www.hapmap.org/genotypes/?N=D).
Downloaded genotypes files can be imported into IGG easily through the menu item
Import Genotype in the Hapmap menu. Before importing, you are not allowed to
change name of these files because IGG will selectively retrieve SNPs in these files
according to the file names as defined by HapMap. You need not download the
pedigree and subject information of HapMap sample since they have already been
built in the system.
Beside the genotype files, annotation data of the HapMap SNP are also required
before you can integrate the HapMap genotype into your own dataset. In the current
version, you can download the annotation data straightforwardly through IGG,
Tools->Download Hapmap Annotation.
Figure 1: Dialog box of Download HapMap SNP Annotation
3.3 Annotation Files of Genechips
In order to make IGG easier to use, IGG 2.0 has added a function to automatically
3
detect and download the annotation data from the IGG website
(http://bioinfo.hku.hk/iggweb/). Since you may be not interested all kinds of
genechips of Illumina and Affymetrix, you need fist specify your genechips on the
dialog “Download GeneChip Annotations” (Figure 2), Tools-> Download Chip
Annotation.
Figure 2: Dialog box of Download GeneChip Annotation
Hint: You need not use third-party tools to download the HapMap and Genechip
annotation data from the IGG website because IGG has employed multi-task and
resuming-broken download technologies to speed up the download.
4. Functions
IGG 2.0 focuses on strengthening its integration function and simplify its usage. It
has simply three integration Modules: Integrate Genechips according to genomes and
chromosomes; Integrate Genechips according to specific genes and SNPs; Integrate
HapMap genotypes. After integration, you can export the integrated data into various
formats for statistical genetic analyses. Some basic but important functions in IGG 1.0
have been moved into these modules, such as checking Mendelian inheritance and
comparing allele frequencies between the annotation and the observed dataset, and
checking consistency one subjects’ genotypes across various genechips, if available.
4
4.1 Integrate Genotypes across various GeneChips
4.1.1 Load Pedigrees or Subjects
Loading pedigrees or subjects is the first step of any integration and analysis on
IGG.
Two alternatives are provided to load pedigrees or subjects files. The first one is
through the main menu File-> Load Pedigree/Subjects. The other one is to use a
popup menu, which is more convenient. Right-clicking the mouse after selecting the
leaf Pedigree/Subjects in the Files tree can display the popup menu (See Figure 3).
Alternatively, you can click the accelerator
on the tool bar. IGG can load and
recognize only one pedigree/subject file.
To remove the loaded file in the tree, you can use the right-clicking popup menu
and choose the Remove File(s) item, or just press the “Delete” key on the keyboard
after selecting these file. Click File-> Close All Files menu will move all loaded files
including the Pedigree/Subjects file and genotype files.
Figure 3: A Snapshot of Graphic Interface to Load Pedigree or Subject File
4.1.2 Load Genotype Files
After loading the pedigree or subject information, you need load genotype files
exported by Affymetrix GTYPE and Illumina BeadStudio software. Similarly, you
have several alternatives to load and remove the genotype files, such as through the
main or popup menu. However, there are 2 different points, compared with the
function of loading Pedigrees/Subjects file.
5

You have to specify the right type of genechips on a dialog (Figure 4) before your
genotype files. A mismatch between the genotype files and genechip types can
result in a neglect of genotypes in the process of integration.

User can add multiple files for each node.
Figure 4: The dialog to specify types of genechips before loading the genotypes
4.1.3 Integration Loaded Genotypes
After loading the pedigree/subject and genotype data, you can integrate them
now. IGG 2.0 provide two ways for the integration, according to
Genome/Chromosome and according to Gene/SNPs. We will introduce both in the
following. Figure 5 shows the dialog for the integration according to
Genome/Chromosome. The dialog can be shown by clicking the Integrate -> Genome
& Chromosome menu or the accelerator,
on the toolbar panel of the main frame.
6
Figure 5: Dialog Box of Integrating Loaded Genotypes according to Genome/Chromosome.
In the dialog box, eight parameters are offered to customize the integration and
exporting.





Chromosomes of Genome: It is only one required parameter for the integration.
You can select the chromosomes for the integration.
Genetic Map Region: You can set the region for the integration in terms of genetic
map.
Physical Region: You can set the region for the integration in terms of physical
map on the reference genome.
Frequency Populations: Choose a reference population to output allelic
frequencies in the final integration result.
Interval of Trimming: Set a length of intervals to trim the dataset. The default
length is 0, indicating no trimming. The algorithm of trimming is very simple
though very useful for genome-wide linkage scan. IGG first split the selected
chromosomes or regions into even segments with the interval length customized.
It then selects one SNP with the maximum heterozigosity on each segment for
exporting (illustrated in Figure 6).
Figure 6: Algorithms to trim dataset



Remove Bad Mendel: IGG, once chosen, will automatically remove the genotype
of a child once the mendelina violation is detected within parents-child trios.
Remove All Missing SNPs: If the genotypes of an SNP are all missing, this SNP
will be removed out of the final integrated dataset.
Correct Annotation Frequency (required for Linkage): Once selected, IGG will
deal with the following situation. If a SNP has a zero allele frequency in the
annotation file but has a non-zero one in the loaded datasets, IGG will retain their
genotypes and correct the frequency 0 to be 0.000001 in the final integrated
dataset. This correction is necessary for upcoming linkage analysis. Otherwise,
the linkage analysis may abort for the tools like Merlin.
If you are not interested in genotype of the whole genome but some specific
genes or SNPs, you can selectively integrate them in another way. Figure 7 shows the
dialog, which is opened through Integrate -> Gene & SNP menu or the accelerator,
on the toolbar panel of IGG’s main frame. The following are the description of
settings on the dialog.
7
Figure 7: Dialog Box of Integrating Loaded Genotypes according to Genome/Chromosome.

Genes->Type: Ways to identify genes. You can choose either the HGNC official
gene symbol (http://www.genenames.org/)or the Entrez gene ID of NCBI
(http://www.ncbi.nlm.nih.gov). Both types are very popular in current
biological community.

Genes->From File: You can load the gene identifiers from a text file. The format is
that genes are separated by lines in the text file, just as the example genes appear
in the above text box.
Genes->Gene Extension: You can set an extension range for each gene. In this
way, SNPs outside a gene by close to the gene will also be included for the
integration.








Genes->Gene Extension->upstream: Location in the upstream of the transcription
starting site.
Genes->Gene Extension->downstream: Location in the downstream of the
transcription end site.
SNPs: List your interested SNPs in the text box.
SNPs->From File: Load your interested SNPs from a text file. In the text file, the
SNPs are separated by lines, just as the example genes appear in the above text
box.
Frequency Populations: The same as Figure 5.
Interval of Trimming: The same as Figure 5.
Remove Bad Mendel: The same as Figure 5
8

Correct Annotation Frequency (required for Linkage): The same as Figure 5
Clicking the button “Integrate” on the dialog can launch the process of
integration. A progress bar at the bottom-right will twinkle to indicate IGG is
integrating the data. Some real-time brief and detail running information will be
displayed on the bottom-right panels.
The integrated results will be shown in the middle tree of the left panel on the
main frame as an integrated dataset. The dataset name is the time the integration
started with a prefix “C”. Here “C” stands for integrated geneChip dataset. Within
the datasets, the integrated results are organized according to chromosomes.
4.2 Integrate HapMap Genotypes
Compared with previous version, IGG 2.0 has greatly simplified the procedure
to integrate HapMap Genotypes. The integration of HapMap genotypes is carried out
after the integration of GeneChip data.
Figure 8 shows the dialog to integrate HapMap genotypes. You can open the
dialog by clicking the main menu Integrate->HapMap Genotype or accelerator .
Explanations of the settings on the dialog are list in the following.
Figure 7: Dialog of Integrating HapMap Genotypes

Integrated GeneChip Datasets: You can choose an integrated GeneChip dataset
for the integration. The purpose of this integration is to merge HapMap
9
genotypes of SNPs in the integrated GeneChip dataset into a new dataset.

Sample Population: Choose the HapMap sample(s) population for you to
integrate. Genotype Files: The HapMap genotype files you loaded for the
integration.
These
genotype
files
are
downloaded
from
http://ftp.hapmap.org/genotypes/?N=D. You have to unzip the downloaded
files before you can load them into IGG. As you will note, the file name defined
by HapMap contains information of chromosome and sample information. IGG
can only recognize the original file name as defined by HapMap. So please do not
change the file name of these genotype file before you load them. Successful parse
of the file name will be shown in the Filter of File Names panel box on the dialog
once you load the genotype files. In addition, IGG can only recognize one pattern
of file names at a time. So please make sure all names of imported files have
consistent pattern.
Also the integrated results will be shown in the middle tree of the left panel on the
main frame as an integrated dataset. The dataset name is the time the integration
started with a prefix “H” rather than “C”. Here “H” stands for integrated HapMap
dataset. Within the datasets, the integrated results are also organized according to
chromosomes.
4.2 Export Integrated Datasets
After the integration, you can export the integrated dataset with a specific statistical
genetic analysis. The dialog “Export Integrated Data”can be opened by a click on the
menu Tools->Export For Genetic Tools or the accelerator . Figure 8 and 9 show the
dialog to export integrated genechip genotype and HapMap genotypes.
The definitions of settings on the dialog are list in the following.
10
Figure 8: Dialog of Exporting integrated genechip data

Integrated GeneChip Datasets: You can choose an integrated GeneChip dataset
for the integration. The purpose of this integration is to merge HapMap
genotypes of SNPs in the integrated GeneChip dataset into a new dataset.

Phenotypes->Type: Choose the type of phenotypes in export output. There are
three types of phenotypes, Quantitative Trait, Affection Status and Covariate. The
three are usually differently treated by many tools.
Phenotypes->Available: Phenotypes in the loaded pedigree file.
Phenotypes->Selected: Phenotypes selected for the output.
Format->Name of Tool: Output format. Each format corresponds to an analysis
tool.






Format->Missing Genotypes: Labels denote the missing genotypes in the output.
Specific labels have defined the tools of genetic analyses. Therefore, you have to
refer to document of the tools to set the missing genotype labels.
Files->Path: Set you output path of the exported files.
Files->Prefix of FileName: Set the prefix of file names exported.
11
Figure 9: Dialog of exporting integrated HapMap Genotypes
As for the export of integrated HapMap genotypes, there is only one additional
setting, compared with the integrated genechip genotypes. It is the assumed
phenotype value of the HapMap subjects. For example, in a linkage analysis of a
given disease, we can assume the HapMap subjects to be normal. Therefore, we can
put 1 into the text field.
5. Issues involving large datasets
The Random Access Memory (RAM) is absolutely important for IGG to handle large
datasets. Computers with large-size Random Access Memory (RAM) are favorable for the
integration in terms of running time and the size of dataset. The size of the dataset is
determined by the number of SNPs in genechips and the number of genechips for analysis.
IGG divides these SNPs into different chromosomes and treat them chromosome by
chromosome. Although this “Divide-and-Conquer” strategy strengthens IGG’s ability to
handle large dataset, enough size of RAM is essential for a successful integration.
5.1 Conservative Estimate the Required Maximum Java Heap Size
We have not yet completed a systematical testing about the ability to deal with large
datasets. But we are definitely sure the IGG 2.0 is more powerful that its early versions to
deal with large dataset because we have adopted more advanced Java data structure. At
the moment, we just past our previous testing results of IGG 1.0 for your reference. Before
we update the testing result, you can regard them as a conserve testing for IGG 2.0.
Testing Results of IGG 1.0
12
In an attempt to guide users to deal with large datasets, we list some testing results of IGG for
large datasets in the following table. These results are generated by a desktop computer with Intel
Pentium D 2.8GHz, 2.00 GB RAM and Microsoft Windows XP operating system. Chromosome 2
was chosen for the test since the number of SNPs on this chromosome is largest in almost all of the
genechips. (Hint: IGG can report the number of integrated SNPs) For other chromosomes, IGG will
deal with more chips and use less time. Based on the following results, we gave an equation to
CONSERVATIVELY estimate the required maximum Java heap size (-Xmx) for IGG. For instance,
the value of –Xmx for 500 chips and 20,608 SNPs is about 850.7 MB, indicating a computer with 1
GB RAM required to handle the dataset as a whole.
Maximum _ Heap _ Size required  47 MB  0.078 * Number _ of _ Chip * Number _ of _ SNP / 1000
Results
of testing large datasets:
java -jar -Xms1024m -Xmx1024m "./IGG01.jar"
1024 Megabytes RAM
Chip Type
Maximum Number of Chips
Elapsed Time
Number of Retrieved SNP
Size of Total File
Affymetrix250KNsp
611
~48 min
20,608
2.49 Gigabytes
Affymetrix 50k Xba
2707
~108min
4,787
2.41 Gigabytes
java -jar –Xms768m -Xmx768m "./IGG01.jar"
768Megabytes RAM
Affymetrix250K Nsp
452
~36 min
20,608
1.83Gigabytes
Affymetrix 50k Xba
2024
~85 min
4,787
1.80Gigabytes
java -jar –Xms512m -Xmx512m "./IGG01.jar"
512 Megabytes RAM
Affymetrix250K Nsp
291
~19 min
20,608
1.17Gigabytes
Affymetrix 50k Xba
1335
~38 min
4,787
1.19 Gigabytes
java -jar –Xms256m –Xmx256m "./IGG01.jar"
256 Megabytes RAM
Affymetrix250K Nsp
131
~6min
20,608
547Megabytes
Affymetrix 50k Xba
652
~13min
4,787
598 Megabytes
5.2 Merge Datasets
IGG also has provided an alternative for computers with relative small RAM and/or
huge datasets where thousands of genechips with high dense SNPs available. In these
situations, users are suggested to integrate divided sub-datasets first and then merge
them (Tools-> Merge Dataset) in the merge dataset dialog (Figure 10). For example, the
integration of 1000 Affymetrix 250K Nsp and 1000 Illumina Human Linkage IVb Panel
chips and can be fulfill in a computer with 1GB RAM by twice integrating two separated
datasets and merging the results of the integration.
13
Figure 10: Merge Dataset Dialog
6. Problems and solutions
If you meet some problems when using IGG, please try following solutions before
reporting errors to us.
1. Close and reopen IGG;
2. Reboot the computer. It will solve a lot of problems;
3. Make sure that a JRE is installed and running.
4. If you cannot see genotypes in the output of IGG, please check whether you
have specified right genechip type of the genotype files.
14
Download