April 1, 2014
Yang Li
Lin Liu
STAT115 STAT225 BIST512 BIO298 - Intro to Computational Biology
Major Projects for Human Genetics
Human Variation
Genotype-Phenotype
• HapMap (1,2,3)
• dbGap
• 1,000 Genomes
• CGEMS
• 1KGP
• TCGA
STAT115 STAT225 BIST512 BIO298 - Intro to Computational Biology
Tests for Association
• X2 Test
• Logistic Regression
STAT115 STAT225 BIST512 BIO298 - Intro to Computational Biology
X2 Test – Testing the effect
STAT115 STAT225 BIST512 BIO298 - Intro to Computational Biology
Odds Ratio – How strong is the effect
STAT115 STAT225 BIST512 BIO298 - Intro to Computational Biology
Odds Ratio – How strong is the effect
STAT115 STAT225 BIST512 BIO298 - Intro to Computational Biology
Logistic Reg – More flexible Testing
STAT115 STAT225 BIST512 BIO298 - Intro to Computational Biology
PLINK
• http://pngu.mgh.harvard.edu/~purcell/plink/
• Most often used software for GWAS
• Include both data preprocessing and
statistical analyses
STAT115 STAT225 BIST512 BIO298 - Intro to Computational Biology
Data Format for PLINK
• Ped Files. The PED file is a white-space
(space or tab) delimited file
– Family ID
– Individual ID
– Paternal ID
– Maternal ID
– Sex (1=male; 2=female; other=unknown)
– Phenotype
STAT115 STAT225 BIST512 BIO298 - Intro to Computational Biology
• Affection status, by default, should be
coded:
– -9 missing
– 0 missing
– 1 unaffected
– 2 affected
STAT115 STAT225 BIST512 BIO298 - Intro to Computational Biology
• Map Files. By default, each line of the
MAP file describes a single marker and
must contain exactly 4 columns:
– chromosome (1-22, X, Y or 0 if unplaced)
– rs# or snp identifier
– Genetic distance (morgans)
– Base-pair position (bp units)
– Genetic distance can be specified in
centimorgans with the --cm flag.
STAT115 STAT225 BIST512 BIO298 - Intro to Computational Biology
• To save space and time, you can make a
binary ped file (*.bed). This will store the
pedigree/phenotype information in
separate file (*.fam) and create an
extended MAP file (*.bim) (which contains
information about the allele names, which
would otherwise be lost in the BED file).
To create these files use the command:
STAT115 STAT225 BIST512 BIO298 - Intro to Computational Biology
• >plink --file mydata --make-bed
which creates (by default):
plink.bed ( binary file, genotype information )
plink.fam ( first six columns of mydata.ped )
plink.bim ( extended MAP file: two extra cols
= allele names)
STAT115 STAT225 BIST512 BIO298 - Intro to Computational Biology
Identify/remove highly related
samples
• # Step 1: calculate pairwise genomewide
identity-by-state (IBS)
• plink --bfile v1gwas --genome --geno 0 -mind 1 --out ibsibd
• # Step 2: find the pairs of subjects with
excessive relatedness
• gawk ' $8 > 0.8 ' ibsibd.genome
STAT115 STAT225 BIST512 BIO298 - Intro to Computational Biology
Creating the GWAS dataset –
filtering bad SNPs.
• # command to filter by hwe, etc.
• plink --bfile v1gwas --mind 0.05 --geno
0.01 --maf 0.01 --hwe 1e-3 --out v2gwas -make-bed
STAT115 STAT225 BIST512 BIO298 - Intro to Computational Biology
Missing data
• # PLINK commands for analyzing missing
data.
• plink --bfile v2gwas --out v2gwasmiss –missing
STAT115 STAT225 BIST512 BIO298 - Intro to Computational Biology
Descriptive analyses of the
case/control GWAS data
• # PLINK commands for testing
missingness between cases and controls.
• plink --bfile v2gwas --out v2gwasmisstest -test-missing
• # PLINK commands for generating allele
frequencies.
• plink –-bfile v2gwas --out v2gwasfreq –
freq
STAT115 STAT225 BIST512 BIO298 - Intro to Computational Biology
• Allele frequency data (v2gwasfreq.frq)
CHR SNP A1 A2 MAF NCHROBS
1 rs3934834 T C 0.1679 1084
1 rs3737728 T C 0.2706 1094
1 rs6687776 T C 0.1344 1094
1 rs9651273 A G 0.2866 1092
STAT115 STAT225 BIST512 BIO298 - Intro to Computational Biology
• A1 and A2 refer to the minor and major
alleles
• MAF (minor allele frequency) is the
number of occurrences of the minor allele
divided by the number of non-missing
genomes
• NCHROBS, maximum value is two times
the sample size.
STAT115 STAT225 BIST512 BIO298 - Intro to Computational Biology
• # PLINK commands for generating
genotype frequencies and HWE estimates.
• plink --bfile v2gwas --out v2gwashwe -hardy 10
STAT115 STAT225 BIST512 BIO298 - Intro to Computational Biology
Association Analysis of GWAS
data
• # PLINK commands for association
analyses.
• plink --bfile v2gwas --out v2gwasassoc -assoc
• plink --bfile v2gwas --out v2gwasmodel -model --model-trend --adjust
STAT115 STAT225 BIST512 BIO298 - Intro to Computational Biology
Output
• Adjustment for multiple comparisons
(v2gwasmodel.model.trend.adjusted)
CHR SNP UNADJ GC BONF HOLM
SIDAK_SS SIDAK_SD FDR_BH FDR_BY
10 rs4363506 4.568e-007 4.791e-007
0.2156 0.2156 0.1939 0.1939 0.2156 1
5 rs5014235 7.823e-006 8.124e-006 1 1
0.9751 0.9751 0.9625 1
STAT115 STAT225 BIST512 BIO298 - Intro to Computational Biology
MAF and VCF file format
• Frequently used in cancer genome and
1000 genome project
• https://wiki.nci.nih.gov/display/TCGA/Mutat
ion+Annotation+Format+(MAF)+Specificati
on
• http://www.1000genomes.org/wiki/Analysis
/Variant%20Call%20Format/vcf-variantcall-format-version-41
STAT115 STAT225 BIST512 BIO298 - Intro to Computational Biology