GWAS_incomplete.ppt

advertisement
Genome wide association studies
(A Brief Start)
Source:
PLoS Comput Biol. 2012 Dec; 8(12): e1002822.
Published online 2012 Dec 27. doi: 10.1371/journal.pcbi.1002822
PMCID: PMC3531285
Chapter 11: Genome-Wide Association Studies
William S. Bush1,* and Jason H. Moore2
Fran Lewitter, Editor and Maricel Kann, Editor
And Zhiwu Zhang Lecture and Labs
GWAS
• Idea is a epidemiological study of common diseases using the
Genome.
• Essentially GWAS searches the genome for small variations,
called single nucleotide polymorphisms or SNPs, that occur
more frequently in people with a particular disease than in
people without the disease or vice versa.
• Then it does significance testing to see if there are any
association between the disease and the location of that the
genetic variation.
• First we need to understand what is a SNP.
SNP
• Most humans have a genome that is very similar but there are locations
on the genome where commonly there are differences between people.
• SNPs are single base-pair changes in the DNA sequence that occur with
high frequency in the human genome.
• SNPs are typically used as markers of a genomic region, with the large
majority of them having a minimal impact on biological systems.
• SNPs can have functional consequences, causing amino acid changes,
changes to mRNA transcript stability, and changes to transcription factor
binding affinity.
• SNPs are by far the most abundant form of genetic variation in the
human genome.
• SNPs typically have two alleles, meaning within a population there are two
commonly occurring base-pair possibilities for a SNP location.
SNP versus Mutation
•
The frequency of a SNP is given in terms of the minor allele frequency or the
frequency of the less common allele.
•
A SNP with a minor allele G frequency of 0.35 implies that 35% of a population has
the allele versus the more common allele (the major allele), which is found in 65%
of the population.
•
Mutations: These conditions are largely caused by extremely rare genetic variants
that ultimately induce a detrimental change to protein function, which leads to the
disease state. Variants with such low frequency in the population are sometimes
referred to as mutations, though they can be structurally equivalent to SNPs single base-pair changes in the DNA sequence.
•
In the genetics literature, the term SNP is generally applied to common single
base-pair changes, and the term mutation is applied to rare genetic variants.
SNP and GWAS
• GWAS examine SNPs across the genome, they represent
a promising way to study complex, common diseases in
which many genetic variations contribute to a person’s
risk.
• This approach has already identified SNPs related to
several complex conditions including diabetes, heart
abnormalities, Parkinson disease, and Crohn disease.
• There is hope that as we do more studies we will
understand more common diseases.
CV/CD hypothesis
• This hypothesis states that common disorders are likely influenced
by common genetic variation
• If common genetic variants influence disease, the effect size for any
one variant must be small relative to that found for rare disorders.
• If common disorders show heritability (inheritance in families), then
multiple common alleles must influence disease susceptibility. As
such, the total genetic risk due to common genetic variation must
be spread across multiple genetic factors.
• These two points suggest that traditional family-based genetic
studies are not likely to be successful for complex diseases,
prompting a shift toward population-based studies.
The HapMap Project
• We need to KNOW where the SNPS occur with what density
• We also need to figure out which SNPS are related to racial
phenotypes.
• Hence, the International Hap/Map project was launched to
understand the SNPs related to race.
• Indentified 500,000 SNPs for people of European descent.
LD: Linkage Disequilibrium
• LD: property of one allele in an SNPs being correlated with
an allele in another SNPs along a contiguous stretch of the
genome.
• When all alleles are independent we have Linkage
equilibrium, so when they are dependent – we call it LD.
• Common measures are Distance, or R-square defined for
proportions.
• Idea is: causality is almost impossible to prove in these
studies and so, because of the small effect sizes and
indirect associations. Hence, large scale studies are
required.
Genotyping Technology
• Two primary platforms have been used for most GWAS. These include
products from Illumina (San Diego, CA) and Affymetrix (Santa Clara, CA).
• Affymetrix platform prints short DNA sequences as a spot on the chip that
recognizes a specific SNP allele. Alleles (i.e. nucleotides) are detected by
differential hybridization of the sample DNA.
• Illumina on the other hand uses a bead-based technology with slightly
longer DNA sequences to detect alleles. The Illumina chips are more
expensive to make but provide better specificity.
• A chip that has more SNPs with better overall genomic coverage for a
study of Africans than Europeans. This is because African genomes have
had more time to recombine and therefore have less LD between alleles at
different SNPs. More SNPs are needed to capture the variation across the
African genome.
• These next-generation sequencing methods will provide all the DNA
sequence variation in the genome. It is time now to retool for this new
onslaught of data.
Design
• Most common are:
• Case control (binary response)
• Quantitative (continuous response)
• Quantitative easier: uses ANOVA like methods for each SNP presence or
absence (response like HDL, LDL anything that is measured)
• For yes/no phenotypes we can use 2 by 2 tables and chi-square or logistic
regression. This study type asks if the allele of a genetic variant is found
more often than expected in individuals with the phenotype of interest
(e.g. with the disease being studied).
• Early calculations on statistical power indicated that this approach could
be better than linkage studies at detecting weak genetic effects
Common Data
•
•
•
•
•
•
The most common approach of GWA studies is the case-control setup, which
compares two large groups of individuals, one healthy control group and one case
group affected by a disease.
For each of these SNPs it is then investigated if the allele frequency is significantly
altered between the case and the control group.
In such setups, the fundamental unit for reporting effect sizes is the odds ratio.
If the allele frequency in the case group is much higher than in the control group,
the odds ratio is higher than 1, and vice versa for lower allele frequency.
Additionally, a P-value for the significance of the odds ratio is typically calculated
using a simple chi-squared test.
Finding odds ratios that are significantly different from 1 is the objective of the
GWA study because this shows that a SNP is associated with disease.
Data
• The most common type of data appears to be in the form of 2
by 2 tables.
• Lets say we have two groups disease and not disease and we
are focusing on the presence and absence of
• Essentially calculate the chi-square test for all the SNPs.
Disease
Not disease
G
2000
8000
Not G
8000
2000
Other types of Data
• Instead of being Disease or Not Disease the phenotype could
be a measure of a trait, like height, biomass etc.
• In that case we model the data as a linear model:
• Y = SNP effect + error
• And perform ANOVA type analysis
• However, there are other contributing factors to the model
and Dr. Zhiwu Zhang talked to us about these kinds of models
Fixed and Random Effect Models
• GLM for GWAS
Y = SNP + Q (or PCs) + e
(fixed effect)
MLM for GWAS
Y = SNP + Q (or PCs) +
Fixed effect
Kinship +
Random effect
e
GLM to GLiM
• The Mixed model is obviously a better approach as we can
model the systematic variations in the model batter.
• However, it has been looked at in depth only for continuous
response and not so much for binary response or categorical
response.
• Hence, the direction is going from General Linear Mixed
models to Generalized Linear Mixed models, using logistic
regression.
• P(Y=1| X’s) = SNP + Q + K
• (where we incorporate a fixed and a random effect in the
model).
Download