genome browser seminar

advertisement
Introduction to genomes & genome browsers
Content



Introduction to genomes
The human genome
Human genetic variation




SNPs
CNVs
Alternative splicing
Browsing the human genome
Celia van Gelder
CMBI
UMC Radboud
December 2014
Celia.vanGelder@radboudumc.nl
Exponential Growth in Genomic Sequence Data
# of genomes
First 2
bacterial
genomes
complete
First eukaryote
complete
(yeast)
First metazoan
complete
(flatworm)
http://www.genomesonline.org/
Ebola
The human genome
• Genome: the entire sequence of DNA in a cell
• 3 billion basepairs (3Gb)
• 22 chromosome pairs + X en Y chromosomes
• Chromosome length varies from ~50Mb to ~250Mb
• About 20000 protein-coding genes
(average gene length 3000 bases, but largest known gene is 2.4 Mb (dystrophin))
• Human genome is 99.9% identical among individuals
This means that every 2 persons differ in 3 million nts!!
Eukaryotic Genomes: more than collections of genes
• Genes & regulatory sequences make up 5% of the genome
Protein coding genes
RNA genes (rRNA, snRNA, snoRNA, miRNA, tRNA)
Structural DNA (centromeres, telomeres)
Regulation-related sequences (promoters, enhancers, silencers,
insulators)
– Parasite sequences (transposons)
– Pseudogenes (non-functional gene-like sequences)
– Simple sequence repeats
–
–
–
–
The human genome cntnd
• Only 1.2% codes for proteins
• Long introns, short exons
• Large spaces between genes
• More than half consists of repetitive DNA
Alu repeat
~300 bp
> million copies
From: Molecular Biology of the Cell
(4th edition) (Alberts et al., 2002)
Non coding DNA
Human Genetic Variation
•
Genetic variation explains some of the differences among people, such as:
– Blood group
– Eye color, Skin color, Hair color
– Length
– Higher or lower risk for getting particular diseases
• Cystic fibrosis, Sickle cell disease, Diabetes, Cancer, Arthritis, Asthma etc
Variations in the Genome
Common Sequence
Variations
Polymorphism
Deletions
Insertions
Chromosome
Translocations
Today’s focus
1. Single Nucleotide Polymorphisms (SNPs)
2. Copy number variations (CNV)
3. Alternative transcripts
Single Nucleotide Polymorphisms (SNPs)
• SNPs are DNA sequence variations that occur when a single
nucleotide (A,T,C,or G) in the genome sequence is altered.
• For a variation to be considered a SNP, it must occur in at least 1%
of the population.
• SNPs make up about 90% of all
human genetic variation and occur every
100 to 300 bases.
• SNPs can occur in coding (gene) and
non coding regions of the genome;
<1% alter the protein sequence
SNPs
• determine properties like eye color, hair (curly or straight), or if
you can taste bitter or not.
• are used for identification and forensics
• are used for estimating predisposition to disease
• can cause drug side–effects and/or non responsiveness for the
drug
• have impact on how humans respond to environmental factors like
bacteria, viruses, toxins and chemicals
• are used to predict specific genetic traits
• are used for classifying patients in clinical trials
• are used for mapping and genome-wide association studies of
complex diseases
SNP - Bitter tasting, TAS2R38
SNP & disease, Alzheimer
Alzheimer's disease (AD) & apolipoprotein E (APOE)
•
Apolipoprotein E is a cholesterol carrier that is found in the brain and other
organs. APOE is suspected to be involved in amyloid beta aggregation and
clearance, influencing the onset of amyloid beta deposition.
•
APOE contains 2 SNPs that result in 3 possible alleles: E2, E3, E4.
•
Variant
E2
E3
E4
•
A person who inherits at least one E4 allele will have
a greater chance of developing AD.
rs429358
T
+
T
+
C
+
rs7412
T
C
C
Today’s focus
1. Single Nucleotide Polymorphisms (SNPs)
2. Copy number variations (CNV)
3. Alternative transcripts
Copy Number Variation
• Copy Number Variations (CNVs):
segment of DNA (> 1 kB) which is present at variable copy number
in two or more genomes
• When there are genes in the CNV areas, this can lead to variations
in the number of gene copies between individuals
• CNVs contribute to our uniqueness.
• CNVs can also influence the susceptibility to disease.
• CNVs may either be inherited or caused by de novo mutation
Copy Number Variation
Normal cell
CN=2
deletion
CN=0
amplification
CN=1
CN=3
CN=4
CNVs & disease
• Many inherited genetic diseases result from CNVs;
–
–
–
–
–
Gene copy number can be elevated in cancer cells
Autism
Schizophrenia (dept. human genetics)
Mental retardation (dept. human genetics)
Parkinsons disease
• There are CNVs that protect against HIV infection and malaria.
• The contribution of CNV to the common, complex diseases, such as
diabetes and heart disease, is currently less well understood
Today’s focus
1. Copy number variations (CNV)
2. Single Nucleotide Polymorphisms (SNPs)
3. Alternative transcripts
Alternative splicing
Alternative splicing
• Defects in alternative splicing have been implicated in many
diseases, including:
–
neuropathological conditions such as Alzheimer disease
–
cystic fibrosis, those involving growth and developmental defects
–
many human cancers, e.g. BRCA1 in breast cancer
– Beta-globin in Beta-thalassemia
– Parkinsons Disease
Annotating & Browsing the Human Genome
Annotating the genome
Annotation: attaching biological information to sequences.
Two main steps:
• identifying elements on the genome
• attaching biological information to these elements.
Basic & Advanced Genome Annotation
• Basic:
–
–
–
–
–
–
Genomic location
Gene features: Exons, Introns, UTRs
Transcript(s)
Pseudogenes, Non-coding RNA
Protein(s)
Links to other sources of information
• Advanced
–
–
–
–
–
–
–
Cytogenetic bands
Polymorphic markers
Genetic variation, including SNPs & CNVs
Repetitive sequences
cDNAs or mRNAs from related species
Genomic sequence variation
Regulation sequences (enhancers, silencers, insulators)
[Human] Genome Browsers
Not limited to
only human data
EBI
Ensembl
NCBI
Map Viewer
UCSC Genome Browser
Ensembl
©EMBL-EBI
Other Ensembl Installations
©EMBL-EBI (2013)
Organized Data Based on Chromosome Location
Gene X
tracks
genes & predictions
variations &
repeats
cross-species
comparative data
& many more types of data from expression
& regulation to mRNA and ESTs…
Description
Transcript data
Structure
Gene Ontology
Pathway Data
Homologous
Genes
Expression Data
Etc….
ENSG### Ensembl Gene ID
ENST### Ensembl Transcript ID
ENSP### Ensembl Peptide ID
ENSE### Ensembl Exon ID
HGNC – a unique name and
symbol for every gene in human
http://www.genenames.org/
tracks
tracks
Ensembl: An Example
Click for
more
details
Direction of transcription
Above blue line: forward strand
Below blue line: reverse strand
Ensembl Transcripts
A red transcript comes from Ensembl or VEGA/Havana.
A transcript from the Ensembl annotation pipeline starts with 2 (MYO6-201)
A transcript with Vega/Havana manual curation starts with 0 (MYO6-001)
A gold, or merged, transcript is identical between Ensembl automated annotation and
VEGA/Havana manual curation. Only human, mouse, and zebrafish will have gold
transcripts. This transcript can be thought of as stable (unlikely to change), and is coloured
gold. It is assigned a number beginning with 0.
A blue, pink or grey transcript is non-coding. See the 'NON-CODING TRANSCRIPTS' section
below for more.
©EMBL-EBI
Synopsis- What can I do with Ensembl?
•
View, examine & explore annotated information for any chromosomal
region:
– Genes,
– ESTs, mRNAs, alternative transcripts
– Proteins
– SNPs, and SNPs across strains (rat, mouse), populations (human), or
even breeds (dog)
– homologues and phylogenetic trees across more than 40 species
– whole genome alignments
– conserved regions across species
– gene expression profiles
•
Upload your own data and use BLAST/BLATagainst any Ensembl genome
•
Export sequence, or create a table of gene information
Help
•
•
•
•
•
•
Glossary
FAQ
Help & Documentation -> Tutorials
Save configuration
Share this link functionality
Share this image functionality
Download