生物信息学基础讲座
第2讲 生物信息学中的生物学
生物学基础知识
生物学基本概念biological concepts
生物学实验技术biological techniques
测序技术 sequencing:大分子序列
NMR/X-ray技术:大分子结构
Microarray技术:基因表达
酵母双杂交Y2H:蛋白-蛋白相互作用
生物学数据 biological data
生物大分子 macromolecules
生物过程 biological processes
序列数据:DNA/RNA/蛋白质序列
结构数据:蛋白质/复合物二维/三维结构
表达数据:microarray/Chip-chip数据
生物学数据库 biological databases
文献数据库:Pubmed
核酸数据库:Genbank/EMBL/DDBJ
基因组数据库:UCSC Genome Browser
蛋白数据库:Genprot/Swissprot/
结构数据库:PDB
表达数据库:GEO/SMD
基因通路数据库:KEGG
其他专业数据库:Flybase/MGI/Wormbase
Biological Nomenclature
Need to know the meaning of:
Species, organism, cell, nucleus, chromosome, DNA
Genome, gene, base, residue, protein, amino acid
Transcription, translation, messenger RNA
Codons, genetic code, evolution, mutation, crossover
Polymer, genotype, phenotype, conformation
Inheritance, homology, phylogenetic trees
Substructure and Effect
Species
Organism
Cell
Affects the
Behaviour of
Affects the
Function of
Nucleus
Protein
Chromosome
Amino Acid
DNA strand
Gene
Base
Prescribes
Folds
into
Cells
Basic unit of life
Different types of cell:
Cells produced by cells
Skin, brain, red/white blood
Different biological function
Cell division (mitosis)
2 daughter cells
Eukaryotic cells
Have a nucleus
Nucleus and Chromosomes
Each cell has nucleus
Rod-shaped particles inside
Different number for species
Are chromosomes
Which we think of in pairs
Human(46),tobacco(48)
Goldfish(94),chimp(48)
Usually paired up
X & Y Chromosomes
Humans: Male(xy), Female(xx)
Birds: Male(xx), Female(xy)
DNA Strands
Chromosomes are same in every cell of organism
Supercoiled DNA (Deoxyribonucleic acid)
Take a human, take one cell
Determine the structure of all chromosonal DNA
You’ve just read the human genome (for 1 person)
Human genome project
13 years, 3.2 billion chemicals (bases) in human genome
Other genomes being/been decoded:
Pufferfish, fruit fly, mouse, chicken, yeast, bacteria
DNA Structure
Double Helix (Crick & Watson)
Nitrogenous Base Pairs
2 coiled matching strands
Backbone of sugar phosphate pairs
Roughly 20 atoms in a base
Adenine Thymine [A,T]
Cytosine Guanine [C,G]
Weak bonds (can be broken)
Form long chains called polymers
Read the sequence on 1 strand
GATTCATCATGGATCATACTAAC
Differences in DNA
DNA differentiates:
We share DNA with
Species/race/gender
Individuals
Primates,mammals
Fish, plants, bacteria
Genotype
DNA of an individual
Genetic constitution
Phenotype
Characteristics of the
resulting organism
Nature and nurture
Genes
Chunks of DNA sequence
Large percentage of human genome
Is “junk”: does not code for proteins
“Simpler” organisms such as bacteria
Between 600 and 1200 bases long
32,000 human genes, 100,000 genes in tulips
Are much more evolved (have hardly any junk)
Viruses have overlapping genes (zipped/compressed)
Often the active part of a gene is split into exons
Seperated by introns
The Synthesis of Proteins
Instructions for generating Amino Acid sequences
DNA double helix is unzipped
One strand is transcribed to messenger RNA
RNA acts as a template
Ribosome translate the RNA into the sequence of amino acids
Amino acid sequences fold into a 3d molecule
Gene expression
Every cell has every gene in it (has all chromosomes)
Which ones produce proteins (are expressed) & when?
Transcription
Take one strand of DNA
Write out the counterparts to each base
G becomes C (and vice versa)
A becomes T (and vice versa)
Change Thymine [T] to Uracil [U]
You have transcribed DNA into messenger RNA
Example:
Start:
GGATGCCAATG
Intermediate: CCTACGGTTAC
Transcribed: CCUACGGUUAC
Genetic Code
How the translation occurs
Think of this as a function:
Input: triples of three base letters (Codons)
Output: amino acid
Example: ACC becomes Threonine (T)
Gene sequences end with:
TAA, TAG or TGA
Genetic Code
A=Ala=Alanine
C=Cys=Cysteine
D=Asp=Aspartic acid
E=Glu=Glutamic acid
F=Phe=Phenylalanine
G=Gly=Glycine
H=His=Histidine
I=Ile=Isoleucine
K=Lys=Lysine
L=Leu=Leucine
M=Met=Methionine
N=Asn=Asparagine
P=Pro=Proline
Q=Gln=Glutamine
R=Arg=Arginine
S=Ser=Serine
T=Thr=Threonine
V=Val=Valine
W=Trp=Tryptophan
Y=Tyr=Tyrosine
Example Synthesis
TCGGTGAATCTGTTTGAT
Transcribed to:
AGCCACUUAGACAAACUA
Translated to:
SHLDKL
Proteins
DNA codes for
Amino acids strings
Fold up into complex 3d molecule
3d structures:conformations
Between 200 & 400 “residues”
Folds are proteins
Residue sequences
strings of amino acids
Always fold to same conformation
Proteins play a part
In almost every biological process
Evolution of Genes: Inheritance
Evolution of species
But actually, it is the genotype which evolves
Caused by reproduction and survival of the fittest
Organism has to live with it (or die before reproduction)
Three mechanisms: inheritance, mutation and crossover
Inheritance: properties from parents
Embryo has cells with 23 pairs of chromosomes
Each pair: 1 chromosome from father, 1 from mother
Most important factor in offspring’s genetic makeup
Evolution of Genes: Mutation
Genes alter (slightly) during reproduction
Caused by errors, from radiation, from toxicity
3 possibilities: deletion, insertion, alteration
Deletion: ACGTTGACTC ACGTGACTC
Insertion: ACGTTGACTC AGCGTTGACTC
Substitution: ACGTTGACTC ACGATGACTT
Mutations are almost always deleterious
A single change has a massive effect on translation
Causes a different protein conformation
Evolution of Genes:
Crossover (Recombination)
DNA sections are swapped
From male and female genetic input to offspring
DNA
Phylogenetic trees
Understand our evolution
Genes are homologous
By looking at DNA seqs
If they share a common ancestor
For particular genes
See who evolved from who
Example:
Mammoth most related to
African or Indian Elephants?
LUCA:
Last Universal Common Ancestor
Roughly 4 billion years ago
Genetic Disorders
Disorders have fuelled much genetics research
Remember that genes have evolved to function
Not to malfunction
Different types of genetic problems
Downs syndrome: three chromosome 21s
Cystic fibrosis:
Single base-pair mutation disables a protein
Restricts the flow of ions into certain lung cells
Lung is less able to expel fluids
Predicting Protein Structure
Proteins fold to set up an active site
Small, but highly effective (sub)structure
Active site(s) determine the activity of the protein
Remember that translation is a function
Always same structure given same set of codons
Is there a set of rules governing how proteins fold?
No one has found one yet
“Holy Grail” of bioinformatics
Protein Structure Knowledge
Both protein sequence and structure
1.3+ Million protein sequences known
Found with projects like Human Genome Project
20,000+ protein structures known
Are being determined at an exponential rate
Found using techniques like X-ray crystallography
Takes between 1 month and 3 years
To determine the structure of a protein
Process is getting quicker
Sequence versus Structure
500000
Protein sequence
Number
400000
300000
200000
100000
Protein structure
0
85
90
95
Year
00
Database Approaches
Slow(er) rate of finding protein structure
Structure is much more conservative than sequence
Still a good idea to pursue the Holy Grail
1.3m genes, but only 2,000 – 10,000 different conformations
First approach to sequence prediction:
Store [sequence,structure] pairs in a database
Find ways to score similarity of residue sequences
Given a new sequence, find closest matches
A good match will possibly mean similar protein shape
E.g., sequence identity > 35% will give a good match
Rest of the first half of the course about these issues
Potential (Big) Payoffs
of Protein Structure Prediction
Protein function prediction
Rational drug design
Protein interactions and docking
Inhibit or stimulate protein activity with a drug
Systems biology
Putting it all together: “E-cell” and “E-organism”
In-silico modelling of biological entities and process
Further Reading
Human Genome Project at Sanger Centre
Talking glossary of genetic terms
http://www.sanger.ac.uk/HGP/
http://www.genome.gov/glossary.cfm
Primer on molecular genetics
http://www.ornl.gov/TechResources/Human_Genome/publicat/
primer/toc.html