生物信息学基础讲座 第2讲 生物信息学中的生物学 生物学基础知识 生物学基本概念biological concepts 生物学实验技术biological techniques 测序技术 sequencing:大分子序列 NMR/X-ray技术:大分子结构 Microarray技术:基因表达 酵母双杂交Y2H:蛋白-蛋白相互作用 生物学数据 biological data 生物大分子 macromolecules 生物过程 biological processes 序列数据:DNA/RNA/蛋白质序列 结构数据:蛋白质/复合物二维/三维结构 表达数据:microarray/Chip-chip数据 生物学数据库 biological databases 文献数据库:Pubmed 核酸数据库:Genbank/EMBL/DDBJ 基因组数据库:UCSC Genome Browser 蛋白数据库:Genprot/Swissprot/ 结构数据库:PDB 表达数据库:GEO/SMD 基因通路数据库:KEGG 其他专业数据库:Flybase/MGI/Wormbase Biological Nomenclature Need to know the meaning of: Species, organism, cell, nucleus, chromosome, DNA Genome, gene, base, residue, protein, amino acid Transcription, translation, messenger RNA Codons, genetic code, evolution, mutation, crossover Polymer, genotype, phenotype, conformation Inheritance, homology, phylogenetic trees Substructure and Effect Species Organism Cell Affects the Behaviour of Affects the Function of Nucleus Protein Chromosome Amino Acid DNA strand Gene Base Prescribes Folds into Cells Basic unit of life Different types of cell: Cells produced by cells Skin, brain, red/white blood Different biological function Cell division (mitosis) 2 daughter cells Eukaryotic cells Have a nucleus Nucleus and Chromosomes Each cell has nucleus Rod-shaped particles inside Different number for species Are chromosomes Which we think of in pairs Human(46),tobacco(48) Goldfish(94),chimp(48) Usually paired up X & Y Chromosomes Humans: Male(xy), Female(xx) Birds: Male(xx), Female(xy) DNA Strands Chromosomes are same in every cell of organism Supercoiled DNA (Deoxyribonucleic acid) Take a human, take one cell Determine the structure of all chromosonal DNA You’ve just read the human genome (for 1 person) Human genome project 13 years, 3.2 billion chemicals (bases) in human genome Other genomes being/been decoded: Pufferfish, fruit fly, mouse, chicken, yeast, bacteria DNA Structure Double Helix (Crick & Watson) Nitrogenous Base Pairs 2 coiled matching strands Backbone of sugar phosphate pairs Roughly 20 atoms in a base Adenine Thymine [A,T] Cytosine Guanine [C,G] Weak bonds (can be broken) Form long chains called polymers Read the sequence on 1 strand GATTCATCATGGATCATACTAAC Differences in DNA DNA differentiates: We share DNA with Species/race/gender Individuals Primates,mammals Fish, plants, bacteria Genotype DNA of an individual Genetic constitution Phenotype Characteristics of the resulting organism Nature and nurture Genes Chunks of DNA sequence Large percentage of human genome Is “junk”: does not code for proteins “Simpler” organisms such as bacteria Between 600 and 1200 bases long 32,000 human genes, 100,000 genes in tulips Are much more evolved (have hardly any junk) Viruses have overlapping genes (zipped/compressed) Often the active part of a gene is split into exons Seperated by introns The Synthesis of Proteins Instructions for generating Amino Acid sequences DNA double helix is unzipped One strand is transcribed to messenger RNA RNA acts as a template Ribosome translate the RNA into the sequence of amino acids Amino acid sequences fold into a 3d molecule Gene expression Every cell has every gene in it (has all chromosomes) Which ones produce proteins (are expressed) & when? Transcription Take one strand of DNA Write out the counterparts to each base G becomes C (and vice versa) A becomes T (and vice versa) Change Thymine [T] to Uracil [U] You have transcribed DNA into messenger RNA Example: Start: GGATGCCAATG Intermediate: CCTACGGTTAC Transcribed: CCUACGGUUAC Genetic Code How the translation occurs Think of this as a function: Input: triples of three base letters (Codons) Output: amino acid Example: ACC becomes Threonine (T) Gene sequences end with: TAA, TAG or TGA Genetic Code A=Ala=Alanine C=Cys=Cysteine D=Asp=Aspartic acid E=Glu=Glutamic acid F=Phe=Phenylalanine G=Gly=Glycine H=His=Histidine I=Ile=Isoleucine K=Lys=Lysine L=Leu=Leucine M=Met=Methionine N=Asn=Asparagine P=Pro=Proline Q=Gln=Glutamine R=Arg=Arginine S=Ser=Serine T=Thr=Threonine V=Val=Valine W=Trp=Tryptophan Y=Tyr=Tyrosine Example Synthesis TCGGTGAATCTGTTTGAT Transcribed to: AGCCACUUAGACAAACUA Translated to: SHLDKL Proteins DNA codes for Amino acids strings Fold up into complex 3d molecule 3d structures:conformations Between 200 & 400 “residues” Folds are proteins Residue sequences strings of amino acids Always fold to same conformation Proteins play a part In almost every biological process Evolution of Genes: Inheritance Evolution of species But actually, it is the genotype which evolves Caused by reproduction and survival of the fittest Organism has to live with it (or die before reproduction) Three mechanisms: inheritance, mutation and crossover Inheritance: properties from parents Embryo has cells with 23 pairs of chromosomes Each pair: 1 chromosome from father, 1 from mother Most important factor in offspring’s genetic makeup Evolution of Genes: Mutation Genes alter (slightly) during reproduction Caused by errors, from radiation, from toxicity 3 possibilities: deletion, insertion, alteration Deletion: ACGTTGACTC ACGTGACTC Insertion: ACGTTGACTC AGCGTTGACTC Substitution: ACGTTGACTC ACGATGACTT Mutations are almost always deleterious A single change has a massive effect on translation Causes a different protein conformation Evolution of Genes: Crossover (Recombination) DNA sections are swapped From male and female genetic input to offspring DNA Phylogenetic trees Understand our evolution Genes are homologous By looking at DNA seqs If they share a common ancestor For particular genes See who evolved from who Example: Mammoth most related to African or Indian Elephants? LUCA: Last Universal Common Ancestor Roughly 4 billion years ago Genetic Disorders Disorders have fuelled much genetics research Remember that genes have evolved to function Not to malfunction Different types of genetic problems Downs syndrome: three chromosome 21s Cystic fibrosis: Single base-pair mutation disables a protein Restricts the flow of ions into certain lung cells Lung is less able to expel fluids Predicting Protein Structure Proteins fold to set up an active site Small, but highly effective (sub)structure Active site(s) determine the activity of the protein Remember that translation is a function Always same structure given same set of codons Is there a set of rules governing how proteins fold? No one has found one yet “Holy Grail” of bioinformatics Protein Structure Knowledge Both protein sequence and structure 1.3+ Million protein sequences known Found with projects like Human Genome Project 20,000+ protein structures known Are being determined at an exponential rate Found using techniques like X-ray crystallography Takes between 1 month and 3 years To determine the structure of a protein Process is getting quicker Sequence versus Structure 500000 Protein sequence Number 400000 300000 200000 100000 Protein structure 0 85 90 95 Year 00 Database Approaches Slow(er) rate of finding protein structure Structure is much more conservative than sequence Still a good idea to pursue the Holy Grail 1.3m genes, but only 2,000 – 10,000 different conformations First approach to sequence prediction: Store [sequence,structure] pairs in a database Find ways to score similarity of residue sequences Given a new sequence, find closest matches A good match will possibly mean similar protein shape E.g., sequence identity > 35% will give a good match Rest of the first half of the course about these issues Potential (Big) Payoffs of Protein Structure Prediction Protein function prediction Rational drug design Protein interactions and docking Inhibit or stimulate protein activity with a drug Systems biology Putting it all together: “E-cell” and “E-organism” In-silico modelling of biological entities and process Further Reading Human Genome Project at Sanger Centre Talking glossary of genetic terms http://www.sanger.ac.uk/HGP/ http://www.genome.gov/glossary.cfm Primer on molecular genetics http://www.ornl.gov/TechResources/Human_Genome/publicat/ primer/toc.html