• 3 hoofdonderdelen • 1) Cel, DNA, genen, transcriptie, translatie • 2) the genome, genes, differences between people • 3) genome browsers (brugje naar onderdeel Martin) Life, Cells, Proteins • The study of life the study of cells • Cells are born, do their job, duplicate, die • All these processes controlled by proteins Molecular Biology Background • Cells – general structure/organization • Molecules – that make up cells • Cellular processes – what makes the cell alive Cells • The cell is the fundamental working unit of every living organism. • Humans: trillions of cells (metazoa); other organisms like yeast: one cell (protozoa). • Different types of cell: – Skin, brain, red/white blood – Different biological function • Cells produced by cells – Cell division (mitosis) – 2 daughter cells • Eukaryotic cells – Have a nucleus Two Cell Organizations • Prokaryotes – lack nucleus, simpler internal structure, generally quite smaller • Eukaryotes – with nucleus (containing DNA) and various organelles Selected organelles… • Nucleus – contains chromosomes/DNA • Mitochondria – generate energy for the cell, contains mitochrondrial DNA • Ribosomes – where translation from mRNA to proteins take place (protein synthesis machinery) • Lysosomes – where protein degradation takes place Cells can become specialized… Three domains of life • Prokarya Bacteria Archaea • Eukarya Eukaryotes Universal phylogenetic tree. Fig. 1 from: N.R. Pace, Science 276 (1997) 734-740. Nucleus and Chromosomes • Each cell has nucleus • Rod-shaped particles inside – Are chromosomes – Which we think of in pairs • Different number for species – Human(46),tobacco(48) – Goldfish(94),chimp(48) – Usually paired up • X & Y Chromosomes – Humans: Male(xy), Female(xx) – Birds: Male(xx), Female(xy) Chromosomes and DNA 13 DNA • DNA is a molecule: deoxyribonucleic acid • Double helical structure (discovered by Watson, Crick & Franklin) • Chromosomes are densely coiled and packed DNA Chromosome DNA SOURCE: http://www.microbe.org/espanol/news/human_genome.asp DNA Strands • Chromosomes are same in every cell of organism – Supercoiled DNA (Deoxyribonucleic acid) • Take a human, take one cell – Determine the structure of all chromosonal DNA – You’ve just read the human genome (for 1 person) – Human genome project • 13 years, 3.2 billion chemicals (bases) in human DNA • A deoxyribonucleic acid or DNA molecule is a double-stranded polymer composed of four basic molecular units called nucleotides. • Each nucleotide comprises a phosphate group, a deoxyribose sugar, and one of four nitrogen bases: adenine (A), guanine (G), cytosine (C), and thymine (T). • The two chains are held together by hydrogen bonds between nitrogen bases. • Base-pairing occurs according to the following rule: G pairs with C, and A pairs with T. 17 Genes • The human genome is distributed along 23 pairs of chromosomes. – 22 autosomal pairs; – the sex chromosome pair, XX for females and XY for males. • In each pair, one chromosome is paternally inherited, the other maternally inherited. • Chromosomes are made of compressed and entwined DNA. • A (protein-coding) gene is a segment of chromosomal DNA that directs the synthesis of a protein. 18 Central dogma • The expression of the genetic information stored in the DNA molecule occurs in two stages: (i) transcription, during which DNA is transcribed into mRNA; (ii) translation, during which mRNA is translated to produce a protein. DNA mRNA protein • Other important aspects of regulation: methylation, alternative splicing, etc. • The correspondence between DNA's four-letter alphabet and a protein's twenty-letter alphabet is specified by the genetic code, which relates nucleotide triplets to amino acids. 19 Genetic and physical maps 20 21 DNA under electron microscope 22 3D model of a section of the DNA molecule 23 Genetic code 24 25 Replication of DNA 26 Transcription • Process of making a single stranded mRNA using double stranded DNA as template • Only genes are transcribed, not all DNA • Gene has a transcription “start site” and a transcription “stop site” • Ik wil er ook iets in dat de genen in beide richtingen op het DNA kunnen liggen, • Over coding strand, template strand etc etc. OPZOEKEN!! • Dit komt ook terug in oefeningen. Gene structure • Exons and Introns – Introns are “spliced” out, and are not part of mRNA • Promoter (upstream) of gene Gene expression • Process of making a protein from a gene as template • Transcription, then translation • Can be regulated Gene Regulation • • • • • • • Chromosomal activation/deactivation Transcriptional regulation Splicing regulation mRNA degradation mRNA transport regulation Control of translation initiation Post-translational modification Transcriptional regulation TRANSCRIPTION FACTOR GENE ACAGTGA PROTEIN Transcriptional regulation TRANSCRIPTION FACTOR GENE ACAGTGA PROTEIN Introduction to Bioinformatics LECTURE 2: Section 2.3 Gene annotation: gene finding READING FRAMES The DNA is translated per codon = nucleotide-triplet. The sequence: …ACGTACGTACGTACGTACGT… Can thus be read as: …-ACG-TAC-GTA-CGT-ACG-TAC-GT… or: …A-CGT-ACG-TAC-GTA-CGT-ACG-T… or: …AC-GTA-CGT-ACG-TAC-GTA-CGT-… 36 Introduction to Bioinformatics LECTURE 2: Section 2.3 Gene annotation: gene finding OPEN READING FRAMES: ORF An open reading frame or ORF is a portion of an organism's genome which contains a sequence of bases that could potentially encode a protein In a gene, ORFs are located between the start-code sequence (initiation codon) and the stop-code sequence (termination codon). 37 Introduction to Bioinformatics LECTURE 2: Section 2.3 Gene annotation: gene finding OPEN READING FRAMES: ORF 38 Genetic code: exons/introns 39 Introduction to Bioinformatics LECTURE 2: GENE FINDING intron - exon 40 Translation • Process of making an amino acid sequence from (single stranded) mRNA • Each triplet of bases translates into one amino acid • Each such triplet is called “codon” • The translation is basically a table lookup Genetic code: TRANSLATION RNA → protein 42 SOURCE: http://www.bioscience.org/atlases/genecode/genecode.htm Differences in DNA • DNA differentiates: – Species/race/gender – Individuals • We share DNA with – Primates,mammals – Fish, plants, bacteria • Genotype – DNA of an individual • Genetic constitution • Phenotype – Characteristics of the resulting organism • Nature and nurture Evolution of Genes: Inheritance • Evolution of species – Caused by reproduction and survival of the fittest • But actually, it is the genotype which evolves – Organism has to live with it (or die before reproduction) – Three mechanisms: inheritance, mutation and crossover • Inheritance: properties from parents – Embryo has cells with 23 pairs of Evolution of Genes: Mutation • Genes alter (slightly) during reproduction – Caused by errors, from radiation, from toxicity – 3 possibilities: deletion, insertion, alteration • Deletion: ACGTTGACTC ACGTGACTC • Insertion: ACGTTGACTC AGCGTTGACTC • Substitution: ACGTTGACTC ACGATGACTT Evolution of Genes: Crossover (Recombination) • DNA sections are swapped – From male and female genetic input to offspring DNA The Genome DNase I sensitive site Histone modification Gene Conserved sequence SNP Genome • The entire sequence of DNA in a cell • All cells have the same genome – All cells came from repeated duplications starting from initial cell (zygote) • Human genome is 99.9% identical among individuals • Human genome is 3 billion base-pairs (bp) long Genome features • Genes • Regulatory sequences • The above two make up 5%of human genome • What’s the rest doing? – We don’t know for sure • “Annotating” the genome – Task of bioinformatics Some genome sizes Organism Virus, Phage Φ-X174; Virus, Phage λ Bacterium, Escherichia coli Plant, Fritillary assyrica Fungus,Saccharomyces cerevisiae Nematode, Caenorhabditis elegans Insect, Drosophila melanogaster Mammal, Homo sapiens Genome size (base pairs) 5387 - First sequenced genome 5×104 4×106 13×1010 Largest known genome 2×107 8×107 2×108 3×109 Note: The DNA from a single human cell has a length of ~1.8m. A Bit of History Sequenced genomes • • • • • • • • 1995 1996 1998 1999 2000 2001 2002 2004 Haemophilus influenzae Yeast C. elegans Fruit fly Arabidopsis Human (draft) Mouse Human (“finished”) 1.8 Mb 12 Mb 100 Mb 125 Mb 115 Mb 2.6 Gb 3 Gb A Bit of History http://www.genomesonline.org/ Annotation Wikipedia: Genome annotation is the process of attaching biological information to sequences. It consists of two main steps: 1. identifying elements on the genome, a process called Gene Finding, and 2. attaching biological information to these elements. Automatic annotation tools try to perform all this by computer analysis, as opposed to manual annotation which involves human expertise. Ideally, these approaches co-exist and complement each other in the same annotation pipeline. Genome browsing why present the whole genome? • Browse genes in their genomic context • See features in and around a specific gene • Explore larger chromosome regions • Search & retrieve information on a geneand genome-scale • Investigate genome organization • Compare genomes What can we learn about genomes? • Within one genome: regulatory elements, gene order, chromatin structure… • Through comparative studies: Evolution, conserved regions, rearrangements… Gene quality and prediction. Basic Genome Annotation • Genomic location • Gene model structures – Exons – Introns – UTRs • Transcript(s) – Pseudogenes – Non-coding RNA • Protein(s) • Links to other sources of information Advanced Genome Annotation • • • • • • • Cytogenetic bands Polymorphic markers Genetic variation Repetitive sequences Expressed Sequence Tags (ESTs) cDNAs or mRNAs from related species Regions of sequence homology Eukaryotic Genomes: Not only collections of genes • Protein coding genes • RNA genes (rRNA, snRNA, snoRNA, miRNA, tRNA) • Structural DNA (centromeres, telomeres) • Regulation-related sequences (promoters, enhancers, silencers, insulators) • Parasite sequences (transposons) • Pseudogenes (non-functional gene-like sequences) • Simple sequence repeats Challenges of genome browsers • Increasing sequence information 198,879,188,987 nt (Aug 2007) Eukaryotic Genomes: High fraction non-coding DNA Bron: Mattick, NRG, 2004 • • • Blue: Prokaryotes Black: Unicellular eukaryotes Other colors: Multicellular eukaryotes (red = vertebrates) Het Human Genome Project Idee voor het project kwam in 1988, men schatte dat het ongeveer 20 jaar zou duren voordat het project ten einde zou komen In 2003 waren de 3.000.000.000 basenparen gesequenced Slechts 2% van het genoom levert informatie over eiwitten. We weten nog niet waarvoor die overige 98 % dient => is dit nutteloos DNA??? We hebben ongeveer 20.000 genen in ons genoom. Dit is erg weinig als je denkt dat een platworm met z’n 350 breincellen toch amper minder genen heeft. De vraag is dan: hoeveel eiwitten kunnen we echt coderen met die 20.000 genen? De helft van de genen coderen voor eiwitten met een nog onbekende functie, www.bioinformatica-in-de-klas.nl Human Genome • 3 billion basepairs (3Gb) • 22 chromosome pairs + X en Y chromosomes • Chromosome length varies from ~50Mb to ~250Mb • About 22000 protein-coding genes – compare with ~14000 for fruitfly en ~19000 for Nematode C. elegans Human genome Bron: Molecular Biology of the Cell (4th edition) (Alberts et al., 2002) • • • • Only 1.2% codes for proteins, 3.5-5% is under selection Long introns, short exons Large spaces between genes More than half consists of repetitive DNA Variation Along Genome sequence • Nucleotide usage varies along chromosomes – Protein coding regions tend to have high GC levels • Genes are not equally distributed across the chromosomes – Housekeeping generally in gene-dense areas – Gene-poor areas tend to have many tissue specific genes Bron: Ensembl Chromosome organisation • • • • • Bron: Lodish (4th edition) DNA packed in chromatin Active genes in less dense chromatin (beads-on-a-string) Non-active genes often in densely packed chromatine (30-nm fiber) Gene regulation by changing chromatin density, methylation/acetylation of the histones Limited availability of chromatin information in genome browsers (post transcriptional modifications are currently under investigation with ChIP-onchip experiments Genomic Sequence Conservation • Not only protein coding parts are conserved in evolution • Conserved non-coding genomic sequences can be involved in gene regulation (enhancers, silencers, insulators) • With the UCSC browser one can examine genomic conservation Copy Number Variation • People do not only vary at the nucleotide level (SNPs); short pieces genome can be present in varying number of copies (Copy Number Polymorphisms (CNPs) or Copy Number Variants (CNVs) • When there are genes in the CNV areas, this can lead to variations in the number of gene copies between individuals • With the UCSC browser CNVs can be examined • Voorbeeld uitwerken • Eventueel ook aangeven dat dit gebruikt word in forensics Single Nucleotide Polymorphisms (SNPs) • Sequence variations within a species • Similar to mutations, but are simultaneously present in the population, and generally have little effect • Are being used as genetic markers (a genetic disease is e.g. associated with a SNP) • The Ensembl browser offers a nice SNP view • • • • • Hoeveel snps zijn er ? Verschillen tussen mensen? Paar voorbeelden uitwerken Verschil met mutatie SNPdb? SNP’s en mutaties www.bioinformatica-in-de-klas.nl Alternative Transcripts Source: Wikipedia (http://www.wikipedia.org/) • Voorbeeld uitwerken, wellicht het voorbeeld van de oefening die ze daarna gaan doen? Evolution • A model/theory to explain the diversity of life forms • Some aspects known, some not – An active field of research in itself • Bioinformatics deals with genomes, which are end-products of evolution. Hence bioinformatics cannot ignore the study of evolution Homologie: genoomanalyse www.bioinformatica-in-de-klas.nl Proefstuderen MLW Wat is bioinformatica? Homologie en evolutie De mens en de aap verschillen maar in 1% in hun DNA-volgorde En de mens en de hond slechts 7.5% Proefstuderen MLW Wat is bioinformatica? “… endless forms most beautiful and most wonderful …” - Charled Darwin Evolution • • • • All organisms share the genetic code Similar genes across species Probably had a common ancestor Genomes are a wonderful resource to trace back the history of life • Got to be careful though -- the inferences may require clever techniques Genome browsers UCSC NCBI Ensembl http://genome.ucsc.edu/ http://www.ensembl.org/ Genome browsers can be used to examine many kinds of data – Genomic sequence conservation – Duplications en deletions of pieces chromosome (Copy Number Variations, CNVs) – Single Nucleotide Polymorphisms (SNPs) – Alternative splicing The Ensembl gene set • All Ensembl genes start from a known protein or mRNA Sequence Assembly Ensembl gene set mRNAs protein • An initial alignment of protein and mRNA to the genome begins the ‘Genebuild’. Ensembl Genes – biological basis All Ensembl gene predictions are based on proteins and mRNAs in: • UniProt/Swiss-Prot (manually curated) • UniProt/TrEMBL • NCBI RefSeq (manually curated) Protein/ mRNA Sequence Assembly Ensembl Genes