Last lecture summary Sequencing strategies • Hierarchical genome shotgun HGS – Human Genome Project • “map first, sequence second” • clone-by-clone … cloning is performed twice (BAC, plasmid) Sequencing strategies • Whole genome shotgun WGS – Celera • shotgun, no mapping • Coverage - the average number of reads representing a given nucleotide in the reconstructed sequence. HGS: 8, WGS: 20 Human genome • 3 billions bps, ~20 000 – 25 000 genes • Only 1.1 – 1.4 % of the genome sequence codes for proteins. • State of completion: • best estimate – 92.3% is complete • problematic unfinished regions: centromeres, telomeres (both contain highly repetitive sequences), some unclosed gaps • It is likely that the centromeres and telomeres will remain unsequenced until new technology is developed • Genome is stored in databases • Primary database – Genebank (http://www.ncbi.nlm.nih.gov/sites/entrez?db=nucleotide) • Additional data and annotation, tools for visualizing and searching • UCSCS (http://genome.ucsc.edu) • Ensembl (http://www.ensembl.org) New stuff Personal human genomes • Personal genomes had not been sequenced in the Human Genome Project to protect the identity of volunteers who provided DNA samples. • Following personal genomes were available by July 2011: • Japanese male (2010, PMID: 20972442) • Korean male (2009, PMID: 19470904) • Chinese male (2008, PMID: 18987735) • Nigerian male (2008, PMID: 18987734) • J. D. Watson (2008, PMID: 18421352) • J. C. Venter (2007, PMID: 17803354) • HGP sequence is haploid, however, the sequence maps of Venter and Watson are diploid. Next generation sequencing (NGS) • The completion of human genome was just a start of modern DNA sequencing era – “high-throughput next generation sequencing” (NGS). • New approaches, reduce time and cost. • Holly Grail of sequencing – complete human genome below $ 1000. 1st and 2nd generation of sequencers • 1st generation – ABI Prism 3700 (Sanger, fluorescence, 96 capillaries), used in HGP and in Celera • Sanger method overcomes NGS by the read length (600 bps) • 2nd generation - birth of HT-NGS in 2005. 454 Life Sciences developed GS 20 sequencer. Combines PCR with pyrosequencing. • Pyrosequencing – sequencing-by-synthesis • Relies on detection of pyrophosphate release on nucleotide incorporation rather than chain termination with ddNTs. • The release of pyrophosphate is detected by flash of light (chemiluminiscence). • Average read length: 400 bp • Roche GS-FLX 454 (successor of GS 20) used for J. Watson’s genome sequencing. 3rd generation • 2nd generation still uses PCR amplification which may introduce base sequence errors or favor certain sequences over others. • To overcome this, emerging 3rd generation of seqeuencers performs the single molecule sequencing (i.e. sequence is determined directly from one DNA molecule, no amplification or cloning). • Compared to 2nd generation these instruments offer higher throughput, longer reads (~1000 bps), higher accuracy, small amount of starting material, lower cost Moore’s law source: http://www.genome.gov/27541954 Cost per genome 1 363$ source: http://www.genome.gov/27541954 Cost per megabase 5000 $ 1.5 centu Illumina HiSeq X Ten • 14. 1. 2014 Illumina anounced the new HiSeq X Ten Sequencing System. • Illumina claims they are enabling the $1,000 genome. • Uses Illumina SBS technology (sequencing-by-synthesis). • It sells for at least $10 million. Human Longevity • 4. 3. 2014 – Human Longevity was founded by Craig • • • • Venter Its main aim: to slow down the process of ageing The largest human DNA sequencing operation in the world, capable of processing 40,000 human genomes a year. DNA data will be combined with other data on the health and body composition of the people whose DNA is sequenced, in the hope of gleaning insights into the molecular causes of aging and age-related illnesses like cancer and heart disease. Equipment: 2x Illumina Hiseq X Ten Which genomes were sequenced? • http://www.ncbi.nlm.nih.gov/sites/genome • GOLD – Genomes online database (http://www.genomesonline.org/) • information regarding complete and ongoing genome projects Important genomics projects • The analysis of personal genomes has demonstrated, how difficult is to draw medically or biologically relevant conclusions from individual sequences. • More genomes need to be sequenced to learn how genotype correlates with phenotype. • 1000 Genomes project (http://www.1000genomes.org/), 2009-2012. Sequence the genomes of at least a 1000 people from around the world to create the detailed and medically useful picture of human genetic variation. 2nd generation of sequencers is used in 1000 Genomes. • 10 000 Genomes (UK10K), 2010-2013. • 100 000 Genomes, started 2012, should be finished in 2017. Sequence Alignment What is a sequence alignment? CTTTTCAAGGCTTA GGCTTATTATTGC Fragment overlaps CTTTTCAAGGCTTA GGCTATTATTGC CTTTTCAAGGCTTA GGCT-ATTATTGC What is a sequence alignment ? CCCCATGGTGGCGGCAGGTGACAG CATGGGGGAGGATGGGGACAGTCCGG TTACCCCATGGTGGCGGCTTGGGAAACTT TGGCGGCTCGGGACAGTCGCGCATAAT CCATGGTGGTGGCTGGGGATAGTA TGAGGCAGTCGCGCATAATTCCG CCCCATGGTGGCGGCAGGTGACAG CATGGGGGAGGATGGGGACAGTCCGG TTACCCCATGGTGGCGGCTTGGGAAACTT TGGCGGCTCGGGACAGTCGCGCATAAT CCATGGTGGTGGCTGGGGATAGTA TGAGGCAGTCGCGCATAATTCCG TTACCCCATGGTGGCGGCTGGGGACAGTCGCGCATAATTCCG consensus Why align sequences • The draft human genome is available • Automated gene finding is possible • Gene: AGTACGTATCGTATAGCGTAA • What does it do? • One approach: Is there a similar gene in another species? • Align sequences with known genes • Find the gene with the “best” match Sequence alignment • Procedure of comparing sequences • Point mutations – easy ACGTCTGATACGCCGTATAGTCTATCT ACGTCTGATTCGCCCTATCGTCTATCT gapless alignment • More difficult example ACGTCTGATACGCCGTATAGTCTATCT CTGATTCGCATCGTCTATCT • However, gaps can be inserted to get something like this insertion × deletion indel ACGTCTGATACGCCGTATAGTCTATCT ----CTGATTCGC---ATCGTCTATCT gapped alignment Sequence alphabet side chain charge at physiological pH 7.4 Positively charged side chains Negatively charged side chains Polar uncharged side chains Special Hydrophobic side chains Name Arginine Histidine Lysine Aspartic Acid Glutamic Acid Serine Threonine Asparagine Glutamine Cysteine Selenocysteine Glycine Proline Alanine Leucine Isoleucine Methionine Phenylalanine Tryptophan Tyrosine Valine 3 letters Arg His Lys Asp Glu Ser Thr Asn Gln Cys Sec Gly Pro Ala Leu Ile Met Phe Trp Tyr Val 1 letter R H K D E S T N Q C U G P A L I M F W Y V Adenine A Thymine T Cytosine G Guanine C Flavors of sequence alignment pair-wise alignment × multiple sequence alignment Flavors of sequence alignment global alignment × local alignment global local align entire sequence stretches of sequence with the highest density of matches are aligned, generating islands of matches or subalignments in the aligned sequences Evolution common ancestors wikipedia.org Evolution of sequences • The sequences are the products of molecular evolution. • When sequences share a common ancestor, they tend to exhibit similarity in their sequences, structures and biological functions. DNA1 DNA2 Protein1 Protein2 Sequence similarity Similar 3D structure Similar function Similar sequences produce similar proteins However, this statement is not a rule. See Gerlt JA, Babbitt PC. Can sequence determine function? Genome Biol. 2000;1(5) PMID: 11178260