This text is meant to serve as concise outline of the project, describing background, aims of the project, the obstacles, current status, and history. Healthy Genomes. Outline of the project. Emanuel Yakobson, Thomas Bettecken, Edward N. Trifonov Contents Current campaigning group Directorate Sympathizers, involved in various ways The aim of the project Ethnic dimension Lack of proper standard genome. Novelty Genomic signatures of hereditary diseases Stages of the project Stage 1 Technical aspects and difficulties Complete genomes available References Current campaigning group Directorate Emanuel Yakobson Professor of Latvian University. Molecular genetic studies. Racial differences in melanoma and obesity. emanuel.yakobson@gmail.com 1 Edward N. Trifonov Professor, University of Haifa and Masaryk University (Czechia). Molecular biophysics. Genetic codes. Genome origin and evolution. trifonov@research.haifa.ac.il Indrikis Muiznieks Professor, Prorector of Latvian University. Molecular Biology, Nucleic Acids Research. indrikis.muiznieks@lu.lv Thomas Bettecken Doctor of Medicine, MPI for Psychiatry, Muenchen. Medical Genetics, Genotyping, DNA sequencing. Chromatin structure. bettecken@mpipsykl.mpg.de Sergejs Zinovatnijs Board Member of Latvian Sport Club Maccabi. Business promotion, Jewish life in Latvia. sergejs.zinovatnijs@maccabi.lv Sympathizers, involved in various ways Jon Entine Science writer. Genetics. Author of “Abrahams Children” Zakharia Frenkel bioinformatician, University of Haifa. zfrenkel@yahoo.com Janis Klovins Genome Database of Latvian Population. Biobanking, GPCR signaling, Human Genetics, Type 2 Diabetes, Acromegaly klovins@biomed.lu.lv Abraham Korol Professor. Director of Institute of Evolution, University of Haifa. Molecular evolution, recombination. korol@research.haifa.ac.il Eugene Levich Professor, physics, computer science, business promotion eugenelevich@gmail.com Paul Pumpens Professor, Scientific Director, Latvian Biomedical Research and Study Center. Drug research. 2 paul@biomed.lu.lv The aim of the project Any two complete individual genome sequences, ~3*109 bases each, are largely identical, except for a few million point changes and other differences (2, 23). Total number of known genetic variants is estimated as tens of millions (34, 35, 45). There is a whole spectrum of various types of differences (41, 45) which include single nucleotide polymorphisms (SNPs), deletions, insertions, as well as inversions (17), translocations, variations in copy numbers of genes (4) and repeats (26). These differences reflect ethnicity and ancestry of the individuals, pathological patterns, as well as individual “healthy” phenotypic traits. The main idea of the project is to derive from existing and forthcoming individual genome sequences a consensus, an invariable part, without the individual changes, to serve as standard for molecular medical and other studies. Many of the individual variations are associated with various inherited pathological conditions, from mild predisposition to serious life threatening pathologies. Most of these, fortunately, are recessive, but a chance to become dominant in progeny makes them all highly undesirable. This connection of the changes with genetic diseases and dispositions suggests a natural term for the genome sequence consensus – Healthy Genome, since such genome would not contain any of those variations, both innocent and pathological ones. Master Genome, Consensus Genome, Genome Standard, Universal Reference Genome, PanGenome would be possible synonyms of the Healthy Genome. Thus, the aim of the project is to construct the consensus Healthy Human Genome with the primary purpose of having a balanced standard for medical genetic studies. 3 “A reference genome sequence is clearly needed for research. Without a point of reference and common coordinate, or naming system, research and clinical assay results cannot be reported in ways that allow for inter-lab comparisons and independent validation of research results. ... A basic coordinate system needs to be developed that can accommodate any indel and rearrangements” (34). Ethnic dimension Ethnic variations of sequence polymorphisms are well documented (13,14,16-20,24,27,45). A whole new discipline, pharmacogenomics, is developing, based on different susceptibilities of various ethnicities to drugs, as also reflected in the differential occurrence of disease-associated SNPs (19,20). Many genetic abnormalities are known which are typical of specific ethnic or geographic groups (6,39,40). To name a few: Tay-Sachs syndrome characteristic of Ashkenazi Jewish population (7), cystic fibrosis of Caucasians, especially amongst Danes (8), diabetes of Puerto-Ricans (1), and hypoglycemia of Faroe Islands (38). The specific ethnic SNPs spread from the geographic location of the respective ethnic group (21) but their local higher occurrence persists. “There are demonstrated differences in people’s genetic makeup that predisposes some groups to different diseases based solely on their ethnicity” (31). Thus, there are all reasons to expect that the healthy consensus genome sequences derived for specific ethnical groups would contain sequence features, of ethnical pathology (and normality), different from respective general Healthy Genome consensus (2). In other words, in order to study one or another ethnically linked genetic abnormality, one has to have separate genome standards for respective ethnical groups – like Jewish Healthy Genome, or Danish Healthy Genome, etc. These, of course, will be very 4 close to the general standard, though, perhaps, carrying quite a few of ethnically specific differences (2, 32, 45). This implies that for reliable detection and characterization of the differences one has to have an unbiased general consensus for Homo sapiens where many genome sequences of, desirably, all major distinct ethnical groups would be equally presented. These would be, first of all, Han (China), Bengalis (Bangladesh), Germans, Russians, Italians, Yamamoto (Japan), Punjabi (Pakistan), French, English, Mestizos (Mexico),… Yoruba (Nigeria) etc., in descending order by population size (e.g., as listed in (3)). The truly representative unbiased consensus genome may be derived only by sequence analysis at all consecutive stages of the construction of the consensus. For example, the white race or mongoloid Han and Yamamoto genomes may show some distinct common features, so that additional care to avoid the biases in general consensus would be needed. The construction of the Healthy Genomes for various ethnic groups is well justified. It will be only fair if every ethnicity will be treated with equal medical attention, taking into account all the differences, on the basis of very latest achievements in genome studies. Lack of proper standard genomes. Novelty. Today of the order of 500-1000 genomes of various healthy and sick individuals are available, at various stages of completion. About 70 of them up to now may qualify as complete fully assembled and mapped genomes (42, and listed below) suitable for the derivation of the standard. Notably, the individual sperm genomes (25) are not good for the task since each one of them underwent multiple natural recombinations, and their fertility (and normalcy) status is uncertain. Each 5 one of the above 70 or so can be used, of course, as (temporary) reference for analyzing differences between genomes (2), especially those which are associated with genetic diseases. However, differences between individual genomes do not fully reflect those between the genomes and the (non-existing) standard. Currently, apart from few individual personal genomes (e.g., of C. Venter and of J. Watson) the most frequently used standard is the NCBI human reference genome (11) which is derived from DNA samples of a small number of anonymous donors (12). Comparisons of individual Chinese, Korean and Yoruba genomes with the NCBI standard reveals hundreds of thousands of differences common for these three individuals (2). It appears that these common sites rather represent the (non-existing) standard, while the NCBI is an outlier. Potentially, the sequences from 1000 Genomes project (10, 45) could be used for derivation of the standard. This collection, however, is not intended to be ethnically balanced, rather being geared to best overlap with available databases of SNPs. For example, its sample list consists of genomes of only 4 races (Whites, Blacks, Amerindians and East Asians). It does not even include representatives of the second largest ethnical group in the world, Bengalis (3). Moreover, since most of the efforts of the Genome sequencing projects are geared to the desirably complete collection of SNPs and other structural variants, this task, too, suffers from lack of good standard: “The current reference sequence, being based on a limited number of samples, neither adequately represents the full range of human diversity, nor is complete”(34). Also, from recent account of the 1000 genomes team: “the interpretation of rare variants in individuals with a particular disease should be within the context of the local (either geographic or ancestry-based) genetic background,” (45) – alluding to the ethnical (geographic) genome standards. 6 Very much in line with our proposal, one could imagine sort of logo, consensus genome: “If the Human Genomes sequence was portrayed in this way, we might replace our arbitrary typespecimen with more natural, biologically accurate”(44). Thus, no systematic effort to construct a balanced genome standard of Homo sapiens has been attempted so far. The novelty of the project is in its very target – the Healthy Genomes standards. Genomic signatures of hereditary diseases There are over 9000 known genetic disorders and diseases. Although rather detailed haplotypes for many of these diseases are known today, the full genomic sequence characterization of the diseases is not available, and realization is growing that such full characterization is vitally needed (29). “That there is currently no comprehensive, accurate, and openly accessible database of human disease-causing mutations "is the single greatest failure of modern human genetics," Massachusetts General Hospital's Daniel MacArthur says” (as cited in 29). It is also clear that any such standard can be developed only after the Healthy Genomes will become available. The genomic disease standards can be derived in the way similar to ethnic genome standards. The consensus of many individual genomes of the carriers of the disease has to be compared with the Healthy Genome. The differences revealed may then serve as the genomic signature of the disease, for further studies and analyses. The personalized complete lists of the abnormalities of individual patients, compared with standards, will serve as a guide for preventive treatments, and genetic consultations. 7 Stages of the project 1. Derivation of the general Healthy Genome standard. Initial effort will include 10-15 available genomes. No necessity to sample and sequence the individual genomes at this stage. The exponentially growing number of available sequenced genomes will be sufficient for derivation of the Healthy Genome standard to any level of completeness. 2. Ethnical genome standards Jewish and Latvian genomes, perhaps, will be the first to work on within the framework of the project, by collecting blood samples, subcontracting genome sequencing centers (at a cost ~$1000 or less per genome) and constructing the ethnic standards. A large collection of Latvian samples is already created at Biomedical Research and Study Centre, University of Latvia. Constructions of Korean and Chinese genome standards are in progress in respective countries. Many nations are likely to follow these examples soon. E.g., human genome variation map of the major ethnic groups in Malaysia is under construction (30). Similar effort in Singapore is on the way (33). And, again, all these national genome standards will make sense only in comparison with the general Homo sapiens genome standard. 3. Genomic signatures of hereditary diseases. The pathological genomic structural differences specific for Jewish populations (like Tay-Sachs, diabetes and others) will be major focus. With the expected progress of the project and evaluation of possible diseases and anomalies specific or most problematic for Latvians, respective disease genomic standards will be taken care of as well. 8 4. Ethnical and medical characterization of individual genomes. This will become possible after the general Healthy Genome standard, Ethnical Healthy Genomes, and the signatures of pathologies will be derived. Potentially, these products of the project may become a major set of Genome Standards of high demand. Every nation and every ethnical group would need these standards for highest efficiency of medicare and personal medicine. The genomic standards are needed as well in studies on human genomic history and evolution (e.g. 36,37), forensic analyses, legacy cases. One can envisage also analysis and prospective use of genomic sequence signatures of specific talents and professional inclinations. The quick progress in the whole genome studies and applications makes it rather hard to predict the future developments. It is clear, however, that all these developments will require a whole spectrum of various genome sequence standards, starting with the Healthy Genome of Homo sapiens. Stage 1 For derivation of the initial first version of the Healthy Genome 10-20 suitable complete genome sequences will be selected from the list of currently available individual genomes (below), keeping with the rule “largest ethnical groups first”. Further additions would include more ethnicities, in the order of appearance of the new suitable sequences in publicly available sources. The first and subsequent versions of the Healthy Genome will be offered as commercialized product to various users for a price to be then established, together with a package of programs and instructions for use. The package will include 9 the procedure of comparison of any genome of interest with the standard, as well as listing and categorization of the differences. Technical aspects and difficulties Derivation of the standard genomes and signatures is as formidable as an important task. Technically, this is a multiple alignment problem. The task is to align selected subsets of the sequenced genomes to each other and to derive the reference genomes, or standard hereditary disease signatures, all products of the multiple alignments. Major difficulties would be incompleteness of some of the genomes, sequencing and mapping errors, multiple locations of the same or almost identical genes, numerous tandem repeats with variable copy numbers, sequence inversions, variable copy numbers of genes (4) and possible other hurdles which may appear in forthcoming studies of genome structure. The construction of the reference genome has never been attempted so far, and one may expect many surprises. A package of computer programs has to be developed, to build the standard genome from any number of individual genomes, to compare any given genome with the standard, to derive complete lists of differences (signatures) for the individual genomes, to cluster the genomes with similar signatures and to develop specific full signatures for various genetic diseases. Some of the programs with similar functions are publicly available and routinely used in the genomic community (2,9). Complete genomes available (refs as indicated and in Google) 10 HuRef – C. Venter J. Watson G. Church NA18507 – Yoruba Desmond Tutu Bantu (24) !Gubi Khoisan (24) Seong-Jin Kim SJK – Korean (2) AK1 - Korean YH – Chinese Gordon Moore Stephen Quake works on sperm genomes Marjolein Kriek Hermann Hauser 14 others sequenced by Complete Genomics, Unknown number sequenced by Knome, 7 genomes sequenced at high depth by the 1000 Genomes Project (28): Han Chinese South (CHS) African Caribbean in Barbados (ACB) Puerto Rican in Puerto Rico (PUR) Peruvian in Lima, Peru (PEL) Punjabi in Lahore, Pakistan (PJL) Sri Lankan Tamil in the UK (STU) Indian Telegu in the UK (ITU) 10 genomes from various labs (23) (10 more – from sick individuals) 20 Korean genomes (27) Steve Jobs (43) First Irish Genome Sitting Bull (Chief) First Russian Genome (NCBI) Glenn Close Five Southern African Genomes (with Tutu) (to be continued) References 1. National Diabetes Information Clearinghouse. National Diabetes Statistics, 2011. USA 2. Ahn S.-M. et al. (2009). The first Korean genome sequence and analysis: Full genome sequencing for a socio-ethnic group. Genome Res. 19(9): 1622–1629 3. CIA, The World Factbook. 11 4. Li J. et al. (2009). Whole Genome Distribution and Ethnic Differentiation of Copy Number Variation in Caucasian and Asian Populations. PLoS ONE 4(11): e7958 5. Macrae F., 12 July 2012. Gene test could soon see if future lovers are compatible. http://www.dailymail.co.uk/health/article-2172870/Genetest-soon-future-lovers-compatible.html?ito=feeds-newsxml 6. Milunsky A. Ethnicity and Genes. http://www.babyzone.com/pregnancy/fetal_development/geneti cs_gender/article/ethnicity-genes-disease 7. Sachs, Bernard (1887), "On arrested cerebral development with special reference to cortical pathology", Journal of Nervous Mental Disease 14 (9): 541–554 8. Wennberg C, Kucinskas V (1994). "Low frequency of the delta F508 mutation in Finno-Ugrian and Baltic populations". Hum. Hered. 44 (3): 169–71. 9. Axelrod N. et al. The HuRef Browser: a web resource for individual human genomics. Nucleic Acids Res. 2009 January; 37(Database issue): D1018–D1024. 10. G Spencer, International Consortium Announces the 1000 Genomes Project, EMBARGOED (2008) http://www.1000genomes.org/files/1000GenomesNewsRelease.pdf 11. Pruitt KD, Tatusova T, Maglott DR (2007) NCBI reference sequences (RefSeq): a curated non-redundant sequence database of genomes, transcripts and proteins. Nucleic Acids Res 35: D61–65 12. PlOS Genetics, Sept. 2011, Dewey FE, Phased Whole-Genome Genetic Risk in a Family Quartet Using a Major Allele Reference Sequence. 13. Mori M., et al. Journal of Human Genetics (2005) 50, 264– 266. Ethnic differences in allele frequency of autoimmunedisease-associated SNPs 12 14. Silverberg MS, et al. European Journal of Human Genetics (2007) 15, 328–335. Refined genomic localization and ethnic differences observed for the IBD5 association with Crohn's disease 15. Human Genome Project Opens the Door to Ethnically Specific Bioweapons. Project Censored.Apr. 30, 2010.(web source) 16. Ghodke Y. et al. Profiling single nucleotide polymorphisms (SNPs) across intracellular folate metabolic pathway in healthy Indians. Indian J Med Res 133, March 2011, pp 274-279 17. Karyn Meltz Steinberg et al., Structural diversity and African origin of the 17q21.31 inversion polymorphism Nature Genetics. Published online: 01 July 2012 18. Richard S Spielman et al. Common genetic variants account for differences in gene expression among ethnic groups Nature Genetics 39, 226-231, 2007 19. Kimchi-Sarfaty C. et al., Ethnicity-related polymorphisms and haplotypes in the human ABCB1 gene. Pharmacogenomics. 2007 Jan;8(1):29-39. 20. Peter H. O’Donnell1 and M. Eileen Dolan. Cancer Pharmacoethnicity: Ethnic Differences in Susceptibility to the Effects of Chemotherapy. Clin Cancer Res. 2009 August 1; 15(15): 4806–4814. 21. Templeton A. www.faculty.biol.ttu.edu/strauss/Phylogenetics/Readings/Te mpleton1998 22. Hunt, S. (2008) Pharmacogenetics, personalized medicine, and race. Nature Education 1(1) 23. Pelak K, Shianna KV, Ge D, Maia JM, Zhu M, et al. (2010) The Characterization of Twenty Sequenced Human Genomes. PLoS Genet 6(9): e1001111 24. Stephan C. Schuster et al., Complete Khoisan and Bantu genomes from southern Africa. Nature 463, 943-947; 2010 13 25. Jianbin Wang et al. Genome-wide Single-Cell Analysis of Recombination Activity and De Novo Mutation Rates in Human Sperm. Cell, Volume 150, Issue 2, 402-412, 20 July 2012 26. Trifonov, E. N., The tuning function of the tandemly repeating sequences: molecular device for fast adaptation. In: Evolutionary Theory and Processes: Modern Horizons, Wasser, S. P. (Ed.), Kluwer Academic Publishers, pp 115138 (2004) 27. http://www.bio-itworld.com/2011/09/13/korean-genomeproject-finds-korea-SNPs.html 28. http://www.1000genomes.org/about , 2012 29. http://www.genomeweb.com/mdx/quest-clarity?page=show (T. Vance, A quest for clarity, July/August 2012) 30. http://1mhgvc.kk.usm.my/ 31. http://www.examiner.com/article/ethnicity-specificreference-genome-improves-everyone-s-health; Dewey FE et al. (2011) Phased Whole-Genome Genetic Risk in a Family Quartet Using a Major Allele Reference Sequence. PLoS Genet 7(9): e1002280 32. http://ethnicgenome.wordpress.com/ 33. http://www.statgen.nus.edu.sg/~SGVP/ 34. Rosenfeld JA, Mason CE, Smith TM (2012) Limitations of the Human Reference Genome for Personalized Genomics. PLoS ONE 7(7): e40294 35. Sherry ST, Ward MH, Kholodov M, Baker J, Phan L, et al. (2001) dbSNP: the NCBI database of genetic variation. Nucleic acids research 29: 308–311 36. Rocca RA, Magoon G, Reynolds DF, Krahn T, Tilroe VO, et al. (2012) Discovery of Western European R1b1a2 Y Chromosome Variants in 1000 Genomes Project Data: An Online Community Approach. PLoS ONE 7(7): e41634 37. Nature 485, Special issue: Peopling the planet. (03 May 2012) 38. Jef Akst, Island disease, The Scientist, August 2012. 39. Yakobson E et al. A single Mediterranean, possibly Jewish, origin for the Val59Gly CDKN2A mutation in four 14 melanomaprone families. EUROPEAN JOURNAL OF HUMAN GENETICS 11, 288-296, 2003 40. Leachman SA,...Yakobson E, et al. Selection criteria for genetic assessment of patients with familial melanoma. JOURNAL OF THE AMERICAN ACADEMY OF DERMATOLOGY 61, 677684, 2009 41. S. Levy,... C. Venter, PLoS Biol 5(10): e254. The Diploid Genome Sequence of an Individual Human 42. http://www.completegenomics.com/news-events/pressreleases/Complete-Genomics-Adds-29-High-Coverage-CompleteHuman-Genome-Sequencing-Datasets-to-its-Public-GenomicRepository--119298369.html 43. Lohr, Steve (2011-10-20). "New Book Details Jobs's Fight Against Cancer". The New York Times. 44. Weiss KM http://the-scientist.com/2012/08/17/opinion-whatis-the-human-genome/ 45. An integrated map of genetic variation from 1,092 human genomes. The 1000 Genomes Project Consortium. Nature 491,56–65 (2012) 15 Funding The funding is expected from donations or from governmental support. Currently the project has no support. The Latvian group concentrates on possible stipend for Ph. D. student, cosupervised by professors in Riga, and in Haifa. Another prospective support may come from a large European grant (in progress, Latvian University). All people involved have to increase the circle of sympathizers, hoping to get eventually a sponsor or few. Updated November 2, 2012. 16