Alexei Fedorov, Ph.D. Associate Professor Head of Bioinformatics Lab Department of Medicine Vice Director Program in Bioinformatics and Genomics/Proteomics Tel: (419)-383-5270 Email: alexei.fedorov@utoledo.edu http://bpg.utoledo.edu/~afedorov/lab/ 1 May 2011 Bioinformatics Lab in 2013-2014 PhD students Shuhao Qiu Masters students Ahmed Al-Khudair Current grants NSF Career Development 2007-2012 “Investigation of intron cellular roles” 4 MAJOR GOAL: Bioinformatics Investigation of the Human Genome 5 Education in Bioinformatics (TWO TYPES OF STUDENTS) • Computer/math background gain experience in Biology (Sam, Andy) • Biological background gain experience in programming (Dave, Maryam) • Example of computational projects: Binary-absrtacted Markov models and their application to sequence classification http://etd.ohiolink.edu/view.cgi?acc_num=mco1271271172 http://bpg.utoledo.edu/~sshepard/defense/ video Genomic MRI http://bpg.utoledo.edu/gmri/ http://www.jove.com/Details.php?ID=2663 Job perspectives (example: Ashwin Prakash) PhD – November 2011, HSC UT PhD research fellow -- from January 2011 Johns Hopkins School of Medicine Declined offers: • Cold Spring Harbor Laboratory • Baylor College of Medicine The PI’s students received the following awards: • Jason Bechtel, Outstanding MSBS student in 2008 at HSC UT. • Theodor Rais, Second/Third Poster award by Ohio Bioinformatics Consortium, 2009. • Samuel Shepard, Outstanding PhD student in 2010 at HSC UT. • Lorraine Walters, Undergraduate Research Recognition Award, UT May 2012. • Arnab Saha-Mandal, 1) Outstanding MSBS student in 2013 at HSC UT; and 2) Canadian Institute of Health Research fellowship support ($20,000). • Jasmine Serpen, 1) Ohio Governor's Thomas Edison Award for Excellence in Biotechnology & Biomedical Technologies-1st place; and 2) OSERA Biomedical Research/Bioengineering Award-1st place (for high school students). Program in Bioinformatics and Genomics/Proteomics (BPG) • http://hsc.utoledo.edu/depts/bioinfo/ • BPG offers a Certificate in association with the degrees of Doctor of Philosophy (Ph.D.) or Doctor of Medicine (M.D.). BPG also offers a Master of Science in Biomedical Sciences (MSBS). 10 Two courses in Spring semester: • Application of Bioinformatics, Proteomics, and Genomics (BIPG 640) or “Advanced Bioinformatics” (should be taken after “Fundamental Bioinformatics” of Dr. Trumbly) • Introduction to Bioinformatic Computation (BIPG 610) The main goal of this course is to provide basic programming skills to biological and medical students who may lack a background in computer sciences. Programming will be specifically taught using important biological examples, focusing in particular on the PERL language. No programming skills are required! 11 In the “Introduction to Bioinformatic Computation” course, rather than doing “cookbook” lab exercises, students participate in real-world, challenging problems whose resolution advances the field of genome biology. In addition to learning programming and other bioinformatic skills the students of this course acquire knowledge in how to present the final product of bioinformatic research and how to write a scientific paper on the subject. •In 2005 the class developed a program to identify novel genes for non-coding RNAs in humans and other mammals. This work resulted in publication of an article in Nucleic Acids Research1, coauthored by the group of students who were actively working on this project. •In 2006 course students created a novel public database (ASMD) and also a novel computational resource “Splicing Potential”. Ten students were coauthors in two manuscripts2,3. •In 2007 the class participated in the “Genomic MRI” project. Seven of these 4 students are co-authors in BMC Genomics, 2008 •2008 class continued “Genomic MRI” project. They performed whole genome comparisons for human, chimpanzee, and macaque and also analyzed distribution of 4 million SNPs inside and outside MRI regions. The results are in preparation for publication in Genome Research with 6 students among the authors. 12 Publications with IBC students 54. Prakash A., Shepard S., Mileyeva-Biebesheimer O., He J., Hart B., Chen M., Amarachiniha S., Bechtel J., Fedorov A. “Molecular forces shaping human genomic sequence at midrange scales”, BMC Genomics 2009, 10:513. 53. Bechtel J.M., Wittenschlaeger T., Dwyer T., Song J., Arunachalam S., Ramakrishnan S.K., Shepard S., Fedorov A. Genomic mid-range inhomogeneity correlates with an abundance of RNA secondary structures. BMC Genomics 2008, 9:284. 52. Bechtel J. M., Rajesh P., Ilikchyan I., Deng Y., Mishra P.K., Wang G., Wu X., Afonin K., Grose W., Wang Y., Khuder S., and Fedorov A. Calculation of Splicing Potential from the Alternative Splicing Mutation Database Research Notes 2008, 1:4. 51. Bechtel J. M., Rajesh P., Ilikchyan I., Deng Y., Mishra P.K., Wang G., Wu X., Afonin K., Grose W., Wang Y., Khuder S., and Fedorov A. The Alternative Splicing Mutation Database: a hub for investigations of alternative splicing using mutational evidence. Research Notes 2008, 1:3. 44. Fedorov A, Stombaugh J., Harr M.W., Yu S., Nasalean L., Shepelev V. Computer identification of snoRNA genes using a Mammalian Orthologous Intron Database. Nucl. Acids Res. 2005. 33, 4578-4583. http://www.utoledo.edu/centers/brim/index.html COURSE: Bioinformatics of Biomarkers and Individualize Medicine, Spring 2012 • Course time line: 14 Weeks • No prerequisites, recommended: Introduction of bioinformatics and molecular biology • Reserve materials: None • Unit 1 Biomarker discovery and validation • Unit 2 Individualized Medicine Investigation of the human genome BASE COUNT 846302 a 578512 c 575805 g 843114 t 1703 others ORIGIN 1 gaattcaaaa aagaaagaca atgacttgta gctgaagcta tgatcaggaa 61 ggacggcatt tgagaaaatc aggacagtgg tgtacttatc aaataagaag 121 aagattgttg aaaaagcaga cacagcactg agtagcagca tggagcagaa 181 aacaagtagt gcagtgtgcc tgaacatagg atgggaaatt aggaaagata 241 gactgtggga agccttacat tccaggctta gtggaataag taaatattta 301 gttcttttct ctctgctttc tatttttcac gacctgaact cacctcccag 361 tttccaccta gcactaaaca gtaactagtt cagactatat atttaaaaaa 421 aaaaaaaaaa gcagaacagc tcagatcatc cagtgaagtg gtgctactat 481 acggggagat gaaagccaga taagatggag aagtaggaaa tttacgaaac 541 aaaatttatt tattcatcaa tatttacata aatgtttatt aattctaagt 601 gcacccattt attactttca aaaattgaca atatacaagt taataaaatc 661 cctcttctaa taaaattatc tcactcaaat tcatataact aaaaatacat 721 ttatttttaa aatataggcc acttctactc tattcatttt tgcacttaac 781 tttcaaaaat gtatgaaaaa tttcagttta gtccccacca aatctcaatt 841 ataaagagta aataaattaa agagctgtca gaattaaaac actactacag 901 ctttatggca tagatgaagg caggaaatac tggctgaaaa ttttgtttat 961 ttgatgatta ccatcagaga tctgatatct cagggaagaa aagcctttca 1021 aaaaaattct gccaggcgcg gtggctcacg cctgtaatcc cagcactttg 1081 gtgggcagat cacctgaggt cagaagttcg agaccagcct gaccaacatg 1141 gtctctacta aaaatacaaa atcagccggg cgtggtggcg catgcctgta 1201 cttgggaggc tgaggcagga gaatcacttg aacccaggag gcagaggttg 1261 agatcacacc attgcactcc agcctgggca acaagggcga aactctgtct 1321 aaaacttctg gggaaatggt ggcctgcctt gtaacatcta tgtgtcttag 1381 tatgacaccc ttgggcagtc atttatagag tccttccctg accagggaat aagatggggt atctgggcag aagcataagg aatggaggct aatctcatga tgaggagatg aaaaaaaaaa tatactatta attttaaaag actatagtag atattagttt ttaataaatt attctcttgc tagaccccgg gtctccttca gtcaaagatt tataccactt ggaggctgag gagaaaccct atcccagcta cggtgagccg caaaaaaaaa agggccatgg catcctgcca 16 ... after the first 50 pages .. 141601 141661 141721 141781 141841 141901 141961 142021 142081 142141 142201 142261 142321 142381 142441 142501 142561 142621 142681 142741 142801 142861 142921 142981 143041 143101 143161 143221 143281 143341 cagcaccaaa tgtccatgca cccacactat aaacttgaaa acagataacc cttaagtact gcatttatta aagaatgcta acctgaggaa acttaaaaac agagcagcat gcaattaggc ccacacgtgt ttagccaaaa gtatatcaaa agacctcaaa tacgtaatga cacttacatt cacccaggct caagcaattc tccagctaat tcgaactcct gcatcagccg ctgtctctac ctacttggga ccgagatcac aaaaaaaaaa tgttgatgct agttaaaatg agttgagaaa tcctctcatt atctgttgaa atatcaaaat atattgagat aacagaggaa tcaaaaaagt caaataattc agatcacatt aaaagctaac ctatcgaaat ttttccccat aatcttgtat gaatcctaaa ggaaaacgac atgatgaaat aatgcccaaa aacagaatac cagatttttt ggagggcagt tcctgcctca ttttgtattt ggcctcaagt ggtgcggtgg taaaatacaa ggctgaggga accactgtac aaaaaagaaa agtctattgt tatcaaaatg aatgtaagca gcctttttaa aaatctggct aaacccaagt gaatattagt gtcagaaaac cattacaata agaaaaagga ttttaaaaag ctcacaagta aacgaagtgt tgtggaggga caaaaatctt acaattaaaa ctaaatgacg attttgcagc atatattaat agttgatcct tctttttgct ggcaccattc gcctcccaag ttagtagaga aatccacctg cttatgcctg aaaattagct tgagaattgc tccagcctgg aagaaaaaga gtaatttacc tatacacaaa aacatgaaga aaaatgttgt atttgcaaac gtataaaaga tagagctttg agtaatcatt cttaaaaacc tttatatccc tagctaaagg ttcaaccaaa ttggaaaatg gtgtgtaaat caaagtgttc gtatgaacat aatgatgtgc tttgaaaagg tgaaaaggat tgaacaacgc tttttttttt tggctcacta tagctggaat cggagtttca cctcagcctc caatcccatc gagtgtggtg ttgaacctgg gcaacagagc aaaaggtatg accataaaat cacttagaga tgcagtatta ccaatttaac aaagaaaaaa gaaaatttta agtaggaaag tccttaatga ttacaacaat taataactaa ataatataaa taaaataacc acaagattca tggtgtggtc ttactctttg atttttatgc aactgcatgg taattttgaa acaaaacttt tggtttgaac gagacgaagt caacctgcgt tacaggcgcc ccatgttggc ccaaagtgct ctggctaaca gcacatgcct gaggcagagg aagactccat ttatgaatgc atacacaggt tagtacatgg aatcataact atcaagacac tgtatagcct agtgaaacca gattttttga aaatacaaaa catgtggaaa agaagtgagg tgactaacag tcgagatacc aaatctggta tttctgaaaa atgaagaatt acaaagatgt ataaattgtt aaaactttaa attatttcac tgcactcgtc ctcactctgt ataccaggtt tgtcaccacg caggctggtc gggattacag cggtgaaacc atagttccag ttgcagtgag ctcaaaaaaa agaaagtata ctattataga tatcattccc gtataaaatt 17 ... after next 200 pages 683041 683101 683161 683221 683281 683341 683401 683461 683521 683581 683641 683701 683761 683821 683881 683941 684001 684061 684121 684181 684241 684301 684361 684421 684481 684541 684601 684661 684721 684781 684841 684901 ggaggtgggg agccaccaac gaggagcacc agcgaccatc aatgtgggga ataggagact tctataacct aaatggatta aaaaaagaaa ctccaacact tttaaaggtt aagaatgttg cactgttagt ttttttcctt tcaaggagta ataggttggg tctcccagtc atatttcttg gctttatttc acatttcttg agtaccatta taatatgttg gaatgggtgg caattccact taaaacagtg agtgtcatag tgtttgtccc gcactatttt taatcccagc ccaacgtggt atgcctataa cagaggttgt agcgcctctg ccatctggga tctgccgggc gagaatgggc aaagaaagag ccattttgtt tacccccaaa agggcgatgc gagaaaaaaa tgtcacctaa ttcagcttaa aatattggcc ctgatggctt catttcaacc tctttgtggt gaagttctcc actttcaggt gaggctttgc attaagttag tcttttttgg cgctccgtga cctggtccag gtggttagat tactggtgag acaatgatat tgatcaggaa aatgtatatg atgaacttta actttgggag gaaaccacat tcccagctac ggtgagctga cccagccgcc agtgaggagc tgccccgtct catgatgacg agatcagatt ctgtactaag cccctgctct aagatgtgct aaatcattga tgaccaggga ctgttttgtc cccactctct ccctttgtgg atggtgaatc gttctctgta tggataatat acaccaatca tcattccttt tttatatttg gcctgataat ggacagggac agtagatact gaatggaatt aagccttgtc tgtttctgct taaagccagg gcagagggag aaatcctcat gccaaggcag ctctactaaa ttgggaggct gattgtgcca ccatctggga gcctctgcct gggaagtgtt atggtggttt gttactgtgt aaaaattctt ctgaaacatg ttgttaaaca aggattattt tcaataccca tcttaataaa tctggcttgt gtaacccagt tgacaattat tttcctgaat cctgaagagt aatgtaggtt tcattctttt actgtgcttt tactctgcaa tattttgttc catatataaa tgccttaatt taagtcttta accacaatgg gcttgaagca aaagaaaacc agcagggcca gcagatcact aatacaaaaa gaggcaggag ctgtactcca ggtggggagc ggccaccccg cccaacagct tgtcgaaaag ctgtgtagaa ctgccttggg tgctgtgtca gatgcttgaa atgccctatg caaatacagt tttttatata agagtttctg ctttctttct gtgtcttggt ttgaatattg gttttccaac tggtcttttc ttctctaatc atacttgaca gttaaaaagg attgttgcaa tacttgctga ttcaagatgg aaccttactt aaaaaaggac tctcctgatt gttgagtctt ggtgcagtgg tgaggtcagg ttagccaggc aaatgcttga gcctgggcaa gcctctgtcc tctgggaagt ctgaagagac aaaaggggga agaagtagac atgctgttaa actcagggtt gacagaaaaa gcatcccttt aagacctatt ggaaaaaaaa cagagagatc gcccttaaca gttgctcttc gcctgtgtgg ttggttccat acatagtccc ttgtcttcaa aagcactttc aaaaactcca cctaagcact ataaagggat attcaatttc tcctcatcta agaattactt cctagggcat aatctgtcag ctcacacctg accagcctgt gtggtggtgc acctgggagg cagaacaaga 18 Human chromosome 1 4,814,628 lines = =100,000 pages = 100 books (1000 pages each) 19 Nature 2012, Sept th 6 , v.489, p 46 Lab 2013 THE 1000 GENOME PROJECT A GUIDE TO YOUR ANCESTRY The pattern of the human genetic variations believed to be a key to reveal much about the human population history and diversity. The 1000 Genome project has sequences 1092 genome from different populations and by identifying the sequence that correspond to LWK, GBR, JPT and FIN, we are aiming to learn more about the population genetic patterns and to get a picture of the genetic diversity existed within the mentioned populations. The 1000 genome project effort to catalogue the human genetic variation is utilized in this project to calculate and compare these genetic differences between 14 populations. I am presenting the results that our bioinformatics lab’s team obtained so far and working on having it put in a paper. Using Perl programming to compute the differences between each two individual’s genomes from the 1000 Genome project for the 14 populations •ASW HapMap African ancestry individuals from SW US •CEU CEPH individuals •CHB (CHB) Han Chinese in Beijing •CHS (CHB) Han Chinese South •CLM Colombian in Medellin, Colombia •FIN HapMap Finnish individuals from Finland •GBR British individuals from England and Scotland (GBR) •IBS Iberian populations in Spain •JPT JPT Japanese individuals •LWK (LWK) Luhya individuals •MXL HapMap Mexican individuals from LA California •PUR Puerto Rican in Puerto Rico •TSI Toscan individuals •YRI (YRI) Yoruba individuals Figure 1: Frequency distribution of differences in 14 populations (GBR, JPT, LWK, FIN, ASW, PUR, CLM, CHS, MXL, CEU, IBS, YRI, TSI and CHB), Each peak represents one population and the differences for each of these populations calculated between each two individuals within the same population (example: GBR has 89 individuals that will make 3916 pair of individuals). The differences ranged between 2.75 million – 5.28 million, and they are plotted on a scale of bins 0.01 million in size 1000 ASW-60 CEU-84 CHB-96 CHS-99 CLM-59 FIN-93 GBR-89 IBS-13 JPT-88 LWK-96 MXL-65 PUR-54 TSI-96 YRI-87 900 800 700 600 500 Number Of Pairs 400 300 200 100 0 2.7 2.9 3.1 3.3 3.5 3.7 3.9 4.1 4.3 4.5 4.7 4.9 5.1 5.3 5.5 The Graph above illustrates the distribution of the genetic differences among the 14 populations. The X axis shows the range in the number of differences (2.7 million – 5.5 million). The Y axis represents the number of pairs (two individuals compared by calculating the number of genetic differences between their genomes). Figure 2: The Graph below showing the 14 populations consisting 4 distinct origins and lets call them 4 ancestries. 1_African , 2_Hybrid , 3_European, 4Asian. 4 3 1000 ASW-60 CEU-84 CHB-96 CHS-99 CLM-59 FIN-93 GBR-89 IBS-13 JPT-88 LWK-96 MXL-65 PUR-54 TSI-96 YRI-87 1 900 800 700 600 500 Number Of Pairs 400 2 300 200 100 0 2.7 2.9 3.1 3.3 3.5 3.7 3.9 4.1 4.3 4.5 4.7 4.9 5.1 5.3 5.5 Figure 3: The three populations that have African origin, they total differences distributed close to each other. The LWK population(Luhya individuals ) showd some individual who had almost half (2.7 million – 4.8 million) the number of differences, almost all of these have been declared as siblings and relatives. Some of them are not declared to be relatives by the 100 Genome project so our results suggest that they might be some undeclared relatives in the 100 genome project. We further examined some populations for any declared relationships between any of these individuals; the relatives showed that they have the minimum difference in their genetic variation. For example, In the LWK population as showing in the table below, the relatives fall at the top of the list when we sorted the total differences from lowest to highest. The green highlighted cells showing that these individuals are related to each other as been declared by the 1000 genome appendix, The ones that are not highlighted we suggest that they are somehow relatives but they haven’t been declared by the 1000 genome project. 1 2 3 4 5 6 7 8 9 1 0 1 1 1 2 1 3 1 4 ID1_L WK NA193 74 NA193 52 NA194 70 NA193 97 NA194 44 NA193 34 NA193 82 NA194 53 NA194 70 NA193 31 NA193 82 NA194 53 NA193 34 NA194 69 ID2_LW K NA1937 3 NA1934 7 NA1944 3 NA1939 6 NA1943 4 NA1933 1 NA1938 1 NA1944 5 NA1946 9 NA1931 3 NA1938 0 NA1944 4 NA1931 3 NA1944 3 Total_LWK differences 2756691 Siblings 2777456 Siblings 2848500 Aunt/Uncle 2871776 Siblings 3004459 Siblings 3007478 ? 3070661 uncertain parent/child relationship 3077137 ? 3111728 Niece/Nephew 3119208 ? 3970915 Half Siblings 4106949 ? 4178970 Unknown relation 4236592 Niece/Nephew Figure 4: CLM, PUR and MXL populations, they show a very wide distribution ranged from 3.1-4.86. what our results indicate that these population have wide range of mixed blood. The PUR population have a second peak showing on the right side (range between 4.74-4.9 million), we expect that these individuals having different blood. More investigation on these people being conducted to know where do they have blood from. Figure 5: Populations from FIN, GBR, TSI, CEU and IBS. All these population fall under European origin. The IBS population show as a really low curve because only 13 person have been sequenced from this population. Figure 6: The population from Asian origin showed how they are close in their blood by having really close shape of distribution that ranged between 3.4 million- 3.69 million. We are more investigating the highest differences pairs (the highest differences between pairs of individuals) that we suggest that they possibly have a different origin. We investigated the highest 40 pairs in some population and we found that some individuals showed high difference with other individual and that were significantly repeated. Example in the figure below 80 70 60 50 40 CLM 30 20 10 0 2.7 2.8 2.9 3 3.1 3.2 3.3 3.4 3.5 3.6 3.7 3.8 3.9 4 4.1 4.2 4.3 4.4 4.5 4.6 4.7 4.8 The list below is the CLM individuals that showed the highest genetic differences with each other and when we looked at them individually we noticed that some of them have been repeated significantly more than others as it shows in the right side list of repeats. We see that HG01551 and HG01342 has been repeated as highest difference for 20 times while others were repeated HG01136 2and 3 times. So we decided to investigate the possibility of these individuals having other origin. •HG01551 4479513 •HG01365 •HG01342 •HG01551 •HG01551 •HG01551 •HG01488 •HG01366 •HG01551 •HG01342 •HG01342 •HG01377 •HG01462 •HG01551 •HG01461 •HG01342 •HG01551 •HG01551 •HG01375 •HG01551 •HG01551 •HG01389 •HG01342 •HG01551 •HG01342 •HG01551 •HG01342 •HG01440 •HG01342 •HG01342 •HG01551 •HG01551 •HG01551 •HG01551 •HG01551 •HG01390 •HG01462 •HG01551 •HG01551 •HG01551 4480834 4481529 4481637 4483529 4485279 4487693 4488647 4490996 4493212 4493218 4494064 4494414 4496682 4497146 4498051 4499694 4499713 4500523 4501432 4503181 4506393 4508562 4510222 4514486 4519187 4520380 4527415 4533004 4535490 4537772 4541901 4542804 4558088 4561600 4562418 4564478 4577349 4608288 4678948 HG01342 HG01250 HG01250 HG01375 HG01125 HG01342 HG01342 HG01259 HG01271 HG01277 HG01342 HG01390 HG01365 HG01342 HG01125 HG01148 HG01345 HG01342 HG01134 HG01495 HG01342 HG01148 HG01377 HG01134 HG01389 HG01124 HG01342 HG01275 HG01272 HG01272 HG01488 HG01461 HG01462 HG01275 HG01342 HG01342 HG01440 HG01390 HG01342 The idea was to take those repeated high difference individuals with 10 other controls from the same population that showed average number of genetic difference within the same population , we then randomly took individuals from other populations and calculated the genetic differences between our 10 control +2 high repeats and the 1 control from the other populations. The comparison below was between 10 controls from CLM plus the 2 high repeated high genetic difference (HG01551 and HG01342 ) , against one control individual from YRI population(Yoruba individuals ) “African Ancestry “. HG01551 and HG01342 had the lowest difference indicating that these two persons might be from African origin. We more compared CLM controls with individual from African population(LWK) and another individual from Asian(CHS). The two control individuals showed lowest genetic difference against LWK control while showed highest difference when against CHS individual . This suggest that our two individuals from CLM population are originally belong to an African origin. CLM - LWK CLM - CHS Conclusions • Total variants showed substantial geographic differentiation, • Total number of differences determines diverse populations that are more geographically and ancestrally remote. • populations are grouped by the predominant component of ancestry: Europe (CEU, TSI, GBR, FIN and IBS), Africa (YRI, LWK and ASW), East Asia (CHB, JPT and CHS) and the Americas (MXL, CLM and PUR). • Relatives within the same population have significantly less number of genotype variations “almost half the number” comparing to the non relatives. • The study of human genetic variation has evolutionary significance. It can help to understand ancient human population migrations as well as how different human groups are biologically related to one another.