University of Belgrade School of Electrical Engineering Siniša Ivković, Goran Rakočević, Prof. Veljko Milutinovic Introduction -Sequence alignment • way of arranging sequences of DNK, RNK or protein to identify regions of similarity • • • functional structural evolutionary relationships between sequences - How to know that two genes, often in different organizams, in fact two versions of the same gene? Similarity! Siniša Ivković - sinisa.ivkovic@gmail.com 2/13 Introduction • There are a number of algorithms that solve problems of aligning the sequences and guarantee the best solutions • By increasing amount of data that need to be processed execution speed of these algorithms becomes unacceptable • Therefore, we must turn to heuristic methods - BLAST Siniša Ivković - sinisa.ivkovic@gmail.com 3/13 BLAST - Basic Local Alignment Search Tool • Fast local sequence alignment algorithm • BLAST efficiency lies in the fact that it tends to find regions of high similarity, not necessarily trying to find and check all local alignment. KRKLQRNRTSFTQEQIEALEKEFERTHYPDVFARERLAAKIDLPEARIQVWFSNRRAKWRREEKL KKKHRRNRTTFTTYQLHQLERAFEASHYPDVYSREELAAKVHLPEVRVQVWFQNRRAKWRRQERL KKKHRRNRTTFTTYQLHQLERAFEASHYPDVYSREELAAKVHLPEVRVQVWFQNRRAKWRRQERL Siniša Ivković - sinisa.ivkovic@gmail.com 4/13 Parallel BLAST - Most bioinformatics algorithms are designed as a sequential • The very nature of bioinformatics processing • The rapid spread of knowledge in biology causes constant emergence of new concepts, and significant changes to already known - Declining price of genome sequencing requires increasing the speed of execution of these algorithms - Implementations of Parallel BLAST • PThread • MPI Siniša Ivković - sinisa.ivkovic@gmail.com 5/13 ETF Hadoop BLAST - Big Data – collection of data sets so large and complex that it becomes difficult to process using standard database tools or traditional data processing applications - Parallel computing – a form of computation in which many calculations are carried out simultaneously • communication and synchronization between processes • hardware failure - MapReduce – programming model that frees programmers of thinking about these problems - Apache Hadoop – free implementation of the MapReduce paradigm Siniša Ivković - sinisa.ivkovic@gmail.com 6/13 MapReduce VALUE MAP VALUE MAP SORT VALUE VALUE REDUCE VALUE VALUE VALUE MAP REDUCE VALUE Siniša Ivković - sinisa.ivkovic@gmail.com 7/13 ETF Hadoop BLAST - Implementation {db1} mySequence {q1} {db2} {db3} {q1} {db1} {q1} MAP {db2} {q1} MAP {db3} MAP {hit1} {db1} {hit3} {db2} {hit5} {db3} {hit2} {db1} {hit4} {db2} {hit6} {db3} Siniša Ivković - sinisa.ivkovic@gmail.com 8/13 mySequence {q1} {db2} {db3} ETF Hadoop BLAST - Implementation {q1} {db1} {q1} MAP {db2} {q1} MAP {db3} MAP {hit1} {db1} {hit3} {db2} {hit5} {db3} {hit2} {db1} {hit4} {db2} {hit6} {db3} REDUCE {hit1} {db1} {hit3} REDUCE {db2} {hit6} Siniša Ivković - sinisa.ivkovic@gmail.com {db3} 9/13 ETF Hadoop BLAST >GENSCAN00000000013 pep:genscan chromosome:GRCh37:18:4755977:4807982:1 transcript:GENSCAN00000000013 transcript_biotype:protein_coding TANTGLLAVKVEVIILVSLTHAQLSRAGQHAGCTTCLQDECAVAAGEEEETQQGELADVIYPSLL AASTSSVLEDGAGPHKGLQKLSRLIRFVDVVGGFRREKGYMAWIKPRYSEFPKVNSWTESSFP FG TANTGLLAVKVEVIILVSLTHAQLSRAGQHAGCTTCLQDECAVAAGEEEETQQGELADVIYPSLL AASTSSVLEDGAGPHKGLQKLSRLIRFVDVVGGFRREKGYMAWIKPRYSEFPKVNSWTESSFP FG HSP: 661 E-value: 0.001446314485823671 Siniša Ivković - sinisa.ivkovic@gmail.com 10/13 Conclusion - Bioinformatics has become an important part of many areas of biology • • • Sequencing and annotating genomes and their observed mutations Datamining of biological literature and the development of gene ontologies Understanding of evolutionary aspects of molecular biology - Personalized medicine • Medical model that proposes the customization of healthcare • We need to consider whole spectar of clinical information • Electronic health care records • Clinical trials • etc. Siniša Ivković - sinisa.ivkovic@gmail.com 11/13 Conclusion - We need to collect information from real world - Develop analytics that can actually extract causal relationships and generate predictive models - Future steps: - Specialized hardvare (FPGA) Siniša Ivković - sinisa.ivkovic@gmail.com 12/13 13/13