Hadoop Blast - Sinisa Ivkovic

advertisement
University of Belgrade
School of Electrical Engineering
Siniša Ivković, Goran Rakočević, Prof. Veljko Milutinovic
Introduction
-Sequence alignment
•
way of arranging sequences of DNK, RNK or protein
to identify regions of similarity
•
•
•
functional
structural
evolutionary relationships between sequences
- How to know that two genes, often in different organizams,
in fact two versions of the same gene?
Similarity!
Siniša Ivković - sinisa.ivkovic@gmail.com
2/13
Introduction
•
There are a number of algorithms that solve problems of
aligning the sequences and guarantee the best solutions
•
By increasing amount of data that need to be processed
execution speed of these algorithms becomes unacceptable
•
Therefore, we must turn to heuristic methods - BLAST
Siniša Ivković - sinisa.ivkovic@gmail.com
3/13
BLAST - Basic Local Alignment
Search Tool
•
Fast local sequence alignment algorithm
•
BLAST efficiency lies in the fact that it tends to find regions of
high similarity, not necessarily trying to find and check all
local alignment.
KRKLQRNRTSFTQEQIEALEKEFERTHYPDVFARERLAAKIDLPEARIQVWFSNRRAKWRREEKL
KKKHRRNRTTFTTYQLHQLERAFEASHYPDVYSREELAAKVHLPEVRVQVWFQNRRAKWRRQERL
KKKHRRNRTTFTTYQLHQLERAFEASHYPDVYSREELAAKVHLPEVRVQVWFQNRRAKWRRQERL
Siniša Ivković - sinisa.ivkovic@gmail.com
4/13
Parallel BLAST
- Most bioinformatics algorithms are designed as a sequential
• The very nature of bioinformatics processing
• The rapid spread of knowledge in biology causes
constant emergence of new concepts, and
significant changes to already known
- Declining price of genome sequencing requires
increasing the speed of execution of these algorithms
- Implementations of Parallel BLAST
• PThread
• MPI
Siniša Ivković - sinisa.ivkovic@gmail.com
5/13
ETF Hadoop BLAST
- Big Data – collection of data sets so large and complex
that it becomes difficult to process using standard database tools or
traditional data processing applications
- Parallel computing – a form of computation
in which many calculations are carried out simultaneously
• communication and synchronization between processes
• hardware failure
- MapReduce – programming model that frees programmers of
thinking about these problems
- Apache Hadoop – free implementation of the MapReduce
paradigm
Siniša Ivković - sinisa.ivkovic@gmail.com
6/13
MapReduce
VALUE
MAP
VALUE
MAP
SORT
VALUE
VALUE
REDUCE
VALUE
VALUE
VALUE
MAP
REDUCE
VALUE
Siniša Ivković - sinisa.ivkovic@gmail.com
7/13
ETF Hadoop BLAST - Implementation
{db1}
mySequence
{q1}
{db2}
{db3}
{q1}
{db1}
{q1}
MAP
{db2}
{q1}
MAP
{db3}
MAP
{hit1}
{db1}
{hit3}
{db2}
{hit5}
{db3}
{hit2}
{db1}
{hit4}
{db2}
{hit6}
{db3}
Siniša Ivković - sinisa.ivkovic@gmail.com
8/13
mySequence
{q1}
{db2}
{db3}
ETF Hadoop BLAST - Implementation
{q1}
{db1}
{q1}
MAP
{db2}
{q1}
MAP
{db3}
MAP
{hit1}
{db1}
{hit3}
{db2}
{hit5}
{db3}
{hit2}
{db1}
{hit4}
{db2}
{hit6}
{db3}
REDUCE
{hit1}
{db1}
{hit3}
REDUCE
{db2}
{hit6}
Siniša Ivković - sinisa.ivkovic@gmail.com
{db3}
9/13
ETF Hadoop BLAST
>GENSCAN00000000013 pep:genscan chromosome:GRCh37:18:4755977:4807982:1
transcript:GENSCAN00000000013 transcript_biotype:protein_coding
TANTGLLAVKVEVIILVSLTHAQLSRAGQHAGCTTCLQDECAVAAGEEEETQQGELADVIYPSLL
AASTSSVLEDGAGPHKGLQKLSRLIRFVDVVGGFRREKGYMAWIKPRYSEFPKVNSWTESSFP
FG
TANTGLLAVKVEVIILVSLTHAQLSRAGQHAGCTTCLQDECAVAAGEEEETQQGELADVIYPSLL
AASTSSVLEDGAGPHKGLQKLSRLIRFVDVVGGFRREKGYMAWIKPRYSEFPKVNSWTESSFP
FG
HSP: 661
E-value: 0.001446314485823671
Siniša Ivković - sinisa.ivkovic@gmail.com
10/13
Conclusion
- Bioinformatics has become an important part of many areas of
biology
•
•
•
Sequencing and annotating genomes and
their observed mutations
Datamining of biological literature and
the development of gene ontologies
Understanding of evolutionary aspects of molecular biology
- Personalized medicine
• Medical model that proposes the customization of healthcare
• We need to consider whole spectar of clinical information
• Electronic health care records
• Clinical trials
• etc.
Siniša Ivković - sinisa.ivkovic@gmail.com
11/13
Conclusion
- We need to collect information from real world
- Develop analytics that can actually extract causal relationships
and generate predictive models
- Future steps:
- Specialized hardvare (FPGA)
Siniša Ivković - sinisa.ivkovic@gmail.com
12/13
13/13
Download