History of Bioinformatics

advertisement
Bioinformatics
Bin Liu (刘滨), PhD, Associate Professor
Intelligent Computing Research Center
Homepage: http://bioinformatics.hitsz.edu.cn/
Email: bliu@insun.hit.edu.cn or binliu@hitsz.edu.cn
Before we start
 Course name: Bioinformatics
 Instructor: Bin Liu, PhD, Associate Professor
 Office hours: by appointment, Office: C303B;
 Evaluation: attendance and presentation (30%); projects
and report (30%); examination (40%)
 Class hours: 32;
Credits: 2
 Object: students for master degrees of Computer
Science and related majors.
 Note: Biology background is not required.
Before we start: Under the dome
Why should we study this course?
 To understand ourselves
 Most of the biologists don’t know computer science. Most
computer scientists don’t know biology.
 For study
 Very easy to find a position in top universities in the world.
 For jobs
 Jobs in academic
 Jobs in industry
References not limited to
Carlos Setubal, Joao
Meidanis, Introduction
to Computational
Molecular Biology
Dan E. Krance and
Michael L. Raymer,
Fundamental Concepts
of Bioinformatics
Marketa Zvelebil, Jeremy
O. Baum. Understanding
bioinformatics
Definitions
 Biology easily has 500 years of exciting problems to work on.







-- Donald E. Knuth (高德纳), Professor Emeritus of The Art of
Computer Programming at Stanford University
Names:
1 Bioinformatics: an interdisciplinary field that develops and
improves on methods for storing, retrieving, organizing and
analyzing biological data. A major activity in bioinformatics is to
develop software tools to generate useful biological knowledge.
2. Computational Biology: involves the development and
application of data-analytical and theoretical methods,
mathematical modeling and computational simulation techniques
to the study of biological, behavioral, and social systems
Participants in fields:
1. Computer Science: (1)algorithm; (2) AI; (3) database
2. Biological Science
3. Mathematics
Definition in 百度百科
 生物信息学(Bioinformatics)是在生命科学
的研究中,以计算机为工具对生物信息进行储
存、检索和分析的科学。它是当今生命科学和
自然科学的重大前沿领域之一,同时也将是21
世纪自然科学的核心领域之一。
History of bioinformatics
 Dr Hua A. Lim created the
word “Bioinformatics” in
1987.
History of bioinformatics
History of Bioinformatics
1950s, the first period
 A=T,G=C in DNA were discovered in
1949
 Pauling and Corey discovered the α and β
structures of protein sequences in 1951
 Watson and Crick proposed the DNA
structure in 1953
 The first bioinformatics meeting was help in
USA, 1956”
History of bioinformatics
1960s, 1970s, second period .
The basic concept of bioinformatics:sequence comparison.
 Margret Dayhoff
 Collecting protein family data,
 In 1970s, PAM(Percent Accepted Mutation matrices) was
proposed。
 Needleman & Wunsch:In 1970,sequence comparison
algorithm。
History of bioinformatics
1980s.
 EMBL, Genbank, DDBJ
 Smith & Waterman(algorithm of local alignments)
 Pearson &Lipman FASTA tool.
History of bioinformatics
1990s
 Human Genome Project, HGP
 Other genome projects(Gemone projects): Mus.
Musculus(家鼠), C.elegans (线虫),, …
 Lipman developed the BLAST tool and later PSIBLAST.
History of Bioinformatics
The growth rate comparison between protein sequence and structure data
Protein sequence
400k
No of protein sequence
PDB
Swiss-Prot
300k
Unbalanced
200k
Protein structure
100k
0
1985
1990
1995
2000
database update date
2005
2010
preface
Bioinformatics
 Biologists: creators and ultimate users of the data
 Scientists from mathematics and computer science: sheer
size and complexity of the data.
Techniques
 Databases: new database models to record changes
 Pattern recognition: to understand molecular
sequences, AI, machine learning, etc.
 Algorithms
 Internet
preface
Can Biology Help Computing?
 Computational techniques inspired by biology:
 Neural network (artificial intelligence)
 Genetic algorithm
 A new driver of computer science:
 Better hardware (supercomputers)
 New data representation
 New driver for algorithm development
 Develop new theoretical framework:
 DNA computing
 ant colony algorithm (communication between ants)
preface
 Develop new theoretical framework:
 Ant colony algorithm (communication between ants)

蚁穴
蚁穴
preface
This course:
 To present a representative sample of bioinformatics
problems in biology
 Efficient algorithms: for above problems
algorithms
 Definition: a step-by-step procedure that tries to
solve a certain well-defined problem in a limited
time bound
 Efficient algorithms: they should not take “too long”
to solve a problem, even a large one. E.g., sequence
comparison ⇒Chapter 2

Why does computation work?
The digital computer

•
•
•
Analog signals get degraded over time
Digital information can be propagated unaltered
The cell is mixture of analog and digital components
The digital molecules of life

•
•
•
DNA: inherit genetic information across generations
RNA: message temporary information within the cell
Protein: execute molecular processes as dictated in code
Properties of each molecule tailored to its role

•
•
•
DNA: Highly stable, protected, self-complementary
RNA: Quickly degraded, single-stranded, mobile
Protein: Versatile code (nX20), complex 3D structure
Bioinformatics in China
 The research started at the early time point
Start in the end of 1960s
The first bioinformatics center was established in
Peking university life science department in 1996
Bioinformatics websites
National Center for Biotechnology Information
(NCBI)
http://www.ncbi.nlm.nih.gov/
Databases, bioinformatics tools and software.
European Bioinformatics Institute (EBI)
http://www.ebi.ac.uk/
DDBJ (DNA Bank of Japan):
http://www.ddbj.nig.ac.jp/
Sanger:http://www.sanger.ac.uk
Tools
http://www.isb-sib.ch/
Peking University Center for Bioinformatics:http://www.cbi.pku.edu.cn
是EMBnet和亚太生物信息网络(APBioNet)的中国节点。
上海生命科学研究院生物信息中心:
http://www.biosino.org/
香港中文大学生物信息中心(HKBIC):
http://www.hkbic.bch.cuhk.edu.hk/
台湾分子信息中心:
http://bioinfo.life.nctu.edu.tw/index.p
http://www.chgc.sh.cn/
http://www.genomics.cn/index
http://www.genomics.cn/index
Useful web sites
http://emuch.net/ (小木虫)
http://www.dxy.cn/ (丁香园)
http://www.bioon.com/ (生物谷)
http://www.bio-soft.net (生物软件)
Course overview
Chapter 1
 fundamental concepts from biology:
 basic structure and function of proteins and nucleic
acids
 mechanisms of molecular genetics
 most important laboratory techniques for studying the
genome of organisms
 an overview of existing sequence databases.
Chapter 2
 strings: the most important mathematical objects used in
the course.
 Medical Literature retrieval.
Course overview
Chapter 3 sequence comparison
 two-sequence problem: classic dynamic programming algorithm
 more general cases of the problem: extensions of algorithm:
multiple-sequence comparison problem
programs used in database searches
some other miscellaneous issues
Chapter 4 phylogenetic tree
 Proteins and nucleic acids also evolve through the ages: an
important tool ⇒phylogenetic tree
 help understand protein function
 some of the mathematical problems related to phylogenetic tree
reconstruction
 simple algorithms: for certain special cases
Course overview
Chapter 5 genome rearrangements
 An important new field: some organisms are genetically different, not
so much at the sequence level, but in the order in which large similar
chunks of their DNA appear in their respective genomes
 mathematical models
Chapter 6 molecule's structure prediction
 methods that try to predict a molecule's structure based on its primary
sequence
 RNA structure prediction: dynamic programming algorithms
 protein structure prediction:
•difficulties
•protein
threading: attempts to align a a protein sequence
with a known structure
Course overview
Chapter 7 Data Driven Machine Learning Approaches
for Bioinformatics
Download