BioM 3

advertisement
Basics of Systems
Bioinformatics
沈祖望 博士
Tsu-Wang (David) Shen, Ph.D.
Medical Informatics Department
Tzu Chi University, Taiwan
E-mail: tshen@mail.tcu.edu.tw
Web: http://www.biom3.tcu.edu.tw
BioM3
Agenda
BioM3
1. About me & Data mining
2. Introduction of Biological signal processing
3. Basics of Biology
4. Basics of Signals & Systems
5. Applications
6. Future Works
2016/3/18
2
Introduce myself
BioM3
Education
 美國威斯康辛大學麥迪遜分校生物醫學工程博士,
Biomedical Engineering, Ph.D.(University of Wisconsin
- Madison, WI, USA)
美國威斯康辛大學麥迪遜分校電機電腦工程碩士,
Electrical and Computer Engineering, M.S. (University
of Wisconsin - Madison, WI, USA)
美國伊利諾理工學院電機電腦工程碩士, Electrical and
Computer Engineering M.S.(Illinois Institute of
Technology, Chicago, IL, USA)
Specialty
生醫工程、生物醫學訊號處理、生物安全辨識、類神經網
路、生物醫學統計、人工智慧、生醫電子
2016/3/18
3
Introduce myself
BioM3
 Experience
 慈濟大學課務組組長 (2005.8.1~2006.7.31)
 國立東華大學全球運籌管理研究所兼任助理教授
(2007)
 慈濟大學醫學研究所合聘助理教授 (2006~)
 教育部高科技專利取得與攻防種子教師 (2006)
 IEEE Member since 1996
 台灣癲癇醫學會會員 since 2007
 中華民國生醫工程師證照 (2007 I1090)
 Educational Psychology Department, University of
Wisconsin-Madison, WI, USA, 電腦專案助理
(1999~2005)
 大華技術學院專任講師(1995~1996)
2016/3/18
4
Data mining
BioM3
• Why mining?
• We need information from big data warehouse, but it
is too many information!
• What is data mining?
• Data mining is the
process of sorting
through large amounts
of data and picking out
relevant information.
2016/3/18
5
Traditional data mining skills
2016/3/18
6
BioM3
Traditional data mining skills
2016/3/18
7
BioM3
Today – New view of data mining
BioM3
 Systems Bioinformatics: An Engineering Case-Based
Approach by Gil Alterovitz, Marco F. Ramoni
Computational Biology : This trail-blazing
work introduces a quantitative systems
approach to bioinformatics research using
powerful computational tools drawn from
signal processing, circuit analysis, control
systems, and communications. It presents
the functionality of biological processes in
an engineering context to facilitate the
application of technical skills in solving the
field's challenges, from the lab bench to
data analysis and modeling, and to enable
reverse engineering from biology in the
development of synthetic biological devices.
2016/3/18
8
Systems Bioinformatics
BioM3
Today, we will focus on using signal
processing and system skill to solve
Bioinformatics & Biology problems as
the “mining” skills.
2016/3/18
9
What are signals?
BioM3
♦ Signals are everywhere! Signals for detecting the
presence of an object.
For example,
• Signals for making medical diagnosis (eg.
electrocardiogram (ECG), electroencephalogram
(EEG)]
• The trading volume in the stock market.
♦ Signals can be categorized in various ways
continuous (analog) - time signals

discrete - time signals
digital signals

one - dimensiona l signals

two - dimensiona l signals
2016/3/18
10
What are signals?
 A signal is an abstract element of
information, or (more commonly)
a flow of information (in one or
more dimensions).
 One or more variable -> one
dimensional vs. multidimensional
 Signal vs. noise -> 端看是否需要
2016/3/18
11
BioM3
Continuous (Analog), Discrete,
and Digital signals
2016/3/18
12
BioM3
Continuous (Analog), Discrete,
and Digital signals
2016/3/18
13
BioM3
BioM3
What is a system?
 A system is defined as an entity that
manipulates one or more signals to
accomplish a function, thereby yielding
new signals
x[n]
y[n]
h[n]
2016/3/18
14
Specific systems
Communication Systems
Control Systems
Microelectromechanical
systems (MEMS)
Remote sensing
 Biomedical signal systems
 Auditory system
 Bioinformatics systems
Others …
2016/3/18
15
BioM3
Biological Signal Processing
The discipline aimed at
 Understanding and modeling the
biological algorithms implemented by
living systems using signal processing
theory. (biology as an endpoint)
 The efforts seeking to use biology as a
metaphor to formulate novel signal
processing algorithms for engineering
systems. (signal processing as an
endpoint)
2016/3/18
16
BioM3
History
BioM3
 Claude Bernard 1800’s – concept of
homeostasis
 Ludwig von Bertalanffy 1968 – General
System Theory: there appear to exist
general system laws which apply to any
system of a particular type, irrespective of
the particular properties of the systems and
the elements involved …
 Norbert Wiener 1900’s – system feedback
 Reiner 1960s – System works vs. DNA
findings
 Focused on reductionist component-level
view.
2016/3/18
17
Electrical vs. Biological Signal Processing
BioM3
整合訊號處理
與生物
2016/3/18
18
BSP at the cell level -Today’s coverage
Signal detection and estimation
 DNA sequencing
 Gene identification
 Protein hotspots identification
System identification and analysis
 Gene regulation systems
 Protein signal systems
2016/3/18
19
BioM3
BioM3
Basics of Biology
DNA -> RNA -> amino acid ->protein -> cells
(Gene)
2016/3/18
20
Introduction of Basics of Biology
BioM3
In April 2003, sequencing of all three
billion nucleotides in the human
genome was declared complete.
1631 human genetic diseases are
now associated with know DNA
sequence.
The human genome was not the only
organism targeted. (June 2004, 1557
viruses, 165 microbes, and 26
eukaryotes.)
2016/3/18
21
DNA and Gene Expression
•DNA is found in nucleus and
mitochondria of eukaryotic cells.
DNA是存在細胞核內,包含動物細胞、
植物細胞與真菌細胞。
•DNA in the nucleus contains
information from both parents.
•The chemical that directs protein
synthesis and serves as a genetic
blueprint is DNA, which is found in
the nucleus of the cell.
2016/3/18
22
BioM3
BioM3
DNA
 脫氧核糖核酸(DNA,為英文Deoxyribonucleic acid的縮寫),又稱去
氧核糖核酸,是染色體的主要化學成分,同時也是組成基因的材料。有
時被稱為「遺傳微粒」,因為在繁殖過程中,父代把它們自己DNA的一
部分複製傳遞到子代中,從而完成性狀的傳播。
 事實上,原核細胞(無細胞核)的DNA存在於細胞質中,而真核生物的
DNA存在於細胞核中。嚴格的說,DNA是由兩條單鏈像葡萄藤那樣相互盤
繞成雙螺旋形,根據螺旋的不同分為A型DNA,B型DNA和Z型DNA,詹姆斯
·沃森與佛朗西斯·克里克所發現的雙螺旋,是稱為B型的水結合型DNA,
在細胞中最為常見。
 這種核酸高聚物是由核苷酸連結成的序列,每一個核苷酸都由一分子脫
氧核糖,一分子磷酸以及一分子鹼基組成。DNA有四種不同的核苷酸結
構,它們是腺嘌呤(adenine,縮寫為A),胸腺嘧啶(thymine,縮寫
為T),胞嘧啶(cytosine,縮寫為C)和鳥嘌呤(guanine, 縮寫為G
)。在雙螺旋的DNA中,分子鏈是由互補的核苷酸配對組成的,兩條鏈
依靠氫鍵結合在一起。由於氫鍵鍵數的限制,DNA的鹼基排列配對方式
只能是A 對T或C對G。因此,一條鏈的鹼基序列就可以決定了另一條的
鹼基序列,因為每一條鏈的鹼基對和另一條鏈的鹼基對都必須是互補的
。在DNA複製時也是採用這 種互補配對的原則進行的:當DNA雙螺旋被
展開時,每一條鏈都用作一個模板,通過互補的原則補齊另外的一條鏈
。
 分子鏈的開頭部分稱為3'端而結尾部分稱為5'端,這些數字表示脫氧核
糖中的碳原子編號。
2016/3/18
23
Base pairs
BioM3
鹼基對是形成核酸DNA、RNA單 體以及編碼遺傳信
息的化學結構。組成鹼基對的鹼基包括A、G、T、
C、U。嚴格地說,鹼基對是一對相互匹配的鹼基
(即A:T, G:C,A:U相互作用)被氫鍵連接起
來。然而,它常被用來衡量DNA和RNA的長度(儘
管RNA是單鏈)。它還與核苷酸互換使用,儘管後
者是由一個五碳 糖、磷酸和一個鹼基組成。
鹼基對通常簡寫做bp(英語base pair),千鹼基對
爲kbp,或簡寫作kb(對於雙鏈核酸。對於單鏈核
酸,kb指千鹼基)。
2016/3/18
24
Base pairs, DNA
2016/3/18
25
BioM3
DNA and Gene
Human Genome has been estimated about 30000.
2016/3/18
26
BioM3
DNA-> Gene
BioM3
Each gene has a particular location in a
specific chromosome and contains the
“code” for producing one of three forms of
RNA (ribosomal RNA (rRNA), messenger
RNA (mRNA), and transfer RNA (tRNA)).
2016/3/18
27
Genomes
在生物學中,一個生物體的基因組是指該生物
的DNA(對一部分病毒是RNA)中所包含的全部
遺傳信息。基因組包括基因和非編碼序列。
1920年,德國漢堡大學植物學教授 Hans
Winkler 首次使用基因組這一名詞。
 精確地講,一個生物體的基因組是指一組染
色體中的完整的DNA序列。例如,生物個體體
細胞中的二倍體由兩條染色體組成,其中之一
的DNA序列就是一個基因組。
Yu-Gi-Oh!
2016/3/18
28
BioM3
The Central Dogma of biology
 The Central Dogma of biology. DNA is copied
into RNA (transcription); the RNA is used to
make proteins (translation); and the proteins
perform functions such as copying the DNA
(replication). (www.biologyforengineers.org).
2016/3/18
29
BioM3
Replication
BioM3
 Replication of DNA by DNA polymerase. After
the two strands of DNA separate (top), DNA
polymerase uses nucleotides to synthesize a
new strand complementary to the existing one
(bottom). Images from the online tutorial
“Biological Information Handling: Essentials for
Engineers” (www.biologyforengineers.org).
2016/3/18
30
Replication
 During
Replication,
some enzymes
check for
accuracy.
 Error rate:
approximately
one per billion.
2016/3/18
31
BioM3
Transcription (轉錄)
BioM3
 Transcription of DNA by RNA polymerase. Note
that RNA contains U’s instead of T’s. Image
from the online tutorial “Biological Information
Handling: Essentials for Engineers”
(www.biologyforengineers.org).
2016/3/18
32
Transcription
mRNA,為messenger RNA的簡稱,或稱為信使RNA。
mRNA上帶著從DNA轉錄來的,提供轉譯成蛋白質所需訊
息。 Messenger RNA (mRNA) 是攜帶從DNA而來的遺傳
訊息,到細胞中合成蛋白質的核糖體位置的RNA。 mRNA
在它短暫的存在時間中,經過了數個步驟:在轉錄的過
程中,一個叫做RNA聚合酶的酵素,按照其需要,從DNA
中複製出一段基因到mRNA上。在原核生物中,mRNA並未
被進一步去處理(但有些罕有的特例),而經常是在轉錄
過程中,同時也進行轉譯。在真核生物中,轉錄跟轉譯
發生在細胞的不同位置(轉錄發生在DNA所儲存的細胞核
中,而轉譯是發生在核糖體所在的細胞質中)。
2016/3/18
33
BioM3
Transcription
mRNA在真核細胞中
的交互作用。 RNA
在轉錄之後被創造
出來;在經過修剪
和加上多腺嘌呤尾
之後,被運送到細
胞質,然後在核糖
體那進行轉譯。
2016/3/18
34
BioM3
Transcription vs. replication
BioM3
Only a certain stretch of DNA acts as
the template and not whole strand
Different enzymes are used.
Only a single strand is produced.
2016/3/18
35
Translation (轉譯)
BioM3
 In the process of translation, each mRNA codon
attracts a tRNA molecule containing a complementary
anticodon. (www.biologyforengineers.org).
2016/3/18
36
反密碼 (anti-codon)
BioM3
 是位在tRNA上,tRNA上反密碼決定tRNA所要攜帶的氨
基酸,反密碼會與mRNA上的密碼配對。
2016/3/18
37
Transcription & Translation
2016/3/18
38
BioM3
DNA to RNA to Protein
BioM3
 The genetic information is stored in DNA, copied
to RNA, and then interpreted from the RNA copy
to form a functional protein.
 mRNA <-> rRNA <-> tRNA
蛋白質是由氨基酸串成的 。細胞內有20種氨基酸
,可以串出不同長度、不同形狀、不同功能的蛋
白質。
2016/3/18
39
蛋白質是依照基因的密碼來合成
BioM3
DNA是遺傳物質,
DNA上面三個相鄰
的鹼基就可以組成
一個密碼,基因要
做蛋白質的時候,
是先將DNA的密碼
轉錄成RNA,RNA離
開細胞核進入細胞
質中,與核糖體結
合,就可以依RNA
上的密碼合成蛋白
質。
2016/3/18
40
BioM3
DNA to RNA to Protein(蛋白質)
2016/3/18
41
如何知道是三個相鄰?
二十個氨基酸 (amino acid)
 一個:
 兩個:
4 4
1
4  16
2
 三個
4  64
3
2016/3/18
42
BioM3
64 codons and the amino acid for each
http://en.wikipedia.org/
http://www.medigenomix.de/
2016/3/18
43
BioM3
BioM3
2016/3/18
 Complexity at the protein level exceeds
complexity at the gene and transcript
levels. Individual genes in the genome
are transcribed into RNA. In
eukaryotes, the RNA may be further
processed to remove intervening
sequences (RNA splicing) and result in
a mature transcript that encodes a
protein. Different proteins may be
encoded by differently spliced
transcripts (alternative splicing
products). Moreover, once proteins are
produced, they can be processed (e.g.
cleaved by a protein-cutting protease)
or modified (e.g. by addition of a sugar
or lipid molecule). Moreover, proteins
may have non-covalent interaction with
other proteins (and/or with other
biomolecules such as lipids or
nucleotides). Each of these can have
tissue-, stage- and cell-type specific
effects on the abundance, function
and/or stability of proteins produced
44
from a single gene.
Transcription
BioM3
而且在真核生物中,mRNA在準備好轉譯前需要經過許多處理的步驟:
1.
加上一個5'cap-一個經過修飾的鳥嘌呤(guanine)被加到mRNA
的5'端(5'end)。這個5'cap對於辨識與接到適當的核糖體是相當
重要的。
2.
修剪(splicing)-pre-mRNA (尚未經過修飾或是部份經過修
飾的mRNA,稱作pre-mRNA,或是heterogeneous nuclear RNA,
hnRNA)被修飾去除掉內含子(intron),也就是不被轉錄的區段;
其餘存留著,能夠轉譯成蛋白質的序列,則被稱作外顯子(exon)。
通常一個pre-mRNA能經由數種不同的修剪方式,產生不同的成熟
mRNA,使一段基因(在不同組織,器官中)能經過轉錄轉譯後產生
不同的蛋白質,表現出不同的作用。這種修剪叫做alternative
splicing。大部分的mRNA修剪都是由酵素所執行,不過有些RNA分
子也有能力催化自身的修剪,例如:ribozymes。
3.
加上多個腺嘌呤尾 (polyadenylation)-藉由酵素polyA
polymerase,在pre-mRNA的3'端上加上了一段數個腺嘌呤序列(通
常是數百個),(這項修飾並不會出現在原核生物中)。這段多腺嘌
呤尾在轉錄的時候,會加上一段特殊的片段,AAUAAA。
2016/3/18
45
Gel electrophoresis
 Method for separating
DNA, RNA, or protein
using an electrical charge
to separate DNA molecules
through a threedimensional matrix. The
larger the DNA fragment,
the slower it will move
through the matrix. DNA
isolated on a gel can be
recovered and purified
away from the matrix.
2016/3/18
46
BioM3
Polymerase Chain Reaction (PCR)
BioM3
 A method for amplification of a specific DNA
fragment in which paired DNA strands are
separated (by high temperature) and then each
is used as a template for production of
complementary strand by an enzyme (a DNA
polymerase).
 PCR的發展可以說是從DNA合成酵素的發現緣起。但
由於這個酵素是一種易被熱所破壞之酵素,因此不符合
一連串的高溫連鎖反應所需。
 現今所使用的酵素 (簡稱 Taq polymerase),則是於
1976年從熱泉中的細菌(Thermus Aquaticus) 分離
出來的。它的特性就在於能耐高溫,是一個很理想的酵
素.
2016/3/18
47
Polymerase Chain Reaction (PCR)
Flash DEMO
2016/3/18
48
BioM3
BioM3
Gene clone
 Flow-chart of
the major
steps involved
in gene clone
set production
and use.
2016/3/18
49
BioM3
Annotated Genome
Computational
Analysis
Experimental Data
Literature Mining
Unique Set
of Target ORFs
 Bioinformatic approaches to select target genes for a cloning
project. For bacterial genomes, target selection primarily draws
from genome sequence, where introns are not a consideration and
genome-scale projects are feasible. For eukaryotes, researchers
commonly use one or more informatics-based methods to identify
sub-groups of target genes that share a common feature, such as
function, localization, expression or disease association. As noted,
these information sources draw significantly on one another (as
experimental data is in genome annotation, etc.).
2016/3/18
50
BioM3
Basics of Signals and Systems
2016/3/18
51
BioM3
Signal processing Basics
 Time- Domain Representation
隨時間改變之訊號
1
0.5
0
-0.5
0
500
1000
1500
2000
 Frequency – Domain Representation
傅立業發現:任何的訊號都可表達為SIN和COS函數的
組合 (驚!)
1
200
0.5
150
0
-0.5
100
-1
50
-1.5
-2
0
50
2016/3/18
100
150
200
52
250
50
100
150
200
250
Time Operations
Time scaling
y (t )  x(at )
y[n]  x[kn], k  0
If a >0, Y(t) is compressed.
If 0< a <1, y(t) is expended
2016/3/18
53
BioM3
Time scaling
2016/3/18
54
BioM3
Reflection
BioM3
Y(t) = x(-t)
The y(t) represents a reflected version of
x(t) about t=0.
2016/3/18
55
Time shifting
Y(t)=x(t-t0)
If t0 >0 shifting toward the right.
If t0 <0 shifting toward the left.
2016/3/18
56
BioM3
Example: Precedence Rule
Y(t)=x(at-b)
 V(t)=x(t-b)
 Y(t)=v(at)=x(at-b)
2016/3/18
57
BioM3
時間訊號,傅力業轉換,Z 轉換
藍星人=
地球人
只是角度不同,
說話因而不同!
2016/3/18
58
BioM3
Relationship between Time Properties
of a signal and the Appropriate Fourier
Representation
BioM3
Time property
Periodic
Non-periodic
Continuous (t)
Fourier Series
(FS)
Fourier Transform
(FT)
Discrete [n]
Discrete-time
Fourier Series
(DTFS)
Discrete-time
Fourier Transform
(DTFT)
2016/3/18
59
Relations Among Fourier Methods
2016/3/18
60
BioM3
BioM3
各種轉換方式
Time
Fourier
Z
S -Laplace
2016/3/18
61
Filters
2016/3/18
62
BioM3
Advantages of digital filters over analog filters
BioM3
Highly immune to noise because of the
way it is implemented (software/digital
circuits)
Accuracy dependent only on round-off
error, directly determined by the number
of bits
Easy and inexpensive to change a filter’s
operating characteristics (e.g., cutoff
frequency)
Performance not a function of component
aging, temperature variation, and power
supply voltage
2016/3/18
63
BioM3
Signal conversion
Band- limiter
and sampler
Continuous
signal

Reconstruction
filter
y(t)
y(kT)
x(kT)
x(t)
2016/3/18
Digital filter
(processor)
Continuous
signal
Sampled
signal


64
Sampling as multiplication by a train of
impulses
xp(t)
x(t)
(a)
x
p(t)
x(t)
(b)
0
t
p(t)
1
0
Ts
2Ts
3Ts
4Ts
5Ts
6Ts
7Ts
t
7Ts
t
(c)
xp(t)
xq(1Ts)
xq(0Ts)
xq(2Ts)
xq(6Ts)
(d)
0
2016/3/18
Ts
2Ts
3Ts
4Ts
65
5Ts
6Ts
BioM3
Sampling theorem
BioM3
Must sample at a rate at least twice
the highest frequency present in the
signal (including noise)
If a signal contains no frequencies
higher than fc, the original signal can
be completely recovered by sampling
at least 2 fc samples/s.
Sampling frequency fs must be at least
twice the highest frequency present in
a signal (Nyquist frequency)
2016/3/18
66
LTI system
BioM3
•何謂LTI (Linear Time Invarent)
two conditions
ay
(
n
)

by
(
n
)

L
[
ax
(
n
)

bx
(
n
)]

1
2
1
2




 y(n, k )  T [ x(n  k )]  y(n  k )
2016/3/18
67
BioM3
System operation
convolution
multiple

y ( n)   x ( k ) h( n  k )
Y ( )  X ( )  H ( )
k  

  h( k ) x ( n  k )
k  
2016/3/18
68
Autocorrelation and PSD
BioM3
Rxx [m]  E{x[n]x[n  m]}

R yy [m]   Rxx [m  k ]Rhh [k ]
k  
The power spectral density (PSD) of a signal
is defined as Fourier transform of the
autocorrelation function.
2016/3/18
69
Example
數位 OR 類比?
2016/3/18
70
BioM3
BioM3
Signal Detection and Estimation
2016/3/18
71
Estimation and detection
2016/3/18
72
BioM3
Signal detection and estimation
BioM3
 Estimation Xˆ [n]
 Detection Hypotheses testing
 Five steps for analyzing genomic and proteomic
data
 Describe and identify the measurement system S
 Define the signal of interest x[n]. Map the biological space
into a numerical space.
 Formulate the problem (estimation, detection, or analysis
 Solve problem and compute output signals
 Interpret the results in a biological context.
2016/3/18
73
DNA sequencing
BioM3
The DNA sequencing process
 DNA sample preparation
 Electrophoresis
 Processing
Processing the eletropherogram data
to identify the DNA sequence
 Conditioning the signal and increasing S/N
ratio
 Identifying the underlying DNA seqence.
2016/3/18
74
DNA sequencing - eletropherogram
2016/3/18
75
BioM3
http://www.genome.uab.edu/
Model of DNA sequencing
Blurring : for example, diffusion effects,
instrument noises
2016/3/18
76
BioM3
Wiener filter
http://en.wikipedia.org/wiki/Wiener_filter
2016/3/18
77
BioM3
DNA sequence estimation using LTI
filtering
 h[k ]R~x ~x [m  k ]  R x~x [m]
k
j
S x~x (e )
H (e ) 
S ~x ~x (e j )
j
2016/3/18
78
BioM3
Homomorphic Blind Deconvolution
BioM3
Wiener filter can lead
significant errors on
diffusion effects.
High-pass filter
to reduce
blurring (low
frequency)
2016/3/18
79
Results
BioM3
Error rate 1.06% is better
than reports from
florescence – base
sequencing instrument!
2016/3/18
80
Model based estimation techniques
There are four main distortions introduced by
sequencing:
1.Loading artifects
2.Diffusion effects
3.Fluorescence interference
4.Additive instrument noise
以MODEL方式反推
2016/3/18
81
BioM3
Gene identification
Once the DNA sequence has
been identified, it needs to be
analyzed to identify genes and
coding sequence.
2016/3/18
82
BioM3
DNA signal properties
Bacterium Aquifex aeolicus
2016/3/18
BioM3
• Consider autocorrelation
function of xa[n]
• The nonflat shape of the
spectrum revels
correlations at low
frequencies, indicating
that base pairs that are far
away seems to be
correlated.
83
DNA signal properties
BioM3
 At thin peak 2pi/3, the increased correlation
corresponds to the tendency of nucleotides to be
repeated along the DNA sequence with period 3
and is indicative of coding regions.
 The triplet nature of the codon
 Potentially codon bias (unequal usage of codon)
 The biased usage of nucleotide triples in genomic DNA
(triplet bias).
 Yin and Yau showed that the period-3 property is
not affected by codon bias.
 The period -3 property of coding regions seems
to be generated by unbalanced nucleotide
distributions in three codon potions.
2016/3/18
84
DNA signal processing for Gene
identification
BioM3
 Fickett – the problem of interpreting nucleotide
sequences by computer, in order to provide
tentative annotation on the location, structure,
and functional class of protein-coding gene. i.e.
predict the amino acid sequence of protein to
provide the insight of function.
 The premise of all methods is to exploit the
period-3 property of coding regions by processing
the DNA signal to identify regions with strong
period – 3 correlation.
2016/3/18
85
DNA signal processing for Gene
identification
BioM3
 DNA spectrum
2
2
2
S x [k ]  X a [k ]  X t [k ]  X c [k ]  X g [k ]
2
 Signal-to-noise ratio
S x [ N / 3]
Px ( N / 3) 
Sx
 Tiwari et al. observed that for most coding
sequences in variety of organism, Px is large, but
not in non coding regions.
2016/3/18
86
Fourier spectra
Coding stretch of DNA
2016/3/18
noncoding stretch of DNA
87
BioM3
Filtering methods applied
window
Predict the five
exons for gene
F56F11.4 in C.
elegans
chromosome III.
IIR antinotch
Multistage
filters
2016/3/18
88
BioM3
Protein Hotspots identification
BioM3
Once coding regions have been
identified, the corresponding protein
sequence can be determined by
mapping the coding region to the
amino acid sequence using the
genetic code.
2016/3/18
89
Protein Signal Definition
BioM3
The new physicomathematical
approach resented here is called the
Resonant Recognition Model (RRM).
The RRM is based on the
representation of the protein primary
structure as a numerical series by
assigning to each amino acid a
physical parameter value relevant to
the protein’s biological activity.
2016/3/18
90
Protein Signal Definition
BioM3
 The RRM is a physical and mathematical model
which interprets protein sequence linear
information using signal analysis methods.
 It comprises two stages:
 The first involves the transformation of the amino
acid sequence into a numerical sequence. Each
amino acid is represented by the value of the
electron-ion interaction potential (EIIP) which
describes the average energy states of all valence
electrons, in particular amino acids.
 Numerical series obtained this way are then
analyzed by digital signal analysis methods in order
to extract information pertinent to the biological
function.
2016/3/18
91
BioM3
EIIP
2016/3/18
92
RESONANT RECOGNITION MODEL (RRM)
BioM3
2016/3/18
93
Prediction of protein hotspot
cytochrome C proteins
2016/3/18
94
BioM3
BioM3
System identification and Analysis
2016/3/18
95
Signal processing view of the cell
2016/3/18
96
BioM3
Signal coordination at cell level
System view
2016/3/18
97
BioM3
Gene expression
BioM3
 A DNA microarray (also commonly known as
gene chip, DNA chip, or biochip) is a collection of
microscopic DNA spots attached to a solid
surface, such as glass, plastic or silicon chip
forming an array.
 DNA microarrays, such as cDNA microarrays and
oligonucleotide microarrays.
 In genetics, complementary DNA (cDNA) is
DNA synthesized from a mature mRNA template.
cDNA is often used to clone eukaryotic genes in
prokaryotes.
2016/3/18
98
Gene expression
One sample from a tumor;
One sample from a
normal tissue.
White – highly expressed
under treatment A
Gray – no difference
Dark - highly expressed
under treatment B
2016/3/18
99
BioM3
Time changes – measure drugs in 6 hours.
2016/3/18
100
BioM3
cDNA microarrays
The procedure begins
by attaching the DNA
sequences of thousands
of genes onto
microscope slide in the
pattern of spots, with
each spot containing
only DNA sequences of
a single gene.
2016/3/18
101
BioM3
Oligonucleotide microarrays
BioM3
• In oligonucleotide microarrays (or single-
channel microarrays), the probes are designed
to match parts of the sequence of known or
predicted mRNAs.
• In stead of attaching full-length DNAs,
oligonucleotide microarrays make use of short
oligonucleotide chosen to be specific to
individual genes.
2016/3/18
102
Oligonucleotide microarrays
These microarrays
give estimations of
the absolute value
of gene expression
and therefore the
comparison of two
conditions requires
the use of two
separate
microarrays.
2016/3/18
103
BioM3
BioM3
DNA microarray
 x0 [ m] 
 x [ m] 
1


X 
  


 x N 1[m]
2016/3/18
104
Principle components analysis
BioM3
 Principle component analysis transforms the original set of
variables into a smaller set of linear combinations that
account for most of variance of the original set. The
purpose of principle component analysis is to determine
factors (i.e., principle components) in order to explain as
much of the total variation in the data as possible with as
few of these factors as possible.
PC (1)  w11 X 1  w12 X 2  ...  w1 p X p

PC (m)  wm1 X 1  wm 2 X 2  ...  wmp X p
The principal components are those uncorrelated
linear combinations PC(1), PC(2), …, PC(m) whose
variances are as large as possible.
2016/3/18
Eigenvalues and Eigenvectors
BioM3
Definition
Let A be an n  n matrix. A scalar  is called an eigenvalue of A if there exists
a nonzero vector x in Rn such that
Ax = x.
The vector x is called an eigenvector corresponding to .
2016/3/18
106
Computation of Eigenvalues
and Eigenvectors
Let A be an n  n matrix with eigenvalue  and
corresponding eigenvector x. Thus Ax = x. This
equation may be rewritten
Ax – x = 0
giving
(A – In)x = 0
Solving the equation |A – In| = 0 for  leads to all the
eigenvalues of A.
On expending the determinant |A – In|, we get a
polynomial in . This polynomial is called the
characteristic polynomial of A.
The equation |A – In| = 0 is called the characteristic
equation of A.
2016/3/18
107
BioM3
Example 1
Find the eigenvalues and eigenvectors of the matrix
Solution
We get
  4  6
A
5 
3
Let us first derive the characteristic polynomial of A.
  4  6
1 0  4    6 
A  I 2  
 




3
5
0
1
3
5





 

A  I 2  (4   )(5   )  18  2    2
We now solve the characteristic equation of A.
2

   2  0  (  2)(  1)  0    2 or  1
The eigenvalues of A are 2 and –1.
=2
2016/3/18
 6  6  x1 
( A  2I 2 ) x  
0



3   x2 
3
108
BioM3
BioM3
This leads to the system of equations
 6 x1  6 x2  0
3x1  3x2  0
giving x1 = –x2. The solutions to this system of equations are x1 = –r, x2 = r,
where r is a scalar. Thus the eigenvectors of A corresponding to  = 2 are
nonzero vectors of the form
 1
r 
 1
  = –1
 3  6  x1 
( A  1I 2 ) x  
0



6   x2 
3
Thus x1 = –2x2. The eigenvectors of A corresponding to  = –1 are
nonzero vectors of the form s[-2 1]t
2016/3/18
109
Singular value decomposition (SVD)
BioM3
 Suppose M is an m-by-n matrix whose entries
come from the field K, which is either the field of
real numbers or the field of complex numbers.
Then there exists a factorization of the form
• The matrix V thus contains a set of orthonormal
"input" or "analysing" basis vector directions for M
• The matrix U contains a set of orthonormal "output"
basis vector directions for M
• The matrix Σ contains the singular values, which can
be thought of as scalar "gain controls" by which each
corresponding input is multiplied to give a
corresponding output.
2016/3/18
110
SVD example
BioM3
 A non-negative real number σ is a singular value
for M if and only if there exist unit-length vectors u
in Km and v in Kn such that
2016/3/18
111
Eigengenes from applying SVD
2016/3/18
112
BioM3
Project CLB2 and CLN3
2016/3/18
113
BioM3
Apoptosis System Identification
2016/3/18
114
BioM3
Apoptosis System Identification
2016/3/18
115
BioM3
PCA results
2016/3/18
116
BioM3
Summarizations
 We tried to link signals, systems, and biology.
 Filtering is necessary for removing artifacts.
 We learned “Signal detection and estimation”
1. DNA sequencing
2. Gene identification
3. Protein hotspots identification
 We learned “System identification and analysis”
1. Gene regulation systems
2. Protein signal systems
 We provided the new thought of data mining.
2016/3/18
117
BioM3
Thanks
2016/3/18
118
BioM3
Download