Computational Biology
Dr. Jens Allmer
Lecture Slides Week 3
MBG404 Overview
Processing
Pipelining
Generation
Data
Storage
Mining
Sample preparation for mass spectrometry
Centrifugation of crude
cell extracts in
sucrose gradient
0.5 M
1.3 M
Thylakoids
1.8 M
Starch, etc.
Sucrose
gradient
Separation of the
thylakoid fractions via
SDS PAGE
Cutting of
interesting bands
from the gel
Proteolytic (trypsin)
digestion in gel
Liquid chromatography
of resulting peptides
Mass Spectrometry (MS)
1D SDS PAGE of
thylakoid
fraction from crude cell
extracts
of Chlamydomonas
reinhardtii.
Mass
spectrometric
methods for
protein
identification
Schematic depiction of an ion trap mass spectrometer
Peng, J. and Gygi, S.P. (2001) Proteomics: the move to mixtures. J. Mass Spectrom., 36,
1083-1091.
Example tandem MS spectrum
479.4
100
Scan 4502
626.1
100
90
95
90
+ d Z ms [ 622.30-632.30]
85
85
80
80
75
70
Scan 4501
626.6
2
65
70
65
55
Relative Abundance
50
100
45
40
35
95
30
90
627.1
25
20
85
15
1
80
627.7
10
0
622
623
624
625
626
70
627
m/z
628
629
630
631
Relative Abundance
828.2
55
50
957.3
45
40
30
632
715.2
25
65
958.2
20
60
15
55
+ c Full ms [ 400.00-2000.00]
50
10
835.5
602.2
400
600
0
40
200
982.4
35
30
610.2
1054.4
25
1156.2
852.2
20
503.9578.8
445.1
1157.5
703.2
765.9 885.0
1217.7
1259.8
10
1469.7
5
0
400
600
800
1070.3
406.2
5
45
15
60
35
5
75
3
75
60
626.3
535.8
95
1000
1200
m/z
1400
1600
1800
2000
800
1000
m/z
1200
1400
1600
1800
Mass spectrometric peptide fragmentation spectrum analysis
(Sequest or Mascot)
Ion trap mass spectrometry (ITMS)
Single peptide ions
Mass spectra
Collision-induced dissociation (CID)
Sequentially fragmented
peptide ions
Tandem mass spectra
Database
search
‚In silico‘ tryptic
digestion
Theoretical MS/MS
fragmentation pattern
Peptide
amino acid sequences
‚Hit‘
Significant match with
theoretical fragmentation pattern
of a database sequence
DATABASE
Translated DNAor protein sequences
Cross correlation
• Digesting the database with the enzyme in
question.
• Picking all fragments within a mass window
close to the precursor mass of the peptide in the
mass spectrum
• Calculating an artificial spectrum from all those
fragments
• Cross correlate spectra to original mass
spectrum
WLQYSEVIHAR
Theoretical spectrum in red (a,b,c,x,y,z ions) and measured spectrum in blue
Mass spectrometric peptide fragmentation spectrum analysis
(Sequest or Mascot)
Ion trap mass spectrometry (ITMS)
Single peptide ions
Mass spectra
Collision-induced dissociation (CID)
Sequentially fragmented
peptide ions
Tandem mass spectra
Database
search
‚In silico‘ tryptic
digestion
DATABASE
Translated DNAor protein sequences
Theoretical MS/MS
fragmentation pattern
Peptide
amino acid sequences
Limitation:
‚Hit‘
Significant match with
theoretical fragmentation pattern
of a database sequence
Identification is limited
to peptide sequences present in the database.
Database Search Software
• Many tools have been developed
– OMSSA (NCBI, discontinoued)
– X!Tandem (The global proteome machine)
X!Tandem
• http://www.thegpm.org/tandem/
X!Tandem Initalization Files
• X!Tandem
– Taxonomy.xml
– Default_Input.xml
– Input.xml
• Running X!Tandem
– ?>tandem.exe input.xml
• That was easy
– But behold, what about the input?
OMSSA
• Open Mass Spectrometry Search Algorithm
• Discontinued
– Due to problems?
• Still existing uses
– PeptideShaker
– SearchGUI
Sequence Alignment
• Exact
– simple
target
pattern
• Approximate
– More difficult
target
pattern
Sequence Alignment
• Exact pattern matching
–
–
–
–
Naive method aligns pattern with each location of the target
Boyer-Moore indexes the pattern to skip some alignments
Wu-Manber indexes many patterns and skips some alignments
Indexing
• Suffix tree indexes target and then quickly finds each pattern
• Many other methods
Sequence Alignment
• Approximate pattern matching
– Pairwise
• Local
– Smith Waterman
– BLAST
– FASTA
• Global
– Needlemann Wunsch
– Multiple
• T-Coffee
• ClustalW
• ...
Basic Local Alignment Seach Tool
• Input
– Pattern
– Target
– Search parameters and settings
• Output
– Alignments in various formats
• XML
• Help
– http://www.ncbi.nlm.nih.gov/books/NBK1763/
BLAST
• Target
– Needs to be indexed
– Cannot be FASTA
– Must fit to the pattern and BLAST variant
• protein target and protein pattern can be searched using blastp
• Target indexing
– makeblastdb, in the BLAST package can index FASTA files
– Needs sequence input (e.g. FASTA, asn.1)
– Needs sequence type to be provided e.g.: protein
BLAST
• blastp
– Needs indexed database
– Needs query sequence (can be unindexed FASTA)
– Produces alignments
22
Blast flavors
Query:
DB:
•
•
•
•
•
DNA
DNA
Protein
Protein
BlastN - nt versus nt database
BlastP - protein versus protein database
BlastX - translated nt (6 frames) versus protein database
tBlastN - protein versus translated nt database (6 frames)
tBlastX - translated nt versus translated nt database (both 6 frames)
BLAST Output
• XML
– -outfmt 5
• This switch leads to XML output
End Theory I
• 5 min mindmapping
• 10 min break
Practice I
Download Blast
• http://blast.ncbi.nlm.nih.gov/Blast.cgi?CMD=Web&PAGE
_TYPE=BlastDocs&DOC_TYPE=Download
– Get blastp and makeblastdb from mbg404 since you are not
allowed to install anything
• Download a Fasta file (protein, genome, collection of
sequences in fasta format)
– Database must consist of amino acids since we only have
access to blastp today
• Use makeblastdb from the Blast package to index the file
• Several files will be created when you do it right
MakeDB
• Example
– makeblastdb -in seq.fasta -dbtype prot -out seqBl –title
seqBlastDB
• More information?
– Go to the doc folder of BLAST
– Documentation is there
– http://www.ncbi.nlm.nih.gov/books/NBK1763/
BLAST
• Now that we have an indexed database try to run BLAST
• Read documentation and try to solve the simplest case
– You will need the indexed database and you will need a FASTA
file as query
– You could create queries from the database and slightly change
them
• Good luck
OMSSA
• Unzip folder and check
– Alternatively, download from NCBI
•
•
•
•
•
MS/MS mgf file
Database file as FASTA
makeblastdb.exe
omssacl.exe
usermods.xml
OMSSA
Before running OMSSA, database file must be converted to
BLAST-like format.
So let’s run makeblastdb.exe to create a hash-indexed
database
OMSSA
Here 2 different settings are used.
First one is with 0.05 product ion tolerance
Second one is with default product ion tolerance
For variable modifications (-mv) check usermods.xml
X!Tandem
• Unzip folder and check
•
•
•
•
Mgf formated spectra (file)
Database file (FASTA)
tandem-win32-10-12-01-1 folder
Used .xml configuration files (default_input.xml, input.xml
and taxonomy.xml)
• To get the same output given in zip folder;
– Replace configuration files in «tandem-win\bin» folder with ones
in «used» folder.
– Also copy database file to «fasta» folder and .mgf file to «bin» in
«tandem-win»
X!Tandem Console Application
X!Tandem Default Input
Parameters such as mass tolerances, enzyme type, number of
charged for search can be reset in default_input.xml
X!Tandem Input.xml
In input.xml file, you should specify path of:
• taxonomy.xml
• default_input.xml
• Spectra filename
• Output filename
NOTE: Here input.xml and all files above are in same folder(directory))
X!Tandem Taxonomy
In taxonomy file, you should specify «database file path». In this
example, database file is in «fasta» folder in «Xtandem\tandem-win3210-12-01-1» folder.
X!Tandem Output
End Practice I
• 15 min break
Theory II
Automation
• BLAST needed sequence file preprocessing
• OMSSA, X!Tandem, etc may need conversion of spectra
files
– A lot of manual processes
• Needed: an automation facility
• Solution: computational pipelines
Computational Pipeline
Spectra
mgf
mzXML
dta
mz2
...
Spectra
Format
Converter
DB
dta
PepNovo
Result
Converter
Analysis Network
Result
Converter
Spectra
mgf
mzXML
dta
mz2
...
2DB
Lutefisk
Spectra
Format
Converter
GPF
PepNovo
Result
Converter
General Pipeline Considerations
• Data cannot be connected to data
• Operations cannot be connected directly either
• Data needs to be transformed (operation)
Data
Store
DB
Data
Store
DB
Data Flow
Operation
Data
X
Operation
• In the example the data element cannot be
directly connected to the DB
• The data element is also not necessar it has
been added to clarify that the process
generates data which will go directly to the DB
Data
OpenMS
Pipeline Examples
• We will see a few examples
– OpenMS/TOPP
– Trans Proteomics Pipeline
– Proteomatics
– Ensembl
TOPP - The OpenMS Proteomics Pipeline
http://open-ms.sourceforge.net/
Trans-Proteomics Pipeline TPP
http://sourceforge.net/projects/sashimi/
Proteomatic
http://www.uni-muenster.de/hippler/proteomatic/
Proteomatic
http://www.uni-muenster.de/hippler/proteomatic/
Ensembl
http://genome.cshlp.org/content/14/5/934.full
http://www.ensembl.org
Standardization
• Some programs have the same aim
– Unfortunately, produce largely different output
– Depend on different input formats
– One need for pipelines arises from this
• Standardization can eleviate that problem
• Currently mostly XML
– Developments of controlled vocabularies are seen
• In ten years full transition to ontologies expected
Standardization (HUPO PSI)
Selfmade
• Windows
– Batch script
– Powershell
• Linux
– Bash script
– Shell script
• Common
– A file that contains instructions
– Usually found in the console
Delete Temp
• Batch script
– cd c:\
– cd Windows\Temp
– rm –r –s *.*
• Save file as
– DeleteTemp.bat
• Put the file into
– C:\Users\%USERNAME%\AppData\Roaming\Microsoft\Windows
\Start Menu\Programs\Startup
• Next startup
– The temporary files will be deleted
Pipelining
• The previous example performed pipelining
• You can use this for anything like
– First making a BLAST DB
– Second searching it
• Advantage
– You have a log of the settings etc.
– You can repeat it at any time
End Theory II
• 5 min mindmapping
• 10 min break
Practice II
Raw Data
• Screenshots
• Copy paste
• Unstructured
• Not integrated
• Unreflected
Information
• Structured Data
PepNovo
PEAKS
Lutefisk
OMSSA
• Sorted
• Integrated
Prediction Distance
1
0.8
0.6
0.4
0.2
0
0.22-0.43
• Properly graphed
• Figure
– Number
– Caption
– Reference
0.55-0.66
0.67-0.83
Spectral Quality
Figure 1: Spectral Quality (present fragment ions/
expected fragment ions) versus Prediction Distance to the
true sequence (normalized edit distance; 0:great and
1:poor). Predictions were done by PepNovo, PEAKS, and
Lutefisk while identification was done with OMSSA. All
MS/MS spectra were of charge 1.
Even Better
PepNovo
COMAS
Lutefisk
OMSSA
PEAKS
1
Prediction Quality
0.8
0.6
0.4
0.2
0
0.22-0.43
0.55-0.66
0.67-0.83
Spectral Quality
Figure 5: Spectral Quality (present fragment ions / expected fragment ions)
versus Prediction Quality (normalized edit distance). The box-and-whisker
plot presents three groups at different spectral quality. Note there were no
measurements between 0.43 and 0.55, before 0.22 and after 0.83.
Presenting Data
• When presenting data in your manuscripts:
• Raw data (not acceptable)
• Information (minimum)
• Knowledge (strive for this)
Whiteboardmaths.com
Stand SW 100
© 2004 - 2008 All rights reserved
Click when ready 
In addition to the demos/free presentations in this area there are at
least 8 complete (and FREE) presentations waiting for download under
the My Account button.
Simply register to download immediately. 63
www.similima.com
Median, Quartiles, Inter-Quartile Range and Box Plots.
Measures of Spread
Remember: The range is the measure of spread
that goes with the mean.
Example 1. Two dice were thrown 10 times and their
scores were added together and recorded. Find the mean
and range for this data.
7, 5, 2, 7, 6, 12, 10, 4, 8, 9
Mean = 7 + 5 + 2 + 7 + 6 + 12 + 10 + 4 + 8 + 9
10
= 70
=7
10
Range = 12 – 2 = 10
www.similima.com
64
Median, Quartiles, Inter-Quartile Range and Box Plots.
Measures of Spread
The range is not a good measure of spread because one
extreme, (very high or very low value) can have a big
affect. The measure of spread that goes with the
median is called the inter-quartile range and is
generally a better measure of spread because it is not
affected by extreme values.
A reminder about
the median
www.similima.com
65
Averages (The Median)
The median is the middle value of a set of data once
the data has been ordered.
Example 1. Robert hit 11 balls at Grimsby driving
range. The recorded distances of his drives, measured
in yards, are given below. Find the median distance for
his drives.
85, 125, 130, 65, 100, 70, 75, 50, 140, 95, 70
50, 65, 70, 70, 75, 85, 95, 100, 125, 130, 140
Single middle value
Ordered data
Median drive = 85 yards
www.similima.com
66
Averages (The Median)
The median is the middle value of a set of data once
the data has been ordered.
Example 1. Robert hit 12 balls at Grimsby driving range.
The recorded distances of his drives, measured in yards,
are given below. Find the median distance for his drives.
85, 125, 130, 65, 100, 70, 75, 50, 140, 135, 95, 70
50, 65, 70, 70, 75, 85, 95, 100, 125, 130, 135, 140
Two middle values so
take the mean.
Ordered data
Median drive = 90 yards
www.similima.com
67
Finding the median, quartiles and inter-quartile range.
Example 1: Find the median and quartiles for the data below.
12,
6,
4,
9,
8,
4,
9,
8,
5,
9,
8,
10
10,
12
Order the data
Q2
Q1
4,
4,
5,
6,
Lower
Quartile
= 5½
8,
8,
Q3
8,
Median
= 8
9,
9,
9,
Upper
Quartile
= 9
Inter-Quartile Range = 9 - 5½ = 3½
www.similima.com
68
Finding the median, quartiles and inter-quartile range.
Example 2: Find the median and quartiles for the data below.
6,
3,
9,
8,
4,
10,
8,
4,
15,
8,
10
Order the data
Q2
Q1
3,
4,
4,
6,
Lower
Quartile
= 4
8,
8,
Median
= 8
Q3
8,
9,
10,
10,
15,
Upper
Quartile
= 10
Inter-Quartile Range = 10 - 4 = 6
www.similima.com
69
Discuss the calculations below.
Battery Life:
The life of 12 batteries recorded in hours is:
2, 5, 6, 6, 7, 8, 8, 8, 9, 9, 10, 15
Mean = 93/12 = 7.75 hours and the range = 15 – 2 = 13 hours.
2, 5, 6, 6, 7, 8, 8, 8, 9, 9, 10, 15
Median = 8 hours and the inter-quartile range = 9 – 6 = 3 hours.
The averages are similar but the measures of spread are
significantly different since the extreme values of 2 and 15 are
not included in the inter-quartile range.
www.similima.com
70
Box and Whisker Diagrams.
Box plots are useful for comparing two or more sets of data like
that shown below for heights of boys and girls in a class.
Anatomy of a Box and Whisker Diagram.
Lower
Lowest
Quartile
Value
Whisker
4
5
Median
Upper
Quartile
Whisker
Box
6
7
Highest
Value
8
9
10
11
12
Boys
130
140
150
160
www.similima.com
170
180
cm
190
Girls
Box Plots
71
Drawing a Box Plot.
Example 1: Draw a Box plot for the data below
Q2
Q1
4,
4,
5,
6,
8,
8,
Lower
Quartile
= 5½
4
5
Q3
8,
Median
= 8
6
7
8
www.similima.com
9
9,
9,
9,
10,
12
Upper
Quartile
= 9
10 11
12
72
Drawing a Box Plot.
Example 2: Draw a Box plot for the data below
Q2
Q1
3,
4,
4,
6,
8,
Lower
Quartile
= 4
3
4
5
6
Q3
8,
8,
Median
= 8
7
8
9
www.similima.com
9,
10,
10,
15,
Upper
Quartile
= 10
10 11
12 13
14 15
73
Drawing a Box Plot.
Question: Stuart recorded the heights in cm of boys in his
class as shown below. Draw a box plot for this data.
Q2
QL
Qu
137, 148, 155, 158, 165, 166, 166, 171, 171, 173, 175, 180, 184, 186, 186
Lower
Quartile
= 158
130
140
Upper
Quartile
= 180
Median
= 171
150
160
www.similima.com
170
180
cm
190
74
Drawing a Box Plot.
Question: Gemma recorded the heights in cm of girls in the same class and
constructed a box plot from the data. The box plots for both boys and girls
are shown below. Use the box plots to choose some correct statements
comparing heights of boys and girls in the class. Justify your answers.
Boys
130
140
150
160
170
180
cm
Girls
1. The girls are taller on average.
2. The boys are taller on average.
3. The girls show less variability in height.
5. The smallest person is a girl.
www.similima.com
4. The boys show less variability
in height.
75
6. The tallest person is a boy.
190
Konstanz Information Miner
• We will use the Workflow Management and Data
Analytics Platform
• First we need to find out how to get our data into
KNIME
Create Data
• Use Excel to create two colums
– Girls, boys
• Make a few hundred random numbers (randbetween)
– 140 -170 for girls
– 150 - 180 for boys
• Copy the table
• Paste into Notepad++
• Save as Distribution.txt
KNIME Data Import
• Open Knime
• Select the folder containing the data as workspace
• Right click LOCAL
– Select new workflow
– Name it HeightAnalysis
• Drag and Drop Distribution.txt into the workflow
Box Plot
• Type box to find box plot
node
• Double click
• Right click Box Plot node
– Select Execute and open
views
• Done
Workflow