Parallel Detection of Regulatory Elements with gMP Damayanti Gupta CMSC 838 Presentation

advertisement
Parallel Detection of Regulatory
Elements with gMP
Bertil Schmidt, Lin Feng, Amey Laud, Yusdi Santoso
Damayanti Gupta
CMSC 838 Presentation
Motivation

Fundamental question






How are expression levels of thousands of genes regulated ?
Very important
Understanding of gene function
 Response to environment
Understand genetic causes of diseases
 Evaluate effects of drus
Detect mutations
Remember


Sets of genes -> Pathways -> Genetic Networks
Gene regulation
 Control decisions turn genes on/off
 Gene Regulation Network
CMSC 838T – Presentation
Talk Overview

Overview of talk

Motivation

Technique

Experiment

Related work

Conclusions
CMSC 838T – Presentation
Technique

Motifs upstream of genes regulate gene expression



Motifs are sites of regulatory activity
Identify regulatory motifs by combining

Gene expression data

Detect common motifs occuring upstream of genes
Huge datasets

Utilise parallel computing
CMSC 838T – Presentation
Technique

gRNA
 Java

development framework
gMP
 Java
communication library

REDUCE

Algorithm to identify regulatory motifs
REDUCE parallelised with gMP



Increase computing power
Get motifs ranked in statistical significance
CMSC 838T – Presentation
gRNA framework

Consists of APIs
CMSC 838T – Presentation
gRNA - APIs

Interact with data sources

Provide functionality from biology

Pipelines tasks into unified process

Repository of resources

Distributed programming
CMSC 838T – Presentation
gRNA environment

gRNA Grid


Clustered computing environment
Application written for gRNA

Multiple-tier application

Applications operate from client computer

Communicates with cluster through single computer

Hosts EJB server
Server identifies processing nodes


each of these perform tasks
CMSC 838T – Presentation
gRNA Grid
CMSC 838T – Presentation
gMP

Java based message passing tool

Built on top of sockets

Manages virtual processors to run on available
machines

Scalable

Machines added/removed easily
CMSC 838T – Presentation
gMP

Processes are grouped

Communication primitives provided for sending and
receiving data

Collective communication to several nodes enabled
modularly and efficiently

Enables functions to be implemented on data
CMSC 838T – Presentation
REDUCE algorithm

Based on model

Upstream motifs contribute additively to expression level of
each gene

Quantify the extent to which these motifs contribute to
expression data
Fit log of expression ratio to sum of activating and
inhibitory terms
Find stastically most significant motifs



Plots of fitting parameters suggest biological function
CMSC 838T – Presentation
REDUCE algorithm

Terms

Occurence vector

Measure of how often a motif is found
Expression vector


Measure of gene expression
CMSC 838T – Presentation
REDUCE method
Consists of
1) Motif frequency counter

counts occurrences of DNA motifs upstream of each ORF

motifs are about 7~11 nucleotides in length

get occurence vectors
CMSC 838T – Presentation
REDUCE algorithm
2) Significant motif finder

Use

i) Normalised occurrence vector made for each motif nμ
ii) Normalised vector of logs of gene expression ratio vectorsa
Take dot product of these (a . nμ) ,and square.




Can be considered as frequency of occurence X expressive
power of regulatory motif
It is squared to get rid of negatives
Correlate gene expression with occurence of motif
Largest dot product is most significant motif
CMSC 838T – Presentation
....

a is modified to remove effect of this motif

residual gene expression vector
Process repeated until motifs are ranked

CMSC 838T – Presentation
Table: Finding significant motifs

Uses a - (.5816,.2522,.2886,-.5947, -.1595, -.3683)
CMSC 838T – Presentation
REDUCE parallelised with gMP...


Parallel motif frequency counter

Split set of ORFs equally

Distribute across available nodes

Each node calculates in parallel to get occurence vectors
Matrix transposition

Occurence vectors scattered across nodes

Advantageous to store each vector in single node



Transpose motif frequency matrix
For each ORF can only calculate fraction of occurence
frequencies for all motifs
But the entire occurence frequency is needed
CMSC 838T – Presentation
...

Parallel significant motif finder

Normalises occurence vector within each node

At each node, most significant motif calculated

Global most significant motif calculated

Process iterated to rank occurence vectors

Interface in gRNA allows ease of implementation
CMSC 838T – Presentation
Experiment

Use Compaq Alpha system

Consists of cluster of 8 AlphaServer SC/ES45

Connected by high-speed Alpha SC 16-Port switch and
ELAN PCI adapter cards.

Each server contains 4 Alpha EV68 processors
CMSC 838T – Presentation
Results

Use 7090 gene expressions of yeast

ORFs of length 600

Motifs upto length 7

Throughput (in MBytes/s) also shown

20 most significant motifs computed.
CMSC 838T – Presentation
Analysis

Runtime scales well with number of processing nodes

Frequency counter scales perfectly

Motif finder also scales

Cannot achieve perfect scaling because of communication
overhead.
CMSC 838T – Presentation
Related work

DiscoveryLink


Provides configurable wrappers as interfaces to multiple data
sources
Kleisli system



Systematically manages and integrates external databases
Uses functional query language to perform correlation
across databases
Toolkits designed with functionality for specialised
areas

BioJava, BioPerl, PAL


Sequence Analysis
Ensembl initiative, DAS

provide extensible approach to issue of annotating genomic
data
CMSC 838T – Presentation
Related work
Previous approaches using Java for high performance
computing

Bindings into native message-passing APIs(e.g.MPI)


Does not allow easy integration into larger Java
applications
Pure Java message passing interfaces

JMPI, CCJ



Both implemented on top of Java RMI
– Slower than using raw sockets
CCJ tries to overcome
– optimised RMI implementation
– not portable
Both cannot handle integration
CMSC 838T – Presentation
Comparison
According to authors ...

gRNA distinguishes itself

Uses whole range of requirements for applications in
computational biology

Provides decoupled, yet inter-related subsystems

Ease of 3rd party implementation
CMSC 838T – Presentation
Observations

REDUCE surpasses traditional clustering approach

REDUCE algorithm has high runtime


Complexity depends on product of number possible motifs and
that of genes.

Grows exponentially with length of sequences

So length of motif is restricted
REDUCE algorithm is greedy


suboptimal
REDUCE is simplistic

lacks parameters for interactions between motifs

does not consider impact of other biological knowledge
CMSC 838T – Presentation
...

Not clear that results of REDUCE are biologically
significant

Experiment does not effectively show how higher
computation power helps results

Only analysis from 9 to 16 processors, is this sufficient to
determine ‘good scaling’?
CMSC 838T – Presentation
Conclusions
Finally...

gRNA demonstrates efficient mechanism for
development of genome-centric applications
Further...

Extensions to REDUCE have been proposed

require higher computing power

more specialised programming interfaces required

Identifying communication patterns

Use of data structures e.g. sequences, trees, matrices
CMSC 838T – Presentation
Download