ResponseNet

advertisement
ResponseNet
Esti Yeger-Lotem, Laura Riva, Linhui Julie Su, Aaron D Gitler, Anil G Cashikar, Oliver D
King, Pavan K Auluck, Melissa L Geddie, Julie S Valastyan, David R Karger, Susan
Lindquist & Ernest Fraenkel
Bridging high-throughput genetic and transcriptional
data reveals cellular responses to alpha-synuclein
toxicity, Nature Genetics 2009
Presented by Shahar Zini
June 2013
Introduction
• Gene expression profiling alone reveals very little about the inner
workings of the cell and the mechanism involved.
• In the last semester we’ve seen many efforts to combine data from
gene expression and protein-protein/protein-DNA knowledge in order
to reveal more about the processes that occur within the cell.
• ResponseNet is a method which adds another high-throughput
experimental data into the mix: genetic screening.
Genetic screening
• A method which involves alteration or deletion of an organism’s gene
products and the examination of the phenotype of the resulting
organism.
• Phenotype: the composite of an organism’s observable characteristics
• The process:
• Pick-up an healthy organism
• Delete/alter/overexpress a gene or a group of genes
• Test the resulting organism’s phenotype against a desirable characteristic
(called “scoring”)
• For example: delete the MEP1 gene in a yeast cell and check for
impaired growth.
Genetic screening
• This process can be used as a high-throughput procedure, meaning
one could test alterations of 1000x of genes for numerous desired
characteristics.
• The set of genes found to affect a process is called “genetic hit”.
• Can reveal an answer to some questions like: “What genes, if
removed, will cause a yeast cell to die?”
• Just like gene expression: lots of data, yet, does not reveal a lot about
the procedures the took place inside the cell to cause the observed
phenotype.
Motivation
• Cells respond to stimuli by changes in various processes.
• Efforts to identify components of these responses using both methods
(gene expression and genetic screening) yields different sets of genes.
• Genetic screens tend to identify response regulators.
• Gene expression tend to identify metabolic responses.
• Combining the data together may reveal something deeper about the
processes of the cell.
• It is biologically plausible that some of the reported response regulators
will be connected through regulatory pathways to differently expressed
genes.
ResponseNet
• Identify molecular interaction paths connecting genetic hits and
differentially expressed genes.
• Create a graph containing 2 types of nodes: genes and proteins.
• Genetic hits represented as their corresponding protein
• Fill the graph with (mostly bi-directional) edges connecting protein to
another protein if they have a known interaction
• Fill the graph with directional edges connecting a protein node to a
gene node if the protein is probable transcriptional regulator of the
gene.
Genetic hit
Differentially expressed gene
Protein
Stimulusrelated
proteins
stimulusrelated
genes
intermediary
proteins
TFs
The signaling and regulatory sub-network, by which stimulus-related proteins
detected by genetic screens may lead to the measured transcriptomic response.
Min-cost flow
• Represent the problem as a flow problem.
• Each edge’s weight set to the negative log of it’s probability
• Try to pass flow from the genetic hits to the differently expressed
genes using regulatory proteins.
• Strong flow => Highly regulated path => Important component in the
process
• Rank nodes based on amount of flow to distinguish core component
of the cellular response and peripherals.
Source
Genetic hit
Differentially expressed gene
Protein
Sink
Weighting scheme – protein-protein
• There are K types of evidence for a protein-protein interaction
• For example, “Two-hybrid” connection can be tested using direct experiment
and using high-throughput experiment
• We trust the direct experiment more
• We are interested in finding regulatory pathways, and thus we would
like to emphasis interactions between proteins that function together
in a common response pathway.
Weighting scheme – protein-protein
• For each pair of proteins A and B we construct a vector X = (x1,x2…,xk)
• Xi is 1  There is evidence of type i that A and B are connected
• wab = P(a and b are connected | X)
• By bayes’ rule:
• = P(X | a and b are connected) * P(a and b are connected) / P(X)
• P(X) = P(X | a and b are connected) * P(a and b are connected)
+ P(X | a and b are not connected) * P(a and b are not connected)
• Assume independence between types of evidence:
• P(X | a and b are [not] connected) =
P(Xi| a and b are [not] connected)
Weighting scheme – protein-protein
• In order to calculate values using the equations derived we need to
calculate the prior distributions P(Xk | a and b are [not] connected)
• They used 3 sets of data:
• A set of response pathways
• A positive set containing all interacting protein pairs in a common response
pathway based on reliable GO annotation (direct assays or expert knowledge)
• A negative set of interacting protein pairs known not to be in a common
response pathway
• They capped the weight to 0.7 because some interactions received a
score close to 1 (probably well studied proteins)
Weighting scheme – protein-gene
• They used data obtained using “Chip-chip” and “Chip-chip motif”
methods.
• Those experiments are used to detect transcription factors for genes.
• The weighting scheme is irrelevant, but it was set so that the weight
of each edge would reflect the interaction’s reliability.
Flow problem details
• For each genetic hit, the weight and capacity of the edge from the source to the
node is set to:
Source
• Strength is determined by results of the experiment
• For each edge connecting a differentially expressed gene to the sink, set the
weight and capacity to:
Sink
• Strength is determined by fold-change or p-value of the experiment’s results
• All other capacities are set to 1 (and weight using the weighting scheme)
Linear programming
• Min-cost flow optimization can be solved efficiently using linear
programming
• The objective function used included a nice twist, so we would
investigate it further:
• The first term is what you would expect from min-cost-flow problem
• Reduce the sum of cost*flow in all edges
• The second term is used to control the size of the input connected by
flow
• Smaller values will result in an empty flow (any flow will increase the
cost)
• A value of 0 means you will only allow for perfect edges (weight=1)
• As the value increases, you give room to more and more
imperfection, and allow for lower-probabilities edges to be added
• Adding more edges => better coverage
• Meaning, we can adjust the coverage of the solution and so the
conflict between coverage and accuracy continues.
Results
• Validation of algorithm results on well-studied response pathways
• STE5 deletion
• DNA Damage
• Broader validation on less-studied response pathways
• Alpha-synuclein toxicity
STE5 deletion
• A protein that coordinates a MAPK cascade activated by pheromone
• Multi-tier pathway activated by the presence of pheromone
• Simple and well-studied response pathway
• Used a set of genetic hits associated with STE5 and differentially
expressed genes from a yeast strain lacking STE5
DNA Damage
• There are many mechanisms in the cell which identifies and fixes
damage created to the DNA.
• A much more complex pathway, but still highly-studied
• The input is very large:
• 91 genetic hits
• 9 differentially expressed genes (exposed to mutagen)
• Large output
Results – DNA Damage
• Contains 361 edges between 166 intermediate proteins
• Highly enriched for DNA damage stimulus (p<10-14) and DNA repair
(p<10-14)
Some of highly ranked parts of the network are proteins that
Have been uncovered by years of intense research. (MEC1,
RAD53, RFC2,RFC3,RFC4,RFC5, and RFX1)
MEC1, RAD53,RFC2,RFC3,RFC4 and RFC5 are essential genes
And thus could not have been detected via genetic screening
Broader experiments
• In order to evaluate the algorithm’s results better a broader
evaluation took place
• The writers used data from 101 different perturbation, each involved
the inactivation of a single gene
• In most cases, the response pathways are poorly studied
• The genetic hits contained the genetic interactors of the inactive gene
• The differentially expressed genes are based on microchip profiling of the
inactivated strain
Broader experiments
• In order to validate the results, the identity of the inactivated gene was
hidden from the algorithm
• The result of each test were considered successful if:
• They included the inactivated gene of the specific test and it was ranked significantly
well
• They were significantly enriched for a specific biological process attributed to the
specific gene
• Significance was tested by creating a randomized inputs for each test:
• 50 times the genetic hits inputs were randomized
• 50 times the differentially expressed genes were randomized
• A significant solution was one which was better (in ranking / enrichment level) than
95% of the randomized solutions
Broader experiments – results
• Finding the inactivated gene was successful in 25 of the tests
• In 8 of the tests the results contained the inactivated gene but it was not scored high
enough
• Finding enriched results where successful in 28 of the tests
• The success rate for cases based on incomplete genetic hits was 40% and
under full genetic screens it was 47% => An indication that the algorithm
works well even under very limited data sets
• A typical result covered only 1% of the yeast genes
• We can infer that both criteria were important
• Overall: 41% success (considered high because of the low quality data)
Alpha-synuclein toxicity
• A protein known to be a key component to Parkinson disease.
• Despite intense research, the cellular pathways leading to cell death are
just beginning to get uncovered.
• Genetic screen experiment detected 77 proteins
• No significant overlap between genetic hits and transcriptional profiling
results were found, meaning, we have lots of data, but no of to explain it
• Applying ResponseNet revealed a sub-network connecting 34 genetic hits
and 166 differentially expressed genes shading light on the inner processes
effecting alpha-synyclein.
• Some of the conclusions were tested in the lab, and confirmed.
• Real contribution to the understanding of the Parkinson disease!
Freedom of data
• The authors complained that many experiments are taking place
world-wide but only a few release source codes and complete data
sets on the internet
• So, they decided to open their algorithm and method to the world by
creating a web page that anyone can upload data into and receive
results in minutes
• One can use yeast data (as used in the article) or can upload it’s own
data representing the protein-protein and protein-gene relations in a
different organism
Any Questions?
Questions
• What is the motivation to connect genetic hits and differentially
expressed genes?
• Why do the authors capped the weight of each edge? What would
happen they would have skipped this stage?
• What would happen if we set the gamma parameter to value which is
too high?
Please send answers to my mail (shshzi at gmail)
Download