Data Quality is Bad? Deal With It

advertisement
Data Quality is Bad?
Deal With It
Dennis Shasha
New York University
Data Quality Problem –challenges
• Two companies merge or two divisions want
to share data. Problem: identify common
customers even though their names are
spelled differently (work with
Bellcore/Telcordia colleagues: Munir
Cochinwala, Verghese Kurien, and Gail Lalk)
• Real-time sensor network. Problem: sensors
fail; want to avoid false alarms (work with
physicist Alan Mincer and student Yunyue
Zhu)
My Approach
• Let’s look at fields that have dealt with data
quality problems for years though they
consider these problems part of business as
usual.
• We will ask: what do these fields do and how
might that help us?
Data Quality Problem – biology
• Take two genetically identical plants, treat
them in the same way, and measure the RNA
expression levels. Get vastly different results.
• Differences increase if experiments done in
different labs or by different people in the
same lab.
• Even breathing can be dangerous…
• Goal: find causal relationships among genes.
What Can One Do?
• One way to tease out causality is to perform a
time series experiment on closely spaced time
points.
• Want close spacing to be able to say gene
expression level at time t depends on gene
expression levels at t-1.
• Start with noise-free model.
Noise-free Modeling of Transcriptome Time Series Data
Time
t
t+2
t+3
t+4
Gene expression
gene zk
t+1
gene zi
TFs + targets t+1
TFs t
f
f
Red Squares represent a
transition function f to
be learned
f
TFs t+3
TFs + targets t+4
f
Explain target gene expression as function of up to 4 input TFs
Krouk et al 2010 submitted [19]
Modeling Noise (poor quality)
• There is reason to believe that Gaussian noise
is a decent model of the inconsistencies in
biological replicates.
• So model the relationship between
observations and “true” value by a Gaussian
noise component.
• We’ll see whether this is a good idea or not.
(A) Transcriptome Data Set – time series
0
3
6
9
12
15
20 min
(B) Noisy model (black box is Gaussian noise)
observation
model g
dynamic
model f
Training set
Y(t)
Y(t+1)
Y(t+2)
Y(t+3)
Y(t+4)
Y(t+5)
Z(t)
Z(t+1)
Z(t+2)
Z(t+3)
Z(t+4)
Z(t+5)
0
3
“Leave-out-last” test:
(C) Naive
“Trend-forecast” test:
Krouk et al 2010 submitted [19]
6
9
12
15 min
Predict 20
min?
71%
correct
Predict direction of change of each gene @ 20 min
Training set
12
15 min
Predict direction of change of each gene @ 20 min
51%
correct
Test and Adaptation
• Test the model by predicting values at a time
point not used in the training.
• Predictions are not generally perfect, so
adaptation is to figure out which other time
points to test.
• One way to do this is to perform the training
and testing process with one fewer
experiment. If the most critical experiment is
at time t, then gather more data at time t.
Lessons from Network Inference
• The objective is predictive power.
• Use the training set to train noise model and
causal relationships among the genes.
• If predictions work out, then good.
• Modeling data quality is part of the learning
problem.
Physics -- supernovas
• Look at sky and observe showers of gamma
particles.
• Model the background as a Poisson process.
• Look for exceptionally high bursts (these can
last seconds, minutes, hours, up to days).
• Aim telescopes in the appropriate part of the
sky.
Astrophysical Application
Motivation:
In astrophysics, the sky is constantly observed for high-energy particles.
When a particular astrophysical event happens, a shower of high-energy
particles arrives in addition to the background noise. An unusual event burst
may signal an event interesting to physicists.
Technical Overview:
900
1.The sky is partitioned into
1800*900 buckets.
2.14 Sliding window lengths are
monitored from 0.1s to 39.81s
1800
12
Physics -- adaptation
• A burst is only the first filter for detecting a
supernova.
• If certain kinds of bursts (e.g. 10 second long
bursts) lead to false positives often, then
adjust the thresholds.
Physics -- lessons
• Once again the noise model is an integral part
of the problem setting.
• Adaptation is ongoing (no fixed training set).
• Because physicists are looking for a single
piece of information, e.g. there is a supernova
at location X,Y, redundancy can overcome
noise.
Drug Testing
• Give N patients a drug and N patients a
placebo.
• This is a classic “data quality”/”biological
variation” situation. Different patients will
react differently to a drug and almost all
patients will benefit from a placebo.
• Two questions: is the drug better than the
placebo and how much?
Drug Testing -- Resampling
• Suppose you arrange the results in a table
(patient id, drug/placebo, improvement).
• Compute the average improvement for the
drug population
• Evaluate significance using a permutation test
• Evaluate the level using confidence intervals
• Don’t require assumption about distribution.
Typical table
Patient improvement
Drug/Placebo
10
Drug
12
Placebo
8
Drug
-3
Placebo
20
Drug
4
Placebo
Drug improvement: 38/3; Placebo: 11/3
One Permutation of table
Patient improvement
Drug/Placebo
10
Drug
12
Placebo
8
Drug
-3
Placebo
20
Placebo
4
Drug
Drug improvement: 22/3; Placebo: 29/3
Significance Test – is the drug’s
apparent effect due to luck?
• count = 0
• do 10,000 times
permute the drug/placebo column
recompute improvement under permutation
if recomputed improvement >= measured
improvement in real test then count+= 1
P-value = count/10,000; chance that improvement
was due to chance.
Confidence interval – what’s a good
estimate of the drug’s benefit
• count = 0
• do 10,000 times
take 2N elements from the original table
with replacement
compute improvement
Sort the 10,000 improvement scores and
compute 95% confidence interval as 250th
score to 9,750th score.
Lessons from Drug Testing
• Assume different patients can react
differently.
• Is the drug benefit significant?
• How much of a benefit does it have?
• Lesson: questions are simple; individual noise
is overcome with redundancy.
Data Quality Problem – adversaries
• A farmer in the developing world wants to do
a banking transaction.
• The bank has appointed the shopkeeper the
bank agent. The shopkeeper will call the bank
over an insecure phone line.
• The farmer doesn’t know whether the
shopkeeper is truly honest and even whether
messages can be intercepted and mangled
(poor quality due to adversary).
Basic Solution
• Bank provides a collection of (essentially) onetime nonces and one-time pads to each of
farmer and shopkeeper ahead of time.
• Per transaction: each of farmer/shopkeeper
sends one-time nonce and messages to the
bank listing the amount of the transaction.
• The bank verifies their identities via the
nonces and the farmer/shopkeepers verify the
amounts via the one-time pad.
“Quality Issues” this Solves
• Replay is impossible because nonces are onetime.
• Mangling will be detected because of onetime pads.
• False confederates and hacking of telephone
network will be detected thanks to one-time
pads.
• Even a determined adversary can be
overcome. Never mind a little random noise.
Application – record matching
• Develop noise model: how sounds are
misheard or how symbols are mistyped?
• Develop training set having correct outcomes
but also metadata properties (e.g. who took
the information and when was it taken) in
case noise characteristics/probabilities
depend on that.
• Model cost of errors vs. cost to clean.
Application – sensor reading
• Be conscious of what the goals of the sensor
are, e.g. fire/no fire; earthquake/no
earthquake.
• Use burst detection to locate possibly
troublesome sensors in quiet times.
• Error model is key: could there be an
adversary? Can you use non-parametric stats?
Lessons
• Data quality problems (i.e. noise or adversarial
attacks) are an everyday occurrence in many
fields.
• First lesson: model the amount of noise and
design system to answer critical question (e.g.
what is causal network, is drug effective,
where is supernova) in spite of noise.
More Lessons
• Second lesson: If you can design for an
adversary, then get noise correction for free.
• Third lesson: Use the meta-data to try to
localize bursts of errors to try to shut down
the reason for noise.
Download