Uploaded by Robbie Luo

Data science More Than Just Algorithms

advertisement
data science
case study
1 of 29
Data Science:
More Than
Just Algorithms
A discussion with
Alfred Spector,
Peter Norvig,
Chris Wiggins,
Jeannette Wing,
Ben Fried, and
Michael Tingley
D
ramatic advances in the ability to gather, store,
and process data have led to the rapid growth
of data science and its mushrooming impact on
nearly all aspects of the economy and society.
Data science has also had a huge effect on
academic disciplines with new research agendas, new
degrees, and organizational entities.
Recognizing the complexity and impact of the field,
Alfred Spector, Peter Norvig, Chris Wiggins, and Jeannette
Wing have completed a new textbook on data science,
Data Science in Context: Foundations, Challenges,
Opportunities, published in October 2022.6 With deep and
diverse experience in both research and practice, across
academia, government, and industry, the authors present a
holistic view of what is needed to apply data science well.
Ben Fried, a venture partner at Rally Ventures and
formerly Google’s CIO for 14 years, and Michael Tingley,
a software engineering manager at Meta, gathered the
authors together as they were finishing up the manuscript
to discuss the motivation for their work and some of its key
points.
Norvig is a Distinguished Education Fellow at Stanford
HAI (Human-centered Artificial Intelligence) and a
acmqueue | january-february 2023 1
TEXT
ONLY
data science
case study
2 of 29
research director at Google; Spector is a visiting scholar
at MIT with previous positions leading engineering and
research organizations; Wiggins is an associate professor
of applied mathematics at Columbia University; and
Jeannette Wing is executive vice president for research
and professor of computer science at Columbia University.
(More biographic detail on the panelists is available at the
conclusion of this article.)
Ben Fried You’ve come to data science from very different
backgrounds. Was there a shared inspiration to write the
book?
Alfred Spector In one way or another, I think we all saw a
deep and growing polarity in data science. On the one hand,
it has enormous, unprecedented power for positive impact,
which we’d each been lucky enough to contribute to; on
the other hand, we had seen serious downsides emerge
even with the best of intentions, often for reasons having
little to do with the technical skills of the practitioner.
There are many excellent texts and courses on the science
and engineering of the field, but it seems like there is
something in the headlines every day that demonstrates
there is an urgent need to educate on what you, Ben, have
called the “extrinsics” of the field.
Peter Norvig Throughout the rapid growth in applications
of data science, there have been serious issues to
confront: click-fraud, the early Google bombs, data leaks,
abusive manipulation of applications, amplification of
misinformation, overinterpretation of correlations, and so
many more—all things we read about daily. Some problems
are more serious than others, but we feel education
acmqueue | january-february 2023 2
data science
case study
3 of 29
will help us to lessen their frequency and severity,
while simultaneously allowing us to understand their
significance.
BF Why the word Context in the title of your book?
Chris Wiggins It was our primary motivator. In a nutshell,
we wanted to provide some inclusive “context” for the
data-science discipline. We felt the term data science is
often used too narrowly.
AS We think of context in three ways.
It refers to the topics beyond just the data and the
model. These include dependability, clarity of objectives,
interpretability, and other things I’m sure we’ll get into.
It also refers to the domain in which data science is
being applied. What is crucial for certain applications isn’t
needed for others. Teams practicing data science must be
particularly sensitive to the uses to which their work will
be placed.
Finally, context refers to the societal views and norms
that govern the acceptance of data-science results. Just as
we have seen changing views and norms regarding privacy
and fairness, data science will increasingly be expected
to solve challenging problems, where societal views vary
by region and over time. Some of these problems are
“wicked,” in C. West Churchman’s2 language, and they are
so very different from the problems that computing first
addressed.
Jeannette Wing While data science draws from the
disciplines of computer science, statistics, and operations
research to provide methods, tools, and techniques we can
apply, what we do will vary according to whether we’re
working on a healthcare issue, something related to
acmqueue | january-february 2023 3
data science
case study
4 of 29
autonomous driving, or perhaps exploring some particular
aspect of climate change. Just as each discipline comes
with its own constraints, the same might be said of each
of these different problem domains. Which is why the
application of data science is largely defined by the nature
of the problem we’re looking to solve or the task we’re
trying to complete.
PN Beyond this, I personally wanted to reach a broader
audience than I had with my more mathematical and
algorithmic textbook. To do data science, we need to know
many techniques, but we also need to be conversant with
larger, societal issues. We all shared this motivation.
BF All this leads to the question of how you define data
science.
JW By the time Alfred and I first started talking about
working on a book, I was already writing papers and
giving talks where I defined data science as the study
of “extracting value from data.” But we agreed that this
definition was too high level and insufficiently operational.
AS So, we started with “extracting value from data,”
then added prose to address the two personalities of the
field—one where data is used to provide insight to people
(as in many uses of statistics) and the other having to do
with data science’s ability to enable programs to reach
conclusions.
CW We also recognized we needed a capacious definition
[see sidebar] to respect what people are doing in the name
of data science within industry and academia, as well as
the rapidity of change in the field.
BF It’s a very fluid definition. Not only does data science
mean different things to different people, it also has fuzzy
acmqueue | january-february 2023 4
data science
case study
Definition of Data Science
Data science is the study of extracting value
from data—value in the form of insights or
conclusions.
A data-derived insight could be:
3 An hypothesis, testable with more data.
3 An “aha!” that comes from a succinct
statistic or an apt visual chart.
3 A plausible relationship among variables of
interest, uncovered by examining the data
and the implications of different scenarios.
A conclusion could be in an analyst’s head
or in a computer program. To be useful, a
conclusion should lead us to make good
decisions about how to act in the world, with
those actions taken either automatically by a
program or by a human who consults with the
program. A conclusion may be in the form of a:
3 Prediction of a consequence.
3 Recommendation of a useful action.
3 Clustering that groups similar elements.
3 Classification that labels elements
in groupings.
3 Transformation that converts data to
a more useful form.
3 Optimization that moves a system to
a better state.
Taken from Data Science in Context: Foundations,
Challenges, Opportunities.6
5 of 15
boundaries.
CW Exactly! We’re at that
time in the creation of a new
field where it does have
fuzzy boundaries. It touches
on many different subjects:
privacy/security, resilience,
public policy, ethics, etc.
But it’s also clearly taking
form with the creation
of job titles, degrees, and
departments. We saw an
opportunity to take a stab
at defining its breadth—
starting with the diverse
challenges its practitioners
must overcome.
Michael Tingley Do you
make a distinction between
data science and machine
learning?
AS As a domain, data science
is broader than machine
learning, in that machine
learning is only one of the
techniques it employs. Data
science encompasses many
techniques from statistics,
operations research,
visualization, and many
acmqueue | january-february 2023 5
data science
case study
6 of 29
more areas: in fact, all the things needed to bring insights
and conclusions to a worthwhile end. That being said, the
revolutionary growth in machine learning has absolutely
catalyzed the most change: incredible successes but some
challenges too.
PN One difference is that, in the machine-learning arena,
a researcher’s focus might be to write a paper that
touts some new algorithm or some tweak to an existing
algorithm. Whereas, in the data-science sphere, research
is more likely to talk about a new dataset and how to apply
a collection of techniques to use it.
BF So, you were motivated by the breadth of challenges we
face. Where did you end up? Are there approaches that can
help?
PN After lots of give and take, we came up with something
we call an analysis rubric, where we enumerate the
elements a data scientist needs to take into account.
As Atul Gawande writes in The Checklist Manifesto,3
checklists such as our rubric make for better solutions,
and we hope ours might help people avoid some of the
mistakes we have made in past projects. But
because each project is different, it’s hard to
Analysis Rubric
come up with one checklist that will work across
3 Tractable data
all of them, so we’ll see how well it holds up to
3 Technical approach
the test of time.
3 Dependability
AS Let’s be specific. The analysis rubric
3 Understandability
addresses the challenges in seven categories.
3 Clear objectives
Some relate more to how we implement or apply
3 Tolerance of failures
data science. The others relate more to the
3 Ethical, legal,
requirements we are trying to satisfy.
societal implications
PN The rubric starts with data: getting and
acmqueue | january-february 2023 6
data science
case study
7 of 29
storing it, wrangling it into a useful form, ensuring privacy,
ensuring integrity and consistency, managing sharing and
deletion, etc. In some ways, this may be the hardest part of
a data-science project.
For me, the first big revelation of data science was that
data can be a key asset that offers real value.4 But, the
second revelation was that data can be a liability if you’re
not a good shepherd for it.
BF Are there hidden costs to holding onto data?
PN I’ve learned something in this regard from all the
efforts that have been made in recent years to advance
federated learning. In earlier days, if a team wanted to
build a better speech recognition system, it would import
all the data into one location and then run and optimize a
model there until they had something they could launch
to users. But then that would have meant holding onto all
these people’s private conversations, with concomitant
risks. As a field, we decided it would be best if you didn’t
hold onto that information but instead optimized each
person’s data privately while figuring out some clever way
to share the optimizations made individually with multiple
people in a federated learning framework. This federated
approach seems to be working out pretty well. The privacy
concerns have ended up leading to a pretty good scientific
advancement.
AS Our second rubric element is the most obvious. There
needs to be a technical approach, which can come from
machine learning, statistics, operations research, or
visualization. This offers a way to provide valuable insight
and conclusions, whether prediction, recommendation, or
acmqueue | january-february 2023 7
data science
case study
8 of 29
the others.
It isn’t easy to find a model in some situations.
Sometimes there is just too much inherent uncertainty, and
other times the world may continually change and make
modeling efforts ineffective. Some situations are gametheoretic, and a model’s conclusions themselves generate
feedback that makes the world less predictable.
One example of the limitations of modeling has been
to predict what might happen due to Covid-19. For many
reasons relating to limitations of data, rapidly changing
policy, variations in human behavior, and virus mutations,
the ability to make long-term predictions of mortality has
been poor.
BF Are you saying data science didn’t help at all in the war
on Covid?
PN I was involved in a project with an intern and some
statisticians at UC Berkeley where we were trying to give
hospitals advance notice of how many staffers they would
need to bring in three days ahead of time. We couldn’t give
them accurate predictions 30 days in advance, but we
could do useful short-term predictions.
JW And for sure, data science was applied successfully
in many other areas, most obviously in the vaccine and
therapeutics trials.
BF We could devote our whole time to models, but given
the topic’s broad coverage, let’s move to the next rubric
element: dependability.
JW With data science being used in ever more important
ways, dependability is of increasing importance, and we
include four subtopics under it: Are the privacy implications
of data collection, storage, and use acceptable? Are the
acmqueue | january-february 2023 8
data science
case study
9 of 29
security ramifications for the application acceptable,
given the likelihood that attacks may release data or
impair an application’s correctness or availability? Is a
system resilient in the face of a world that is continually
changing and with modeling techniques we may not fully
understand? Finally, is the resulting system sufficiently
resistant to the abuse that has savaged so many
applications?
CW We should note the tensions within the dependability
components. The push for privacy versus the need to
provide security is an example. End-to-end encryption
would reduce risks to privacy and keep providers from
seeing private messages, but it would also limit platforms’
abilities both to respond to law enforcement requests and
to perform content moderation. There definitely are some
unresolved tensions here.
MT Getting privacy, security, resilience, and abuse
resistance right is a good start and a formidable challenge
in itself. Is that enough to allow people to trust the
applications of data science?
AS It’s probably not enough. Developers, scientists,
and users must have sufficient understanding of datascience applications, particularly in increasingly sensitive
situations. The general public and policymakers also need
to have more understanding, given the pervasive impact.
This leads to the rubric topic of understandability,
which has three categories: Must a model’s conclusions
be interpretable—that is, should the application be able
to explain “why?” Must conclusions prove causality, or is
correlation sufficient? And must data-science applications,
particularly in the realms of science and policy, make their
acmqueue | january-february 2023 9
data science
case study
10 of 29
data and models available to others so they can test for
reproducibility?
Where data science is employed in research, the
tradition is that others must be able to reproduce work so
they can test and validate it. This is very hard to accomplish
when we’re dealing with massive volumes of data and
complex models.
PN Understandability has been particularly hard with
machine learning, but contemporary research is making
progress—for example, with visualization and what-if
analysis tools. While causality is difficult to show with only
retrospective data, the causal inference work from the
statistics community can reduce the amount of additional
experimentation needed to demonstrate it.
AS Here’s a real-world example from about 10 years ago
when I was at Google. Some argued it might be better
for societies to measure and then maximize happiness
rather than, say, per capita GDP (gross domestic product).
Catalyzing this interest, perhaps, was Bhutan’s thenrecently introduced gross domestic happiness metric.
Some believed that Google could glean a happiness score
from the collective searches of a population. Before we
proceeded too far, we realized there was a big gotcha:
The score would be so influential that Google would
need to explain to the public how it was calculated. If the
mechanism were fully explained, however, people would
want to abuse it—and render it invalid. While there was
data and (likely) a model, understandability—and then
dependability—concerns eventually torpedoed the effort.
MT This naturally leads to the question of setting precise
goals. Are the objectives of the system an immutable,
acmqueue | january-february 2023 10
data science
case study
11 of 29
external property, or is there also some emergent
property in how the system or its context evolves?
AS The next rubric element relates to having clear
objectives. Do we really know what we’re trying to
achieve? Requirements analysis has always been needed
in complex systems, but many uses of data science are
extremely challenging. They require the balancing of
near- and long-term objectives, the needs of different
stakeholders, etc. There may not even be societal
consensus on what we should achieve. For example, how
much fun—or how addictive—should a video game be?
Which recommendations to a user are beneficial versus
which might prove distracting in the wrong situations? Are
some downright harmful?
As already mentioned, a society’s norms may change
over time. It’s hard to anticipate everything, but we should
try to think about the downside risks posed by aspects of a
particular design. We advocate that these risks be made as
explicit as possible.
CW Beyond that, we need to be prepared to monitor the
way a data product is used and to mitigate its harms. A
video-game maker years ago may not have anticipated
that some people now would consider their product to be
addictive for young children. Mitigating harms, in this case,
may mean design changes that prevent or lessen extended
play or other signs of addictive behavior. Even then, not
everyone in the company that made the game might agree
this is a problem. A company committed to ethical data
products, however, takes this seriously.
AS An objectives-related topic unto itself is the incentive
structure that data science makes feasible. Given the
acmqueue | january-february 2023 11
data science
case study
12 of 29
ability to measure and optimize almost anything, are we
optimizing the right things? Which incentives should be
built into systems to guide individuals, organizations, and
governments in the best way?
BF Where does fairness come into this? It’s critically
important and very complex. Is there even agreement on
what’s fair and what isn’t? Won’t those opinions change
over time?
AS Fairness is addressed in two ways in our rubric. First,
it’s an implementation-oriented topic: Data collection and
models need to be built and indeed tested to be sure they
work well, not just on average but for subpopulations.
Societal priorities proscribe conclusions that are reached
based on subgroups’ protected attributes.
JW On top of the typical software engineering challenge
of making sure the model is working properly, we need to
pay great attention to training data. This is pretty new for
software engineers.
AS I like to say that when systems learn from data, “the
past may imprison the future,” thereby perpetuating
unwanted behaviors.
Beyond these data and implementation challenges,
the second fairness challenge is in goal-setting. There are
complex ethical, political, and economic considerations
about what constitutes fairness.
CW Ultimately, this comes down to the objective of trying
to gain value, which is a key word in our data-science
definition, since it comes with both an objective meaning
and a subjective meaning. That is, beyond whatever
mathematical value we’re trying to calculate or optimize,
there’s what we or our society may value. In part, I think
acmqueue | january-february 2023 12
data science
case study
13 of 29
this speaks to the fact we’re now making data-science
applications that have more and more impact on society.
Going back to context, you have to think about what
constitutes a success, and that can be complicated.
As Alfred has observed, this involves deciding on the
goal or objective function we’re trying to optimize while
acknowledging what we are omitting. It’s very hard to
consider all the possible edge cases and human impacts of
some data-science applications.
JW On a related topic, in our next rubric element we
examine whether the data-science application is innately
failure-tolerant, given that the objectives a system meets
may not be perfectly defined, and they may be achieved
only with some stochastic probability. Self-driving
cars, for example, aren’t particularly failure-tolerant,
whereas advertising would seem more so. But even some
advertising applications of data science can be intolerant
of failures; for example, it’s important to identify foreign
sources of election advertising revenue and to abide by
regulations governing certain products.
BF What about the last rubric element?
CW With data-science applications affecting individuals
and societies, they must take into account ethics, as well
as a growing body of regulations. These are covered in the
ethical, legal, and societal implications element
(shown in table 1).
AS Indeed, the body of laws governing many data-science
uses is already quite large. Furthermore, there are broad
societal implications; for example, data science almost
certainly is altering the employment landscape and having
acmqueue | january-february 2023 13
data science
case study
14 of 29
effects on societal governance.
MT As a practitioner, I think it’s wonderful to have some
guiding principles like the rubric to think about. In practice,
however, it’s sometimes difficult to anticipate these issues
up front and perform risk assessments or even guess at
some of the longer-term outcomes. For example, thinking
about all the potential ethical implications of something
before you even know where your investigation might lead
TABLE 1: Illustration of the Analysis Rubric Elements
Implementation-Oriented Elements
Tractable Data
Technical Approach
Dependability
Privacy
Security
Abuse-resistance
Resilience
Requirements-Oriented Elements
Understandability
Explanation
Causality
Reproducibility
Clear Objectives
Toleration of Failures
Ethical, Legal, Societal, Considerations
Legal
Societal
Ethical
Taken from Data Science in Context: Foundations,
Challenges, Opportunities.6
acmqueue | january-february 2023 14
data science
case study
15 of 29
is really challenging.
My question is: To what extent do we as practitioners
bear responsibility for exhaustively analyzing and
estimating these sorts of issues in advance? Isn’t it
inevitable that much of this work is going to end up being
guided by retrospective analysis once we’ve figured out
where we’ve landed?
AS Compounding the challenge you raise, the world might
change just because of the launch, meaning the very
existence of a data-science application changes the ground
rules that guided its development. As an example, the
world may become dependent on some application, which
would result in increased dependability requirements.
CW Then there’s also the matter of maintaining and
monitoring a data product. It’s not possible to know in
advance what all the possible failure modes are before a
launch, but there are plenty of opportunities to maintain
and monitor a product as the world changes and potential
harms are made clear.
JW We hope practitioners will end up using the analysis
rubric as a checklist during many stages of a project. Some
things ought to be easy enough to consider before building
a model, but then further assessment will also be required
after the model is built. With data science, it’s even less
likely that you’ll be able to anticipate everything in advance
than it is with more traditional software.
AS This emphasizes the role for product managers, who
are tasked with looking at a project broadly. Their role
becomes all the more critical as projects come to be
less dominated by technology. In fact, if you talk to many
product managers today, you’ll hear them say things
acmqueue | january-february 2023 15
data science
case study
16 of 29
like, “Our engineers started on this effort, particularly
the machine learning, and they did a lot of work without
pausing to think about all the other challenges they
were likely to encounter. And I really wish they’d talked
about that earlier because it would have saved us a lot
of rework.” That being said, as Chris intimated, we don’t
think everything should be approached with a waterfall
methodology. There’s plenty of interaction and adaptation
required.
BF Let’s spend some more time on your work on ethics.
JW While we could have kept the discussion of ethics
implicit in the other rubric elements, such as our
discussions of how to set good and fair objectives, Chris
and I, in particular, wanted to focus on ethics explicitly. We
decided to start with the Belmont principles5 as a basis
and see how far they would take us. I’d say they’ve actually
stood up pretty well so far.
BF What are the Belmont principles, and how do you apply
them?
CW The Belmont principles were effectively an attempt
to create a U.S. government specification for ethics.
In response to serious ethical breaches in taxpayerfunded research, Congress in the 1970s created a diverse
commission of philosophers, lawyers, policymakers,
and researchers to figure out what qualifies as ethical
research on human subjects. After years of discussion,
the commission announced that its focus would turn to
articulating a set of principles that would at least provide
a common vocabulary for people who attempt to make
a good-faith adjudication as to what qualifies as ethical
behavior. The principles themselves are:
Respect for persons, ensuring the freedom of individuals
acmqueue | january-february 2023 16
data science
case study
17 of 29
to act autonomously based on their own considered
deliberation and judgments.
Beneficence, that researchers should maximize benefits
and balance them against risks.
Justice, the consideration of how risks and benefits are
distributed, including the notion of a fair distribution.
These principles were ultimately released by the
U.S. government in 1978, and they’ve since been used
as a requirement in some federal funding decisions.
One exploration in our book is how these principles
remain useful for thinking through ethical decisions that
researchers and organizations must make in data-science
research and in developing data products.
BF Are there any contemporary examples of how the
Belmont principles are being applied?
AS Perhaps the intense discussion of Covid-19 vaccination
for young children is illustrative of the give and take. While
it’s currently believed that vaccinating a young child may
be of only modest benefit to the child, we have hoped that
having fewer infectious children may reduce Covid-19 in
elders with whom the child comes in contact.
This pretty explicitly shows the trade-offs: Respect
for persons might argue we would not seek to vaccinate
the child, since the vaccine is of unclear benefit and the
child may be too young to provide informed consent. On
the other hand, the principle of beneficence might win
the day, given the potential for saving the lives of many
grandparents. In a perfect world, this would be informed by
good statistics.
In any case, it illustrates the sorts of challenges
policymakers and parents face. We all believe that
acmqueue | january-february 2023 17
data science
case study
18 of 29
the explicit give and take of the Belmont principles in
such situations ultimately provides better, and more
transparent, decisions.
BF Do you have an example more related to data science?
AS Earlier in the discussion, Jeannette noted that selfdriving cars are not naturally failure-tolerant. Interesting
ethical questions—as well as some practical ones—come
up around this since it’s unlikely a self-driving car will ever
be 100 percent safe in all circumstances. We’ll face the
question of what constitutes an acceptable failure rate
as the technology gets closer to mass adoption. That is,
how much risk are we willing to accept? Auto accidents
currently account for around 40,000 deaths per year in
the U.S. alone, but if perfection is required, we probably
won’t ever be able to deploy the technology.
PN We’re quite inconsistent as a society when it comes
to what we’ll accept and what we won’t accept. While the
debate over self-driving cars continues to rage on, I happen
to know some people who are working on self-flying cars.
I find it perplexing that as a society, we have apparently
decided that having the 40,000 road deaths a year is OK,
while the number of air-travel deaths ought to be zero.
Accordingly, the legal requirements imposed by the FAA
are far more stringent than those applied to road travel.
And we need to wonder if that’s really a rational choice for
how to run our society or whether we should instead be
looking to make some different tradeoffs.
BF The sphere of ethics is inherently qualitative, whereas
computing is a highly quantitative practice. I’ve witnessed
discussions that diminish qualitative standards because
they can’t be measured and have no objective function.
acmqueue | january-february 2023 18
data science
case study
19 of 29
Given that, are you worried about uptake of these
principles?
CW In my experience, software engineers love to talk
about design principles. In fact, Alfred mentioned the
waterfall model, yet design methodologies are pretty
qualitative. Engineers are already dealing with principles
that get debated regularly—and changed with some
frequency.
BF Are the Belmont principles sufficient for any ethical
question?
AS While we focus on the Belmont principles, we also
acknowledge that individual and organizational decisionmaking will take other frameworks into account. I call out
three:
First, there are professional ethics, like the ACM code
of ethics.1 Truthfulness, capability, and integrity must be a
given as we apply data science.
Second, certain situations have different ethical
standards. The war in Ukraine has made stark for us the
laws of war, so-called jus in bello, and their implications.
Third, decisions are made in an economic framework,
where the economic system exists to channel energy,
competition, and self-interest into benefits for individuals
and society.
CW We want to remind everyone that it’s not enough to
have principles. Each individual and organization applying
data science needs to come up with organizational
structures and approaches to incorporate them into their
process.
JW The academic community is taking this seriously. We
saw an opportunity to put a stake in the ground by telling
acmqueue | january-february 2023 19
data science
case study
20 of 29
students, “If you want to be a data scientist, you’re going
to learn about ethics along with all this quantitative
stuff.” The Academic Data Science Alliance, which began
a few years ago, emphasizes ethics sufficiently that I
believe ethics courses are now integral to most academic
programs in the discipline. I’m very encouraged about
this since, as data science is only beginning to emerge
in academia, we’re now incorporating these qualitative
ethical principles, considered integral to the field.
PN This is just part of being in a field that’s finally growing
up. When the work you’re doing is only theoretical or
academic, then you go ahead and publish your papers and
it really doesn’t matter. But once that field starts to make
a genuine impact on the world, you suddenly find you have
some serious ethical responsibilities.
BF Looking at the other side of the coin, should an
understanding of data science inform a liberal arts
education that includes some exposure to ethics?
CW Having taught a class on the history and ethics
of data,7 I can tell you that humanities students show
a tremendous interest in learning about it. And our
engineering students even demand that we focus on the
ethical aspects. You can imagine people who would like
the topic to be taught as if it dwelled solely in the Platonic
realm of pure thought. You can also imagine there are
other people who would want us to focus more on the
very applied and perhaps even product-driven aspects of
the topic. I’ve found it useful to teach things historically to
provide a structure to these different interests.
PN While it’s important to raise these issues and to
have general principles, it’s also important to have case
acmqueue | january-february 2023 20
data science
case study
21 of 29
law based on real-world examples. That is, in our legal
system we have laws that people take great care to write
as clearly as they can, but they can’t anticipate all the
possibilities that might surface later. We supplement the
laws with case law.
It’s one thing to say that privacy and personhood are
important rights. But then how does that apply to the use
of surveillance cameras? You can’t really answer that just
from general principles. You need to get more precise by
specifying the types of uses that are approved and those
that aren’t. Principles are a good starting point, but we
also need the specificity that examples offer.
BF Now I have an engineering question for you: Is scale
inherent to data science?
PN Yes. If it hadn’t been for big data, we wouldn’t today be
talking about data science as a separate field. Instead, it
would still be part of statistics. While the folks in statistics
were focused on whether you needed 30 or 40 samples
to achieve statistical significance, there were some other
people who were saying, “Well, we’ve got a billion samples,
so we’re not going to worry about that. Instead, we’ve got
a few other problems and we’re going to focus on them.”
Those issues became the focus of the new field.
JW However, we can do plenty of data science at a smaller
scale with what some people call “artisanal data” or
“precious data.” There are plenty of challenges to contend
with in that space since it often involves working with
combined datasets, which means dealing with all the issues
that go along with heterogeneous data. So, we still have
some fundamental scientific and mathematical questions
to address, whether we’re working with big data or
acmqueue | january-february 2023 21
data science
case study
22 of 29
heterogeneous small data.
AS A side effect of all this data is that we all are regularly
confronted with both meaningful—and not-so-meaningful—
details that are hard to put into context. Considered within
our understandability rubric element, the sheer volume
of data and conclusions we get every day is difficult for
even experts to understand. In particular, we are often
presented with correlations whose meanings are not as
far-reaching or conclusive as we are often led to believe.
All of the technology for capturing, storing, and locating
data makes it far easier to cherry-pick data and use it out
of context to advance erroneous points of view.
PN Also, whenever data is derived from human interactions
with various systems, there is a challenge to determine
how much of it is trustworthy. For example, if you’re
working with a lot of data that comes from observations
of what people are clicking on, it might be tempting to
assume they’re clicking on things they’re truly interested
in. We humans have our frailties and biases—meaning our
actions don’t always reflect our own best interests. We
also have lapses in the sense people click on things without
meaning to. It’s important to understand those limitations
in order to interpret the data better.
BF Given all of this, what concerns should we have about
how data science allows us to derive answers and benefits
based on user interactions, especially given how they can
change over time without the creator of the model being
aware?
PN This certainly presents a big challenge. We need to
recognize we’re in a game-theory situation where, when
you make a move, other people are going to make a move
acmqueue | january-february 2023 22
data science
case study
23 of 29
in response, whether they’re spammers or legitimate
participants in the ecosystem. This sort of runs counter
to big data since, even if you’ve got millions of clicks, you
won’t have any clicks for what happens after you reach and
disseminate a conclusion.
You don’t know how people are going to change their
strategies. You have no data on that whatsoever. There’s
this tension between the things for which you can measure
everything and know exactly what’s going on and the
things in the future that may end up messing with your
normal business model in unknown ways. Then there’s also
the possibility you’ll have changed the ecosystem in ways
you don’t understand.
AS This applies to finance as well, of course. If you’re
applying algorithmic approaches to buying and selling and
your activities are having an impact on the market, you
can’t be certain exactly what effect your purchase or sale
might have.
BF Which is why analyses based on historical data have
flaws. “Past performance may not be indicative of future
results,” as all the brokerage houses are quick to remind
you.
CW If I can inject one broader aspect of scale, it also has an
ethical valence. Big systems that operate at scale can have
a far-ranging, global impact.
JW From the engineering perspective, scientists have
their own concerns. Often, they are working with massive
amounts of data from sophisticated instruments from
the IceCube Neutrino Observatory in Antarctica or the
James Webb Space Telescope. And, from what my scientist
colleagues tell me, they need new techniques for storing,
acmqueue | january-february 2023 23
data science
case study
24 of 29
preserving, and analyzing data.
MT What about the software engineering of data science?
AS It’s hard to build quality software under even the very
best of circumstances. Data science adds a new level
of challenge, because we are now using modules that
are learning from data, and they may work well in some
contexts and not in others. We may have confidence they
are likely to work well for an average case, but we don’t
know exactly how well they work for certain inputs, and,
again, we don’t know how well they will work over time.
JW Having once been involved in the formal verification
community, let me restate what Alfred said more
formally. To show that a program was doing the right
thing, we would use a very strong theorem—for all
xP(x)—to overprove the point. Then, once that had been
demonstrated, we could be certain the computer would do
exactly what we had intended for any valid input.
But for machine-learned models, universal
quantification is too strong and unrealistic. We wouldn’t
say for all xP(x) since we do not intend that a machinelearned model should work for all possible data
distributions. Instead of proving for all xP(x), we could
instead focus on proving for all data distributions within a
certain class, but then we would need to characterize the
class.
For robustness, we might say for all norm-bounded
perturbations to characterize the class of data
distributions for which a model is robust. But what about
a property such as fairness? This soon becomes very
tricky to formalize. A practical consequence is that we
need to increase testing, recognizing—as in traditional
acmqueue | january-february 2023 24
data science
case study
25 of 29
software engineering—we’re never going to be able to
test everything that’s likely to crop up in real life. This
illustrates why trustworthiness is an important research
frontier.
CW Another point has to do with Ops—generalizing beyond
just keeping a website up, to making sure a data-science
application is continuing to work well. I mean, inputs can
fail, abuse can occur, and models may be more brittle
than thought. As I alluded to earlier, we need to continue
monitoring the model as if it were a living thing. This also
means thinking through how you’re going to monitor
impacts on users, as well as your statistical metrics. There
are some real engineering challenges to think about here
in terms of how you’re going to maintain observability for
a data-science model that’s deployed, particularly since it
will be retrained and refreshed regularly.
BF We’ve covered a lot of ground today. Any final thoughts
you’d like to leave people with?
AS We hope the analysis rubric shows a path toward
providing useful structure to data science.
JW All four of us definitely believe in harnessing data for
good, whether for a university, a business, or society at
large. But there’s no escaping the breadth of topics that
need consideration. The breadth certainly complicates
data-science education.
CW I would emphasize that we are often solving very hard
problems—these are sometimes wicked problems—and we
need due consideration of many underlying principles. We
then need to act on them and do the very best we can to
balance sometimes-conflicting goals.
PN As I said earlier, our field is growing up. We are having
acmqueue | january-february 2023 25
data science
case study
26 of 29
a genuine impact on the world, and we find that we have
to think hard along many dimensions to achieve the best
possible goals.
References
1. ACM Code of Ethics and Professional Conduct; https://
www.acm.org/code-of-ethics.
2. Churchman, C. W. 1967. Wicked problems. Management
Science 14(4), B141–B142; https://www.jstor.org/
stable/2628678.
3. G
awande, A. 2010. The Checklist Manifesto. Penguin
Books India.
4. H
alevy, A., Norvig, P., Pereira, F. 2009. The unreasonable
effectiveness of data. IEEE Intelligent Systems 24(2),
8–12; https://ieeexplore.ieee.org/document/4804817.
5. National Commission for the Protection of Human
Subjects of Biomedical and Behavioral Research. 1978.
The Belmont Report: Ethical Principles and Guidelines for
the Protection of Human Subjects of Research; https://
www.hhs.gov/ohrp/regulations-and-policy/belmontreport/index.html.
6. S
pector, A. Z., Norvig, P., Wiggins, C., Wing, J. M. 2022.
Data Science in Context: Foundations, Challenges,
Opportunities. Cambridge, England: Cambridge
University Press.
7W
iggins, C., Jones, M. L. 2023. How Data Happened: A
History from the Age of Reason to the Age of Algorithms.
New York, NY: W.W. Norton and Co.
Peter Norvig is a Fellow at Stanford’s Human-Centered
AI Institute and a researcher at Google Inc. Previously he
acmqueue | january-february 2023 26
data science
case study
27 of 29
directed the core search algorithms group and the research
group at Google. He is coauthor of Artificial Intelligence:
A Modern Approach, the leading textbook in the field, and
co-teacher of a class on artificial intelligence that signed
up 160,000 students, helping to kick off the current round
of MOOCs (massive open online courses). He is a Fellow of
the AAAI, ACM, California Academy of Science, and American
Academy of Arts & Sciences.
Dr. Alfred Spector is a visiting scholar at MIT. His career
began with innovation in large-scale, networked computing
systems (at Stanford, as a professor at CMU, and as founder
of Transarc) and then transitioned to research leadership
(as global VP of IBM Software Research, Google Research,
and then as CTO of Two Sigma Investments). Dr. Spector has
lectured widely on the growing importance of computer
science across all disciplines (CS+X), and he just completed
Data Science in Context: Foundations, Challenges, and
Opportunities. He is a Fellow of the ACM, IEEE, the National
Academy of Engineering, and the American Academy of Arts
& Sciences, where he serves on its council. Dr. Spector was a
Hertz Fellow, won the 2001 IEEE Kanai Award for Distributed
Computing, was co-awarded the 2016 ACM Software Systems
Award, and was a Phi Beta Kappa visiting scholar. He received
a Ph.D. from Stanford and an A.B. from Harvard.
Chris Wiggins is an associate professor of applied
mathematics at Columbia University and the
chief data scientist at the New York Times. At Columbia he is
a founding member of the executive committee of the Data
Science Institute and the Department of Systems Biology,
and he is affiliated faculty in statistics. He is a cofounder and
acmqueue | january-february 2023 27
data science
case study
28 of 29
co-organizer of the nonprofit hackNY (http://hackNY.org),
which since 2010 has organized once-a-semester student
hackathons; and the hackNY Fellows Program, a structured
summer internship at New York City startups. Prior to joining
the faculty at Columbia, he was a Courant instructor at NYU
(1998-2001) and earned his Ph.D. at Princeton University
(1993-1998) in theoretical physics. He is a Fellow of the
American Physical Society and is a recipient of Columbia’s
Avanessians Diversity Award.
Jeannette M. Wing is executive vice president for
research and professor of computer science at Columbia
University. Her current research interests are in trustworthy
AI. Wing came to Columbia from Microsoft, where she served
as corporate vice president of Microsoft Research, overseeing
research labs worldwide. Before joining Microsoft, she was on
the faculty at Carnegie Mellon University, where she served
as head of the department of computer science and associate
dean for academic affairs of the School of Computer Science.
She is a Fellow of the American Academy of Arts and Sciences,
American Association for the Advancement of Science, ACM,
and IEEE. She holds bachelor’s, master’s, and doctoral degrees
from MIT.
Copyright © 2023 held by owner/author. Publication rights licensed to ACM.
acmqueue | january-february 2023 28
Download