Big Data - Center for Curriculum Redesign

advertisement
To arrive at the edge of the world's knowledge, seek out the most complex and sophisticated
minds, put them in a room together, and have them ask each other the questions they are asking
themselves.
REINVENTING SOCIETY IN THE WAKE OF BIG
DATA
A Conversation with Alex (sandy) Pentland [8.30.12]
With Big Data we can now begin to actually look at the details of social interaction and how those play
out, and are no longer limited to averages like market indices or election results. This is an astounding
change. The ability to see the details of the market, of political revolutions, and to be able to predict
and control them is definitely a case of Promethean fire --- it could be used for good or for ill, and so
Big data brings us to interesting times. We're going to end up reinventing what it means to have a
human society
ALEX 'SANDY' PENTLAND is a pioneer in big data, computational social science, mobile and health
systems, and technology for developing countries. He is one of the most-cited computer scientists in
the world and was named by Forbes as one of the world's seven most powerful data scientists. He
currently directs the
REINVENTING SOCIETY IN THE WAKE OF BIG DATA
[SANDY PENTLAND:] Recently I seem to have become MIT's Big Data guy, with people like Tim
O'Reilly and "Forbes" calling me one of the seven most powerful data scientists in the world. I'm not
sure what all of that means, but I have a distinctive view about Big Data, so maybe it is something
that people want to hear.
I believe that the power of Big Data is that it is information about people's behavior instead of
information about their beliefs. It's about the behavior of customers, employees, and prospects for
your new business. It's not about the things you post on Facebook, and it's not about your searches
on Google, which is what most people think about, and it's not data from internal company processes
and RFIDs. This sort of Big Data comes from things like location data off of your cell phone or credit
card, it's the little data breadcrumbs that you leave behind you as you move around in the world.
What those breadcrumbs tell is the story of your life. It tells what you've chosen to do. That's very
different than what you put on Facebook. What you put on Facebook is what you would like to tell
people, edited according to the standards of the day. Who you actually are is determined by where
you spend time, and which things you buy. Big data is increasingly about real behavior, and by
analyzing this sort of data, scientists can tell an enormous amount about you. They can tell whether
you are the sort of person who will pay back loans. They can tell you if you're likely to get diabetes.
They can do this because the sort of person you are is largely determined by your social context, so if
I can see some of your behaviors, I can infer the rest, just by comparing you to the people in your
crowd. You can tell all sorts of things about a person, even though it's not explicitly in the data,
because people are so enmeshed in the surrounding social fabric that it determines the sorts of things
that they think are normal, and what behaviors they will learn from each other.
As a consequence analysis of Big Data is increasingly about finding connections, connections with the
people around you, and connections between people's behavior and outcomes. You can see this in all
sorts of places. For instance, one type of Big Data and connection analysis concerns financial data. Not
just the flash crash or the Great Recession, but also all the other sorts of bubbles that occur. What
these are is these are systems of people, communications, and decisions that go badly awry. Big Data
shows us the connections that cause these events. Big data gives us the possibility of understanding
how these systems of people and machines work, and whether they're stable.
The notion that it is connections between people that is really important is key, because researchers
have mostly been trying to understand things like financial bubbles using what is called Complexity
Science or Web Science. But these older ways of thinking about Big Data leaves the humans out of the
equation. What actually matters is how the people are connected together by the machines and how,
as a whole, they create a financial market, a government, a company, and other social structures.
Because it is so important to understand these connections Asu Ozdaglar and I have recently created
the MIT Center for Connection Science and Engineering, which spans all of the different MIT
departments and schools. It's one of the very first MIT-wide Centers, because people from all sorts of
specialties are coming to understand that it is the connections between people that is actually the core
problem in making transportation systems work well, in making energy grids work efficiently, and in
making financial systems stable. Markets are not just about rules or algorithms; they're about people
and algorithms together.
Understanding these human-machine systems is what's going to make our future social systems
stable and safe. We are getting beyond complexity, data science and web science, because we are
including people as a key part of these systems. That's the promise of Big Data, to really understand
the systems that make our technological society. As you begin to understand them, then you can build
systems that are better. The promise is for financial systems that don't melt down, governments that
don't get mired in inaction, health systems that actually work, and so on, and so forth.
The barriers to better societal systems are not about the size or speed of data. They're not about most
of the things that people are focusing on when they talk about Big Data. Instead, the challenge is to
figure out how to analyze the connections in this deluge of data and come to a new way of building
systems based on understanding these connections.
Changing The Way We Design Systems
With Big Data traditional methods of system building are of limited use. The data is so big that any
question you ask about it will usually have a statistically significant answer. This means, strangely,
that the scientific method as we normally use it no longer works, because almost everything is
significant! As a consequence the normal laboratory-based question-and-answering process, the
method that we have used to build systems for centuries, begins to fall apart.
Big data and the notion of Connection Science is outside of our normal way of managing things. We
live in an era that builds on centuries of science, and our methods of building of systems,
governments, organizations, and so on are pretty well defined. There are not a lot of things that are
really novel. But with the coming of Big Data, we are going to be operating very much out of our old,
familiar ballpark.
With Big Data you can easily get false correlations, for instance, "On Mondays, people who drive to
work are more likely to get the flu." If you look at the data using traditional methods, that may
actually be true, but the problem is why is it true? Is it causal? Is it just an accident? You don't know.
Normal analysis methods won't suffice to answer those questions. What we have to come up with is
new ways to test the causality of connections in the real world far more than we have ever had to do
before. We no can no longer rely on laboratory experiments; we need to actually do the experiments
in the real world.
The other problem with Big Data is human understanding. When you find a connection that works,
you'd like to be able to use it to build new systems, and that requires having human understanding of
the connection. The managers and the owners have to understand what this new connection means.
There needs to be a dialogue between our human intuition and the Big Data statistics, and that's not
something that's built into most of our management systems today. Our managers have little concept
of how to use big data analytics, what they mean, and what to believe.
In fact, the data scientists themselves don't have much of intuition either…and that is a problem. I
saw an estimate recently that said 70 to 80 percent of the results that are found in the machine
learning literature, which is a key Big Data scientific field, are probably wrong because the researchers
didn't understand that they were overfitting the data. They didn't have that dialogue between intuition
and causal processes that generated the data. They just fit the model and got a good number and
published it, and the reviewers didn't catch it either. That's pretty bad because if we start building our
world on results like that, we're going to end up with trains that crash into walls and other bad things.
Management using Big Data is actually a radically new thing.
This last year at Davos I ran several sessions around Big Data with the CEOs of leading companies in
this area, and it was very clear that there's a whole new way of doing things that's just now
developing. Some of them, like Palantir and TIBCO, are making progress at this, but to most of the
people in the room this was brand new, and they had not gotten up to speed about it at all.
Another important issue with Big Data is that since this data is mostly about people, there are
enormous issues about privacy, data ownership, and data control. You can imagine using Big Data to
make a world that is incredibly invasive, incredibly 'Big Brother'… George Orwell was not nearly
creative enough when he wrote 1984.
For the last several years I've been helping to run sessions at the World Economic Forum around
sourcing personal data and ownership of the data, and that's ended pretty successfully with what I call
the New Deal on Data. The Chairman of the Federal Trade Commission, who's been part of the group,
put forward the U.S. "Consumer Data Bill of Rights," and in the EU, the Justice Commissioner declared
a version of this New Deal to be a basic human right.
Both of these regulatory declarations put the individual much more in charge of data that's about
them. This is a major step to making Big Data safer and more transparent, as well as more liquid and
available, because people can now choose to share data. It is a vast improvement over having the
data being locked away in industry silos where nobody even knows it's there.
Adam Smith And Karl Marx Were Wrong
These Big Data issues are important, but there are bigger things afoot. As you move into a society
driven by Big Data most of the ways we think about the world change in a rather dramatic way. For
instance, Adam Smith and Karl Marx were wrong, or at least had only half the answers. Why? Because
they talked about markets and classes, but those are aggregates. They're averages.
While it may be useful to reason about the averages, social phenomena are really made up of millions
of small transactions between individuals. There are patterns in those individual transactions that are
not just averages, they're the things that are responsible for the flash crash and the Arab spring. You
need to get down into these new patterns, these micro-patterns, because they don't just average out
to the classical way of understanding society. We're entering a new era of social physics, where it's
the details of all the particles—the you and me—that actually determine the outcome.
Reasoning about markets and classes may get you half of the way there, but it's this new capability of
looking at the details, which is only possible through Big Data, that will give us the other 50 percent of
the story. We can potentially design companies, organizations, and societies that are more fair, stable
and efficient as we get to really understand human physics at this fine-grain scale. This new
computational social science offers incredible possibilities.
This is the first time in human history that we have the ability to see enough about ourselves that we
can hope to actually build social systems that work qualitatively better than the systems we've always
had. That's a remarkable change. It's like the phase transition that happened when writing was
developed or when education became ubiquitous, or perhaps when people began being tied together
via the Internet.
The fact that we can now begin to actually look at the dynamics of social interactions and how they
play out, and are not just limited to reasoning about averages like market indices is for me simply
astonishing. To be able to see the details of variations in the market and the beginnings of political
revolutions, to predict them, and even control them, is definitely a case of Promethean fire. Big Data
can be used for good or bad, but either way it brings us to interesting times. We're going to reinvent
what it means to have a human society.
Creating A Data-Driven Society
One of the great questions is: who is this new Data Driven world going to be for and what is
it going to look like? People ask if this just for the Davos attendees or for everybody? That's
a question of values and ethics, and that's why people have to be debating this now, and why I'm
talking about this—to start the conversation. But I will say however that all the conversations I've
been at in Davos have had an extremely strong egalitarian element. Most people are advocates for the
poor. Many are people from developing countries—an enormous number, not just a token scattering.
There's a real focus on building a sustainable future, which means one in which there aren't large
chunks of the population left out in the cold. Obviously not everybody is 100 percent devoted to that
agenda, but most are.
A key insight is that your data is worth more if you share it because it enables systems like public
health. Data about the way you behave and where you go, and that can be used to can stop the
spread of infectious disease. If you have children, you don't want to see them die of an H1N1
pandemic. How are you going to stop that? Well, it turns out that if you can actually watch people's
behavior in real time...something that is quite possible today…you can tell when each individual
person is getting sick. This means you can actually see the spread of influenza from person to person
on an individual level. And if you can see it, you can stop it. You can begin to build a world where
infectious pandemics cease to be as much of a threat.
Similarly, if you're worried about global warming, we now know how patterns of mobility relate to
productivity (and I just showed some examples of those—we are doing a lot really amazing science
around this). This means you can design cities that are far more efficient, far more human, and burn
an awful lot less energy. But you need to be able to see the people moving around in order to be able
to get these results. That's another instance where sharing your data is invaluable to you personally.
It's everybody contributing his or her data that's going to make a greener world, and that is worth far
more than the simple cash value of the data.
However today the data is siloed off and unavailable, and that was the one of the core reasons I
proposed the New Deal on Data to the World Economic Forum. Since then the idea has run through
various discussions turned into the Consumer Data Bill of Rights in the United States, and the
declaration on Data Rights in the EU. The core idea is that when data is in silos you can't make use of
it either for evil or for the public good, and we need the public good. We need to stop pandemics. We
need to make a greener world. We need to make a fairer world.
Who Owns The Data In A Data-Driven Society?
How do you get the data out of those silos? The first step is you have to figure out who owns that
data. Does the telephone company own it, just because it happened to be collected while you were
walking around with your phone? Maybe they have some right to use it. But what the discussions are
among all the participants, including the telephone companies, is that you're the only one that has
final disposal of it. They would have the ability to keep copies to offer services that you've requested,
but you, the individual, have to have the final say.
Some situations are, of course, more complex. What about if the data is a transaction with a
merchant? Well, they have a right to the data too. But by assigning rights of ownership to people
(which is not exactly the same as legal ownership) what you do is you make it possible to break data
out of the silos. You've turned it into a personal asset that can then be shared for value in return. You
can make it a liquid asset that can be used to build government systems, social systems, or for-profit
systems. That's the world we're moving towards.
Is there opposition to this? Surprisingly little. The incumbents in the Internet are probably the major
opposition because (and I don't mean to pick on them) Facebook and Google grew up in a completely
unregulated environment. It is natural for them to think that they have control over the data, but now
they're slowly, slowly coming around to the idea that they're going to have to compromise on that.
However the people who have the most valuable data are the banks, the telephone companies, the
medical companies, and they're very highly regulated industries. As a consequence they can't really
leverage that data the way they'd like to unless they get buy-in from both the consumer and the
regulators. The deal that they've been willing to cut is that they will give consumers control over their
data in return for being able to make them offers about using their data.
That gets these companies out of the regulator's pocket. It gives them a white hat, because they
explicitly asked you if you wanted to op in, and it lets them make money, which is what they
desperately want. And it appears that if you treat people's data in this sort of responsible manner,
people will willingly share their data. It is a win-win-win solution to the privacy problem, and it's the
companies that grew up in an unregulated environment, or the companies that are in gray markets
that are likely to dry up, that are most strongly opposed.
We are beginning to see is services that leverage personal data in this sort of respectful manner.
Services such as really personal recommendations, identity certification without passwords, and
personal public services for transportation, health, and so forth. All these areas are undergoing
tectonic changes, and the more that we can use specific data about specific people, the better we can
make the system work.
These dramatic improvements in societies' systems goes back to what I was saying earlier. Today
societies' systems are built on big averages and indices, e.g., this class of people do this and this
market's moving that way. But really, it's all made up of millions and millions of small interactions,
and with Big Data we can get down and design things that really work for us on a personal level,
rather than just being treated as another type A4 consumer.
Organizations With Hard Information Boundaries Will Tend To Dissolve
I got to these issues through a long and varied history. I started off doing a lot of signal processing
machine vision. I have a background in psychology as well, and am concerned with how data and
people come together in social systems. For instance, we developed some of the first wearable
computing devices. The Google Glass project comes out of my group…the guys that are building it are
my former students. But as a result of these sorts of projects it became obvious to me that the most
important thing was not the user interface or the device, it was the data about people. Later, as cell
phones became more ubiquitous, it was clear that that they were going to be the biggest source of
data in the world.
If you could see everybody in the world all the time, where they were, what they were doing, who
they spent time with, then you could create an entirely different world. You could engineer
transportation, energy, and health systems that would be dramatically better. It's this history of
thinking about signals and people together, and how people work via these computer systems, and
what data about human behavior can do, that led me to the realization that we're at a phase
transition. We are moving from the reasoning of the enlightenment about classes and about markets
to fine grain understanding of individual interactions and systems built on fine grain data sharing.
This new world could make George Orwell look like an unimaginative third stringer. It became really
clear you had to think hard about the privacy and data ownership issues. Things that George Orwell
didn't realize were that is that you can watch the patterns of people interacting then you can figure
out things like who they're going to vote for and how they're going to react to various situations like
changes of regulation, and so forth. You could build something that, to a first approximation, would be
the real evil empire. And, of course, some people are going to try and do that.
At the same time, there are some elements of this new data driven world that are really promising.
For instance, the most efficient and robust architectures tend to be ones that have no central points.
It means that there's no single place for a dictator to grab control. They have to actually go to every
house to really control the data. In addition, I see government policies going in the right directions, to
minimize these sorts of dangers.
Also there is inherent in a society built on data sharing a certain level of transparency and choice for
individuals that I believe will tend to mitigate against central control. It tends to dissolve the power of
the state and big organizations because you can build things that are far more efficient and robust if
they're distributed and without the hard information boundaries that you see today.
That means that the service-oriented government, as it were, or the service-oriented organization will
tend to have better offerings for a lower price, as opposed to the ones that try to own the customer or
control the citizen. As a consequence I expect to see that organizations with hard information
boundaries will tend to dissolve, because there will be competition from things that are better that
don't have the hard boundaries and don't try to own your data.
Download