Second Life: Teching in Virtual Space

advertisement
An Introduction to Big Data
Harry E. Pence
TLTC Faculty Fellow
for Emerging Technologies
Harry E. Pence 2013
Why Big Data Now?
Big Data is produced by the collision of cheaper,
faster computing with very large social networks.
In 1980 a terabyte of disk storage cost $14 million;
now it costs about $30.
Amazon or Google will “rent” a cloud-based
supercomputer cluster for only a few hundred
dollars an hour.
Sebastien Pierre’s Facebook Map
The annual
growth rate
of big data
is 60%.
Twitter generates
more than 7
Terabytes (TB)
a day;
Facebook more
than 10 TBs,
and some
enterprises
already store
data in the
petabyte range.
1 Megabyte =
1E6 Bytes.
1 gigabytes =
1E9 Bytes
1 terabytes =
1E12 Bytes
1 Petabyte =
1E15 Bytes
or
250,000 DVDs
Facebook currently stores more than
100 petabytes of data.
Source
http://tinyurl.com
/a8zwman
Or Garbage?
The amount of data available is
growing much faster than the
ability of most companies to
process and understand it.
Moore’s Law 2: Data expands to fill
available storage.
According to researchers at the UC-San
Diego, Americans consumed about 3.6
zettabytes of information in 2008.
David Weinberger (p.7) says digital War
and Peace is about 2 megabytes, so
one zettabyte equals 5E14 copies of
War and Peace.
It would take light 2.9 days to go from the
top to the bottom of this stack.
Harry E. Pence 2013
One definition is, “Big Data is when the size
of the data itself becomes a problem.”
But Big Data is important not just because of size but also
because of how it connects data, people, and
information structures. (http://tinyurl.com/ato2hbu) It
enables us to see patterns that weren’t visible before.
Researchers at Harvard School of Public Health combined
location data from over 14 million cell phone subscribers
in Kenya with a malaria prevalence map to determine
that a main source of malaria in Kenya.
Anyone can use powerful, open, online databases, like
http://data.worldbank.org/ and https://explore.data.gov/.
(https://developers.google.com/bigquery/) Google’s Big
Query allows anyone to query all of Wikipedia,
Shakespeare, and weather stations from within your
browser for less than $0.035 per GB.
Harry E. Pence 2013
Gerd Leonhard
John Wanamaker once said, “I know that half of
my advertising doesn’t work. The problem is I
don’t know which half.”
Did Big Data help to
determine the 2012 election?
NY times, 2/14/13 http://tinyurl.com/acj8hk5
Romney raised slightly more money from his online ads
than he spent on them, Obama’s team more than
doubled the return on its online-ad investment.
Romney’s get-out-the-vote digital tool, Orca, crashed on
Election Day; Obama’s Narwhal, gave every member of
the campaign instant access to continuously updated
voter information.
Obama was the very first candidate to appear on Reddit,
and the photo of the Obamas became the most popular
image ever seen on Twitter or Facebook.
Romney’s senior strategist, Stuart Stevens, may well be
the last guy to run a presidential campaign who never
tweeted.
Harry E. Pence 2013
Commercial users seem more
concerned with correlation, not
causality
By comparing the 50 million most common
search terms with CDC data on the flu,
Google's found a combination of search
terms that strongly correlated with the
spread and intensity of 2009 flu season.
Both Amazon and Netflix use correlationbased recommendations to boost sales.
Target assumes that if a 20-something female
shopper purchases a unscented lotion, an
supplements such as zinc and calcium, and
a large purse, she is pregnant .
Cheap sensors
constantly monitor
personal data to allow for
personalized support.
Inexpensive ($100-200)
devices, like Fitbit
already will track your
daily physical activity to
a web page.
General Motors (OnStar)
and several insurance
companies offer driving
monitors.
Harry E. Pence 2013
Inexpensive sensors are rapidly moving
us towards an Internet of Things.
Physical World
We are here!
http://tinyurl.com/aoev3x9
Google’s
Information Graph
is a step towards
the Semantic Web.
Companies ranging from the NY Times to the UK
Ordnance Survey are creating linked data to make the
web more interconnected.
Each entity is defined by a Uniform Resource Identifier
(URI) which is machine readable.
The hope is to attach metadata to each entity to show
how they relate to each other, employees to
companies, actors to motion pictures, etc.
Harry E. Pence 2013
Some predict that the Internet of Things will soon
produce a massive volume and variety of data at
unprecedented velocity. http://tinyurl.com/ahytzdf
Welcome to the new information age
http://tinyurl.com/ahytzdf
Big Data and Big Science
“The Fourth Paradigm”
Massive Sensor Arrays are
important in many areas of
science, like meteorology,
environmental monitoring,
astronomy, and climate
measurements.
The Square Kilometre Array (SKA) under development
in Australia and South Africa will collect one exabyte of
data per day from 36 small antennas spread over more
than 3000 km to simulate a single giant radio telescope.
Bradley Voytek ( Big Data location 1186) argues that Big Data
analysis allows researchers to identify patterns that were
previously invisible; it is possible to automate critical
aspects of the scientific method itself.
Harry E. Pence 2013
17 miles
The large Hadron Collider
at CERN produces so much
data that scientists must
discard most of it, hoping
they haven’t thrown away
anything useful.
Weather prediction combines
data from multiple earth
satellites with massive
computing power.
Most of the satellites belong to
the U.S., but the Europeans
have a more powerful
computer.
Our weather satellites are old.
http://tinyurl.com/cvpz5qe
Harry E. Pence 2013
Big Data and the Future
of Health Care
A patient can use a smartphone to do many tests now
done in the laboratory, such as an EKG or glucose test.
The Quest for the $1000 Genome Analysis.
In May 2007, the first personal genome cost $1 million.
Recently, several companies are close to $1000 and
some are predicting a $100 dollar cost soon.
Knowing an individual’s genome should allow treatment
to be customized to the individual.
As the cost plummets, the bottleneck will increasingly be
the cost of interpreting the genomic data.
IBM defines big
data as the
combination of
velocity, variety,
and volume
The Three V’s
The sheer volume of stored data is exploding.
IBM predicts that there will be 35 zettabytes
stored by 2020.
This data comes in a bewildering variety of
structured and unstructured formats.
The velocity of data depends on not just the
speed at which the data is flowing but also the
pace at which it must be collected, analyzed,
and retrieved.
IBM claims that its software will adjust to
changes in any of these three factors.
Harry E. Pence 2013
Search engines, like Yahoo!
and Google, were the first
companies to work with
datasets that were too large
for conventional methods.
(According to Big Data 2 location 177 Google
has over a million servers.)
In order to power its searches, Google developed
a strategy called MapReduce. You map a task
onto a multitude of processors then retrieve the
results.
Traditional data warehouses use a relational
database (think rows and columns); Search
engines need to handle non-relational
databases, sometimes called NoSQL.
Harry E. Pence 2013
The most popular software to
search No-SQL databases is called
Hadoop and several different
versions are freeware.
Named after his son’s pet elephant.
Hadoop is designed to collect data, even if it doesn’t fit
nicely into tables, distribute a query across a large
number of separate processors, and then combine the
results into a single answer set in order to deliver
results in almost real time.
This is often paired with machine learning apps, like a
recommendation engine, classification, error
detection, or facial recognition.
Personal Aside: I suggest that Google Analytics might
be the best way to introduce students to crafting a
query for a Big Data exercise.
Harry E. Pence 2013
We are in the Golden Age of Data
Visualization
http://tinyurl.com/beqxuyl
A streamgraph of the conversation around a brand.
Classify each of these as pro or con fracking:
Don’t appoint a #fracking proponent to lead the
Dept. of Energy. (often RT)
Broome County Executive took $82,428 in pro#fracking campaign contributions.
The only politician I know that has backed up
his promise to address #fracking is Gov.
Cuomo.
Natural gas is neither perfect nor perfectly evil.
Businesses surprised to see their names on
#fracking petition.
Mmmm @fracking fluid bit.ly/ZmQVQj
Harry E. Pence 2013
Amazon’s Mechanical Turk is one way to
manually create a standard data template.
https://www.mturk.com/mturk/welcome
Moving from Big Data to Smart Data is a
multistep process.
“Smart Data”
http://tinyurl.com/atcanjw
“The
issue is not about the volume of data but the ability to
analyze and act on data in real time.” http://tinyurl.com/atcanjw
Scraping, processing,
and buying Big Data
One can only copyright a specific arrangement of the
data, but the metadata is often extremely important and
may not follow the scrape.
Companies, like InfoChimp, are scraping and cleaning
selected data from Twitter and then selling access to
these datasets.
Pete Warden reports that it only cost him $120 to gather,
analyze, and visualize 220 million public Facebook
profiles (http://tinyurl.com/yb2q3dv) and 8olegs allowed
him to download a million web pages for about $2.20.
Harry E. Pence 2013
Problems with Big Data –
Selection bias
(http://tinyurl.com/bdb7wgy)
Selection bias occurs when the individuals or groups to
take part in a study don’t represent the general
population.
According to a recent Pew survey, Twitter users are
younger than the general public and more likely to lean
toward the Democratic Party. Twitter reactions are
often at odds (six out of eight times) with overall public
opinion. http://tinyurl.com/cvqq5hz
A recent article in EJP Data Science says Twitter is
actually comprised of modern-day tribes, groups of
people who use a discrete language and are connected
to a character, occupation or interest.
http://preview.tinyurl.com/bc8gecu
Harry E. Pence 2013
The partition
of English
speaking
Twitter users
into
communities,
annotated
with words
typical of
those often
used by
members of
each
community.
http://tinyurl.com/
bmb9r9e
Problems with Big Data –
Misclassification bias
(http://tinyurl.com/bdb7wgy)
Misclassification occurs when either the cause or the
effect is not accurately recognized.
What if a response is not correctly identified as intended
by the participant. This is especially true when
subjective interpretation is required to classify an
answer. http://tinyurl.com/cvqq5hz
Remember: Correlation does not imply causation,
But some projects, like targeting building inspections in
NY City (http://tinyurl.com/clhmv3t), seem to work
well on the basis of correlation only.
Harry E. Pence 2013
Comments on Big Data for education
The information in LMSs (i.e. time on task, number of logins, number of list posts, etc.), like Moodle or Angel, is
well structured and is already being analyzed.
In MOOCs, students interact entirely online, leaving
behind a record of every page they visited.
The Gates Foundation recently gave $100 million to
InBloom to improve ways to transfer information among
the many technology information silos where students
records are currently stored. Nine states (including New
York) are participating in this pilot project and plan to
offer third-party vendors access to student data
(without student or parental consent).
http://tinyurl.com/b8e4whh
Unanswered questions include security, who owns
learner produced data (TurnItIn?), who owns the data
analysis, and what will be shared with the students?
Harry E. Pence 2013
Dwell time is how long a
student spent on a given
activity.
Chico State
University (CA) has
been using learning
analytics to study
how student
achievement is
related to (LMS) use
and student
characteristics.
http://tinyurl.com/acz5f7o
The project merges LMS data with student characteristics
and course performance from the campus database.
The study reports a direct positive relationship between
LMS usage and the student ‘s final grade.
Voytek’s Third Law: Any sufficiently advanced statistics
can trick people into believing the results reflect truth.
Harry E. Pence 2013
Problems with Big
Data – Confounding
bias
(http://tinyurl.com/bdb7wgy)
Confounding occurs when there is a failure to control
for some other factor that is affecting the outcome.
How do other factors, like attendance, reading the
textbook, attending extra help sessions, etc. relate to
the time spent on the LMS (and so the course grade)?
“Low-income students spend more time on the LMS.”
The article notes that, “No individually identifiable
information is included in the data files.”
Really???
Harry E. Pence 2013
Problems with Big Data –
Privacy
A sensible back-up strategy may create more than 100
copies. How can organizations protect the privacy of all
this data from hackers?
This can make it hard to protect individual privacy, and
recent experiences suggest that there are now so many
public datasets available for cross-referencing that it is
difficult to assure that any Big Data records can be kept
private.
In a number of cases, information from “anonymous
studies” has been tracked back to identify individuals
and even their families.
Harry E. Pence 2013
“Just as Ford changed the way we make
cars – and then transformed work itselfBig Data has emerged as a system of
knowledge that is already changing the
objects of knowledge, while also having
the power to inform how we understand
human networks and community.”
danah boyd (http://tinyurl.com/bdb7wgy)
“And finally, how will the harvesting of Big Data change
the meaning of learning and what new possibilities and
limitations may come from these systems of knowing?”
Harry E. Pence 2013
Thank you for
listening.
http://soundcloud.com/tracks/search?q=acxiom
That was Vienna Teng singing the song acxiom,
sacred music for the age of big data.
Any questions?
Download