Second Life: Teching in Virtual Space

advertisement
An Introduction to Big Data
Harry E. Pence
TLTC Faculty Fellow
for Emerging
Technologies
Harry E. Pence 2013
Gerd Leonhard
John Wanamaker once said, “I know that half of
my advertising doesn’t work. The problem is I
don’t know which half.”
What’s new about Big Data?
IBM describes the problem as The Three V’s.
The sheer volume of stored data is exploding. IBM
predicts that there will be 35 zettabytes stored by 2020.
This data comes in a bewildering variety of structured
and unstructured formats.
The velocity of data depends on not just the speed at
which the data is flowing but also the pace at which it
must be collected, analyzed, and retrieved.
Although most businesses already collect terabytes of
information about customers, employees, and their
enterprise, a recent survey found that 62 % of business
leaders couldn’t access their information fast enough,
and 83 % believe it didn’t give them what they needed
to know.
Harry E. Pence 2013
The annual
growth rate
of big data
is 60%.
Many
Business
Schools call
it Business
Analytics
rather than
Big Data.
Computing resources
are becoming much
cheaper and more
powerful.
In 1980 a terabyte of
disk storage cost $14
million; now it costs
about $30.
Amazon or Google will
“rent” a cloud-based
supercomputer
cluster for only a few
hundred dollars an
hour.
Social networks, like Facebook and Twitter,
are spanning the globe.
Twitter generates more than 7 Terabytes (TB) a day;
Facebook more than 10 TBs, and some
enterprises already store data in the petabyte
range.
Sebastien Pierre’s Facebook Map
1 Megabyte =
1E6 Bytes.
1 gigabytes =
1E9 Bytes
1 terabytes =
1E12 Bytes
1 Petabyte =
1E15 Bytes
or
250,000 DVDs
Facebook currently stores more than
100 petabytes of data.
Source
http://tinyurl.com
/a8zwman
The amount of data available is
growing much faster than the
ability of most companies to
process and understand it.
Moore’s Law 2: Data expands to fill
available storage.
According to researchers at the UC-San
Diego, Americans consumed about 3.6
zettabytes of information in 2008.
David Weinberger (p.7) says digital War
and Peace is about 2 megabytes (1296
pgs.), so one zettabyte equals 5E14
copies of War and Peace.
It would take light 2.9 days to go from the
top to the bottom of this stack.
Harry E. Pence 2013
Some applications of Big Data
Topological data analysis is being used to
rethink basketball (five or thirteen positions).
http://tinyurl.com/c5ajwm3
In March, 2013, the Obama Administration
announced $200 million in R&D investments
for Big Data. http://tinyurl.com/85oytkj
Google combined search terms with CDC data
to identify search terms that correlated with
the spread of the 2009 flu season.
Both Amazon and Netflix use correlation-based
suggestions to boost sales.
Target assumes that if a 20-something female
shopper purchases a unscented lotion,
supplements such as zinc and calcium, and a
large purse, she is pregnant .
Did Big Data help to
determine the 2012 election?
NY Times, 2/14/13 http://tinyurl.com/acj8hk5
Romney raised slightly more money from his online ads
than he spent on them, Obama’s team more than
doubled the return on its online-ad investment.
Romney’s get-out-the-vote digital tool, Orca, crashed on
Election Day; Obama’s Narwhal, gave every member of
the campaign instant access to continuously updated
voter information.
Obama was the very first candidate to appear on Reddit,
and the photo of the Obamas became the most popular
image ever seen on Twitter or Facebook.
Romney’s senior strategist, Stuart Stevens, may well be
the last guy to run a presidential campaign who never
tweeted.
Harry E. Pence 2013
One definition is, “Big Data is when the size
of the data itself becomes a problem.”
Often so much data is collect that much of it is discarded,
a process know as the Donner Party Effect.
Open, online databases, like http://data.worldbank.org/ and
https://explore.data.gov/, are now available.
Google Analytics allows us to query search patterns and
Google’s Big Query (https://developers.google.com/bigquery/)
allows anyone to query all of Wikipedia, Shakespeare,
and weather stations for less than $0.035 per GB.
http://tinyurl.com/bvd2yve
But Big Data is important not just because of size but also
because of how it connects data, people, and
information structures. (http://tinyurl.com/ato2hbu) It
enables us to see patterns that weren’t visible before.
Harry E. Pence 2013
We are moving
towards the
Semantic Web.
Companies ranging from the NY Times to the UK
Ordnance Survey are creating linked data to make the
web more interconnected.
Each entity is defined by a Uniform Resource Identifier
(URI) which is machine readable.
The hope is to attach metadata to each entity to show
how they relate to each other, employees to
companies, actors to motion pictures, etc.
Harry E. Pence 2013
Inexpensive sensors are rapidly moving
us towards an Internet of Things.
Physical World
We are here!
http://tinyurl.com/aoev3x9
Some predict that the Internet of Things will soon
produce a massive volume and variety of data at
unprecedented velocity. http://tinyurl.com/ahytzdf
Welcome to the new information age
http://tinyurl.com/ahytzdf
Morgan Stanley predicts the
following data applications will
grow fastest in 2013:
1. Healthcare,
2. Entertainment
3. Com/Media,
4. Manufacturing
5. Financial
Inexpensive ($100-200) devices,
like Fitbit already will track
your daily physical activity to
a web page.
American Society of Clinical Oncology is creating a
database, CancerLinQ, to centralize cancer records so that
Big Data methods can evaluate the effectiveness of
treatments and hasten development of new medicines.
http://tinyurl.com/cnv6wfw
Harry E. Pence 2013
Big Data and the Future of
Health Care
A patient can use a smartphone to do many tests now
done in the laboratory, such as an EKG or glucose test.
The cost of a personal genome is dropping rapidly and
some are predicting a $100 dollar cost soon. Knowing
an individual’s genome should allow treatment to be
customized to the individual.
A recent report says the Big Data could save as much as
$450 million in health care costs but the AMA says that
current electronic health record systems lack the
sophistication to manage the storage and retrieval of big
data. http://tinyurl.com/cln8vf9
Ever since the 1970s, the
roughly 25,000 families who
create the Nielsen ratings
have determined what TV
shows survive and what ad
rates will apply.
In Nov. 2012, Neilsen
purchased SocialGuide,
which measures the
“social impact” of TV,
and announced it was
partnering with Twitter.
Now Twitter has
purchased Bluefin
Labs, a social-TV
analytics company.
As more and more people
TiVoed, Nielsen created the C3
rating in 2007, and recently
they added the C7 rating to
measure how many people
viewed the show after it was
originally aired.
Now advertisers want to know
not just if people watched, but
if they were “engaged.”
Wired, April 2013, 92-94
Big Data and Big Science
“The Fourth Paradigm”
The University of California,
San Diego, is building a "big
data freeway system" for
science projects in "genomic
sequencing, climate science,
electron microscopy,
oceanography and physics."
The Square Kilometre Array (SKA) under development
in Australia and South Africa will collect one exabyte of
data per day from 36 small antennas spread over more
than 3000 km to simulate a single giant radio telescope.
Bradley Voytek ( Big Data location 1186) argues that Big Data
analysis allows researchers to identify patterns that were
previously invisible; it is possible to automate critical
aspects of the scientific method itself.
Harry E. Pence 2013
17 miles
The large Hadron Collider
at CERN produces so much
data that scientists must
discard most of it, hoping
they haven’t thrown away
anything useful.
Weather prediction combines
data from multiple earth
satellites with massive
computing power.
Most of the satellites belong to
the U.S., but the Europeans
have more powerful
computers.
Our weather satellites are old.
http://tinyurl.com/cvpz5qe
Harry E. Pence 2013
Search engines, like Yahoo!
and Google, were the first
companies to work with
datasets that were too large
for conventional methods.
(According to Big Data 2 location 177, Google
has over a million servers.)
In order to power its searches, Google developed
a strategy called MapReduce. You map a task
onto a multitude of processors then retrieve the
results.
Traditional data warehouses use a relational
database (think Excel rows and columns);
Search engines need to handle non-relational
databases, sometimes called NoSQL.
Harry E. Pence 2013
The most popular software to
search No-SQL databases is called
Hadoop and several different
versions are freeware.
Named after his son’s pet elephant.
Hadoop is designed to collect data, even if it doesn’t fit
nicely into tables, distribute a query across a large
number of separate processors, and then combine the
results into a single answer set in order to deliver
results in almost real time.
This is often paired with machine learning apps, like a
recommendation engine, classification, error
detection, or facial recognition.
Personal Aside: I suggest that Google Analytics might
be the best way to introduce students to crafting a
query for a Big Data exercise.
Harry E. Pence 2013
We are in the Golden Age of Data
Visualization
http://tinyurl.com/beqxuyl
A streamgraph of the conversation around a brand.
It can be difficult to classify a tweet as pro or con.
Don’t appoint a #fracking proponent to lead the
Dept. of Energy. (often RT)
Broome County Executive took $82,428 in pro#fracking campaign contributions.
The only politician I know that has backed up
his promise to address #fracking is Gov.
Cuomo.
Natural gas is neither perfect nor perfectly evil.
Businesses surprised to see their names on
#fracking petition.
Mmmm @fracking fluid bit.ly/ZmQVQj
Harry E. Pence 2013
Amazon’s Mechanical Turk is one way to
manually create a standard data template.
https://www.mturk.com/mturk/welcome
Scraping, processing,
and buying Big Data
One can only copyright a specific arrangement of the
data, but the metadata is often extremely important and
may not follow the scrape.
Companies, like InfoChimp, are scraping and cleaning
selected data from Twitter and then selling access to
these datasets.
Pete Warden reports that it only cost him $120 to gather,
analyze, and visualize 220 million public Facebook
profiles (http://tinyurl.com/yb2q3dv) and 8olegs allowed
him to download a million web pages for about $2.20.
Harry E. Pence 2013
Problems with Big Data –
Selection bias
(http://tinyurl.com/bdb7wgy)
Selection bias occurs when the individuals or groups to
take part in a study don’t represent the general
population.
According to a recent Pew survey, Twitter users are
younger than the general public and more likely to lean
toward the Democratic Party. Twitter reactions are
often at odds (six out of eight times) with overall public
opinion. http://tinyurl.com/cvqq5hz
A recent article in EJP Data Science says Twitter is
actually comprised of modern-day tribes, groups of
people who use a discrete language and are connected
to a character, occupation or interest.
http://preview.tinyurl.com/bc8gecu
Harry E. Pence 2013
The partition
of English
speaking
Twitter users
into
communities,
annotated
with words
typical of
those often
used by
members of
each
community.
http://tinyurl.com/
bmb9r9e
Problems with Big Data –
Misclassification bias
(http://tinyurl.com/bdb7wgy)
Misclassification occurs when either the cause or the
effect is not accurately recognized.
What if a response is not correctly identified as intended
by the customer. This is especially true when
subjective interpretation is required to classify an
answer. http://tinyurl.com/cvqq5hz
Remember: Correlation does not imply causation,
Harry E. Pence 2013
Comments on Big Data for education
The information in LMSs (i.e. time on task, number of logins, number of list posts, etc.), like Moodle or Angel, is
well structured and is already being analyzed.
In MOOCs, students interact entirely online, leaving
behind a record of every page they visited.
The Gates Foundation recently gave $100 million to
InBloom to improve ways to transfer information among
the many technology information silos where students
records are currently stored. Nine states (including New
York) are participating in this pilot project and plan to
offer third-party vendors access to student data
(without student or parental consent).
http://tinyurl.com/b8e4whh
Unanswered questions include security, who owns
learner produced data (TurnItIn?), who owns the data
analysis, and what will be shared with the students?
Harry E. Pence 2013
E-Textbooks can now
track student use patterns.
Image from the blog Electric Venom:
Motherhood, mid-life crisis, martinis.
Instructors who use an e-text from
CourseSmart receive information
about each student showing how
much she is reading the book,
what pages she skips, how much
she highlights, and whether she
is taking notes. See NY Times,
April 9, 2013,
http://tinyurl.com/d7dfob4
Students who take notes with pen
and paper may be penalized,
even if they are doing well in the
course.
If it can be measured, some
teachers will grade it!
Some Learning Management Systems already
display a "dashboard" of a student’s performance.
Attendance
Danger
Homework
Needs improvement
Time on LMS
Good
Math SAT score 390 (You need extra work in math intensive courses)
Avg. grade of students like you who took this course: C
Avg. grade of students like you who had this instructor: D
Your grade in prerequisite courses: Precalculus: D
Harry E. Pence 2013
Dwell time is how long a
student spent on a given
activity.
Chico State
University (CA) has
been using learning
analytics to study
how student
achievement is
related to (LMS) use
and student
characteristics.
http://tinyurl.com/acz5f7o
The project merges LMS data with student characteristics
and course performance from the campus database.
The study reports a direct positive relationship between
LMS usage and the student ‘s final grade.
Voytek’s Third Law: Any sufficiently advanced statistics
can trick people into believing the results reflect truth.
Harry E. Pence 2013
Problems with Big
Data – Confounding
bias
(http://tinyurl.com/bdb7wgy)
Confounding occurs when there is a failure to control
for some other factor that is affecting the outcome.
How do other factors, like attendance, reading the
textbook, attending extra help sessions, etc. relate to
the time spent on the LMS (and so the course grade)?
Do the Dwell Times really make sense?
The article notes that, “No individually identifiable
information is included in the data files.”
Really???
Harry E. Pence 2013
Problems with Big Data –
Privacy
A sensible back-up strategy may create more than 100
copies. How can organizations protect the privacy of all
this data from hackers?
This can make it hard to protect individual privacy, and
recent experiences suggest that there are now so many
public datasets available for cross-referencing that it is
difficult to assure that any Big Data records can be kept
private.
In a number of cases, information from “anonymous
studies” has been tracked back to identify individuals
and even their families.
Harry E. Pence 2013
“Just as Ford changed the way we make
cars – and then transformed work itselfBig Data has emerged as a system of
knowledge that is already changing the
objects of knowledge, while also having
the power to inform how we understand
human networks and community.”
danah boyd (http://tinyurl.com/bdb7wgy)
danah boyd
“And finally, how will the harvesting of Big Data change
the meaning of learning and what new possibilities and
limitations may come from these systems of knowing?”
Ira "Gus" Hunt, chief technology officer for the Central
Intelligence Agency, said, "The value of any piece of
information is only known when you can connect it with
something else that arrives at a future point in time."
Harry E. Pence 2013
Thank you for
listening.
Any questions?
Download