• Today’s class
– Big data
Data mining, machine learning
With applications
The era of big data
• 2.5 exabytes of information are generated every day
– 1 exabyte = a billion gigabytes = 1018 bytes
• This amounts to nearly 1 zettabyte (1000 exabytes) per year
• This number is doubling every 40 months
• Tools are continually being developed to use this data
Some examples of data that is out there
• Locations of all Starbucks
• What can we observe?
A residential building at 73 Payson
Avenue in Inwood is as far from
Starbucks as they can possibly be in
Manhattan (1.36 miles)
The farthest residential address
from Starbucks south of 96th St. is
at 140 Baruch Place on the Lower
East Side (0.93 miles)
Some other examples
A situation of concern
Excerpts from the article
Dr. Janese Trimaldi, 40, recently completed her residency in Tampa,
On a terrifying evening in July 2011, she says, she locked herself in
her bedroom to hide from a drunken, belligerent boyfriend. He went
into the kitchen, retrieved a steak knife and jimmied open the door.
The screams and commotion caused a neighbor to call the police. The
boyfriend contended that a bleeding scratch on his chest had been
inflicted by Dr. Trimaldi with the knife. (It was from one of her
fingernails, she says.) She was arrested and charged with aggravated
assault with a deadly weapon and battery domestic violence.
The state dropped the charges, A few months later, her booking
photograph turned up on a Florida mug-shot Web site and with it
another mug shot from a 1996 arrest on an accusation of possession of
marijuana and steroids. The authorities had raided her apartment on
suspicion that a different boyfriend — this one a bodybuilder — was
illegally selling the steroids. Records show that she was quickly
released, and a certificate of disposition from the 13th Judicial Circuit
of Florida shows that she was not prosecuted for either charge.
She paid $30 to have the images taken down, but they soon appeared
on other sites, one of which wanted $400 to pull the picture.
Publishing mug shots
This information was available about
Janese Trimaldi
This information may be more
What is the real story?
Details in the article
IN March last year, a college freshman named Maxwell
Birnbaum was riding in a van filled with friends from
Austin, Tex., to a spring-break rental house in Gulf Shores,
Ala. As they neared their destination, the police pulled the
van over, citing a faulty taillight. When an officer asked if
he could search the vehicle, the driver — a fraternity
brother of Mr. Birnbaum’s who quickly regretted his
decision — said yes.
Six Ecstasy pills were found in Mr. Birnbaum’s knapsack,
and he was handcuffed and placed under arrest. Mr.
Birnbaum later agreed to enter a multiyear, pretrial
diversion program that has involved counseling and drug
tests, as well as visits to Alabama every six months to
update a judge on his progress.
But once he is done, Mr. Birnbaum’s record will be clean.
Which means that by the time he graduates from the
University of Texas at Austin, he can start his working life
without taint.
Has it worked for him?
Ad placement
Who does the tracking?
Drawbridge is one of several start-ups that have figured out how to
follow people without cookies, and to determine that a cellphone,
work computer, home computer and tablet belong to the same
person, even if the devices are in no way connected. Before, logging
onto a new device presented advertisers with a clean slate.
Let’s visit drawbridge which is at http://www.drawbrid.ge/
How do they do it?
.ge is the country code top-level domain (ccTLD) for Georgia.
.ge top-level domain names are available for registration for
residents of Georgia (unlimited) or for foreign companies via
representation of any local legal person (limited to one domain name
per registrant).[1] Second-level domain names are also available for
registration for several specific types of registrants:[2] (from
How can we use the data out there?
• In 2013, the National Security Agency confirmed it had collected
data from cellphone towers in 2010 and 2011 to locate
Americans’ cellphones, though it said it never used the
information. (further from the NYTimes article)
• Testifying before a hearing of the U.S. Senate Judiciary
Committee on Tuesday, Oct. 1, Ed Felten said that searching for
patterns in large collections of data, called metadata, "can now
reveal startling insights about the behavior of individuals or
…[he] said that merely by combining their analysis of phone
records with call times and durations, investigators can learn
about people's work, social habits, religion and political
• Want to do it yourself?
Interesting displays of data
The Selection of Majors by the Classes of 2000-2003
(Net Movement of More Than 20 Students)
Mechanical, Civil &
Chemical Engineering
Intended: 376 Actual: 378
Unchanged 167 (44%)
Languages and Literatures
Intended: 223 Actual: 256
Unchanged: 77 (35%)
Music / Art & Arch. /
Intended: 152 Actual: 198
Unchanged: 46 (30%)
Intended: 493 Actual: 281
Unchanged: 206 (46%)
Religion / Philosophy
Computer Science /
Electrical Engineering
Intended: 72 Actual: 146
Unchanged: 16 (22%)
Intended: 310 Actual: 319
Unchanged: 188 (61%)
Operations Research
Anthropology / Sociology
Intended: 17 Actual: 133
Unchanged: 17
Intended: 56 Actual: 171
Unchanged: 9 (16%)
Physical Sciences
/ Math
Intended: 361 Actual: 134
Unchanged: 83 (23%)
Intended: 31 Actual: 0
Net Movement of
History / Politics
40 students
Intended: 379 Actual: 931
Unchanged: 205 (54%)
80 students
120 students
160 students
The net movement of 20
or fewer students is not
Biology / Chemistry
Woodrow Wilson School
Intended: 980 Actual: 493
Unchanged: 352 (36%)
Intended: 148 Actual: 232
Unchanged: 51 (34%)
Intended: 365 Actual: 442
Unchanged: 143 (39%)
Intended: 441 Actual: 290
Unchanged: 95 (22%)
Example applications of big data
Searches for X usually lead to website Y
Purchasers of X often browse for Y
X is rated as a good seller
When stock X goes up, so does stock Y
Symptoms X and Y typically indicate disease Z
Mutations on Chromosome Y indicate a tendency towards
• Those who buy X and Y are likely to perform act Z
Application areas
• Searches for X usually lead to website Y
– Search engines (Google)
• Purchasers of X often browse for Y
– Recommendation systems (Amazon, Netflix)
• X is rated as a good seller
– Rating systems (Ebay, Amazon)
• When stock X goes up, so does stock Y
– Quantitative stock trading (Renaissance Technology, D. E.
Shaw, …)
• Symptoms X and Y typically indicate disease Z
– Disease diagnosis
Mutations on Chromosome Y indicate a tendency towards Z
– Human genome Project and genomics
• Those who buy X and Y are likely to perform act Z
– Total Information Awareness (DoD)
• Who knows who
– People near each other have similar characteristics
• How important are the people you know
– This tells your importance
• How can we cluster data observations
– Decimation
• How can we separate the YES’s from the NO’s
– Linear decision trees
• Merging information from multiple sources
– Consistency checking
– Building complete profiles
Who knows who (sidebar)
• Some experiments
How does Google work?
• 3 components
– A web crawler
– An indexer
– A query processor
How Google works (cont.)
• Crawl the web starting from a web page
– Foreach web page reached
Follow its hyperlinks to reach more web pages
Retrieve the contents of each page
Give each page a number
• Index the pages reached
– Foreach word found
Record the words that occur there
– Reverse index so that every word lists the pages on
which it occurs
• A search goes through this index
– Foreach word
Find its pages
– Merge page lists to find those pages that list all words
• List pages by page rank
Making Google work efficiently
• Crawl the web starting from a web page
– Foreach web page reached (a separate computer for each page)
Follow its hyperlinks to reach more web pages
Retrieve the contents of each page
Give each page a number
• Index the pages reached
– Foreach word found
Record the words that occur there (several words per computer)
– Reverse index so that every word lists the pages on which it
• A search goes through this index
– Foreach word
Find its pages (one (or more) computer per word)
– Merge lists to find pages with all words (one or more computers)
• List pages by page rank
Page Rank
• Factors
Number of links
Importance of links
Proximity of words
Position of words (e.g. in title)
Frequency of words on page
The actual search done by Google
From http://www.googleguide.com/google_works.html
Processing by Google
• More than a million CPUs are in Google data centers
• They are in data centers spread around the globe
– It costs $500M-$1B to build a data center
• Where are the Google processors located?
How does Google decide where to locate data
The availability of large volumes of cheap electricity to power the data
Google’s commitment to carbon neutrality, which has sharpened its focus on
renewable power sources such as wind power and hydro power. The Dalles
was chosen primarily for the availability of hydro power from the Columbia
River, while the local utility’s wind power program influenced the selection of
Council Bluffs, Iowa.
The presence of a large supply of water to support the chillers and water
towers used to cool Google’s data centers. A number of recent Google data
center sites have been next to rivers or lakes.
Large parcels of land, which allow for large buffer zones between the data
center and nearby roads. This makes the facilities easier to secure, and is
consistent with Google’s focus on data center secrecy. Google purchased 215
acres in Lenoir, 520 acres for the Goose Creek project, 800 acres of land in
Pryor, and more than 1,200 acres in Council Bluffs. The extra land may also
be used for building windmill farms to provide supplemental power at some
Distance to other Google data centers. Google needs lightning-fast response
time for its searches, and prizes fast connections between its data centers.
While big pipes can help address this requirement, some observers believe
Google carefully spaces its data centers to preserve low latency in
connections between facilities.
Tax incentives. Legislators in North Carolina, South Carolina, Oklahoma and
Iowa have all passed measures to provide tax relief to Google.
From http://www.datacenterknowledge.com/google-data-center-faq-part-2/
A common technique
• We want a machine to learn data (machine learning)
• and make decisions (classification theory)
• Often, our methods of doing so have no connection to previous
ways of thinking about the problem
Sample A sample data set
• Wisconsin breast cancer data set
1. Sample code number id number
2. Clump Thickness 1 - 10
3. Uniformity of Cell Size 1 - 10
4. Uniformity of Cell Shape 1 - 10
5. Marginal Adhesion 1 - 10
6. Single Epithelial Cell Size 1 - 10
7. Bare Nuclei 1 - 10
8. Bland Chromatin 1 - 10
9. Normal Nucleoli 1 - 10
10. Mitoses 1 - 10
11. Class: (2 for benign, 4 for malignant)
• Data set consists of 10-tuples (having the numeric
values corresponding to 1-10 above) along with a
characteristic (B or M as to 11 above)
• We can then ask questions like
– Is 3*#2 – 4*#3 + 6*#5 > 12 and bear no relation to reality
– Various mathematical techniques are used to design tests
– We study tests to see which classify best
How we might do a classification task
Goal: Devise a set of linear
tests that separate the
circles from the squares
Find those points above the solid line and also
to the left of the dotted line.
Add those points above the dashed line.
A new point can be classified as circle or square by these tests
Alternative classifier
Find those points above the solid line
Add those points above the dashed line and
also to the left of the dotted line.
A new point can be classified as circle or square by these tests
Green point is
classified as square
Green point is
classified as circle
Resolving discrepancies
An algorithm for building a better classifier
Find many classifiers which work for a portion of your test set
For each new point in your test set, determine which classifiers
work and which don’t
Combine the results to determine the vitality of each classifier
and weight them accordingly.
The algorithm is said to learn the weights by many
repetitions of steps 2 and 3.
Algorithms of this type are said to use machine
learning and emulate how humans learn
Find a hypothesis (or set of hypothesis)
Use observations to refine hypothesis and match them to
special circumstances
The Netflix challenge (begun in 2006)
• A $1M prize offered to the research team that could
best predict which movie(s) would be rented next
• Netflix provided a training data set of 100,480,507
ratings that 480,189 users gave to 17,770 movies (as
<user, movie, date of grade, grade>).
• The qualifying data set contains over 2,817,131
entries of the form <user, movie, date of grade>,
with grades known only to the jury.
• To get the grand prize, you had to beat Netflix’s own
algorithm by 10%. Each year, a progress prize would
be awarded if no team had achieved that goal.
The winners
• BellKor
– Robert M. Bell, Yehuda Koren and Chris Volinsky from ATT Labs
• How they did it
– Predictive accuracy is substantially improved when blending
multiple predictors. Our experience is that most efforts should
be concentrated in deriving substantially different approaches,
rather than refining a single technique. Consequently, our solution
is an ensemble of many methods.
– Below, we list all 107 results that were blended to deliver
RMSE=0.8712, with their weights within the ensemble.
From http://www.netflixprize.com/assets/ProgressPrize2007_KorBell.pdf
Did they really need 107 results?
• For completeness, we listed all 107 results that were blended in
our RMSE=0.8712 submission. It is important to note that the
major reason for using all these results was convenience, as we
anyway accumulated them during the year. This is the nature of
an ongoing competition. However, in hindsight, we would probably
drop many of these results, and recompute some in a different
way. We believe that far fewer results are really necessary. For
example, based on just three results one can breach the
RMSE=0.8800 barrier: A blend of #8, #38, and #92, with
weights 0.1893, 0.4225, and 0.4441, respectively, would already
achieve a solution with an RMSE of 0.8793.
• Similarly, combining #8, #38, and #64 yields RMSE=0.8798.
Notice that these combinations touch the main approaches (kNN, factorization, RBMs and asymmetric factor models). In
addition, we have found that at most 11 results suffice for
achieving a blend with above 8% improvement over Cinematch
Encore – the next Netflix challenge
• From Neil Hunt, Chief Product Officer for Netflix. About
five months ago we announced that Netflix would sponsor a
sequel to the Netflix Prize. (March 12, 2010)
• In the past few months, the Federal Trade Commission
(FTC) asked us how a Netflix Prize sequel might affect
Netflix members' privacy, and a lawsuit was filed by
KamberLaw LLC pertaining to the sequel. With both the
FTC and the plaintiffs' lawyers, we've had very productive
discussions centered on our commitment to protecting our
members' privacy
• In light of all this, we have decided to not pursue the
Netflix Prize sequel that we announced on August 6, 2009.
From: http://blog.netflix.com/2010/03/this-is-neil-hunt-chief-product-officer.html