More Trumps Better

advertisement

Critical Thinking and Argumentation in Software EngineeringCT60A7000

Big Data

Chapter3-Messy

Behnaz Norouzi

Francis Matheri

Increasing the volume  opens the door to inexactitude.

One of the fundamental shifts of going to big data from small

 considering inexactitude unavoidable and learn to live with them. Instead of treating them as problems and trying to get rid of them.

In the world of small data  reducing errors & ensuring high quality data are essential.

In the world of sampling  the obsession with exactitude was even more critical.

In the middle of nineteen century  the quest for exactitude began in Europe.

If one could measure a phenomenon  the implicit belief was, one could understand it.

Later measurement was tied to the scientific method of observation and explanation:

 Lord Kelvin: “to measure is to know”

 Francis Bacon: “Knowledge is power”

By the nineteenth century  France developed a precise system to capture space, time and more.

Half a century later  the discovery of quantum mechanics shattered forever the dream of comprehensive and perfect measurement.

In many new situations, allowing for messiness may be a positive feature not a shortcoming.

A tradeoff: in return for allowing errors, one can get ahold of much more data.

 It isn’t just that “More trumps some”, but sometimes

“More trumps better”.

The likelihood of errors increases as you add more data points.

Messiness itself is messy  It can arise when we extract or process the data, since in doing so; we are transforming it, turning it into something else. Such as when we perform sentiment analysis on twitter messages to predict Hollywood box office receipts.

Example  measuring the temperature in a vineyard:

If we have only one temperature sensor for the whole plot of land  we must make sure it is accurate  no messiness allowed.

If we have a sensor for every one hundred of vines  using cheaper sensors  messiness is allowed.

It was again a tradeoff  we sacrificed the accuracy of each data point for breath, and in return we received details that we otherwise could not have seen.

 Big data transforms figures into something more probabilistic than precise.

 More data  improvements in computing

Example  Chess algorithms. Using (N=all)

Banko and Brill : “ We want to reconsider the tradeoff between spending time and money on algorithm development versus spending it on corpus development.

 What is the story of this saying?!

 The result  More data, better performance.

 Google’s idea  language translation

The result  An IBM computer translated sixty Russian phrases into English in 1954.

The problem posed by a committee of machine-translation grandees  translation is not just about memorization and recall, it is about choosing the right words from many alternatives.

A novel idea of IBM researchers  Instead of feeding computer with explicit linguistic rules let the computer use statistical probabilities to calculate which word or phrase in one language is the most proper one in another language.

 Google’s mission in 2006  Organize the world’s information and make it universally accessible and useful.

The result  despite messiness of input, Google’s service works the best. And it is far, far richer.

Why it works well?  Fed in more data (not just high quality.)

More Trumps Better

Conventional sampling analysts 

Accepting messiness is difficult for them.

 They use multiple error-reducing strategies.

The problem 

 Such strategies are costly

Exacting standards of collection are unlikely to be achieved consistently at such scale.

Moving into a world of big data will require us to change our thinking about the merits of exactitude.

In dealing with even more comprehensive datasets, we no longer need to worry so much about individual data points biasing the overall analysis.

Take the way sensors are making their way into factories.

Example  At a factory in Washington, wireless sensors are installed throughout the plant, forming an invisible mesh that produces vast amount of data in real time.

Moving to a large scale changes:

Expectations of precision

Practical ability to achieve exactitude

Technology is imperfect  messiness is a practical reality we must deal with.

To get the inflation number  the Bureau of Labor Statistics employs hundreds of staffs to do related matters and it costs around $250 million a year.

The problem  by the time the numbers come out, they are already a few weeks old.

Solution  quicker access to inflation numbers that cannot be achieved with conventional methods focused on sampling.

Two economist at Massachusetts Institute of

Technology  Using big data is the shape of using software to crawl the web and collected half a million prices of products sold in the U.S. every single day.

 The benefit  combining big data collection with clever analysis led to the defection of a deflationary swing in prices immediately after Lehman Brothers field for bankruptcy in September 2008.

Messiness In Action

Move and messy OVER fewer and exact

Categorizing content  hierarchical systems such as taxonomies and indexes are imperfect.

Photo sharing site, Flickr 

In 2011 held more than six billion photos from more than 75 million users.

Tried to label each photo according preset categories.

They replaced the preset by mechanisms that are messier but more flexible.

The imprecision inherent in tagging is about accepting the natural messiness of the world.

Database design 

 Traditional Databases: require highly structured and precise data.

 Traditional databases are good for a world in which data is sparse, and thus can be curated carefully.

 This view of storage is at odds with reality.

The big shift  noSQL databases.

 It accepts data of varying type and size and allows it to be searched successfully.

 They require more processing and storage resources for permitting structural messiness.

Pat Helland : “It is OK if we have “Lossy” answers. That’s frequently what business needs.”

Hadoop  An open source rival to Google’s MapReduce system.

Why Hadoop is very good at processing large quantities of data?

 It takes for granted that the quantity of data is so breathtakingly enormous that it can’t be moved and must be analyzed where it is.

 By allowing for imprecision, we open a window into an untapped universe of insights.

 In return for living with messiness, we get tremendously valuable services that would be impossible at their scope and scale with traditional methods and tools.

 As big data techniques become a regular part of everyday life, we as a society may begin to strive to understand the world from a far larger, more comprehensive perspective than before, a sort of N=all of the mind.

 Big data with its emphasis on comprehensive datasets and messiness helps us get closer to reality than did our dependence on small data and accuracy.

 Big data may require us to change. To become more comfortable with disorder and uncertainty.

Download