References

advertisement
Iacob 1
Roy Iacob
Professor Harlynn Ramsey
WRIT 340
4 December 2013
Bio: Roy Iacob is a Junior Computer Science Student at the University of Southern California.
After interning at software startup SET Media, he became deeply interested in the possibilities of
distributed computing.
Keywords: Computer Science, Business, Big Data, Statistics
Suggested Multimedia: GIF animation of Figure 2: one set of ten small integers being divided
into small subsets in which the biggest number in each is returned and then finally compared by
the master computer for the solution.
Abstract: Big Data is transforming the way businesses operate by providing conclusive patterns
about customer habits and interests. This emerging field is now possible because of improved
hardware and software technologies that process information differently than previous
paradigms. Big Data is so important that nearly all corporations will utilize some form of it in the
next few years.
Big Data’s Big Effect on Business
If you've ever watched an episode of AMC's hit TV drama "Mad Men", you'll see the
Madison Avenue "Ad Men" utilize a well-known market research technique called the focus
group. This study of human behavior provides advertisers with insight into their target market’s
needs and desires by supervising the use of a product. While countless businesses still value this
technique, more companies now utilize the vast and growing set of data modern technology
produces. Big Data is changing the way professionals from nearly every field approach
questions: rather than formulating models and formulas to explain trends, this new field employs
advanced statistics and modern technology to rapidly run tests and analyses on huge stores of
information.
Iacob 2
Data Acquisition:
Consider a music store predicting sales for an upcoming album’s release. A brick and
mortar store such as this music shop struggles to purchase inventory accurately because it must
balance carrying sufficient stock to meet their customers’ needs with buying no more than they
can sell. With traditional market analysis, this store might determine their order size by
inspecting sales records, implementing rewards programs to spot purchasing trends, and running
surveys to ask customers questions directly. These ineffective methods are costly and unable to
produce adequate information to analyze their problems. Big Data, on the other hand, could
provide a competing online store with access to detailed purchases, access to other albums
customers preview, and statistics on how much time customers spend exploring the online store.
This competitor can utilize its abundant data to develop algorithms that fundamentally change
their marketing and management strategies [1]. Soon enough, this store can learn about their
customers’ habits and determine suggestions leading to more sales. Big Data gives the online
seller an unparalleled advantage by helping it with purchasing less inventory while providing
their customers with a more unique experienced.
The increasing availability of information can be attributed to the pervasive nature of
networked electronics like smart phones. Data Scientist Jake Porway asserts, "It used to be top
down, where companies would go out and conduct a survey and collect data. Now we are
walking around with devices that log everything we like, pictures we take, stores we visit. You
don't have to go out and find data, now it is coming to us" [2].This is likely due to modern
technology becoming more embedded within our lives through smart phones, social networks,
and wearable gadgets like Google Glass and smart watches. The integration of devices in more
places than ever demonstrates the trend of information’s ever increasing size.
Iacob 3
Volume, Velocity, and Variety:
People commonly mistake Big Data with basic analytics that have been conducted by
business analysts. What makes Big Data different is commonly defined by the “three v’s”:
volume, velocity, and variety [3]. To truly say you work with Big Data, each of these traits needs
to be firmly rooted in your project.
Volume is the characteristic that refers to Big Data’s namesake. According to the online
journal News Daily, 90% of all online data has been generated between 2011 and 2013. This
quantity accounts for more than 2.7 zetabytes of saved information that demonstrates the massive
amount of data being created. This large number is made up by the activity of billions of webconnected users and systems with embedded computers being saved on vast server farms all over
the world. Figure 1 demonstrates more comparisons for the sheer scale of a zetabyte:
Figure 1: Infographic displaying how much information can be stored in various quantities of data. Source:
Mozy.com
Iacob 4
Recent research shows that five billion people are now connected via mobile phones. The
widespread adoption of communication devices by consumers to make calls, post statuses, use
apps, and surf the web incidentally leaves behind an inordinate amount of data. To make use of
this information, businesses need better technological paradigms to sort through it.
Velocity alludes to the rapid influx of data that is required to warrant the tools of Big
Data. A good portion of information drives in with speeds that outpace old technology’s
processing abilities. Walmart reports dealing with customer transactions for some one million
customers per hour [4]. The challenge for businesses is to handle this information in real time to
make relevant decisions quickly. A mall on Black Friday could use parking statistics to add and
relocate resources like staff and security where necessary. Big Data is effective because it can
operate with a high velocity of data transmission.
The final characteristic of Big Data is variability of information. Modern data is made up
of new and ever changing content coming from widespread sources. Businesses can now track
things like time spent on a web page, clicking habits, and facial recognition systems in
mannequins to see their customers’ reactions. Moreover, new companies like CMO and
Communispace devote their time to capitalizing on alternative forms of information. These
companies quantize emotions and feelings with language sentiment analysis and public social
media streams like Twitter to understand consumers’ opinions of brands in real time. A simple
example would be to assign a positive or negative value to common words to inspect tweets that
hash tag a product’s name. In these examples, social media reveals the diversity of new data
sources that Big Data must make sense of.
Iacob 5
New Technology:
The technology necessary for Big Data research to happen required two crucial advances:
accessibility of cloud computing and MapReduce with Hadoop. Both technologies made working
with Big Data cheap and simple. These two advancements are the reason the Human Genome
Project went from taking ten years to decode, down to seven days [1].
The idea of cloud computing has been around since the 1950’s: it refers to the
interconnection of numerous computers over a network to work together in a cluster [5]. A
cluster utilizes cheaper, mass produced computers rather than using one central computer of
equivalent capability. While the idea is relatively old, it only became ubiquitous recently when
Amazon made it easily available for the public. In 2006, Amazon noticed a problem companies
had with acquiring computing power. Companies had the option of either paying flat rate
subscriptions for un-scalable servers or buying in house servers and hiring Information
Technology specialists to maintain their bulky hardware. Amazon responded by creating a
platform for renting out their servers as a utility because they rarely utilized more than more than
10% of their own computing power [6]. The reason behind their massive computing capabilities
is to stay prepared for influxes such as the month before the winter holidays. This new platform
called Amazon Web Services gave people the freedom to pay for exactly how much processing
power and storage they utilized with the option to scale up or down according to their needs.
Soon after, the popularity of cloud computing caused big companies like Microsoft to create
their own way to rent out their existing server farms. This paradigm shift can be compared to
paying a flat rate on your power bill versus paying for your exact power consumption. Cloud
service providers leveraged the economics of scale to make their bottom line cheaper: a gigabyte
of server storage has dropped from $16 in 2002 to seven cents in 2013 [7]. For organizations like
Iacob 6
the CIA and Netflix, cloud computing helped eliminate the overhead of traditional servers
making way for simpler and cheaper big data processing by treating machine work as a utility
rather than a subscription service.
The other key technology deployed by Big Data is Hadoop with MapReduce. Nearly all
Big Data projects utilize these tools because of their ability to process numbers quickly by
distributing work in parallel. Since most data analysis can be broken down into similarly basic
tasks, a trivial example of what this technology is capable of is finding the largest number in a
set of billions of numbers (see Figure 2). A master machine can be thought of as a project leader
who maps out subtasks and schedules work for hundreds of worker machines in its cluster. The
worker machines, which are individual computers, then search for the largest number in their
unique subsets concurrently rather than iterating and comparing numbers through the entire set
linearly. Finally, the master reduces the set of worker’s solutions to produce a final answer.
Figure 2: Flow chart of MapReduce problem of searching for the largest number in a data set. Source: original.
Iacob 7
MapReduce could be applied in business for finding the average time spent by consumers on a
website by mapping out collections of users’ information and reducing to a total average. The
technology distributes labor within a cloud of computers like a foreman assigns tasks to a team
of construction workers. The delegation of work, whether in a virtual space or a physical one,
diminishes the time necessary to finish a job dramatically.
The big issue plaguing MapReduce is the initial barrier of setting up a cluster of
computers and distributing the subtasks manually, which is where Hadoop becomes
advantageous. Hadoop can take a problem and a large data input, and then take care of the
behind the scenes work in distributing the jobs and putting the solution together. Made by
Yahoo! in 2005, the framework was so revolutionary that it set benchmarks for processing jobs
like decoding the Human Genome Project because of how perfectly the problem could be applied
[9]. Moreover because the software’s source code was released freely to the public for any use,
all businesses were free to use Hadoop to analyze big data easily. The result of open-sourcing
Hadoop was that developers outside of Yahoo! contributed to the project to further improve it
while popularizing it. The pairing of these two technologies made Big Data’s operation faster,
cheaper, and more feasible for businesses big and small.
Business and Big Data:
Big Data seems to be one of the biggest buzzwords of today for good reason. This new
approach to statistical analysis is transforming fields like marketing where decisions were once
made by gut feelings and intuition. Through consumers’ utilization of mobile phones, computers,
and social media, huge amounts of data are created around their habits, behavior, and opinions.
This information is commonly used to make smarter management decisions. Big Data can
Iacob 8
provide an online music store with confidence in how much inventory to order because
purchasing patterns begin to appear when there is enough information. By combining cloud
computing, massive data stores, and rapid processing with Hadoop, Big Data is becoming one of
the most important aspects of business development today.
Iacob 9
References
[1]
Enriquez, Juan. "The Glory of Big Data." Popular Science. N.p., 31 Oct. 2011. Web. 9 Nov.
2013. <http://www.popsci.com/technology/article/2011-10/glory-big-data?page=0,1>
[2] "Profiles in Data Science: Jake Porway | What's The Big Data?" Whats The Big Data. N.p.,
24 May 2012. Web. 15 Nov. 2013. <http://whatsthebigdata.com/2012/05/24/profiles-in-datascience-jake-porway/>.
[3] Shahid, Umair. "The Three “V”s of Big Data." OpenSCG Postgres Hadoop Java and
BigSQL Experts. N.p., n.d. Web. 15 Nov. 2013. <http://www.openscg.com/2013/07/thethree-vs-of-big-data/>.
[4] “Big Data Meets Big Data Analytics” SAS. N.p., n.d. Web. 15 Nov. 2013.
<http://www.sas.com/resources/whitepaper/wp_46345.pdf>.
[5] Knorr, Eric. "What Cloud Computing Really Means." InfoWorld. N.p., n.d. Web. 15 Nov.
2013. <http://www.infoworld.com/d/cloud-computing/what-cloud-computing-really-means031>.
[6] "The History of Cloud Computing & Amazon Web Services." Newvem. N.p., n.d. Web. 16
Nov. 2013. <http://www.newvem.com/cloudpedia/the-history-of-cloud-computing/>.
[7] "Big Data, for Better or Worse: 90% of World's Data Generated Over Last Two
Years."ScienceDaily. ScienceDaily, 22 May 2013. Web. 13 Nov. 2013.
<http://www.sciencedaily.com/releases/2013/05/130522085217.htm>.
[8] "What Is a Petabyte?" SemanticCommunity. N.p., n.d. Web. 13 Nov. 2013.
<www.SemanticCommunity.info>.
[9] Graves, Thomas. "YDN Blog." Yahoo! Yahoo!, 3 July 2013. Web. 15 Nov.
2013.<http://developer.yahoo.com/blogs/ydn/hadoop-yahoo-sets-gray-sort-record-yellowelephant-092704088.html>.
[10] "Wikibon Blog." Wikibon Blog RSS. N.p., 1 Aug. 2012. Web. 11 Nov. 2013.
<http://wikibon.org/blog/big-data-statistics/>.
Download