Iacob 1 Roy Iacob Professor Harlynn Ramsey WRIT 340 4 December 2013 Bio: Roy Iacob is a Junior Computer Science Student at the University of Southern California. After interning at software startup SET Media, he became deeply interested in the possibilities of distributed computing. Keywords: Computer Science, Business, Big Data, Statistics Suggested Multimedia: GIF animation of Figure 2: one set of ten small integers being divided into small subsets in which the biggest number in each is returned and then finally compared by the master computer for the solution. Abstract: Big Data is transforming the way businesses operate by providing conclusive patterns about customer habits and interests. This emerging field is now possible because of improved hardware and software technologies that process information differently than previous paradigms. Big Data is so important that nearly all corporations will utilize some form of it in the next few years. Big Data’s Big Effect on Business If you've ever watched an episode of AMC's hit TV drama "Mad Men", you'll see the Madison Avenue "Ad Men" utilize a well-known market research technique called the focus group. This study of human behavior provides advertisers with insight into their target market’s needs and desires by supervising the use of a product. While countless businesses still value this technique, more companies now utilize the vast and growing set of data modern technology produces. Big Data is changing the way professionals from nearly every field approach questions: rather than formulating models and formulas to explain trends, this new field employs advanced statistics and modern technology to rapidly run tests and analyses on huge stores of information. Iacob 2 Data Acquisition: Consider a music store predicting sales for an upcoming album’s release. A brick and mortar store such as this music shop struggles to purchase inventory accurately because it must balance carrying sufficient stock to meet their customers’ needs with buying no more than they can sell. With traditional market analysis, this store might determine their order size by inspecting sales records, implementing rewards programs to spot purchasing trends, and running surveys to ask customers questions directly. These ineffective methods are costly and unable to produce adequate information to analyze their problems. Big Data, on the other hand, could provide a competing online store with access to detailed purchases, access to other albums customers preview, and statistics on how much time customers spend exploring the online store. This competitor can utilize its abundant data to develop algorithms that fundamentally change their marketing and management strategies [1]. Soon enough, this store can learn about their customers’ habits and determine suggestions leading to more sales. Big Data gives the online seller an unparalleled advantage by helping it with purchasing less inventory while providing their customers with a more unique experienced. The increasing availability of information can be attributed to the pervasive nature of networked electronics like smart phones. Data Scientist Jake Porway asserts, "It used to be top down, where companies would go out and conduct a survey and collect data. Now we are walking around with devices that log everything we like, pictures we take, stores we visit. You don't have to go out and find data, now it is coming to us" [2].This is likely due to modern technology becoming more embedded within our lives through smart phones, social networks, and wearable gadgets like Google Glass and smart watches. The integration of devices in more places than ever demonstrates the trend of information’s ever increasing size. Iacob 3 Volume, Velocity, and Variety: People commonly mistake Big Data with basic analytics that have been conducted by business analysts. What makes Big Data different is commonly defined by the “three v’s”: volume, velocity, and variety [3]. To truly say you work with Big Data, each of these traits needs to be firmly rooted in your project. Volume is the characteristic that refers to Big Data’s namesake. According to the online journal News Daily, 90% of all online data has been generated between 2011 and 2013. This quantity accounts for more than 2.7 zetabytes of saved information that demonstrates the massive amount of data being created. This large number is made up by the activity of billions of webconnected users and systems with embedded computers being saved on vast server farms all over the world. Figure 1 demonstrates more comparisons for the sheer scale of a zetabyte: Figure 1: Infographic displaying how much information can be stored in various quantities of data. Source: Mozy.com Iacob 4 Recent research shows that five billion people are now connected via mobile phones. The widespread adoption of communication devices by consumers to make calls, post statuses, use apps, and surf the web incidentally leaves behind an inordinate amount of data. To make use of this information, businesses need better technological paradigms to sort through it. Velocity alludes to the rapid influx of data that is required to warrant the tools of Big Data. A good portion of information drives in with speeds that outpace old technology’s processing abilities. Walmart reports dealing with customer transactions for some one million customers per hour [4]. The challenge for businesses is to handle this information in real time to make relevant decisions quickly. A mall on Black Friday could use parking statistics to add and relocate resources like staff and security where necessary. Big Data is effective because it can operate with a high velocity of data transmission. The final characteristic of Big Data is variability of information. Modern data is made up of new and ever changing content coming from widespread sources. Businesses can now track things like time spent on a web page, clicking habits, and facial recognition systems in mannequins to see their customers’ reactions. Moreover, new companies like CMO and Communispace devote their time to capitalizing on alternative forms of information. These companies quantize emotions and feelings with language sentiment analysis and public social media streams like Twitter to understand consumers’ opinions of brands in real time. A simple example would be to assign a positive or negative value to common words to inspect tweets that hash tag a product’s name. In these examples, social media reveals the diversity of new data sources that Big Data must make sense of. Iacob 5 New Technology: The technology necessary for Big Data research to happen required two crucial advances: accessibility of cloud computing and MapReduce with Hadoop. Both technologies made working with Big Data cheap and simple. These two advancements are the reason the Human Genome Project went from taking ten years to decode, down to seven days [1]. The idea of cloud computing has been around since the 1950’s: it refers to the interconnection of numerous computers over a network to work together in a cluster [5]. A cluster utilizes cheaper, mass produced computers rather than using one central computer of equivalent capability. While the idea is relatively old, it only became ubiquitous recently when Amazon made it easily available for the public. In 2006, Amazon noticed a problem companies had with acquiring computing power. Companies had the option of either paying flat rate subscriptions for un-scalable servers or buying in house servers and hiring Information Technology specialists to maintain their bulky hardware. Amazon responded by creating a platform for renting out their servers as a utility because they rarely utilized more than more than 10% of their own computing power [6]. The reason behind their massive computing capabilities is to stay prepared for influxes such as the month before the winter holidays. This new platform called Amazon Web Services gave people the freedom to pay for exactly how much processing power and storage they utilized with the option to scale up or down according to their needs. Soon after, the popularity of cloud computing caused big companies like Microsoft to create their own way to rent out their existing server farms. This paradigm shift can be compared to paying a flat rate on your power bill versus paying for your exact power consumption. Cloud service providers leveraged the economics of scale to make their bottom line cheaper: a gigabyte of server storage has dropped from $16 in 2002 to seven cents in 2013 [7]. For organizations like Iacob 6 the CIA and Netflix, cloud computing helped eliminate the overhead of traditional servers making way for simpler and cheaper big data processing by treating machine work as a utility rather than a subscription service. The other key technology deployed by Big Data is Hadoop with MapReduce. Nearly all Big Data projects utilize these tools because of their ability to process numbers quickly by distributing work in parallel. Since most data analysis can be broken down into similarly basic tasks, a trivial example of what this technology is capable of is finding the largest number in a set of billions of numbers (see Figure 2). A master machine can be thought of as a project leader who maps out subtasks and schedules work for hundreds of worker machines in its cluster. The worker machines, which are individual computers, then search for the largest number in their unique subsets concurrently rather than iterating and comparing numbers through the entire set linearly. Finally, the master reduces the set of worker’s solutions to produce a final answer. Figure 2: Flow chart of MapReduce problem of searching for the largest number in a data set. Source: original. Iacob 7 MapReduce could be applied in business for finding the average time spent by consumers on a website by mapping out collections of users’ information and reducing to a total average. The technology distributes labor within a cloud of computers like a foreman assigns tasks to a team of construction workers. The delegation of work, whether in a virtual space or a physical one, diminishes the time necessary to finish a job dramatically. The big issue plaguing MapReduce is the initial barrier of setting up a cluster of computers and distributing the subtasks manually, which is where Hadoop becomes advantageous. Hadoop can take a problem and a large data input, and then take care of the behind the scenes work in distributing the jobs and putting the solution together. Made by Yahoo! in 2005, the framework was so revolutionary that it set benchmarks for processing jobs like decoding the Human Genome Project because of how perfectly the problem could be applied [9]. Moreover because the software’s source code was released freely to the public for any use, all businesses were free to use Hadoop to analyze big data easily. The result of open-sourcing Hadoop was that developers outside of Yahoo! contributed to the project to further improve it while popularizing it. The pairing of these two technologies made Big Data’s operation faster, cheaper, and more feasible for businesses big and small. Business and Big Data: Big Data seems to be one of the biggest buzzwords of today for good reason. This new approach to statistical analysis is transforming fields like marketing where decisions were once made by gut feelings and intuition. Through consumers’ utilization of mobile phones, computers, and social media, huge amounts of data are created around their habits, behavior, and opinions. This information is commonly used to make smarter management decisions. Big Data can Iacob 8 provide an online music store with confidence in how much inventory to order because purchasing patterns begin to appear when there is enough information. By combining cloud computing, massive data stores, and rapid processing with Hadoop, Big Data is becoming one of the most important aspects of business development today. Iacob 9 References [1] Enriquez, Juan. "The Glory of Big Data." Popular Science. N.p., 31 Oct. 2011. Web. 9 Nov. 2013. <http://www.popsci.com/technology/article/2011-10/glory-big-data?page=0,1> [2] "Profiles in Data Science: Jake Porway | What's The Big Data?" Whats The Big Data. N.p., 24 May 2012. Web. 15 Nov. 2013. <http://whatsthebigdata.com/2012/05/24/profiles-in-datascience-jake-porway/>. [3] Shahid, Umair. "The Three “V”s of Big Data." OpenSCG Postgres Hadoop Java and BigSQL Experts. N.p., n.d. Web. 15 Nov. 2013. <http://www.openscg.com/2013/07/thethree-vs-of-big-data/>. [4] “Big Data Meets Big Data Analytics” SAS. N.p., n.d. Web. 15 Nov. 2013. <http://www.sas.com/resources/whitepaper/wp_46345.pdf>. [5] Knorr, Eric. "What Cloud Computing Really Means." InfoWorld. N.p., n.d. Web. 15 Nov. 2013. <http://www.infoworld.com/d/cloud-computing/what-cloud-computing-really-means031>. [6] "The History of Cloud Computing & Amazon Web Services." Newvem. N.p., n.d. Web. 16 Nov. 2013. <http://www.newvem.com/cloudpedia/the-history-of-cloud-computing/>. [7] "Big Data, for Better or Worse: 90% of World's Data Generated Over Last Two Years."ScienceDaily. ScienceDaily, 22 May 2013. Web. 13 Nov. 2013. <http://www.sciencedaily.com/releases/2013/05/130522085217.htm>. [8] "What Is a Petabyte?" SemanticCommunity. N.p., n.d. Web. 13 Nov. 2013. <www.SemanticCommunity.info>. [9] Graves, Thomas. "YDN Blog." Yahoo! Yahoo!, 3 July 2013. Web. 15 Nov. 2013.<http://developer.yahoo.com/blogs/ydn/hadoop-yahoo-sets-gray-sort-record-yellowelephant-092704088.html>. [10] "Wikibon Blog." Wikibon Blog RSS. N.p., 1 Aug. 2012. Web. 11 Nov. 2013. <http://wikibon.org/blog/big-data-statistics/>.