Why Not Store Everything in Main Memory? Why use disks?

advertisement
Background:
There is a Twitter Archive at US Library of Congress
Certainly the record of what millions of Americans say, think, and feel each day (their tweets) is a treasure-trove for
historians. But is the technology feasible, and important for a federal agency> Is it cost-effective to handle the three
V's that form the fingerprint of a Big Data project – volume, velocity, and variety?
U.S. Library of Congress said yes and agreed to archive all tweets sent since 2006 for posterity. But the task is daunting.
Volume? US LoC will archive 172 billion tweets in 2013 alone (~300 each from 500 million tweeters), so many trillions, 2006-2016?
Velocity? currently absorbing > 20 million tweets/hour, 24 hours/day, seven days/week, each stored in a way that can last.
Variety? tweets from a woman who may run for president in 2016 – and Lady Gaga. And they're different in other ways.
"Sure, a tweet is 140 characters" says Jim Gallagher, US Library of Congress Director of Strategic Initiatives, "There are 50
fields. We need to record who wrote it. Where. When." Because many tweets seem banal, the project has inspired
ridicule. When the library posted its announcement of the project, one reader wrote in the comments box: "I'm
guessing a good chunk ... came from the Kardashians." But isn't banality the point? Historians want to know not just
what happened in the past but how people lived. It is why they rejoice in finding a semiliterate diary kept by a
Confederate soldier, or pottery fragments in a colonial town. It's as if a historian today writing about Lincoln could
listen in on what millions of Americans were saying on the day he was shot. Youkel and Mandelbaum might seem like
an odd couple to carry out a Big Data project: One is a career Library of Congress researcher with an undergraduate
degree in history, the other a geologist who worked for years with oil companies. But they demonstrate something
Babson's Mr. Davenport has written about the emerging field of analytics: "hybrid specialization."
For organizations to use the new technology well, traditional skills, like computer science, aren't enough. Davenport points
out that just as Big Data combines many innovations, finding meaning in the world's welter of statistics means
combining many different disciplines. Mandelbaum and Youkel pool their knowledge to figure out how to archive the
tweets, how researchers can find what they want, and how to train librarians to guide them [but not how to data mine
them!]. Even before opening tweets to the public, the library has gotten more than 400 requests from doctoral
candidates, professors, and journalists. "This is a pioneering project," Mr. Dizard says. "It's helping us begin to
handle large digital data." For "America's library," at this moment, that means housing a Gutenberg Bible and Lady
Gaga tweets. What will it mean in 50 years? I ask Dizard. He laughs – and demurs. "I wouldn't look that far ahead."
[I would! It will mean that data mining will drive a sea-change move to vertically storing all big data!]
NSA programs (PRISM (email/twitter/facebook analytics?) and ? (phone record analytics)
Two US surveillance programs – one scooping up records of Americans' phone calls and the other collecting information on Internet-based
activities (PRISM?) – came to public attention. The aim: data-mining to help NSA thwart terrorism. But not everyone is cool with it. In
the name of fighting terrorism, the US gov has been mining data collected from phone companies such as Verizon for the past seven
years and from Google, Facebook, and other social media firms for at least 4 yrs, according to gov docs leaked this week to news orgs.
The two surveillance programs, one that collects detailed records of telephone calls, the other that collects data on Internet-based activities
such as e-mail, instant messaging, and video conferencing [facetime, skype?], were publicly revealed in "top secret" docs leaked to
the British newspaper the Guardian and the Washington Post. Both are run by the National Security Agency (NSA), the papers reported.
The existence of the telephone data-mining program was previously known, and civil libertarians have for years complained that it represents a
dangerous and overbroad incursion into the privacy of all Americans. What became clear this week were certain details about its
operation – such as that the government sweeps up data daily and that a special court has been renewing the program every 90 days since
about 2007. But the reports about the Internet-based data-mining program, called PRISM, represent a new revelation, to the public.
Data-mining can involve the use of automated algorithms to sift through a database for clues as to the existence of a terrorist plot. One member of
Congress claimed this week that the telephone data-mining program helped to thwart a significant terrorism incident in the United States
"within the last few years," but could not offer more specifics because the whole program is classified. Others in Congress, as well as
President Obama and the director of national intelligence, sought to allay concerns of critics that the surveillance programs represent Big
Government run amok. But it would be wrong to suggest that every member of Congress is on board with the sweep of such data mining
programs or with the amount of oversight such national-security endeavors get from other branches of government. Some have hinted for
years that they find such programs disturbing and an infringement of people's privacy. Here's an overview of these two data-mining
programs, and how much oversight they are known to have.
Phone-record data mining On Thursday, the Guardian displayed on its website a top-secret court order authorizing the telephone data-collection
prog. The order, signed by a federal judge on the mysterious Foreign Intelligence Surveillance Court, requires a subsidiary of Verizon to
send to the NSA “on an ongoing daily basis” through July its “telephony metadata,” or communications logs, “between the United States
and abroad” or “wholly within the United States, including local telephone calls.” Such metadata include the phone number calling and
the number called, telephone calling card numbers, and time and duration of calls. What's not included is permission for the NSA to
record or listen to a phone conversation. That would require a separate court order, federal officials said after the program's details were
made public. After the Guardian published the court's order, it became clear that the document merely renewed a data-collection that has
been under way since 2007 – and one that does not target Americans, federal officials said. “The judicial order that was disclosed in the
press is used to support a sensitive intelligence collection op, on which members of Congress have been fully and repeatedly briefed,”
said James Clapper, director of national intelligence, in a statement about the phone surveillance program. “The classified program has
been authorized by all three branches of the Government.” That does not do much to assuage civil libertarians, who complain that the
government can use the program to piece together moment-by-moment movements of individuals throughout their day and to identify to
whom they speak most often. Such intelligence operations are permitted by law under Section 215 of the Patriot Act, so-called “business
records” provision. It compels businesses to provide information about their subscribers to the government. Some companies responded,
but obliquely, given that by law they cannot comment on the surveillance programs or even confirm their existence. Randy Milch,
general counsel for Verizon, said in an e-mail to employees that he had no comment on the accuracy of the Guardian article, the
Washington Post reported. The “alleged order,” he said, contains language that “compels Verizon to respond” to government requests and
“forbids Verizon from revealing [the order's] existence.”
Will NSA leaks wake us from our techno-utopian dream?
A vast surveillance state is being made possible by technologies that
we were told would liberate us. Christian Science Monitor Dan Murphy, Staff writer, 6/10/13
They work a few hundred yards from one of the Library of Congress's most prized possessions: a vellum copy of the Bible printed in 1455 by
Johann Gutenberg, inventor of movable type. But almost six centuries later, Jane Mandelbaum and Thomas Youkel have a task that
would confound Gutenberg. The researchers are leading a team that is archiving almost every tweet sent out since Twitter began in 2006.
A half-billion tweets stream into library computers each day. Their question: How can they store the tweets so they become a meaningful
tool for researchers – a sort of digital transcript providing insights into the daily flow of history?
Thousands of miles away, Arnold Lund has a different task. Mr. Lund manages a lab for General Electric, a company that still displays the desk of
its founder, Thomas Edison at its research headquarters in Niskayuna, N.Y. But even Edison might need training before he'd grasp all the
dimensions of one of Lund's projects. Lund's question: How can power companies harness the power of data to predict which trees will
fall on power lines during a storm – thus allowing them to prevent blackouts before they happen?
The work of Richard Rothman, a professor at Johns Hopkins University in Baltimore, is more fundamental: to save lives. The Centers for Disease
Control and Prevention (CDC) in Atlanta predicts flu outbreaks, once it examines reports from hospitals. That takes weeks. In 2009, a
study seemed to suggest researchers could predict outbreaks much faster by analyzing millions of Google searches. Spikes in queries
like "My kid is sick" signaled a flu outbreak before the CDC knew there would be one. That posed a new question for Dr. Rothman and
his colleague Andrea Dugas: Could Google help predict influenza outbreaks in time to allow hospitals like the one at Johns Hopkins to
get ready? They ask different questions.
But all of these researchers form part of the new world of Big Data – a phenomenon that may revolutionize every facet of life, culture, and, well,
even the planet. From curbing urban crime to calculating the effectiveness of a tennis player's backhand, people are now gathering and
analyzing vast amounts of data to predict human behaviors, solve problems, identify shopping habits, thwart terrorists – everything but
foretell which Hollywood scripts might make blockbusters. Actually, there's a company poring through numbers to do that, too. Just four
years ago, someone wanted to do a Wikipedia entry on Big Data. Wikipedia said no; there was nothing special about the term – it just
combined 2 common words. Today, Big Data seems everywhere, ushering in what advocates consider the biggest changes since Euclid.
Want to get elected to public office? Put a bunch of computer geeks in a room and have them comb through databases to glean who might vote for
you – then target them with micro-tailored messages, as President Obama famously did in 2012.
Want to solve poverty in Africa? Analyze text messages and social media networks to detect early signs of joblessness, epidemics, and other
problems, as the United Nations is trying to do.
Eager to find the right mate? Use algorithms to analyze an infinite number of personality traits to determine who's the best match for you, as
many online dating sites now do.
What exactly is Big Data? What makes it new? Different? What's the downside? Such questions have evoked intense interest, especially since
June 5. On that day, former National Security Agency analyst Edward Snowden revealed that, like Ms. Mandelbaum or Rothman, the
NSA had also asked a question: Can we find terrorists using Big Data – like the phone records of hundreds of millions of ordinary
Americans? Could we get those records from, say, Verizon? Mr. Snowden's disclosures revealed that PRISM, the program the NSA
devised, secretly monitors calls, Web searches, and e-mails, in the United States and other countries.
Short Message Service (SMS) is a text messaging service component of phone, web, or mobile communication systems, uses standardized communications
protocols that allow the exchange of short text messages between fixed line or mobile phone devices.
SMS is the most widely used data application in the world, with 3.5 billion active users, or 78% of all mobile phone subscribers. The term "SMS" is used
for all types of short text messaging and the user activity itself in many parts of the world.
SMS as used on modern handsets originated from radio telegraphy in radio memo pagers using standardized phone protocols. These were defined in 1985, as
part of the Global System for Mobile Communications (GSM) series of standards as a means of sending messages of up to 160 characters to and
from GSM mobile handsets. Though most SMS messages are mobile-to-mobile text messages, support for the service has expanded to include other
mobile technologies, such as ANSI CDMA networks and Digital AMPS, as well as satellite and landline networks.
Message size: Transmission of short messages between the SMSC and the handset is done whenever using the Mobile Application Part (MAP) of the SS7
protocol. Messages are sent with the MAP MO- and MT-ForwardSM operations, whose payload length is limited by the constraints of the signaling
protocol to precisely 140 octets (140 octets = 140 * 8 bits = 1120 bits). Short messages can be encoded using a variety of alphabets: the default GSM
7-bit alphabet, the 8-bit data alphabet, and the 16-bit UCS-2 alphabet. Larger content (concatenated SMS, multipart or segmented SMS, or "long
SMS") can be sent using multiple messages, in which case each message will start with a User Data Header (UDH) containing segmentation info.
Text messaging, or texting, is the act of typing and sending a brief, electronic message between two or more mobile phones or fixed or portable devices over a
phone network. The term originally referred to messages sent using the Short Message Service (SMS) only; it has grown to include messages
containing image, video, and sound content (known as MMS messages). The sender of a text message is known as a texter, while the service itself has
different colloquialisms depending on the region. It may simply be referred to as a text in North America, the United Kingdom, Australia and the
Philippines, an SMS in most of mainland Europe, and a TMS or SMS in the Middle East, Africa and Asia. Text messages can be used to interact with
automated systems to, for example, order products or services, or participate in contests. Advertisers and service providers use direct text marketing to
message mobile phone users about promotions, payment due dates, etcetera instead of using mail, e-mail or voicemail. In a straight and concise
definition for the purposes of this English Language article, text messaging by phones or mobile phones should include all 26 letters of the alphabet
and 10 numerals, i.e., alpha-numeric messages, or text, to be sent by texter or received by the textee.
Security concerns: Consumer SMS should not be used for confidential communication. The contents of common SMS messages are known to the network
operator's systems and personnel. Therefore, consumer SMS is not an appropriate technology for secure communications. To address this issue, many
companies use an SMS gateway provider based on SS7 connectivity to route the messages. The advantage of this international termination model is
the ability to route data directly through SS7, which gives the provider visibility of the complete path of the SMS. This means SMS messages can be
sent directly to and from recipients without having to go through the SMS-C of other mobile operators. This approach reduces the number of mobile
operators that handle the message; however, it should not be considered as an end-to-end secure communication, as the content of the message is
exposed to the SMS gateway provider. Failure rates without backward notification can be high between carriers (T-Mobile to Verizon is notorious in
the US). International texting can be extremely unreliable depending on the country of origin, destination and respective carriers.
Twitter is an online social networking and microblogging service that enables its users to send and read text-based messages of up to 140 characters, known as "tweets".
Twitter was created in March 2006 by Jack Dorsey and by July, the social networking site was launched. The service rapidly gained worldwide popularity, with over 500
million registered users as of 2012, generating over 340 million tweets daily and handling over 1.6 billion search queries per day. Since its launch, Twitter has
become one of the ten most visited websites on the Internet, and has been described as "the SMS of the Internet. Unregistered users can read tweets, while registered
users can post tweets thru the website, SMS, or a range of apps for mobiles. Twitter Inc. is in San Francisco, with servers and offices in NYC, Boston, San Antonio.
Tweets are publicly visible by default, but senders can restrict message delivery to just their followers. Users can tweet via the Twitter website, compatible external
apps (such as for smartphones), or by Short Message Service (SMS) available in certain countries. While the service is free, accessing it thru SMS has phone fees.
Users may subscribe to other users' tweets – this is known as following and subscribers are known as followers or tweeps, a portmanteau of Twitter and peeps. Users
can also check people who are un-subscribing them on Twitter (unfollowing). Also, users have the capability to block those who have followed them. Twitter allows
users to update their profile via their mobile phone either by text messaging or by apps released for certain smartphones and tablets. Twitter has been compared to a
web-based Internet Relay Chat (IRC) client. In a 2009 Time essay, described the basic mechanics of Twitter as "remarkably simple":
As a social network, Twitter revolves around the principle of followers. When you choose to follow another Twitter user, that user's tweets appear in reverse chronological
order on your main Twitter page. If you follow 20 people, you'll see a mix of tweets scrolling down the page: breakfast-cereal updates, interesting new links, music
recommendations, even musings on the future of education.
Pear Analytic analyzed 2,000 tweets (originating from US in English) over a 2-week period in 8/09 from 11:00 am to 5:00 pm (CST) and separated them into six categories:
Pointless babble – 40%
Conversational – 38%
Pass-along value – 9%
Self-promotion – 6%
Spam – 4%
News – 4%
Social networking researcher Danah Boyd argues what Pear researchers labeled "pointless babble" is better characterized as "social grooming" and/or "peripheral awareness"
(which she explains as persons "want[ing] to know what the people around them are thinking and doing and feeling, even when co-presence isn’t viable").
Format: Users can group posts by topic/type with hashtags – words or phrases prefixed with a "#" sign. Similarly, "@" sign followed by a username is used for mention/reply
to other users. To repost a message from another Twitter user, and share it with one's own followers, retweet function, symbolized by "RT" in the message.
In late 2009, the "Twitter Lists" feature was added, making it possible for users to follow (as well as mention and reply to) ad hoc lists of authors instead of individual authors
Through SMS, users can communicate with Twitter thru 5 gateway numbers: short codes for US, Canada, India, New Zealand, Isle of Man-based number for international use.
There is also a short code in the UK only accessible to those on the Vodafone, O2 and Orange networks. In India, since Twitter only supports tweets from Bharti Airtel an
alternative platform called smsTweet was set up by a user to work on all networks - GladlyCast exists for mobile phone users in Singapore, Malaysia, Philippines.
The tweets were set to a 140-character limit for compatibility with SMS messaging, introducing the shorthand notation and slang commonly used in SMS messages. The 140character limit has also increased the usage of URL shortening services such as bit.ly, goo.gl, and tr.im, and content-hosting services, such as Twitpic, memozu.com
and NotePub to accommodate multimedia content and text longer than 140 characters. Since June 2011, Twitter has used its own t.co domain for automatic
shortening of all URLs posted on its website.
Trending topics: A word, phrase or topic that is tagged at a greater rate than other tags is said to be a trending topic. Trending topics become popular either thru a concerted
effort by users, or by an event prompts people to talk about one specific topic These topics help Twitter and their users understand what's happening in the world.
Trending topics are sometimes the result of concerted efforts by fans of certain celebrities or cultural phenomena, particularly musicians like Lady Gaga (known as Little
Monsters), Justin Bieber (Beliebers), and One Direction (Directioners), and fans of the Twilight (Twihards) and Harry Potter (Potterheads) novels. Twitter has
altered the trend algorithm in the past to prevent manipulation of this type.
Twitter's March 30, 2010 blog post announced that the hottest Twitter trending topics would scroll across the Twitter homepage. Controversies abound on Twitter trending
topics: Twitter has censored hashtags other users found offensive. Twitter censored the #Thatsafrican and the #thingsdarkiessay hashtags after users complained they
found the hashtags offensive. There are allegations that twitter removed #NaMOinHyd from trending list and added Indian National Congress sponsored hashtag.
Adding and following content There are numerous tools for adding content, monitoring content and conversations including Telly (video sharing, old name is Twitvid),
TweetDeck, Salesforce.com, HootSuite, and Twitterfeed. As of 2009, fewer than half of tweets were posted using the web user interface with most users using thirdparty applications (based on analysis of 500 million tweets by Sysomos).
Verified accounts In June 2008, Twitter launched a verification program, allowing celebrities to get their accounts verified.[97] Originally intended to help users verify
which celebrity accounts were created by the celebrities themselves (and therefore are not fake), they have since been used to verify accounts of businesses and
accounts for public figures who may not actually tweet but still wish to maintain control over the account that bears their name.
Mobile Twitter has mobile apps for iPhone, iPad, Android, Windows Phone, BlackBerry, and Nokia There is also version of the website for mobile devices, SMS and MMS
service. Twitter limits the use of third party applications utilizing the service by implementing a 100,000 user limit.
Authentication As of August 31, 2010, third-party Twitter applications are required to use OAuth, an authentication method that does not require users to enter their password
into the authenticating application. Previously, the OAuth authentication method was optional, it is now compulsory and the user-name/password authentication
method has been made redundant and is no longer functional. Twitter stated that the move to OAuth will mean "increased security and a better experience".
Related Headlines On August 19, 2013, Twitter announced Twitter Related Headlines.
Usage Rankings Twitter is ranked as one of the ten-most-visited websites worldwide by Alexa's web traffic analysis. Daily user estimates vary as the company does not
publish statistics on active accounts. A February 2009 Compete.com blog entry ranked Twitter as the third most used social network based on their count of
6 million unique monthly visitors and 55 million monthly visits. In March 2009, a Nielsen.com blog ranked Twitter as the fastest-growing website in the Member
Communities category for February 2009. Twitter had annual growth of 1,382 percent, increasing from 475,000 unique visitors in February 2008 to 7 million in
February 2009. In 2009, Twitter had a monthly user retention rate of forty percent.
Demographics Twitter.com Top5 Global Markets by Reach (%) CountryPercent IndonesiaJun 2010 20.8%, Dec 2010 19.0% BrazilJun 2010 20.5%, Dec 2010
21.8% VenezuelaJun 2010 19.0%, Dec 2010 21.1% NetherlandsJun 2010 17.7%, Dec 2010 22.3% JapanJun 2010 16.8%, Dec 2010 20.0%
Note: Visitor age 15+, home and work locations. Excludes visitation from public computers such as Internet cafes or access from mobile phones or PDAs.
In 2009, Twitter was mainly used by older adults who might not have used other social sites before Twitter, said Jeremiah Owyang, an industry analyst studying social media.
"Adults are just catching up to what teens have been doing for years," he said. According to comScore only eleven percent of Twitter's users are aged twelve to
seventeen. comScore attributed this to Twitter's "early adopter period" when the social network first gained popularity in business settings and news outlets attracting
primarily older users. However, comScore also stated in 2009 that Twitter had begun to "filter more into the mainstream", and "along with it came a culture of
celebrity as Shaq, Britney Spears and Ashton Kutcher joined the ranks of the Twitterati."
According to a study by Sysomos in June 2009, women make up a slightly larger Twitter demographic than men — fifty-three percent over forty-seven percent. It also stated
that five percent of users accounted for seventy-five percent of all activity, and that New York City has more Twitter users than other cities.
According to Quancast, twenty-seven million people in the US used Twitter as of September 3, 2009. Sixty-three percent of Twitter users are under thirty-five years old; sixty
percent of Twitter users are Caucasian, but a higher than average (compared to other Internet properties) are African American/black (sixteen percent) and Hispanic
(eleven percent); fifty-eight percent of Twitter users have a total household income of at least US$60,000. The prevalence of African American Twitter usage and in
many popular hashtags has been the subject of research studies.
On September 7, 2011, Twitter announced that it has 100 million active users logging in at least once a month and 50 million active users every day.
In an article published on January 6, 2012, Twitter was confirmed to be the biggest social media network in Japan, with Facebook following closely in second. comScore
confirmed this, stating that Japan is the only country in the world where Twitter leads Facebook.
Finances Funding Twitter's San Francisco headquarters located at 1355 Market St. Twitter raised over US$57 million from venture capitalist growth funding, although
exact numbers are not publicly disclosed. Twitter's first A round of funding was for an undisclosed amount that is rumored to have been between US$1 million and
US$5 million. Its second B round of funding in 2008 was for US$22 million and its third C round of funding in 2009 was for US$35 million from Institutional
Venture Partners and Benchmark Capital along with an undisclosed amount from other investors including Union Square Ventures, Spark Capital and Insight
Venture Partners. Twitter is backed by Union Square Ventures, Digital Garage, Spark Capital, and Bezos Expeditions.
In May 2008, The Industry Standard remarked that Twitter's long-term viability is limited by a lack of revenue. Twitter board member Todd Chaffee forecast that the company
could profit from e-commerce, noting that users may want to buy items directly from Twitter since it already provides product recommendations and promotions.
The company raised US$200 million in new venture capital in December 2010, at a valuation of approximately US$3.7 billion. In March 2011, 35,000 Twitter shares sold for
US$34.50 each on Sharespost, an implied valuation of US$7.8 billion. In August, 2010 Twitter announced a "significant" investment lead by Digital Sky Tech that,
at US$800M, was reported to be the largest venture round in history. Twitter has been identified as a possible candidate for an initial public offering by 2013.
In December 2011, the Saudi prince Alwaleed bin Talal invested $300 million in Twitter. The company was valued at $8.4 billion at the time.
Revenue sources In July 2009, some of Twitter's revenue and user growth documents were published on TechCrunch after being illegally obtained by Hacker Croll. The
documents projected 2009 revenues of US$400,000 in the third quarter and US$4 million in the fourth quarter along with 25 million users by the end of the year.
The projections for the end of 2013 were US$1.54 billion in revenue, US$111 million in net earnings, and 1 billion users. No information about how Twitter planned
to achieve those numbers was published. In response, Twitter co-founder Biz Stone published a blog post suggesting the possibility of legal action against hacker.
On April 13, 2010, Twitter announced plans to offer paid advertising for companies that would be able to purchase "promoted tweets" to appear in selective search results on
the Twitter website, similar to Google Adwords' advertising model. As of April 13, Twitter announced it had already signed up a number of companies wishing to
advertise including Sony Pictures, Red Bull, Best Buy, and Starbucks. To continue their advertising campaign, Twitter announced on March 20, 2012, that it would
be bringing its promoted tweets to mobile devices. Twitter generated US$139.5 million in advertising sales during 2011 and expects this number to grow 86.3% to
US$259.9 million in 2012. The company generated US$45 million in annual revenue in 2010, after beginning sales midway through that year. The company
operated at a loss through most of 2010. Revenues were forecast for US$100 million to US$110 million in 2011. Users' photos can generate royalty-free revenue for
Twitter, with an agreement with WENN being announced in May 2011. In June 2011, Twitter announced that it would offer small businesses a self serve
advertising system. In April 2013, Twitter announced that its Twitter Ads self-service ads platform was available to all US users without an invite.
Technology Implementation Great reliance is placed on open-source software. The Twitter Web interface uses the Ruby on Rails framework, deployed on a performance
enhanced Ruby Enterprise Edition implementation of Ruby. As of April 6, 2011, Twitter engineers confirmed they had switched away from their Ruby on Rails
search-stack, to a Java server they call Blender. From spring 2007 to 2008 the messages were handled by a Ruby persistent queue server called Starling, but since
2009 implementation has been gradually replaced with software written in Scala. The service's application programming interface (API) allows other web services
and applications to integrate with Twitter. Individual tweets are registered under unique IDs using software called snowflake and geolocation data is added using
'Rockdove'. The URL shortner t.co then checks for a spam link and shortens the URL. The tweets are stored in a MySQL database using Gizzard and acknowledged
to users as having been sent. They are then sent to search engines via the Firehose API. The process itself is managed by FlockDB and takes an average of 350 ms.
On August 16, 2013, Twitter’s Vice President of Platform Engineering Raffi Krikorian shared in a blog post that the company's infrastructure handled almost
143,000 tweets per second during that week, setting a new record. Krikorian explained that Twitter achieved this record by blending its homegrown and open source.
Interface On April 30, 2009, Twitter adjusted its web interface, adding a search bar and a sidebar of "trending topics" — the most common phrases appearing in messages.
Biz Stone explains that all messages are instantly indexed and that "with this newly launched feature, Twitter has become something unexpectedly important – a
discovery engine for finding out what is happening right now." In March 2012, Twitter became available in Arabic, Farsi, Hebrew and Urdu, the first right-to-left
language versions of the site. About 13,000 volunteers helped with translating the menu options. it is available in 33 different languages.
Outages When Twitter experiences an outage, users see the "fail whale" error message image created by Yiying Lu, illustrating eight orange birds using a net to hoist a whale
from the ocean captioned "Too many tweets! Please wait a moment and try again." Twitter had approximately ninety-eight percent uptime in 2007 (or about six full
days of downtime). The downtime was particularly noticeable during events popular with the technology industry such as 2008 Macworld Conf & Expo keynote.
Privacy and security Twitter messages are public but users can also send private messages. Twitter collects personally identifiable information about its users and shares it
with third parties. The service reserves the right to sell this information as an asset if the company changes hands. While Twitter displays no advertising, advertisers
can target users based on their history of tweets and may quote tweets in ads directed specifically to the user. A security vulnerability was reported on April 7, 2007,
by Nitesh Dhanjani and Rujith. Since Twitter used the phone number of the sender of an SMS message as authentication, malicious users could update someone
else's status page by using SMS spoofing. The vulnerability could be used if the spoofer knew the phone number registered to their victim's account. Within a few
weeks of this discovery Twitter introduced an optional personal identification number (PIN) that its users could use to authenticate their SMS-originating messages.
On January 5, 2009, 33 high-profile Twitter accounts were compromised after a Twitter administrator's password was guessed by a dictionary attack. Falsified tweets —
including sexually explicit and drug-related messages — were sent from these accounts. Twitter launched the beta version of their "Verified Accounts" service on
June 11, 2009, allowing famous or notable people to announce their Twitter account name. The home pages of these accounts display a badge indicating their status.
In May 2010, a bug was discovered by İnci Sözlük, involving users that allowed Twitter users to force others to follow them without the other users' consent or knowledge. For
example, comedian Conan O'Brien's account, which had been set to follow only one person, was changed to receive nearly 200 malicious subscriptions.
In response to Twitter's security breaches, the US Federal Trade Commission brought charges against the service which were settled on June 24, 2010. This was the first time
the FTC had taken action against a social network for security lapses. The settlement requires Twitter to take a number of steps to secure users' private information,
including maintenance of a "comprehensive information security program" to be independently audited biannually.
On 12/14/10, USDoJ issued a subpoena directing Twitter to provide information for accounts registered to or associated with WikiLeaks. Twitter decided to notify its users
and said "...it's our policy to notify users about law enforcement and governmental requests for their information, unless we are prevented by law from doing so"....
Open source Twitter has a history of both using and releasing open source software while overcoming technical challenges of their service. A page in their developer
documentation thanks dozens of open source projects which they have used, from revision control software like Git to programming languages such as Ruby and
Scala. Software released as open source by the company includes the Gizzard Scala framework for creating distributed datastores, the distributed graph database
FlockDB, the Finagle library for building asynchronous RPC servers and clients, the TwUI user interface framework for iOS, and the Bower client-side package
manager. The popular Twitter Bootstrap web design library was also started at Twitter and is the most popular repository on GitHub.
Innovators patent agreement On April 17, 2012, Twitter would implement an “Innovators Patent Agreement” which obligate Twitter to only use its patents for defense.
URL shortener t.co is a URL shortening service created by Twitter. It is only available for links posted to Twitter and not available for general use. All links posted to
Twitter use a t.co wrapper. Twitter hopes that the service will be able to protect users from malicious sites, and will use it to track clicks on links within tweets.
Having previously used the services of third parties TinyURL and bit.ly. Twitter began experimenting with its own URL shortening service for private messages in March
2010 using the twt.tl domain, before it purchased the t.co domain. The service was tested on the main site using the accounts @TwitterAPI, @rsarver and @raffi. On
Sept 2, 2010, an email from Twitter to users said they would be expanding the roll-out of the service to users. On June 7, 2011, Twitter was rolling out the feature.
Integrated photo-sharing service On June 1, 2011, Twitter announced its own integrated photo-sharing service that enables users to upload a photo and attach it to a Tweet
right from Twitter.com. Users now also have the ability to add pictures to Twitter's search by adding hashtags to the tweet. Twitter also plans to provide photo
galleries designed to gather and syndicate all photos that a user has uploaded on Twitter and third-party services such as TwitPic.
Use and social impact Dorsey said after a Twitter Town Hall with Barack Obama held in July 2011, that Twitter received over 110,000 #AskObama tweets.
Main article: Twitter usage: Twitter has been used for a variety of purposes in many industries and scenarios. For example, it has been used to organize protests, sometimes
referred to as "Twitter Revolutions", which include the Egyptian revolution, 2010–2011 Tunisian protests, 2009–2010 Iranian election protests, and 2009 Moldova
civil unrest. The governments of Iran and Egypt blocked the service in retaliation. The Hill on February 28, 2011 described Twitter and other social media as a
"strategic weapon ... which have the apparent ability to re-align the social order in real time, with little or no advanced warning." During the Arab Spring in early
2011, the number of hashtags mentioning the uprisings in Tunisia and Egypt increased. A study by the Dubai School of Government found that only 0.26% of the
Egyptian population, 0.1% of the Tunisian population and 0.04% of the Syrian population are active on Twitter.
The service is also used as a form of civil disobedience: in 2010, users expressed outrage over the Twitter Joke Trial by making obvious jokes about terrorism; and in the
British privacy injunction debate in the same country a year later, where several celebrities who had taken out anonymised injunctions, most notably the Manchester
United player Ryan Giggs, were identified by thousands of users in protest to traditional journalism being censored.
Another, more real time and practical use for Twitter exists as an effective de facto emergency communication system for breaking news. It was neither intended nor designed
for high performance communication, but the idea that it could be used for emergency communication certainly was not lost on the originators, who knew that the
service could have wide-reaching effects early on when the San Francisco, California company used it to communicate during earthquakes. The Boston Police
tweeted news of the arrest of the 2013 Boston Marathon Bombing suspect. A practical use being studied is Twitter's ability to track epidemics, how they spread.
Twitter has been adopted as a communication and learning tool in educational settings mostly in colleges and universities. It has been used as a backchannel to promote student
interactions, especially in large-lecture courses. Research has found that using Twitter in college courses helps students communicate with each other and faculty,
promotes informal learning, allows shy students a forum for increased participation, increases student engagement, and improves overall course grades.
In May 2008, The Wall Street Journal wrote that social networking services such as Twitter "elicit mixed feelings in the technology-savvy people who have been their early
adopters. Fans say they are a good way to keep in touch with busy friends. But some users are starting to feel 'too' connected, as they grapple with check-in messages
at odd hours, higher cellphone bills and the need to tell acquaintances to stop announcing what they're having for dinner."
Television, rating Twitter is also increasingly used for making TV more interactive and social. This effect is sometimes referred to as the "virtual watercooler" or social
television — the practice has been called "chatterboxing".
Statistics Most popular accounts As of July 28, 2013, the ten accounts with the most followers belonged to the following individuals and organizations:[262]
Justin Bieber (42.2 mil followers worldwide) Katy Perry (39.9m) Lady Gaga (39.2m) Barack Obama (34.5) - most followed account for politician Taylor Swift (31.4m)
Rihanna (30.7m) YouTube (31m) - highest account not representing an individual Britney Spears (29.7m) Instagram (23.6m) Justin Timberlake (23.3m)
Other selected accounts: 12. Twitter (21.6m) 16. Cristiano Ronaldo (20m) - highest account athlete 58. FC Barcelona (9.5m) - highest account representing a sports team
Oldest accounts 14 accounts belonging to Twitter employees at the time and including @jack (Jack Dorsey), @biz (Biz Stone) and @noah (Noah Glass).
Record tweets On February 3, 2013, Twitter announced that a record 24.1 million tweets were sent the night of Super Bowl XLVII.
Future Twitter emphasized its news and information-network strategy in November 2009 by changing the question asked to users for status updates from "What are you
doing?" to "What's happening?" On November 22, 2010, Biz Stone, a cofounder of the company, expressed for the first time the idea of a Twitter news network, a
concept of a wire-like news service he has been working on for years.
The dark side of Big Data involves much more than Snowden's disclosure, or what the US does. And what made Big
Data possible did not happen overnight. The term has been around for at least 15 years, though it's only recently become popular.
"It will be quite transformational," says Thomas Davenport, an information technology expert at Babson College in Wellesley, Mass., who cowrote the widely used book "Competing on Analytics: The New Science of Winning." Going back to the beginning.
Big Data starts with ... a lot of data. Google executive chairman Eric Schmidt has said that we now uncover as much data in 48 hours – 1.8
zettabytes (that's 1,800,000,000,000,000,000,000 bytes) – as humans gathered from "the dawn of civilization to the year 2003." You read
that right. The head of a company receiving 50 billion search requests a day believes people now gather in a few days more data than
humans have done throughout almost all of history. Mr. Schmidt's claim has doubters. But similar assertions crop up from people not
prone to exaggeration, such as Massachusetts Institute of Technology researcher Andrew McAfee and MIT professor Erik Brynjolfsson,
authors of the new book "Race Against the Machine." "More data crosses the Internet every second," they write, "than were stored in the
entire Internet 20 years ago."
A key driver of the growth of data is the way we've digitized many of our everyday activities, such as shopping (increasingly done online) or
downloading music. Another factor: our dependence on electronic devices, all of which leave digital footprints every time we send an email, search online, post a message, text, or tweet. Virtually every institution in society, from government to the local utility, is churning
out its own torrent of electronic digits – about our billing records, our employment, our electricity use. Add in the huge array of sensors
that now exist, measuring everything from traffic flow to the spoilage of fruit during shipment, and the world is awash in information that
we had no way to uncover before – all aggregated and analyzed by increasingly powerful computers.
Most of this data doesn't affect us. Amassing information alone doesn't mean it's valuable. Yet the new ability to mine the right information,
discover patterns and relationships, already affects our everyday lives. Anyone, for instance, who has a navigation screen on a car
dashboard uses data streaming from 24 satellites 11,000 miles above Earth to pinpoint his or her exact location. People living in Los
Angeles and dozens of other cities now participate, knowingly or not, in the growing phenomenon of "predictive policing" – authorities'
use of algorithms to identify crime trends. Tennis fans use IBM SlamTracker, an online analytic tool, to find out exactly how many return
of serves Andy Murray needed to win Wimbledon. When we use sites like SlamTracker, companies take note of our browsing habits and,
through either the miracle or the meddling of Big Data, use that information to send us personal pitches. That's what happens when AOL
greets you with a pop-up ad (Slazenger tennis balls – 70 percent off!).
In their book, "Big Data: A Revolution That Will Transform How We Live, Work, and Think," Kenneth Cukier and Viktor Mayer-Schönberger
mention Wal-Mart's discovery, gleaned by mining sales data, that people preparing for a hurricane bought lots of Pop-Tarts. Now, when a
storm is on the way, Wal-Mart puts Pop-Tarts on the shelves next to the flashlights. But what excites and concerns people about Big Data
is more far-reaching than that. Seeing the bigger picture: taking a closer look at some of the people in the digital trenches.
I follow Mandelbaum and Mr. Youkel down a corridor of the Library of Congress, past exhibits redolent of history and what you might expect
from what we call "America's library," with its 38 million books on 838 miles of shelving.
They open a door. We pass behind people staring at huge computer screens and enter a room that doesn't look as if it belongs in a library at all. It's
the size of a gym, with fluorescent lights overhead and tall metal boxes rising from the floor.
"The tweets come here," Mandelbaum says. It's been three years since Twitter approached the library with a question. What the online networking
service started in 2006 had become a new way of communicating. Would there, Twitter asked, be historical value in archiving tweets?
"We saw the value right away," says Robert Dizard, deputy director of the library. "[Our] mission is, preserve the record of America."
Arnold Lund is looking ahead. Lund has a Ph.D. in experimental psychology. He holds 20 patents, has written a book on managing technology
design, and directs a variety of projects for General Electric. Last year, a tree fell on power lines behind my house. As the local utility
repaired things, an electrical surge crashed my computer, destroying all the contents. Lund's power line project has my attention. "For
power companies, one of the largest expenses is managing foliage," he says. "We lay out the entire geography of a state – and the overlay
of the power grid. We use satellite data to look at tree growth and cut back where there's most growth. Then [we] predict where the most
likely [problem] is. We have 50 different variabilities to see the probability of outage."
In that one compressed paragraph, I see three big changes Mr. Cukier and Mr. Mayer-Shönberger say Big Data brings to research. It's what we
might call the three "nots."
Size, not sample. For more than a century, statisticians have relied on small samples of data from which to generalize. They had to. They lacked
the ability to collect more. The new technology means we can "collect a lot of data rather than settle for ... samples."
Messy, not meticulous. Traditionally, researchers have insisted on "clean, curated data. When there was not that much data around, researchers
[had to be as] exact as possible." Now, that's no longer necessary. "Accept messiness," they write, arguing that the benefits of more data
outweigh our "obsession with precision."
Correlation, not cause. While knowing the causes behind things is desirable, we don't always need to understand how the world works "to get
things done," they note.
Lund's lab exemplifies all three. First, his "entire geography" and 50 variables involve massive sets of data – information streaming in from
sensors, satellites, and other sources about everything from forest density to prevailing wind direction to grid loads. Second, he looks for
"probability" not "obsessive precision." Correlation? Lund values cause, but the reason behind, say, tree growth interests him less than
spotting correlations that might spur action. "Ah – that tree," he exclaims, as if he is an engineer in the field. "Better get the trucks out
ahead of the storm!"
Cukier and Mayer-Schönberger cite the United Parcel Service to bolster their argument about correlation. UPS equips its trucks with sensors that
identify vibrations and other things associated with breakdowns. "The data do not tell UPS why the part is in trouble. They reveal enough
for the company to know what to do."
Lund's boss, GE chief executive officer Jeff Immelt, also talks about sensor data. The company is now investing $1 billion in software and
analytics, which includes putting sensors on its jet engines to help enhance fuel efficiency. Mr. Immelt has said that just a 1 percent
change in "fuel burn" can be worth hundreds of millions of dollars to an airline. "You save an oil guy 1 percent," Immelt said at a
conference this spring, "you're his friend for life." While Lund has talked glowingly about how much data his projects can collect, he
wants to make sure I know data isn't everything. "As a scientist," he says, "I know the biggest challenge is finding the right questions.
How do you find the questions important to business, society, and culture?"
Rothman has questions, too. "We work in emergency rooms," he says about himself and Dr. Dugas. "We're the boots on the ground."
Rothman's work has involved emergency medicine and the nexus between public health and epidemics, including influenza, which kills as many
as 500,000 people a year around the world and about 45,000 in the US.
The two researchers wanted to find out if the Google national study held lessons for Baltimore and their emergency room (ER). They studied
Google queries for the Baltimore area – queries about flu symptoms, or chest congestion, or where to buy a thermometer. If they could
spot spikes, that might help solve one crucial problem. "Crowding," Dugas says. "Huge issue."
When epidemics start, people rush to hospitals. Waiting rooms fill up. If Google trends showed a spike just as epidemics started, ERs could staff
up and reserve more space for the surge of patients.
The link between Google spikes and hospital visits in Baltimore turned out to be strong, especially for children. As soon as the first news reports
surfaced about the 2009 H1N1 virus, pediatric ER visits at Hopkins increased – at the peak by as much as 75 percent. But when the two
researchers looked closer, they found something unexpected. No flu. It turned out that news reports about H1N1 elsewhere fueled a rush
to ERs in Baltimore – what one researcher called "fear week." "If you just looked at correlation for flu, you'd say it was a false trend,"
says Dugas. Even so, she and Rothman found data important for ERs: No matter why people are coming, they need to staff up.
The Baltimore study also showed the importance of finding out what was behind all those medically related Google searches – in other words, not
just correlation but cause. Like GE's Lund, Rothman emphasizes the value of "the questions you're asking."
Evidence that Big Data promises enormous benefits is more than anecdotal. MIT's Mr. Brynjolfsson did a study in 2012 examining 179
companies. He found those whose decisions were "data-driven" had become 5 to 6 percent more productive in ways only the use of data
could explain. On the other hand, consider just this one data point: If you type "Big Data Dark Side" into Google, you'll get 40 million
results. Despite the potential, there's also peril.
The dark side of Big Data concerns Laura DeNardis, Internet scholar, author of three books, and professor at American University's school of
communication in Washington. She and others worry – not exclusively – about three questions. Does the new technology (1) erode
privacy, (2) promote inequality, and (3) turn government into Big Brother?
She points to public health data as one potential source of abuse. Her concern echoes that of critics who fear that supposedly anonymous patient
records are not anonymous at all. As far back as the 1990s, a Massachusetts state commission gave researchers health data about state
workers, believing this would help officials make better health-care decisions. William Weld, then governor of Massachusetts, assured
workers their files had been scrubbed of the data that could identify them.
Harvard University computer science graduate student took this promise of privacy as a challenge.
Using just three bits of data, Latanya Sweeney showed how to identify everyone – including Weld, whose diagnoses, medications, and entire
medical history Ms. Sweeney, now a professor at Harvard, gleefully sent to his office.
Today there are powerful ways to identify people from records supposed to keep things private. And there are concerns other than our health
records. Dr. DeNardis worries about how much companies know about our social media habits.
"Take a look at the published privacy policies of Apple, Facebook, or Google," she says. "They know what you view, when you make a call,
where you are. People consent to that by 'I agree' to privacy terms. But how carefully are they read?"
She's not alone. Jay Stanley of the American Civil Liberties Union describes one example of what companies can do with what they know about
us: "credit-scoring." "Credit card companies," he wrote in a blog, "sometimes lower a customer's credit card limit based on the repayment
history of other customers at stores where a person shops." Do we want Master Card to lower our credit-card limits, thinking we're a risk,
just because people who frequent the stores we do don't pay their bills?
In addition to individual privacy, critics worry about Big Data's impact in more expansive ways, such as the growing gap between rich and poor
nations. Large American companies can hire hundreds of data analysts. How can Bangladesh compete? Will this aggravate the global
digital divide? Perhaps most worrisome to people at the moment is the government's use of Big Data to monitor its own citizens, or
others, in the name of national security. "The American people," President Obama after the NSA story broke, "don't have a Big Brother
who is snooping into their business."
Did Obama mean George Orwell's term doesn't include governments secretly monitoring calls, e-mails, audio, and video of citizens suspected of
nothing? Commandeering information from firms like Yahoo and Google? The questions that arose from Snowden's revelations in June
encompass issues of privacy, confidentiality, freedom, and, of course, security. The Obama administration argues that monitoring
personal info keeps the country safe, asserting that PRISM has helped foil 54 separate terrorist plots against the US.
Some lawmakers on Capitol Hill dispute that number, though, and in recent weeks momentum has been building in Washington to rein in the
NSA. Not only has support increased on the left and right to adopt more oversight of its surveillance program, polls show a hardening of
public opinion about snooping, too. Meanwhile, there is no doubt about the fury in other countries when the news broke – especially in
Germany, where critics have compared American monitoring of foreigners' phone calls and e-mails with that of Stasi, the former hated
East German secret police. In fact, some of those most upset about the NSA revelations include Americans alarmed about what the new
technology means outside US borders. Suzanne Nossel, head of the PEN American Center, which works to free writers and artists around
the world imprisoned for free speech, worries about the government use of data from private companies to stifle dissent. "It's not new,"
she says, citing the Chinese dissident Shi Tao, imprisoned by China in 2004 for posting political commentary on foreign websites, and
still locked up. "Yahoo China had assisted the Chinese government. They used [Yahoo data] to convict him." But then Ms. Nossel talks
about the recent unrest in Turkey, where the Turkish military shot and arrested dozens of protesters in Istanbul's Taksim Square. To find
more of what they called "looters," the Turkish government went to Twitter and Facebook for help – and announced that Facebook was
"responding positively," something Facebook has denied. And Nossel sees a difference between 2004 and now. Talking about the most
repressive governments in the world, she argues that "the government ability to sweep and search is [now] so great, it tips the scale. No
technology on the side of human rights advocates can confront it. That's new – and chilling."
What have we learned? There's a notable "Sesame Street" episode from years back in which Cookie Monster wanders into a library and drives the
librarian crazy by asking over and over for a cookie. "This is a LIBRARY!" the librarian finally screams, forgetting to whisper. "We have
books! Just books!" That's certainly been our image of what libraries do. "You can still find books here," Mandelbaum reminds me,
standing in a room full of processors. But figures over the past decade seem to show that books – those rectangular things with pages we
turn – are slowly on the way out in the Digital Age. That's less significant than it might seem, though. After all, we value books because
of the knowledge they hold. We've changed the way we convey knowledge many times. Big Data is another source of knowledge. Will it
become a more integral part of tomorrow's libraries? It is perhaps fitting that one of the "Sesame Street" characters most in tune with the
future is ... the Count. He counts everything. His role is to teach kids the importance of counting. Big Data allows us to count everything
– and analyze what we find. But are numbers enough? Brynjolfsson and Mr. McAfee compare Big Data to Leeuwenhoek's development
of the microscope in the 1670s. They are, after all, both tools. They let people see lots of things that have always been around. Of course,
the microscope also prompted us to ask questions we could never ask before. Big Data does that, too. RECOMMENDED: How much do
you know about cybersecurity? Take our quiz. Still, while Big Data can predict a flu outbreak or where trees fall, it can't, by itself,
resolve the economic and moral dilemmas we have. Whether to keep power running, help patients faster, or preserve the record of
America, Big Data teaches us what's out there, not what's right.
There's nothing inherently wrong with Big Data. What matters, as it does for Arnold Lund in California or Richard Rothman in Baltimore, are the
questions – old and new, good and bad – this newest tool lets us ask. • Robert A. Lehrman is a novelist and former White House chief
speechwriter for Vice President Al Gore. Author of 'The Political Speechwriter's Companion,' he teaches at American University and coruns a blog, PunditWire. Related stories
Hadoop/MapReduce
WPI, Mohamed Eltabakh
MapReduce computing paradigm (E.g., Hadoop) vs. Traditional DBS
Many enterprises are turning to Hadoop
Especially applications generating big data
Web applications, social networks, scientific applications
Hadoop designed as master-slave
shared-nothing archi
Master node (single node
Why Hadoop is able to compete?
Scalability (petabytes of data, thousands of machines)
Flexibility in accepting all data formats (no schema)
Efficient and simple fault-tolerant mechanism
Commodity inexpensive hardware
Database
Performance (tons of indexing, tuning, data org tech.)
Features:
- Provenance tracking
- Annotation management
Hadoop: swtwr for distr proc of large datasets across large clusters
Large datasets  Terabytes or petabytes of data
Large clusters  hundreds or thousands of nodes
Hadoop is open-source implementation for Google MapReduce
Hadoop is based on a simple programming model called MapReduce
Hadoop is based on a simple data model, any data will fit
Hadoop framework consists on two main layers Distributed file
system (HDFS), Execution engine (MapReduce)
Many slave nodes
DESIGN PRINCIPALS
Need to process big data
Need to parallelize computation across ~1000 nodes
Commodity hardware Many low-end cheap
machines work in parallel to solve problem
This is in contrast to Parallel DBs
Small number of high-end expensive
machines
Automatic parallelization & distribution Hidden
from the end-user
Fault tolerance and automatic recovery
Nodes/tasks fail and recover automatically
Clean and simple prog abstraction Users only
provide 2 fctns map/reduce
WHO USES IT?
Google: Inventors of MapReduce compute paradigm
Yahoo: Develop Hadoop open-source MapReduce
IBM, Microsoft, Oracle
Facebook, Amazon, AOL, NetFlex
Hadoop Architecture
Distributed file system (HDFS)
Execution engine (MapReduce)
Large: A HDFS instance may consist of thousands of server machines, each storing part of the file system’s data
Replication: Each data block is replicated many times (default is 3)
Failure: Failure is the norm rather than exception
Fault Tolerance: Detection of faults and quick, automatic recovery is a goal of HDFS N amenode is consistently checking Datanodes
Master node
(single node)
Many slave
nodes
Hadoop Distributed
File System (HDFS)
Centralized namenode
- Maintains metadata info about files
File F
1
2
3
4
5
Blocks (64 MB)
Many datanode (1000s)
- Store the actual data
- Files are divided into blocks
- Each block is replicated N times
Map-Reduce Execution Engine (Example: Color Count)
Input blocks
on HDFS
Produces (k, v)
( , 1)
Map
Shuffle & Sorting
based on k
Consumes(k, [v])
( , [1,1,1,1,1,1..])
Produces(k’, v’)
( , 100)
Parse-hash
Reduce
Map
Parse-hash
Reduce
Map
Parse-hash
Reduce
Map
Parse-hash
Users only provide the “Map” and “Reduce” functions
Node 1
Properties of MapReduce Engine
Job Tracker is the master node (runs with the namenode)
Receives the user’s job
Decides on how many tasks will run (number of
mappers)
Decides on where to run each mapper (concept of
locality)
This file has 5 Blocks  run 5 map tasks
Node 2 Node 3
Task Tracker is the slave node
(runs on each datanode)
Receives the task from Job
Tracker
Runs the task until completion
(either map or reduce task)
Always in communication with
the Job Tracker reporting
progress
In the top example, 1 map-reduce job
consists of 4 map tasks and 3
reduce tasks
KEY VALUE PAIRS
Mappers and Reducers are users’ code (provided functions)
Just need to obey the Key-Value pairs interface
Mappers:
Consume <key, value> pairs
Produce <key, value> pairs
Reducers:
Consume <key, <list of values>>
Produce <key, value>
Shuffling and Sorting:
Hidden phase between mappers and reducers
Groups all similar keys from all mappers, sorts and passes them to a
certain reducer in the form of <key, <list of values>>
Deciding on what will be the key and what will be the value  developer’s resp
Example 1: Word Count
Job: Count the occurrences of each word in a data set
Example 2: Color Count
Input blocks on
HDFS
Job: Count the number of each color in a data set
Produces (k, v)
( , 1)
Map
Shuffle & Sorting based on
k
Consumes(k, [v])
( , [1,1,1,1,1,1..])
Parse-hash
Reduce
Map
Map
Map
Produces(k’, v’) (
100)
Parse-hash
,
Part0001
Reduce
Part0002
Reduce
Part0003
Parse-hash
Parse-hash
That’s output file, it has 3 parts on probably 3 different
machines
Example 3: Color Filter Job: Select only the blue and the green colors
Map
Map
Map
Map
Write to HDFS
Write to HDFS
Write to HDFS
Write to HDFS
Part0001
Each map task will select only the blue or green
colors
No need for reduce phase
Part0002
Part0003
Part0004
That’s the output file, it
has 4 parts on probably 4
different machines
Bigger Picture: Hadoop vs. Other Systems
Distributed Databases
Hadoop
Computing Model
-
Notion of transactions
Transaction is the unit of work
ACID properties, Concurrency control
-
Notion of jobs
Job is the unit of work
No concurrency control
Data Model
-
Structured data with known schema
Read/Write mode
-
Any data will fit in any format
(un)(semi)structured
ReadOnly mode
Cost Model
-
Expensive servers
-
Cheap commodity machines
Fault Tolerance
-
Failures are rare
Recovery mechanisms
-
Failures are common over thousands of
machines
Simple yet efficient fault tolerance
Key Characteristics
- Efficiency, optimizations, fine-tuning
Bigger Picture: Hadoop vs. Other Systems
Cloud Computing
A computing model where any
computing infrastructure can run
on the cloud
Hardware & Software are provided
as remote services
Elastic: grows and shrinks based on
the user’s demand
Example: Amazon EC2
- Scalability, flexibility, fault tolerance
HDFS (Hadoop Distributed File System) is a distr file sys for commodity hdwr. Differences from other distr file sys are few but significant.
HDFS is highly fault-tolerant and is designed to be deployed on low-cost hardware.
HDFS provides high throughput access to application data and is suitable for applications that have large data sets.
HDFS relaxes a few POSIX requirements to enable streaming access to file system data.
HDFS originally was infrastructure for Apache Nutch web search engine project, is part of Apache Hadoop Core http://hadoop.apache.org/core/
2.1. Hardware Failure
Hardware failure is the normal. An HDFS may consist of hundreds or thousands of server machines, each storing part of the file system’s data.
There are many components and each component has a non-trivial prob of failure means that some component of HDFS is always non-functional.
Detection of faults and quick, automatic recovery from them is a core architectural goal of HDFS.
2.2. Streaming Data Access
Applications that run on HDFS need streaming access to their data sets. They are not general purpose applications that typically run on general
purpose file systems. HDFS is designed more for batch processing rather than interactive use by users. The emphasis is on high throughput of data
access rather than low latency of data access. POSIX imposes many hard requirements not needed for applications that are targeted for HDFS.
POSIX semantics in a few key areas has been traded to increase data throughput rates.
2.3. Large Data Sets
Apps on HDFS have large data sets, typically gigabytes to terabytes in size. Thus, HDFS is tuned to support large files. It provides high
aggregate data bandwidth and scale to hundreds of nodes in a single cluster. It supports ~10 million files in a single instance.
2.4. Simple Coherency Model:
HDFS apps need a write-once-read-many access model for files. A file once created, written, and closed need not be changed. This assumption
simplifies data coherency issues and enables high throughput data access. A Map/Reduce application or a web crawler application fits perfectly
with this model. There is a plan to support appending-writes to files in future [write once read many at file level]
2.5. “Moving Computation is Cheaper than Moving Data”
A computation requested by an application is much more efficient if it is executed near the data it operates on. This is especially true when the
size of the data set is huge. This minimizes network congestion and increases the overall throughput of the system. The assumption is that it is
often better to migrate the computation closer to where the data is located rather than moving the data to where the app is running. HDFS provides
interfaces for applications to move themselves closer to where the data is located.
2.6. Portability Across Heterogeneous Hardware and Software Platforms: HDFS has been designed to be easily portable from one platform
to another. This facilitates widespread adoption of HDFS as a platform of choice for a large set of applications.
3. NameNode and DataNodes: HDFS has a master/slave architecture. An HDFS cluster consists of a single NameNode, a master server that
manages the file system namespace and regulates access to files by clients. In addition, there are a number of DataNodes, usually one per node
in the cluster, which manage storage attached to the nodes that they run on.
HDFS exposes a file system namespace and allows user data to be stored in files. Internally, a file is 1 blocks stored in a set of DataNodes.
The NameNode executes file system namespace operations like opening, closing, and renaming files and directories. It also determines the
mapping of blocks to DataNodes.
The DataNodes are responsible for serving read and write requests from the file system’s clients. The DataNodes also perform block creation,
deletion, and replication upon instruction from the NameNode
The NameNode and DataNode are pieces of software designed to run on commodity machines, typically run GNU/Linux operating system (OS).
HDFS is built using the Java language; any machine that supports Java can run the NameNode or the DataNode software. Usage of the highly
portable Java language means that HDFS can be deployed on a wide range of machines. A typical deployment has a dedicated machine that runs
only the NameNode software. Each of the other machines in the cluster runs one instance of the DataNode software. The architecture does not
preclude running multiple DataNodes on the same machine but in a real deployment that is rarely the case. The existence of a single NameNode
in a cluster greatly simplifies the architecture of the system. The NameNode is the arbitrator and repository for all HDFS metadata. The system is
designed in such a way that user data never flows through the NameNode.
4. The File System Namespace: HDFS supports a traditional hierarchical file organization. A user or an application can create directories and
store files inside these directories. The file system namespace hierarchy is similar to most other existing file systems; one can create and remove
files, move a file from one directory to another, or rename a file.
HDFS does not yet implement user quotas or access permissions. HDFS does not support hard links or soft links. However, the HDFS architecture
does not preclude implementing these features. The NameNode maintains the file system namespace. Any change to the file system namespace
or its properties is recorded by the NameNode. An application can specify the number of replicas of a file that should be maintained by HDFS.
The number of copies of a file is called the replication factor of that file. This info is stored by NameNode.
5. Data Replication: HDFS is designed to reliably store very large files across machines in a large cluster. It stores each file as a sequence of
blocks; all blocks in a file except the last block are the same size. The blocks of a file are replicated for fault tolerance. The block size and
replication factor are configurable per file. An application can specify the number of replicas of a file. The replication factor can be specified at
file creation time and can be changed later. Files in HDFS are write-once and have strictly one writer at any time. The NameNode makes all
decisions regarding replication of blocks. It periodically receives a Heartbeat and a Blockreport from each of the DataNodes in the cluster.
Receipt of a Heartbeat implies that the DataNode is functioning properly. A Blockreport contains a list of all blocks on a DataNode
5.3. Safemode: On startup, the NameNode enters a special state called
Safemode. Replication of data blocks does not occur when the
NameNode is in the Safemode state. The NameNode receives Heartbeat
and Blockreport messages from the DataNodes. A Blockreport contains
the list of data blocks that a DataNode is hosting. Each block has a
specified minimum number of replicas. A block is considered safely
replicated when the minimum number of replicas of that data block has
checked in with the NameNode. After a configurable percentage of
safely replicated data blocks checks in with the NameNode (plus an
additional 30 seconds), the NameNode exits the Safemode state. It then
determines the list of data blocks (if any) that still have fewer than the
specified number of replicas. The NameNode then replicates these
blocks to other DataNodes.
5.1. Replica Placement: The First Baby Steps: The placement of replicas is critical to HDFS reliability and performance. Optimizing replica
placement distinguishes HDFS from most other distributed file systems. This is a feature that needs lots of tuning and experience. The purpose of
a rack-aware replica placement policy is to improve data reliability, availability, and network bandwidth utilization. The current implementation
for the replica placement policy is a first effort in this direction. The short-term goals of implementing this policy are to validate it on production
systems, learn more about its behavior, and build a foundation to test and research more sophisticated policies. Large HDFS instances run on a
cluster of computers that commonly spread across many racks. Communication between two nodes in different racks has to go through switches.
In most cases, network bandwidth between machines in the same rack is greater than network bandwidth between machines in different racks.
The NameNode determines the rack id each DataNode belongs to via the process outlined in Rack Awareness: A simple but non-optimal policy
is to place replicas on unique racks. This prevents losing data when an entire rack fails and allows use of bandwidth from multiple racks when
reading data. This policy evenly distributes replicas in the cluster which makes it easy to balance load on component failure. However, this policy
increases the cost of writes because a write needs to transfer blocks to multiple racks. For the common case, when the replication factor is three,
HDFS’s placement policy is to put one replica on one node in the local rack, another on a different node in the local rack, and the last on a
different node in a different rack. This policy cuts the inter-rack write traffic which generally improves write performance. The chance of rack
failure is far less than that of node failure; this policy does not impact data reliability and availability guarantees. However, it does reduce the
aggregate network bandwidth used when reading data since a block is placed in only two unique racks rather than three. With this policy, the
replicas of a file do not evenly distribute across the racks. One third of replicas are on one node, two thirds of replicas are on one rack, and the
other third are evenly distributed across the remaining racks. This policy improves write performance without compromising data reliability or
read performance. The current, default replica placement policy described here is a work in progress.
5.2. Replica Selection: To minimize global bandwidth consumption and read latency, HDFS tries to satisfy a read request from a replica that is
closest to the reader. If there exists a replica on the same rack as the reader node, then that replica is preferred to satisfy the read request. If angg/
HDFS cluster spans multiple data centers, then a replica that is resident in the local data center is preferred over any remote replica.
6. The Persistence of File System Metadata: The HDFS namespace is stored by the NameNode. The NameNode uses a transaction log called
the EditLog to persistently record every change that occurs to file system metadata. For example, creating a new file in HDFS causes the
NameNode to insert a record into the EditLog indicating this. Similarly, changing the replication factor of a file causes a new record to be inserted
into the EditLog. The NameNode uses a file in its local host OS file system to store the EditLog. The entire file system namespace, including the
mapping of blocks to files and file system properties, is stored in a file called the FsImage. The FsImage is stored as a file in the NameNode’s
local file system too. The NameNode keeps an image of the entire file system namespace and file Blockmap in memory. This key metadata item
is designed to be compact, such that a NameNode with 4 GB of RAM is plenty to support a huge number of files and directories. When the
NameNode starts up, it reads the FsImage and EditLog from disk, applies all the transactions from the EditLog to the in-memory representation of
the FsImage, and flushes out this new version into a new FsImage on disk. It can then truncate the old EditLog because its transactions have been
applied to the persistent FsImage. This process is called a checkpoint. In the current implementation, a checkpoint only occurs when the
NameNode starts up. Work is in progress to support periodic checkpointing in the near future. The DataNode stores HDFS data in files in its
local file system. The DataNode has no knowledge about HDFS files. It stores each block of HDFS data in a separate file in its local file system.
The DataNode does not create all files in the same directory. Instead, it uses a heuristic to determine the optimal number of files per directory and
creates subdirectories appropriately. It is not optimal to create all local files in the same directory because the local file system might not be able
to efficiently support a huge number of files in a single directory. When a DataNode starts up, it scans through its local file system, generates a
list of all HDFS data blocks that correspond to each of these local files and sends this report to the NameNode: this is the Blockreport.
7. The Communication Protocols: All HDFS communication protocols are layered on top of the TCP/IP protocol. A client establishes a
connection to a configurable TCP port on the NameNode machine. It talks the ClientProtocol with the NameNode. The DataNodes talk to the
NameNode using the DataNode Protocol. A Remote Procedure Call (RPC) abstraction wraps both the Client Protocol and the DataNode Protocol.
By design, the NameNode never initiates any RPCs. Instead, it only responds to RPC requests issued by DataNodes or clients.
8. Robustness: The primary objective of HDFS is to store data reliably even in the presence of failures. The three common types of failures are
NameNode failures, DataNode failures and network partitions.
8.1. Data Disk Failure, Heartbeats and Re-Replication: Each DataNode sends a Heartbeat message to the NameNode periodically. A network
partition can cause a subset of DataNodes to lose connectivity with the NameNode. The NameNode detects this condition by the absence of a
Heartbeat message. The NameNode marks DataNodes without recent Heartbeats as dead and does not forward any new IO requests to them. Any
data that was registered to a dead DataNode is not available to HDFS any more. DataNode death may cause the replication factor of some blocks
to fall below their specified value. The NameNode constantly tracks which blocks need to be replicated and initiates replication whenever
necessary. The necessity for re-replication may arise due to many reasons: a DataNode may become unavailable, a replica may become corrupted,
a hard disk on a DataNode may fail, or the replication factor of a file may be increased.
8.2. Cluster Rebalancing: HDFS arch is compatible with data rebalancing . A scheme might automatically move data from 1 DataNode to
another if the free space on a DataNode falls below a certain threshold. In the event of a sudden high demand for a particular file, a scheme might
dynamically create additional replicas and rebalance other data in the cluster. These types of data rebalancing schemes are not yet implemented.
8.3. Data Integrity: It is possible that a block of data fetched from a DataNode arrives corrupted. This corruption can occur because of faults in a
storage device, network faults, or buggy software. The HDFS client software implements checksum checking on the contents of HDFS files.
When a client creates an HDFS file, it computes a checksum of each block of the file and stores these checksums in a separate hidden file in the
same HDFS namespace. When a client retrieves file contents it verifies that the data it received from each DataNode matches the checksum stored
in the associated checksum file. If not, then the client can opt to retrieve that block from another DataNode that has a replica of that block.
8.4. Metadata Disk Failure: The FsImage and EditLog are central data structures. A corruption of these files can cause the HDFS instance to be
non-functional. For this reason, the NameNode can be configured to support maintaining multiple copies of the FsImage and EditLog. Any update
to either the FsImage or EditLog causes each of the FsImages and EditLogs to get updated synchronously. This synchronous updating of multiple
copies of the FsImage and EditLog may degrade the rate of namespace transactions per second that a NameNode can support. However, this
degradation is acceptable because even though HDFS applications are very data intensive in nature, they are not metadata intensive. When a
NameNode restarts, it selects the latest consistent FsImage and EditLog to use. The NameNode machine is a single point of failure for an HDFS
cluster. If the NameNode machine fails, manual intervention is necessary. Currently, automatic restart and failover of the NameNode software to
another machine is not supported.
8.5. Snapshots: Snapshots support storing a copy of data at a particular instant of time. One usage of the snapshot feature may be to roll back a
corrupted HDFS instance to a previously known good point in time. HDFS does not currently support snapshots but will in a future release.
9. Data Organization
9.1. Data Blocks: HDFS is designed to support very large files. Applications that are compatible with HDFS are those that deal with large data
sets. These applications write their data only once but they read it one or more times and require these reads to be satisfied at streaming speeds.
HDFS supports write-once-read-many semantics on files. A typical block size used by HDFS is 64 MB. Thus, an HDFS file is chopped up into 64
MB chunks, and if possible, each chunk will reside on a different DataNode.
9.2. Staging: A client request to create a file does not reach the NameNode immediately. In fact, initially the HDFS client caches the file data
into a temporary local file. Application writes are transparently redirected to this temporary local file. When the local file accumulates data worth
over one HDFS block size, the client contacts the NameNode. The NameNode inserts the file name into the file system hierarchy and allocates a
data block for it. The NameNode responds to the client request with the identity of the DataNode and the destination data block. Then the client
flushes the block of data from the local temporary file to the specified DataNode. When a file is closed, the remaining un-flushed data in the
temporary local file is transferred to the DataNode. The client then tells the NameNode that the file is closed. At this point, the NameNode
commits the file creation operation into a persistent store. If the NameNode dies before the file is closed, the file is lost. The above approach has
been adopted after careful consideration of target applications that run on HDFS. These applications need streaming writes to files. If a client
writes to a remote file directly without any client side buffering, the network speed and the congestion in the network impacts throughput
considerably. This approach is not without precedent. Earlier distributed file systems, e.g. AFS, have used client side caching to improve
performance. A POSIX requirement has been relaxed to achieve higher performance of data uploads.
9.3. Replication Pipelining: When a client is writing data to an HDFS file, its data is first written to a local file as explained in the previous
section. Suppose the HDFS file has a replication factor of three. When the local file accumulates a full block of user data, the client retrieves a
list of DataNodes from the NameNode. This list contains the DataNodes that will host a replica of that block. The client then flushes the data
block to the first DataNode. The first DataNode starts receiving the data in small portions (4 KB), writes each portion to its local repository and
transfers that portion to the second DataNode in the list. The second DataNode, in turn starts receiving each portion of the data block, writes that
portion to its repository and then flushes that portion to the third DataNode. Finally, the third DataNode writes the data to its local repository.
Thus, a DataNode can be receiving data from the previous one in the pipeline and at the same time forwarding data to the next one in the pipeline.
Thus, the data is pipelined from one DataNode to the next.
10. Accessibility: HDFS can be accessed from applications in many different ways. Natively, HDFS provides a Java API for applications to use.
A C language wrapper for this Java API is also available. In addition, an HTTP browser can also be used to browse the files of an HDFS instance.
Work is in progress to expose HDFS through the WebDAV protocol.
10.1. FS Shell: HDFS allows user data to be organized in the form of files and directories. It provides a commandline interface called FS shell
that lets a user interact with the data in HDFS. The /foodirbin/hadoop dfs -mkdir /foodir View the contents of a file named /foodir/myfile.txt
bin/hadoop dfs -cat /foodir/myfile.txt FS shell is targeted for applications that need a scripting language to interact with the stored data.
10.2. DFSAdmin: The DFSAdmin command set is used for admin an HDFS cluster. These are commands used only by HDFS administrator.
Here are some sample action/command pairs: Action Command Put the cluster in Safemode bin/hadoop dfsadmin -safemode enter Generate a list
of DataNodes bin/hadoop dfsadmin -report Decommission DataNode datanodename bin/hadoop dfsadmin -decommission datanodename
10.3. Browser Interface: A typical HDFS install configures a web server to expose the HDFS namespace through a configurable TCP port. This
allows a user to navigate the HDFS namespace and view the contents of its files using a web browser.
11. Space Reclamation
11.1. File Deletes and Undeletes: When a file is deleted by a user or an application, it is not immediately removed from HDFS. Instead, HDFS
first renames it to a file in the /trash directory. The file can be restored quickly as long as it remains in /trash A file remains in /trash for a
configurable amount of time. After the expiry of its life in /trash, the NameNode deletes the file from the HDFS namespace. The deletion of a file
causes the blocks associated with the file to be freed. Note that there could be an appreciable time delay between the time a file is deleted by a
user and the time of the corresponding increase in free space in HDFS. user wants to undelete a file that he/she has deleted, he/she can navigate
the /trash directory and retrieve the file. The /trash directory contains only the latest copy of the file that was deleted. The /trash directory is just
like any other directory with one special feature: HDFS applies specified policies to automatically delete files from this directory. The current
default policy is to delete files from /trash that are more than 6 hours old. In future, this policy will be configurable thru a well defined interface.
11.2. Decrease Replication Factor: When the replication factor of a file is reduced, the NameNode selects excess replicas that can be deleted.
The next Heartbeat transfers this information to the DataNode. The DataNode then removes the corresponding blocks and the corresponding free
space appears in the cluster. Once again, there might be a time delay between the completion of the setReplication API call and the appearance of
free space in the cluster.
12. References
HDFS Java API: http://hadoop.apache.org/core/docs/current/api/
HDFS source code: http://hadoop.apache.org/core/version_control.html
Download