2 Big data problems

advertisement
Chapter 1
Overview of Big Data Technology
Abstract: We are entering into a “big data” era. Due to the bottlenecks, such as poor scalability,installation and
maintenance difficulties, fault tolerance and low performance, in traditional information technique framework,
we need to leverage the cloud computing techniques and solutions to deal with big data problems. Cloud
computing and big data are complementary to each other and have inherent connection of dialectical unity. The
breakthrough of big data techniques will not only resolve the current situation, but also promote the wide
application of cloud computing and the internet of things techniques. We focus on discussing the development and
the pivotal techniques of big data. And provide a comprehensive description of big data from several perspectives,
including the development of big data,the current data-burst situation,the relationship between big data and cloud
computing and the big data techniques. Finally,we introduce the related technique researches and our current
work.
Key Words:
big data technique; cloud computing; data acquisition; data storage; data computation;
1
Chapter Index
Chapter 1 ........................................................................................................................ 1
Overview of Big Data Technology .............................................................................. 1
1 The Background and Definition of Big Data .............................................................. 3
2 Big data problems ....................................................................................................... 5
2.1 Problems of speed ............................................................................................ 6
2.2 Type and architecture problems ....................................................................... 7
2.3 Volume and flexibility problems ...................................................................... 7
2.4 Cost problems .................................................................................................. 7
2.5 Value mining problems .................................................................................... 8
2.6 Storage and security problems ......................................................................... 9
2.7 Interoperability and data sharing issues ......................................................... 10
3 Dialectical relationships between the cloud computing and big data ....................... 11
4 Big Data technology ................................................................................................. 13
4.1 Infrastructure Supports................................................................................... 14
4.2 Data acquisition ............................................................................................. 16
4.3 Data Storage ................................................................................................... 18
4.4 data computing ............................................................................................... 21
4.4.1 Offline batch ....................................................................................... 21
4.4.2 Real-time interactive computing ......................................................... 23
4.4.3 Streaming computing .......................................................................... 25
4.5 Data presentation and interaction................................................................... 27
5 Related researches and our works ............................................................................. 30
6 Summary ................................................................................................................... 31
References .................................................................................................................... 32
Chinese references ....................................................................................................... 34
2
1 The Background and Definition of Big Data
Nowadays, information technology opens the door which makes the human step
into the smart society, led to the development of modern services such as Internet
e-commerce, modern logistics and e-finance, promoted the development of emerging
industries such Telematics, smart grid, new energy, intelligent transportation, smart city,
high-end equipment manufacturing. Modern information technology is becoming
the engine of operation and development of all walks of life. But the engine is facing a
huge test of big data [57]. Various business data is breaking out in the form of geometric
series [1], problems such as collection, storage, retrieval, analysis, application and so
on, can no longer be solved by the traditional information processing technology, has
brought great obstacles for human achieving digital society, network society and
intelligent society. The New York stock exchange produces 1TB trading data every day,
Twitter will generate more than 7TB data every day; Facebook will produce more than
10TB data every day; the large hadron collider which is located in the European
Organization for Nuclear Research produces about 15PB data every year. According to
the investigation and statistics conducted by the famous consulting firm IDC, the global
information volume of 2007 was about 165EB, even in 2009, happened the global
financial crisis, the global information volume reached 800EB, had an increase of 62
percent over the previous year; In the future, the date volume of the whole world will be
doubled every 18 months; The number will reach 35ZB in 2020, about 230 times the
number in 2007, but the written record of 5000 years of human history only have 5EB
data. The statistics and investigation above indicates the era of TB, PB, and EB has
become the past, the global data storage will enter the “Zetta era” formally
Beginning in 2009, the "big data" has become a buzzword of Internet information
technology industry, most applications of big data at the beginning were in the Internet
industry, the data on the Internet increased by 50% per year, doubling every two years,
The global Internet companies are aware of the advent of "big data" era and the great
significance of data. 2011 May, McKinsey Global Institute published a report entitled
"Big data: The next frontier for innovation, competition, and productivity" [2], since
the report was released, "big data" has become a hot concept in the computer industry.
Obama administration in the US launched the "Big Data Research and Development
Initiative" [3] and allocated $200 million special fund in 2012 April, and set off a wave
of big data all over the world. According to the big data report released by Wikibon in
2011 [4], the big data market is on the eve of a growth spurt, the global market value of
big data will reach $50 billion in the next five years. At the beginning of 2012, all the
income of large data-related software, hardware and services was just about $5 billion.
But as companies gradually realize that the big data and related analysis will form a
new differentiated competitive advantage and improve operational efficiency, big data
related techniques and services will obtain the considerable development, Big data will
gradually fall to the ground and keep 58% compound annual growth rate over the next
five years. Greg McDowell, an analyst with JMP Securities said that the market of big
data tools will be expected to grow to $86 billion from $9 billion in ten years. By 2020,
3
investment in big data tools will account for 11% of overall corporate IT spending. At
present, the industry does not have a unified definition of big data; big data is defined as
follows commonly:
"Big Data refers to datasets whose size is beyond the ability of typical database
software tools to capture, store, manage, and analyze.” - McKinsey.
"Big Data usually includes data sets with sizes beyond the ability of commonly
used software tools to capture, curate, manage, and process the data within a tolerable
elapsed time.” - Wikipedia
"Big Data is high volume, high velocity, and/or high variety information assets that
require new forms of processing to enable enhanced decision making, insight discovery
and process optimization.” - Gartner
Big Data has four characteristics: Volume, Velocity, Variety and Value [47]
(referred to as "4V", which means a huge amount of data volume, fast processing speed,
various data type and low value density), below are brief descriptions for each
characteristic.
Volume: Means a large amount of data of big data. The scale of data set keep
increasing and from GB to TB, then to PB level yet ,even counted by EB and ZB. For
instance, video monitors of a medium-sized city can produce tens of TB data every day.
Variety: Indicates the types of big data are complex. In the past, the data types that
we generated or processed are simpler, and most of the data are structured. But now,
with the emerging of new channels and technologies such as social networking, Internet
of Things, mobile computing , online advertising, plenty of semi-structured or
unstructured data were produced, such as XML, email, blog , instant message, etc.
result in a surge of new data types. Companies need to integrate and analyze data that
from complex traditional or non-traditional sources of information, including
companies’ internal and external data. With the explosive growth of sensors, smart
devices and social collaborative technologies, the types of data are uncountable,
including: text, micro blogs, sensors’ data, audio, video, click streams, log files, and so
on.
Velocity: The velocity of data generation, processing and analysis continue to
accelerate, there are three reasons, Data Creation’s nature of real-time, the demand of
combining streaming data to business processes and decision-making processes. The
velocity of data processing is high, processing capacity shifts from batch processing to
stream processing. The industry gave the processing capacity of big data a tile “one
second rule”. It shows big data’s processing capability adequately, and the essential
difference with traditional data mining.
Value: Because of the enlarging scale, big data’s value density of per unit data is
constantly reducing, however, the overall value of the data is increasing. Somebody
even equate big data with the gold and oil, indicates big data contains unlimited
commercial value. According to a prediction from IDC research reports, big data
technology and services market will rise from $3.2 billion in 2010 to $16.9 billion in
2015, and achieve an annual growth rate of 40%, it will be seven times the growth rate
of the whole IT and communications industry. By processing big data, finding out its
potential commercial value, we can make enormous commercial profits. In specific
4
applications, big data processing technology can provide technical and platform
support for the national pillar enterprises, analysis, process and mining data for
enterprises, extract important information and knowledge, and then transform them into
useful models and apply to the process of research, production, operation and sale.
Meanwhile, the state strongly advocate construction of "smart city", in the context of
urbanization and information integration, focusing on improving people's livelihood,
enhancing the competitiveness of enterprises, and promoting sustainable development
of cities, utilize Internet of Things, cloud computing and other information technology
tools comprehensively, combine the city's existing information base, integrate
advanced service concept of urban operation, establish a widely covered and deeply
linked information Network, perceive many factors of city comprehensively, for
example, resources, environment, infrastructures, industry and so on, build a
synergistic and shared urban information platform, to process and utilize information
intelligently, so that provide intelligent response control for city’s operations and
allocation of resources, provide social management of government and public services
with intelligent basis for decision making and methods, offer intelligent information
resources and open information-use platform’s integrated regional information
development process to enterprises and individuals.
Data are undoubtedly the cornerstone of the new IT service and scientific research,
and big data processing technology become the hot pot of today's information
technology development naturally, the flourish of the big data processing technology
has also heralded the arrival of another IT revolution. On the other hand, with the
deepening of national economic restructuring and industrial upgrading, the role of
information processing technologies will become increasingly prominent, and big data
processing technology will become the best breakthrough to achieve the core
technology’s overtaking around the curve, following the development, application
break through, and reducing kidnapping in the information construction of the pillars of
the national economy [16].
2 Big data problems
Big data is become an invisible "gold mine" for the potential value it contains. With
the accumulating and growing data of productions, operations, management,
monitoring, sales, customer services and other aspects’ data, as well as the rise of the
number of users, by analyzing the correlation patterns and trends from large amount of
data, it is possible to achieve efficient management, precision marketing, and this can
become a key to open this "gold mine". However, the traditional IT infrastructure and
the methods of data management and analysis cannot adapt to the rapid growth of big
data. We summarized the problem of big data as the seven categories listed in Table 1:
5
Table 1
Classification of Big data
problem’s
Problems of big data
Description of Big data problem’s
Import and export problem
Statistical analysis problems
speed
Query and retrieval problems
real-time response problems
Multi-source problem
Type and structure
Heterogeneous problems
The original system’s infrastructure problems
Linear scaling problems
Volume and flexibility
Dynamic scheduling problems
Cost Comparison between Mainframe and small
server
Cost
The control of costs of the original system’s
modification
Data analysis and mining
Value mining
The actual efficiency after data mining
Structured and non-structured
Storage and security
Data security
Privacy security
Data standards and interfaces
Connectivity and data sharing
Shared protocols
Access permissions
2.1 Problems of speed
Traditional relational database management systems (RDBMS) generally use
centralized storage and processing methods, without using a distributed architecture, in
many large enterprises, configurations are often based on IOE (IBM Server, Oracle
Database, EMC storage). In this typical configuration, a single server’s configuration is
usually very high, there can be dozens of CPU cores, memory can reach hundreds of
GB, either; Databases are stored in high-speed and large-capacity disk array, and
storage space can up to TB level. This configuration can meet the demand of traditional
Management Information System (MIS), but when face ever-growing data volumes
and dynamic data usage scenario, and this centralized approach was becoming the
bottleneck, especially for its limited speed of response. When face importing and
exporting large amounts of data, statistical analysis, retrieval and query, because of its
dependence of centralized data storage and index, its performance will decline sharply
as data volume grow, let alone the statistics and query scenarios that require real-time
response. For instance, in Internet of Things, the data of the sensor can be up to billions
of items; these data need real-time storage, query and analysis, so traditional RDBMS is
no longer suitable for application requirements.
6
2.2 Type and architecture problems
RDMBS has formed relatively mature store, query, statistical and processing
approaches for data that are structured and have fixed patterns. With the rapid
development of Internet of Things, Internet and mobile communication networks, the
formats and types of data is constantly changing and developing. In intelligent
transportation field, the data involved may contain text, logs, pictures, videos, vector
maps and other different kinds of data that from different monitoring sources. The
formats of these data are usually not fixed; it will be difficult to respond to changing
needs if we adopt structured storage modes. So we need to use various modes of data
processing and storage, integrate structured and unstructured data storage to process
these data whose types, sources and structures are various. The overall data
management mode and architecture also requires new types of distributed file systems
and distributed NoSQL database architecture to adapt to large amount of data and shifty
structures.
2.3 Volume and flexibility problems
As noted earlier, as the huge volume and adopting centralized storage, there are
problems with big data’s speed, response. When the amount of data increases, and the
amount of concurrent read and write become larger and larger, centralized file system
or single database will become the deadly performance bottleneck, after all, a single
machine can only withstand limited pressure. We can distribute the pressure to many
machines by adopting the frameworks and methods of linear scalability to reach a level
that machines can bear, so we can dynamically increase or decrease the amount of files
or database servers according to the amount of data and the quantity of concurrent to
achieve linear scalability.
In terms of data storage, we need to adopt a distributed and scalable architecture,
such as well-known Hadoop file system [25] and HBase database [27]. Meanwhile, in
the respect of processing the data, we also need to adopt a distributed architecture,
assigning the data processing tasks to many computing node, and need to consider the
correlation between the data storage nodes and the compute nodes. In the computing
field, the allocation of resources and tasks is actually a mission scheduling problem. Its
main task is making the best match between resources and jobs or among tasks on the
basis of resources (including CPU, memory, storage, and network resources, etc.) of
each individual node of current cluster and the operation service quality requests of
each user. Since the diverse user requirements of operation service quality, and the
changing state of resources, finding the appropriate resources for distributed data
processing is a dynamic scheduling problem.
2.4 Cost problems
For centralized data storage and processing, when selecting hardware and software,
the basic approach is using very high configuration mainframe or minicomputer server
and high access speed disk arrays with high security to guarantee the data processing
7
performance. These hardware devices are very expensive and frequently up to several
million dollars, and in software, often adopted the products of large foreign software
vendors such as Oracle, IBM, SAP, Microsoft, the maintenance of servers and
databases also requires professional technical personnel, and investment and operation
costs are high. In the face of the challenges of massive data processing, these
companies have also introduced a "AIO" solution in the shape of monster, such as
Oracle's Exadata, SAP's Hana, etc. that by stacking the multi-server, massive memory,
flash memory, High-speed networks and other hardware, to relieve the pressure of data,
however, the hardware cost is significantly higher, general enterprises are hard to bear
it.
The new distributed storage architecture and distributed databases such as HDFS,
HBase, Cassandra [28], MongoDB [29], etc. don’t have the bottlenecks of data
centralized processing and summary as they use a decentralized and massive parallel
processing MPP architecture, and along with linear scalability, can deal with the
problems of storage and processing of big data effectively. In the software architecture,
they also achieved some self-managing and self-healing mechanisms to face the
occasional failure in massive nodes, and protect the robustness of overall system, so the
hardware configuration of each node needn’t to be high, we can even use an ordinary
PC as a server, so the cost of server can be greatly reduced, and in terms of software,
open source software also give us a very large price advantage.
Of course, we cannot simply make a comparison of cost of hardware and software
when talk about the cost. If we want to migrate systems and applications to the new
distributed architecture, we have to do a lot of adjustments from the bottom platform to
the upper application, especially in the database schema and application programming
interfaces, there is a big difference between NoSQL database and the original RDBMS,
enterprises need to assess the costs, cycle time and risk of migration and development.
Additionally, they also need to consider the cost of service, training, operation and
maintenance aspects, but in the general trend, with these new data architecture and
products becoming more mature and perfect, as well as some commercial operating
companies provide professional database development and consulting services based
on open source, the new distributed, scalable database schema is bound to win in the
big data wave, defeat the traditional centralized big machine model from cost to
performance perfectly.
2.5 Value mining problems
Due to the huge and growing volume, the value density per unit of big data is
constantly reducing, while the overall value of big data is steadily increasing, big data is
analogous to the oil and gold, so we can mine its huge business value [54]. If we want to
extract the hidden patterns from large amount of data, we need deep data mining and
analysis. Big data is also quite different from traditional data mining model: traditional
data mining focuses on moderate bulk of data in general. Its algorithm is relatively
complex and the convergence speed is slow. While in big data area the quantity of data
is massive, and for the processes of data storage, data cleaning, ETL (extraction,
transformation, loading), we need to deal with the needs and challenges of massive data,
8
which suggests the use of distributed parallel processing method. For example, in the
case of Google and Microsoft's search engine, it needs hundreds or even thousands of
servers working synchronously to perform the archive storage of users’ search logs
which generated by search behaviors of billions of worldwide users. Meanwhile, while
mining the data, we also need to adapt traditional data mining algorithms and its
underlying processing architecture. And to boost the massive data computing and
analysis, it is promising to introduce the parallel processing mechanism. Apache's
Mahout [30] project provides a series of parallel implementation of data mining
algorithms. In many application scenarios, even the real-time feedbacks of results is
needed, which presents the system a huge challenge: the data mining algorithms usually
take a long time, especially when the amount of data is huge. In this case, maybe only a
combination of real-time calculation and large quantities offline processing can meet
the demand.
The actual gain of data mining is an issue that needs to be carefully assessed before
mining big data’s value. And not all of the data mining programs will acquire the
desired results. Firstly, we need to guarantee the authenticity and completeness of the
data. For example, if the collection of information introduces big noise itself, or some
key data is not included, the value that is dig out will be undermined. Secondly, we also
need to consider the costs and benefits of the mining. If the investments of manpower,
hardware and software platform are costly, and the project cycle is long, but the
information extracted is not very valuable for enterprise’s production decisions,
cost-effectiveness and other aspects, then the data mining is also impractical and not
worth the candle.
2.6 Storage and security problems
In big data’s storage and safety guarantee aspects, its changeable formats, huge
volume have also brought a lot of challenges. For structured data, relational database
management system (RDBMS) involving storage, access, security and backup control
system is rather mature after decades of development. The huge volume of big data also
has impact on the traditional RDBMS, as mentioned, centralized data storage and
processing are shifting to distributed parallel processing. In most cases, big data are
unstructured data, so it derived a lot of distributed file storage system and distributed
NoSQL databases to deal with this kind of data. But these emerging systems need to
perfect their user managements, data access privilege, backup mechanism, security
control and other various aspects. In short, firstly it is necessary to prevent data lost, and
to provide reasonable backup and redundancy mechanisms for the massive structure
and unstructured data so that the data will not to be lost under any circumstances.
Secondly, we should protect the data from unauthorized access. Only the users that
have the authority can access the data. Due to the large amounts of unstructured data
may require different storage and access mechanisms, the formation of unified security
access control mechanisms that focus on multi-source and multiple data types is a big
problems to be solved. Because big data assembled more sensitive data together, so it’s
more attractive to potential attackers; an attacker will be able to get more information if
he commit an attack successfully-"cost performance" is higher. All of these make it
9
easier for big data to become the target of attack. In 2012, LinkedIn was accused that it
leaked 6.5 million user account passwords; Yahoo! have faced network attacks,
resulting in 450,000 user ID leak. 2011 December, CSDN’s security system was hacked,
6000000 user's login name, password, and email were leaked.
Privacy problems also closely associated with big data. Due to the rapid
development of Internet technology and the Internet of Things technology, all kinds of
information related to our works and lives have been collected and stored. We are
always exposed to the "third eye". No matter we are surfing the Internet, phoning,
sending microblogs, using Wechat, or in shopping and travel, our actions are always
being monitored and analyzed. The depth analysis and modeling of user behaviors can
serve customers better, and make precision marketing possible. However, if the
information is leaked or abused, it is a direct violation to the user's privacy, bring the
adverse effects to users, and even cause life and property loss. In 2006, the U.S. DVD
rental company Netflix has organized an algorithm contest. The company released a
million lease records that from about 500,000 users, and publicly offering a reward of
one million dollars, organizing a software design contest to improve the accuracy of
their movie recommendation system, the condition of victory is to improve their
recommendation engine’s accuracy by 10%. Although the data were carefully
anonymized by the company, a user recognized the data, and a closeted lesbian mother,
going by the “Anonymous" has sued Netflix. She comes from a conservative Midwest.
On Twitter.com, a popular site in the United States, many users are accustomed to
publish their location and dynamic information at any time, there are a few sites, such
as "PleaseRobMe.com" - Come rob me, "WeKnowYourHouse.com" - I know you're
home, according to the information users sent, speculating the times that the users is not
at home, getting the user's exact home address, and even finding out photos of the house.
Their behavior is designed to remind us that we are always exposed to the public eye, if
we don’t develop an awareness of safety and privacy, we will bring us disaster by
ourselves. Nowadays, many countries around the world, including China, are
improving the laws that related to data use and privacy, to protect the information of the
privacy not to be abused.
2.7 Interoperability and data sharing issues
In China's companies’ information construction process, fragmentation and
isolated information are common phenomenon. Systems and data between different
industries have almost no intersections. The same industry, such as transportation and
social security system’s interior, also are divided and constructed according to the
administrative areas; the information exchange and collaboration across regions are
very difficult. More serious, even within the same unit, such as the construction of some
of the hospital's information systems, medical record management, beds information,
medicines management and other subsystems are constructed discretely, and there is no
information sharing and interoperability. "Smart City" is the emphasis of China's
Twelfth Five years plan of information construction. The fundamentality of the “Smart
City” is to achieve the interoperability and sharing of information, to achieve intelligent
e-government, social management and improving people's livelihood based on data
10
combination. So on the foundation of Digital city, we also need to achieve
interconnection, open the data interface of walks of life, to achieve interoperability, and
so we can achieve the intelligence. For example, in emergency management of urban
areas, we need the data and assistances from many departments such as transportation,
population, public security, fire, health care. At present, the data sharing platform
constructed by US federal government, www.data.gov, the data resources Network
(www.bjdata.gov.cn) of Beijing Municipal Government and other platforms are
forceful attempts for data open and data sharing.
To achieve the cross-industry data integration, we need to made an uniform data
standards, exchange interfaces as well as sharing protocols, so that we can access,
exchange and share the data from different industries, different departments , different
formats based on a uniform basis. For data access, we also need to make detailed access
authority to define what users can access which type of data in which circumstances. In
the big data and cloud computing era, data from different industries, enterprises may be
stored in a unified platform and data centers, we need to protect some sensitive
information, such as the data related to the commercial secrets of the enterprises and
transaction information, although its process rely on the platforms, besides their own
authorized persons, we should assure the platforms administrators and other companies
can’t access such data.
3 Dialectical relationships between the cloud computing and
big data
Cloud computing has a fast development since 2007. Cloud computing’s core
model is large-scale distributed computing, providing computing, storage, networking
and other resources to many users in service mode, the users use them when they need
[5]. Cloud computing offer enterprises and users the high scalability, high availability
and high reliability, efficient use of resources, can improve resource service efficiency
and reduce the costs of business information construction, investment and maintenance
costs. As the U.S. Amazon, Google, and Microsoft’s public cloud services become
more mature and more perfect, more and more companies are migrating toward the
cloud computing platform.
Because of the strategic planning needs of the country as well as positive guidance,
cloud computing and its technology have made great progress in recent years in China.
China set up several computing model city, including Beijing , Shanghai , Shenzhen,
Hangzhou, and Wuxi, Beijing 's "Lucky Clouds" plan , Shanghai 's "sea of clouds" plan,
Shenzhen's "Cloud Computing International joint Laboratory", Wuxi’s "cloud
computing projects", and Hangzhou’s " West Lake cloud computing platform for public
service " also were launched, other cities, such as Tianjin, Guangzhou , Wuhan, Xi'an,
Chongqing and Chengdu have also introduced the corresponding cloud computing
development plans or set up cloud computing alliances, to carry out research and
development and trial of cloud computing. But the popularity of cloud computing in
China is largely limited by the construction of infrastructure, and lack of large-scale
11
industrial applications, cloud computing not really hit the ground. The reason is that the
popularity of internet of things and cloud computing technology is our great vision, so
we can achieve large-scale, ubiquitous, and collaborative information collection,
processing, and application. However, the premise of its application is that most
industries, enterprises have a good foundation and experience of information
construction, and have an urgent need to transform the existing system architectures,
improving the efficiency of the existing systems, while the reality is that most of our
SMEs have only just started in the area of information construction, only a few large
companies and national ministries have the foundation in the information construction.
Outbreak of big data is a thorny problem encountered in society and
informatization development. Because of the growth of data traffic and data volume,
data formats are multi-source and heterogeneous, and we request a real-time and
accurate data processing, it can help us to discover the potential value of large amount
of data. Traditional IT architecture has been unable to handle big data problem, there
are poor scalability, poor fault tolerance, low performance, difficult installation,
deployment and maintenance, and many other bottlenecks in it. Because of the rapid
development of Internet of Things, Internet, mobile communication network
technology in recent years, the frequency and speed of data transmission are greatly
accelerated, gave birth to the big data problem, and the second development, the deep
recycle of data let the big data become increasingly acute.
We believe that cloud computing and big data are complementary, dialectical
relationship. Cloud computing and Internet of things’ widespread use is our vision, and
the outbreak of big data is a thorny problem that we encountered in the development;
The former is the dream of human’s pursuit of civilization, the latter is the bottleneck to
be solved of social development; Cloud computing is a tendency of technology
development, big data is an inevitable phenomenon of the rapid development of
modern information society. To solve big data problem, we need modern means. The
breakthrough of big data technology not only can solve real problems, but also can let
technology of cloud computing and Internet of things hit the ground and be promoted
and applied modern in depth. From the development of IT technologies, we can
summarize a few laws:
(1) The competition between Mainframe and personal PC ended in PC’s triumph,
the battle between Apple’s IOS and Android, open Android platform has seized 1/3
market share in 2-3 years. Nokia's Symbian operating system has come to the brink of
being eliminated because it is not open. All of these indicate that modern IT technology
needs an open and crowdsourcing concept to achieve rapid development.
(2) The existing conventional technology’ collision with cloud computing
technology is similar with it, the advantage of cloud computing technology is utilizing
crowd sourcing theory and open source, its construction based on open source platform
and distributed architecture of open-source new technologies, can solve the problem
that existing centralized approach difficult to solve or can’t solve. TaoBao, Tencent and
other large Internet companies once relied on proprietary solutions provide by big
company such as Sun, Oracle, EMC, but then abandoned because of the expensive cost
and adopted open source technologies, and their products have also be contributed to
12
the open source community ultimately, reflecting the trend of information technology’s
development.
(3) The traditional industry giants have tilted to the open source system; this is a
historic opportunity which is conducive to catching. Traditional industry giants and
large central enterprises, such as National Grid, telecommunications, banking, civil
aviation, are over-reliance on sophisticated proprietary solutions provided by foreign
companies because of historical reasons, resulting in a pattern that lack of innovation
and kidnapped by foreign products. Analyze from the path that to crack the problem, to
solve big data problem, we must abandon the traditional IT architecture gradually, and
utilize the new generation of information technologies marked by "cloud" technology.
Despite advanced cloud computing technology originated from the United States
mainly, but based on open-source basis, the gap between our technology the developed
technology is not large, the urgent big data problem that to apply cloud computing
technology to large-scale industry is also our historic opportunity to achieve
breakthrough innovation, break the monopoly, and catch up with international
advanced technology.
4 Big Data technology
Big Data brings not only opportunities but also challenges. Traditional data
processing means has been unable to meet the massive real-time demand of big data;
we need the new generation of information technology to deal with the outbreak of big
data. We summarize the big data technology into five Classifications, as shown in Table
2.
Table 2 Classification of big data technology
Classification of big data
technology
big data technology and tools
Cloud computing platform
Cloud Storage
Infrastructure Supports
Virtualization technologies
Network technology
Resource Monitoring Technology
Data Bus
Data acquisition
ETL tools
Distributed File System
Relational database
Data Storage
NoSQL technology
The integration of Relational databases and non-relational database
In-Memory Database
Data query, statistics and analysis
Data Mining and Prediction
Data computing
spectrum process
BI (Business Intelligence)
13
Graphics and reports
Display and Interaction
Visualization Tools
Augmented Reality Technology
Infrastructure supports: mainly includes infrastructure management centers that
to support big data processing, cloud computing platforms , cloud storage equipment
and technology, network technology, resource monitoring technology. Big data
processing needs the supports from cloud data centers that have large-scale physical
resources and cloud computing platforms that have efficient scheduling and
management function.
Data acquisition technology: Data acquisition technology is a requisite for data
processing; firstly we need the means of data acquisition, collecting the information
and can apply the top data processing technology to them. Besides the various types of
sensor and other hardware and software equipment, data acquisition involves the
ETL( acquisition, conversion , load) process of data, can pre-process the data, such as
washing, filtering, checking and conversion, converting the valid data into suitable
formats and types. Meanwhile, to support multi-source and heterogeneous data
acquisition and storage access, we also need to design a data bus of companies, to
facilitate the data exchange and sharing between the various enterprise applications and
services.
Data storage technology: after collection and conversion, the data need to be
storied archived. Facing the large amounts of data, we generally use distributed file
storage systems and distributed databases to distribute the data to multiple storage
nodes, and also need to provide many mechanisms such as backup, security, access
interfaces and protocols.
Data computing: the data queries, statistics, analysis, forecasting, mining,
spectrum process, BI business intelligence and other relevant technology are
collectively referred to as data Computing technology. Data computing technology
cover all aspects of data processing, and are the core technique of big data technology,
either.
Data’s display and interaction: Data’s show and interaction is also essential in
big data technique, since the data will eventually be utilized by people to provide
decision supports for productions, operations and planning. Choosing an appropriate,
vivid and visual way of display can bring us a better understanding of the data, its
connotations and association relationship, and can also help us interpret and use the
data more effectively, to develop their value. In the ways of show, in addition to
traditional reporting, graphics, we can also combine modern visualization tools and
human-computer interactions; even use the Augmented Reality Technology such as
Google Glasses, to achieve a seamless interface between the data and reality.
4.1 Infrastructure Supports
Big data processing needs the support of cloud data centers that have large-scale
physical resources and cloud computing platforms that have efficient scheduling
14
management functions. Cloud computing management platform can provide flexible
and efficient deployments, operations and managements environment for large data
centers and enterprises, support heterogeneous underlying hardware and operating
systems by virtualization technology, provide applications the cloud resource
management solution that are safe, high-performance, highly extensible, highly reliable
and highly scalable, reduce the costs of application development, deployment,
operation and maintenance, improve the efficiency of resource use .
As a new computing model, cloud computing gains great momentum in academia
and industry. Governments, research institutions and industry leaders are trying to solve
the growing computing and storage problems in the internet age by cloud computing
actively. In addition to the Amazon’s AWS, Google’s App Engine and Microsoft’s
Windows Azure Services and other commercial cloud platforms, there are also some
open source cloud computing platforms such as OpenNebula [6][7], Eucalyptus [12],
Nimbus [9], and OpenStack [8], each platform has its significant features and
constantly evolving community .
Amazon's AWS is, as it were, the most popular cloud computing platform, and the
first half of 2013 its platform and cloud computing services have earned $1.7 billion,
with year-on-year growth of 60%. The most important feature of its system architecture
is opening data functions via Web Service interfaces, and achieves systems’ loosely
coupled by SOA architecture. The web service stack provided by AWS provides can be
divided into four layers:
(1) The Access Layer: Provides management console, API, and various
command-line, etc.
(2) The Common Service Layer: including authentication, monitoring,
deployment and automation, etc.
(3) The PaaS Layer: including parallel processing, content deliveries and
messaging services
(4) The IaaS Layer: including cloud computing platform EC2, cloud storage
services S3/EBS, network services VPC / ELB, database services, etc.
Eucalyptus is an open source cloud computing platform that attempt to clone AWS,
has achieved a function similar to Amazon EC2, achieve flexible and practical cloud
computing by computing clusters or workstations’ group, it provides the compatibility
for EC2 and S3 storage system interface. The applications that use these interfaces can
interact directly with Eucalyptus, and it support Xen [10] and KVM [11] virtual
technology, as well as cloud management tools for system managements and user
settlements. Eucalyptus consists of five major components, namely cloud controller
CLC, cloud storage service Walrus, cluster controller CC, storage controller SC and
node controller NC. Eucalyptus manage computing resources through Agent way, these
components can collaborate together to provide the required cloud services.
OpenNebula is an open-source implementation of virtualization management of
virtual infrastructure and cloud computing plan, and it was initiated by the European
Search Institute in 2005. It's an open source tool which is used to create IaaS private
clouds, public clouds and hybrid clouds, and is also a modular system that can achieve
different cloud architecture and interact with a variety of data center services.
15
OpenNebula has integrated storage, network, virtualization, monitoring and security
technologies. It can deploy multi-layered service in the distributed infrastructure
according to the allocation policies in the form of virtual machine. OpenNebula divided
into three levels, namely the interface layer, the core layer and the driver layer.
(1) The interface layer provides the native XML-RPC interfaces, and achieves
various API such as EC2, OCCI (Open Cloud Computing Interface) and
OpenNebula Cloud API (OCA), gives users a variety of options of access.
(2) The core layer provide unified plug-in management, request management, VM
lifecycle management, Hypervisor management, network resources
management and storage resources management and other core functions.
(3) In the bottom layer, there are interactions that between the driver layer
(consists of a variety of drivers) and virtualization software (KVM, XEN), and
interactions between driver layer and physical infrastructure.
OpenStack is an open source cloud computing virtual infrastructure, which users
can use to build and run their cloud computing and storage infrastructure. Users can use
the cloud computing services provided by OpenStack by the API that compatible with
Amazon EC2/S3, and it allows the client tools that written for the Amazon Web
Services (AWS) can also be used together with OpenStack. OpenStack has done the
best of SOA and decoupling of service-oriented components. The overall structure of
OpenStack is divided into three layers, either, the first layer is applications,
management portals (Horizon), API and other access layer; core layer comprises
computing services (Nova), storage services (including the object storage service Swift
and block storage service Cinder) and network services (Quantum); layer 3 are shared
services, now they are account rights management service (keystone) and mirrored
service (Glance).
Nimbus Systems is an open source system, providing interfaces that compatible
with Amazon EC2, can create a virtual machine cluster quickly and easily, that we can
use cluster scheduling system to schedule tasks just like an ordinary cluster. Nimbus
also supports different virtual realization (Xen and KVM). It is mainly used in scientific
computing.
4.2 Data acquisition
Sufficient scale of data is the basis of the big data strategic construction of
enterprises, so data acquisition has become the first step of big data analysis. Data
acquisition is an important part of value mining of big data and the subsequent analysis
and mining are based on it. The significance of big data in not in grasping the sheer
scale of the data information, but rather the intelligent processing of these data, analyze
and mine valuable information from it, but the premise is to have a lot of data. Most
enterprises are difficult to judge which data will become data assets in the future and by
what means to refine the data into real income. For this, even big data services business
is difficult to give a definitive answer. But one thing is sure, the era of big data, the one
who have mastered enough data is likely to master the future, the acquisition of big data
now is the accumulation of assets in the future.
Data acquisition can be based on the sensors of internet of things, and also can be
16
based on network information. For example, in the intelligent transportations, there are
several data acquisitions, information collection based on GPS positioning information,
image collection based on traffic crossroads, coil signal collection based on
intersections, etc. While data acquisition on the Internet is to collect a variety of page
information and user access information on various network media, such as search
engines, news sites, forums, microblog, blog, e-commerce sites, etc. and the main
contents of it are text, URL, access logs, dates and pictures. Then we need the
pretreatment such as clean, filter, remove duplicates and give them a categorized and
summarized storage.
The ETL tools is responsible for extracting the different types and structures data
that from distributive, heterogeneous data sources such as text data, relational data, as
well as pictures, video and other unstructured data, to the temporary middle layer to
clean, convert, classify, integrate, and finally load them into the corresponding data
storage systems such as data warehouses or data marts, has become the basis for online
analytical processing. The ETL tool for big data is different from the traditional ETL
process, on the one hand the volume of big data is huge, on the other hand the data’s
production speed is very fast, for instance, a video camera in the city and smart meter
generate large amounts of data every second, pre-processing of data requires have to be
real–time and fast, so in the choice of ETL architectures and tools, we also adopt the
modern information technology such as distributed memory databases, real-time
stream processing systems.
There are various applications and various data formats and storage requirements
of modern enterprises, but between enterprises, and within enterprises, there exist the
problems of fragmentations and information island, the enterprises can’t achieve
controlled data exchange and sharing, and the limitations of development technologies
and the environment applications between applications set up barriers to enterprises
data sharing, hindered the data exchange and sharing between application, also
hindered the enterprise’s demand of data control, data management and data security.
To achieve cross-industry and across-departments data integration, especially in the
construction of smart city, we need to develop unified data standards, as well as
exchange interfaces and sharing protocols, so data from different industries, different
departments and have different formats can be accessed, exchanged and shared based
on a unified basis. By enterprise data bus (EDS), we can provide data access functions
of all kinds of data, and separate the enterprises’ data access integrations from the
enterprises’ functional integrations.
Enterprise data bus has created an abstraction layer for data access, to make
corporate business functions avoid the details of data access. Business components just
need to contain service functional components (used to implement existing services)
and data access components (by the use of enterprise data bus). We use the way of
enterprise data bus to provide a unified data conversion interface for enterprises’ data
management models and application system data models, and reduce the coupling
between the various application services effectively. In big data scene, there are a large
number of synchronized data access requests in enterprise data bus, the performance
degradation of any module on the bus will greatly affect the bus’ function, so the
17
enterprise data bus need a large-scale, concurrent and highly scalable implementations
way.
4.3 Data Storage
The amount of data increase rapidly every year, along with the existing historical
data information, it has brought great opportunities and challenges to data storage and
data processing industry. In order to meet the storage demand that growing rapidly,
cloud storage requires high scalability, high reliability, high availability, low cost,
automatic fault tolerance, decentralization and other characteristics. Common forms of
cloud storage can be divided into distributed file system and distributed database, the
distributed file system use a large-scale distributed storage nodes to meet the needs of
storing large amounts of files, and distributed database NoSQL support the processing
and analysis of mass unstructured data.
When Google faced the problems of storing and analysis massive web pages early,
as a pioneer, it developed the Google File System GFS [13] and the MapReduce
distributed computing analysis model [15, 18, 31] based on GFS. As part of
applications need to deal with a large number of formatted and semi-formatted data,
Google has built a large-scale database system named BigTable[14], which has weak
consistency requests, and is capable of indexing, querying and analyzing massive data.
This series of Google products has open the door to mass data storage, query and
processing in cloud computing era, and become the de facto standard in this field, has
remained the leader in related technique.
Google's technology is not open, so Yahoo and open source community has
developed Hadoop system collaboratively, which is an open source implementation of
MapReduce and GFS. The design principles of its underlying file system HDFS is
completely consistent with GFS, and it also achieved an open source implementation of
Bigtable, a distributed database system named HBase. Since their launch, Hadoop and
HBase has been widely applied all over the world, they are managed by the Apache
Foundation now, Yahoo's own search system runs on Hadoop clusters of million units.
Google has considered the harsh environment that faced by running distributed file
system in a large-scale data cluster: 1) Take full account of the problems that large
number of nodes may encounter failure, and need to integrate the fault tolerance and
automatic recovery functions into the system; 2) Construct special file system
parameters, files are usually size in G Bytes, and contains a large number of small files;
3) Consider the characteristics of the applications, support file append operations,
optimize sequential read and write speeds;4) Some specific operations of file system
are no longer transparent, and need the assistances of application programs.
18
Fig 1 Architecture of the Google File System
Figure 1 depicts the architecture of the Google File System, namely a GFS cluster
contains a primary server (GFS Master) and several blocks servers (GFS chunkserver),
and accessed by multiple clients (GFS Client). Large files are split into blocks with
fixed size, block server store the blocks on the local hard drive as if they are Linux files,
read and write block data according to the specified block handle and byte range. In
order to guarantee the reliability, each block has three backups by default. Primary
server manages all of the metadata of file system, including namespaces, access control,
mapping of files to blocks, physical location of block and other relevant information.
By the joint design of server and client, GFS provide application supports that have
optimal performance and availability. GFS was designed for Google applications
themselves; there are many deployments of GFS cluster in it. Some clusters have more
than thousands of storage nodes, storage space that over PB, and visited by thousands
of clients from different machines continuously and frequently.
In order to deal with massive data challenges, some commercial database systems
attempt to combine traditional RDBMS technologies with distributed, parallel
computing technologies to process the needs of big data. Many systems accelerate data
processing on the hardware level. Typical systems include IBM’s Netezza, Oracle's
Exadata, EMC's Greenplum, HP's Vertica and Teradata. From the functional
perspective, these systems can support the operational semantics and analyze patterns
of traditional databases and data warehouses continuously, as for scalability, they can
also use massive cluster resources to process data concurrently, dramatically reduce the
time that for loading, indexing and query processing of data.
Exadata and Netezza have adopted data warehouse appliance solutions. Combined
software and hardware together, have seamless integrated database management
system, Server, Storage and networks. For users, one machine can be installed quickly
and easily, and satisfy users’ needs by standard interfaces and simple operations, but
these one machine solutions have many shortcomings, including expensive hardware,
large energy consumption, expensive system service fee, purchasing a whole system
19
when need upgrade, etc. The biggest problem of Oracle's Exadata is the
Shared-Everything architecture, resulting in limited IO processing capacity and
scalability. The storage layer in Exadata can’t communicate with each other, any results
of intermediate computing have to be delivered from storage layer to the RAC Node,
then be delivered to the corresponding storage layer Node by the RAC Node, and then
be computed. The large amounts of data movements result in unnecessary IO and
network resource consumption. Exadata’s query performance is not stable; its
performance tuning also requires experience and in-depth knowledge.
NoSQL database, by definition is to break the paradigm constraints of traditional
relational databases. From the data storage perspective, many NoSQL databases are not
relational database, but the hash database that have key-value data format. Because of
the abandonment of powerful SQL query language and transactional consistency and
paradigm constraints of relational databases, NoSQL database can solve many
challenges faced by traditional relational database to a great extent. In the design, they
are concerned about the high concurrent read and write of data and massive data storage,
etc. Compared with the relational databases, they have a great advantage in scalability,
concurrency and fault tolerance. The mainstream NoSQL databases include Google
BigTable, an open source implementation similar to BigTable named HBase, and
Facebook Cassandra, etc.
As part of Google applications need to process a large number of formatted and
semi-formatted data, Google built a large-scale database systems that need weak consistency
requirements named BigTable. BigTable applications include search logs, maps, Orkut online
community, RSS reader and so on.
Fig 2 Data model in BigTable
Figure 2 describes the data model of applications in the BigTable model. The data
model includes rows, columns and corresponding timestamps, all the data are stored in
the table cells. BigTable contents are divided by rows, it integrates several rows to form
a small table, and save to a single server node. This small table is called Tablet.
Similar to the foregoing systems, BigTable is also a joint design of client and server,
making the performance can meet the needs of the applications furthest. BigTable
system relies on the underlying structure of the cluster system, a distributed cluster task
scheduler, the Google file system, as well as a distributed lock service Chubby. Chubby
is a very robust coarse-grained lock, which BigTable use to save the pointer of root data,
thus users can obtain root server’s location from the Chubby lock firstly, and then
access the data. BigTable use one server as the primary server, to store and manipulate
metadata. Besides metadata management, the primary server is also responsible for
20
remote management and load deployment of tablet server (the general sense of the data
server). Client use the programming interfaces for metadata communications with the
main server, data communications with tablet server.
As for large-scale distributed databases, mainstream databases such as HBase and
Cassandra NoSQL are mainly to provide high scalability support, and make some
sacrifices of consistency and availability, have shortcomings of the traditional RDBMS
ACID semantics, transaction supporting, etc. Google Megastore [32], however, strives
to integrate NoSQL with traditional relational database, and provide a strong guarantee
for the consistency and high availability. Megastore use synchronous replication to
achieve high availability and consistent view of the data. In short, MegaStore provides
a complete serialization ACID semantics for "low-latency data copies in different
regions" to support interactive online services. Megastore combines the advantages of
NoSQL and RDBMS, can meet the high scalability, high fault tolerance and low latency
under the principle of consistency in protection, providing services for Google's
hundreds of production applications.
4.4 data computing
Data queries, statistics, analysis, mining and other needs for big data processing
have motivated different computing models of big data, and we divide big data
computing into three parts, offline batch computing, real-time interactive computing
and stream computing.
4.4.1 Offline batch
With the wide range of applications and development of cloud computing
technique, Hadoop distributed storage systems and MapReduce data processing mode
analysis systems based on open source have also been widely used. Hadoop can support
PB level of distributed data storage through data partitioning and self-recovery
mechanism, as well as analyze and process these data based on MapReduce distributed
processing model. MapReduce programming model can make many general data batch
processing tasks and operations parallel on a large-scale cluster, and have automated
failover capability. Led by open source software such as Hadoop, MapReduce
programming model has been widely adopted, and is applied to Web search, fraud
detection, and other varieties of practical applications.
Hadoop is a software framework that can achieve large amounts of data’s
distributed processing, and process by a reliable, efficient and scalable way, relying on
horizontal expansion, improving the computing and storage capacity by increasing
low-cost commodity servers. Users can easily develop and run applications that for
dealing with massive amounts of data easily, we conclude Hadoop has the following
advantages:
(1) High Reliability: the ability to store and process data bit worthy of the trust;
(2) High Scalability: Allocate the data and complete computing tasks in available
computer clusters, these clusters can be expanded to the scale of thousands of
nodes easily;
(3) High Efficiency: can dynamically move data between nodes, and ensure
dynamic balance of each node, the processing speed is very fast;
21
(4) High Fault-tolerance: can save multiple copies of the data automatically, and
reassign failed tasks automatically.
Fig 3 The Hadoop ecosystem
The big data processing platform technology [61] that the Hadoop platform
represents include MapReduce, HDFS, HBase, Hive, Zookeeper, Avro [48] and Pig, etc.
has formed a Hadoop ecosystem, as shown in Figure 3.
(1) MapReduce programming model is the heart of Hadoop, and used for parallel
computation of massive data clusters. It is this programming model that has
achieved massive scalability that across hundreds or thousands of servers of a
Hadoop cluster;
(2) Distributed File System HDFS provides mass data storage based on Hadoop
processing platform, NameNode provides metadata services, DataNode is
used to store the file blocks of file system;
(3) HBase is built on HDFS, is used to provide a database system that has high
reliability, high performance, column storage, scalability and real-time read
and write, can store unstructured and semi-structured loose data;
(4) Hive [17] is a large data warehouse based on Hadoop, can be used for data
extraction, transformation and loading (ETL), storage, query and analyze
large-scale data that stored in Hadoop;
(5) Pig [21] is a large-scale data analysis platform based on Hadoop, can
transform SQL-like data analysis requests into a series of MapReduce
operations that were optimized, provides a simple operation and programming
Interface for complex massive data parallel computing;
(6) Zookeeper [19] is an efficient and reliable collaborative systems, it is used to
coordinate a variety of services on distributed applications, we can use
Zookeeper to build a coordination service that can prevent single point of
failures and deal with load balancing effectively;
(7) As the middleware of binary and high Performance, Avro provides data
serialization capabilities and RPC services between Hadoop platforms.
Hadoop platform is mainly for offline batch applications, typical application is to
operate static data by scheduling batch tasks, the computing process is relatively slow,
22
to get results, some queries may take hours or even longer, so it’s powerless when face
applications and services that require high real time. MapReduce is a good cluster
parallel programming model, and can meet the needs of most applications. Although
MapReduce is a good abstract of distributed/parallel computing, is not necessarily
suitable for solving any problem of computing. For example , for those applications that
require getting the results in real time, such as advertisement putting of pay-per-click
model based on traffic, social recommendations based on data analysis of real-time
users’ behavior analysis, anti-cheating statistics based on web search and clickstream,
etc. MapReduce can’t provide effective treatments for these real-time applications,
because the processing of these application logics requires multiple operations, or
splitting the input data into a tiny particle size. MapReduce model has the following
limitations:
(1) The intermediate data transfer is difficult to be fully optimized;
(2) Restart of individual tasks is costly;
(3) Big intermediate data storage spending;
(4) The master node can easily become a bottleneck;
(5) The unified file fragment size support only, difficult to deal with complex
collection of documents that have variety of sizes;
(6) Difficult to store and access structured data storage directly.
In addition to MapReduce computing model, workflow computing models
represented by Swift [38, 39], figure computing models represented by Pregel [20], can
handle the application processes and graph algorithms that contain large-scale
computing tasks. As a bridge between scientific workflow and parallel computing,
Swift system is a parallel programming tool for fast and reliable definitions, executions
and managements of large-scale science and engineering workflows. Swift uses a
structured approach to manage workflows’ definitions, scheduling and execution,
which includes a simple scripting language SwiftScript, SwiftScript can describe
complex parallel computing [40] based on data set types and iterative briefly,
meanwhile, it can map data set dynamically for large-scale data that have different data
formats. When it is running, the system provides an efficient workflow engine for
scheduling and load balancing, and it can interact with resources management systems
such as PBS and Condor, to finish the tasks. Pregel is distributed programming
framework for graph algorithms, it can be used in graph traversal, the shortest path,
PageRank computing. It adopt the iterative computing model: In each round, every
vertex process the messages that received in last round, sends messages to other
vertexes, and update their status and topology (outgoing edges, incoming edges) and so
on.
4.4.2 Real-time interactive computing
Nowadays, the real-time computing are generally for massive data, in addition to
meeting some of the requirements of non-real-time computing (e.g., accurate results),
the most important requirement of real-time computing is responding computed results
in real time, millisecond level in general. Real-time computing can be categorized into
the following two application scenarios generally:
(1) The amount of data is huge and the results can’t be computed in advance,
23
while the response time of users has to be real-time.
Mainly used for data analysis and processing in certain occasions. When the
amount of data is large, and we have found out that listing all the query combinations of
possible conditions is impossible, or the exhaustive condition combinations is useless,
then real-time computing can play a role, it postpone the computing process to the
query phase, but need to provide users with real-time response. In this case, it can
process a part of the data in advance, and combine the real-time computing results, to
improve the processing efficiency.
(2) Data sources is real-time and uninterrupted, requires a real-time user response
time
The data sources are real-time, in other words, streaming data. So-called streaming
data means viewing in the data as a data stream to process. Data stream is the
aggregation of a series of data records that are unlimited in time distribution and
number; Data records are the smallest units of data streams. For example, the data
generated by sensors of Internet of Things may be is continuous. We will introduce the
stream processing systems for in the next section separately. Real-time data computing
and analysis can analyze and count the data dynamically and in real time, this has
important practical significance to the monitoring of system’s state and scheduling
management.
The real-time computing process of massive data can be divided into the following
three phases: data generation and collection, data analysis and process, and providing
services. As shown in Figure 4.
Fig 4 Process of real-time calculation
Real-time data acquisition: It need to ensure that can collect all the data integrally
in function, and provide real-time data for real-time applications; In response time,
need to ensure real time and low latency; Configuration is simple, and easy to deploy;
The system is stable and reliable, etc. Currently, Internet companies’ massive data
acquisition tools include Facebook’s open-source Scribe [50], LinkedIn’s open-source
Kafka [34], Cloudera’s open-source Flume [35], Taobao’s open-source TimeTunnel
[36], Hadoop 's Chukwa [37], etc. all of these can meet the log data acquisitions and
transmission requirements that hundreds of MB per second.
Real-time data computing: the traditional data manipulations, collect the data
and store them in a database management system (DBMS) firstly, then interact with
DBMS by query, and get the answer users want. In the whole process, the users are
active, while the DBMS system is passive. However, there are a lot of real-time data
now, such data have strong real-timeliness, huge data volume, and diverse data formats,
traditional relational database schema is not appropriate. The new real-time computing
architectures generally adopt the distributed architecture of massive parallel
processing(MPP), the data storage and processing will be assigned to large-scale nodes
24
to meet the real-time requirements, on the data storage, they use large-scale distributed
file system, such as Hadoop’s HDFS file system, or the new NoSQL distributed
databases.
Real-time query Service: its implementation can be divided into three ways: 1)
Full Memory: provide data read services directly, dump to disks or databases for
persistence regularly. 2) Semi-Memory: Use Redis, Memcache, MongoDB,
BerkeleyDB and other databases to provide Real-time Polling Service, carrying out
persistence operations by these systems. 3) Full Disk: use NoSQL databases that based
on distributed file system (HDFS) such as HBase, as for key-value engine, the key is to
design the distribution of the key.
Among Real-time and interactive computing technologies, Google's Dremel [40]
system is the most prominent. Dremel is Google's "interactive" data analysis system. It
can build clusters of a scale of thousands, process PB-level data. As the sponsor of
MapReduce, Google has developed the Dremel system to shorten the processing time
to the second level, as a strong complement to MapReduce. As a report engine of
Google BigQuery, Dremel has achieved a big success. Like MapReduce, Dremel also
need to run with data and move the computing to the data. It requires file systems such
as GFS as the storage layer. Dremel supports a nested data model, similar to JSON. The
traditional relational model, as there are inevitable large amounts of Join operations in it,
it often powerless when deal with such large-scale data. Dremel also can use the
column storage, so it can only scan the part of the data that needed to reduce the visits of
CPUs and disks. Meanwhile, column storage is compression friendly, using
compression can reduce the amount of storage, and enable maximize performance.
Spark [41] is a real-time data analysis system developed by the AMP Lab of
University of California-Berkeley, adopts a open-source cluster computing
environment similar to Hadoop, but Spark is more superior for the design and
performance of task scheduling, workload optimization. Spark uses memory
distribution data sets, in addition to providing interactive queries, and it can also
optimize the workloads of iteration [46]. Spark is implemented in Scala and they can be
tightly integrated. Scala can operate distributed data sets just like local collections
object easily. Spark support iterative operations on distributed data sets, and is an
effective complement to Hadoop, support the fast data statistical analysis. It can also
run concurrently on Hadoop file system, supporting by a third-party cluster framework
named Mesos. Spark can be used to build large-scale, low-latency data analysis
applications.
Impala [42], released by Cloudera recently, similar to Google's Dremel system, is
an effective tool for big data real-time queries. Impala can offer fast, interactive SQL
queries on HDFS or HBase, besides a unified storage platform, it also uses Metastore
and SQL syntax that as same as used by Hive, provides a unified platform for batches
and real-time queries.
4.4.3 Streaming computing
In many real-time application scenarios, such as real-time trading systems,
real-time fraud analysis and real-time ad push [23], real-time monitoring, real-time
analysis of social networks, etc. large amount of existing data, high real-time
25
requirements, and the data source is continuous. New data’ processing must be
processed immediately, or subsequent data will pile up, and the processing will never
end. We often need a sub-second or even sub-millisecond response time, which requires
a highly scalable streaming computing solution.
Stream Computing [24][26] is for the real-time and continuous types of data.
Analyze in real time during the movement that the stream data are changing, capture the
information that may be useful to the users, and sends the result out. In the process, the
data analysis and processing system is active, the users are in a passive state of
reception. As shown in Figure 5.
Figure 5 Process of Streaming computing
Traditional streaming computing systems are generally based on the event
mechanism, and the amount of data processed by them is small. The new stream
processing technique, such as Yahoo's S4 [22][26], mainly to solve the streaming
processing issue that have a high data rate and a large amount of data.
S4 is a general-purpose, distributed, scalable, partially fault-tolerant, pluggable
platform. Developers can easily develop applications for unbounded, uninterrupted
streaming data processing on it. Data events are routed to processing Elements (PEs),
PEs consume these events and deal with as follows:
(1) Send out one or more events that may be can be processed by other PE;
(2) Publish results.
S4’s design is primarily driven by data acquisitions and machine learning that used
in the production environments on a large scale. Its main features are:
(1) Provides a simple programming interface to handle the data streaming
(2) Design a high-availability cluster that are scalable on general hardware.
(3) Use local memory of every processing node to avoid disk I/O bottleneck and
minimize the latency.
(4) Use a de-centered, peer-to-peer architecture; all nodes provide the same
functions and responsibilities. There is not a central node that takes a special
responsibility. This greatly simplifies the deployment and maintenance.
(5) Use a pluggable architecture to make design universal and customizable as
much as possible.
(6) Friendly design concept, easy to program, and with flexibility.
There are many same characteristics between S4’s design and IBM's stream
processing core SPC middleware [53]. Both systems are designed for large amounts of
26
data. Both of them have the ability to use user-defined operations to collect information
in continuous data streams. The main difference is in the structure design, SPC’s design
is from Publish/Subscribe mode, however S4’s design come from the combination of
MapReduce and Actor model. Yahoo! Believe that because of its equivalent structure,
S4’s design has achieved a very high degree of simplicity. All nodes in the cluster are
identical, there is no central control.
SPC is a distributed stream processing middleware to support applications that
extract information from large-scale data stream. SPC contains programming models
and development environments to achieve distributed, dynamic, scalable applications,
its programming models includes API for declaring and creating processing unit (PE),
as well as a toolset that for assembling, testing, debugging, and applications
deployments. Unlike the other stream processing middleware, in addition to supporting
relational operators, it also supports non-relational operators and user-defined
functions.
Storm [43] is a real-time data processing framework and open source by Twitter,
and it’s similar to Hadoop, this kind of streaming computing solutions that with high
scalability and can process high-frequency data, large-scale data will be applied to
real-time search, high-frequency trading and social networks. Storm has three action
scopes:
(1) Stream Processing
Storm can be used to process new data in real time and update the databases, have
both fault tolerance and scalability.
(2) Continuous Computation
Storm can carry out continuous query and feed back the results to the customers,
such as sending the hot topic of Twitter to the clients.
(3) Distributed RPC
Storm can be used to process intensive queries concurrently, Storm’s topology is a
distributed function that waiting for invocation messages, when receives an invocation
message, it will compute the queries and returns the results.
4.5 Data presentation and interaction
The results need to be shown in a simple and intuitive way, so that end-users can
understand and use them, and form effective statistics, analyses, predictions and
decision-making that applied to the production practices and business operations, so the
showing technology of big data, as well as technology for interacting with data, occupy
an important position in the whole big data, either.
Excel spreadsheets and graphics are the show ways that people have known and
used a long time, also provide great convenience for everyday simple data applications.
Many Wall Street traders rely on Excel and the years of accumulated, summarized
formulas to carry out the large stock exchanges, either, and Microsoft and a number of
entrepreneurs have seen the market potential, are developing the big data processing
platform that display and interact by Excel as well as using Hadoop and other
technology.
The understanding and processing speed of the human brain of graphics is much
27
greater than the speed of texts. Therefore, presenting data for visualization can show the
potential or complex patterns and relationships deeply. With the rise of big data, it has
emerged many new data presentation and interactive ways and start-up companies that
focus on this area. These new methods include interactive charts, which can be
rendered to the web page, and support interactions, can operate and control icons, have
animations and presentations. Besides, interactive map applications such as Google
Maps, can mark dynamically, generate route, superimpose panoramic aerial maps, etc.
Because it’s open API can combine with many users maps and location-based service
applications, it has got an extensive application. Google Chart Tools also offers a
variety of flexible approaches to website data visualization. From simple line graph,
Geograph, Gauges (measuring instrument), to complex tree graph, Google Chart Tools
provide a large number of well-designed charting tools.
Tableau [44], who was born in a big data start-up company of Stanford, is
becoming one of the excellent data analysis tools. Tableau has joined data in computing
and aesthetic chart together perfectly, as shown in Figure 6. Companies can use it to
drag and drop large amounts of data onto a digital "canvas", and can create a variety of
charts shortly. Tableau’s design and implementation philosophy is: The more easily to
manipulate the data on interface, the more thoroughly the companies can understand
what they have done in the field of is right or wrong. Fast processing and easy sharing is
another feature of Tableau. Only a few seconds that Tableau Server can publish the
interactive control panel on the Internet, users only need a browser to filter, select the
data easily and get a response of their questions, which will increase the users’
enthusiasm of using data.
Another big data visualization start-up company Visual.ly [45] is known as its
abundant infographics resources. It is a socializing creation and sharing platform of
infographics. We live in an era of data acquisition and content creation. Visual.ly is the
product of the data age, a new visual infographics platform. Many users are willing to
upload the infographics to the website and share with others. Infographics will greatly
stimulate the visual performance, and promote mutual learning and discussion between
users. Have visualization services of exploration, sharing and promotion. It’s not
complicated to use Visual.ly to make infographics. It is an automated tool that makes
the insertion of different types of data quick and easy, and it expresses the data by
graphics.
28
Fig 6 Visualization examples of Tableau
In addition, 3D digital rendering technology has been applied in many fields
widely, such as digital city, Digital Park, modeling and simulation, design and
manufacturing, with a highly intuitive operability. Modern Augmented Reality
Technology (AR) applies the virtual information to the real world by computer
technologies, and the real environments and virtual objects are superimposed in a same
picture or space, existing at the same time. Combined with virtual 3D digital models
and real-life scenarios, it provides a better sense of presence and interaction. By AR
technology, users can interact with the virtual objects, such as trying on virtual glasses,
trying on virtual clothes, driving simulation aircrafts, etc. In Germany, engineering and
technical personnel, when they are conducting mechanical installations, maintenance
or tuning, by the helmets monitor, the internal structure of the machine and its
associated information that can’t be presented before can be fully presented.
Modern Motion Sensing technologies, such as Microsoft's Kinect and Leap's Leap
Motion somatosensory controller, are capable of detecting and perceived body
movements and gestures, and then convert the actions to the controls of computers and
systems, free people from the constraints of keyboard, mouse, remote control and other
traditional interactive devices, and make us interact with computers and data directly by
our bodies and gestures. Today's hottest wearable technologies, such as Google glasses,
have combined big data technology, augmented reality, and somatosensory technology
organically. With the improvement of data and technologies, we can perceive the
realities around us in real time. Through search and calculation by big data, we can
achieve the real-time identifications and data captures of the surrounding buildings,
businesses, people and objects, and project on our retina, which can help us to work,
shop, relax, etc. provide a great convenience. Of course, the drawbacks of this new
equipment and technology are obvious. We are in a state that is monitored at any time,
29
sustain privacy spying and violations, so the security issues that brings by big data
technique can’t be ignored.
5 Related researches and our works
The scale effect of big data brings a great challenge to the data storage
management and data analysis, and the change of data management methods is brewing
and occurring. Meng Xiaofeng and other scholars have analyzed the basic concept of
big data and have compared it with the major applications of big data simply , have
explained and analyzed the basic framework of big data processing and the effect of
cloud computing technology to the data management of big data era, and summarized
the new challenges we face in the era of big data [49] .Tao XueJiao et al. [51] described
and analyzed the related concepts and features of big data, big data technique’s
domestic and overseas development especially in data mining aspects, and the
challenges we face in the era of big data. Meanwhile, some scholars have pointed out
that when face the real-time and validity needs of data processing, we need a
technological change of conventional data processing technique that set out according
to big data characteristics, to form technique that for big data collection, storage,
management, processing, analysis, sharing and visualization [52]. The review paper
above pay more attention on the analysis of big data characteristics and development
trends, and it’s inadequate in problems that big data face and the classified introduction
summarizes.
Compared with traditional data warehousing applications, Big Data analysis has
features of large volumes of data, complex query and analysis, etc. From the
perspective of big data analysis and data warehouse architecture design, the literature
[33] firstly listed several important features that big data analysis platform need to have,
have analyzed and concluded current mainstream implementation platform parallel
databases, MapReduce and hybrid architectures of the both, and pointed out their
strengths and weaknesses. HadoopDB [59] [60] is an attempt of the combination of the
two architectures. Some scholars discuss from the competition and symbiotic
relationship of RDBMS and MapReduce, analyze the challenges they encountered in
the development, and pointed out that relational data management technologies and
non-relational data management technologies complement each other in constant
competition, and locate the right position in the new big data analysis ecosystem [55]
[58]. In the study of NoSQL systems, scholars like Shen Derong [56] summarized the
related research of NoSQL systems systematically, including architecture, data model,
access method, index technique, transaction characteristics, system flexibility, dynamic
load balancing, replication policy, data consistency policy, multi-level caching
mechanism based on flash, data processing policy based on MapReduce and a new
generation of data management systems. The review papers above tend to introduce the
data storage for massive data, analyze different storage policies and their advantages
and disadvantages, but it is short of a comprehensive exposition of big data technique,
and ignores the synergy between different big data technology and between big data
technology and cloud computing.
30
Modern science in the 21st century brings tremendous challenges to scientific
researchers. The scientific community is facing some "data deluge” problems [1] that
come from experimental data, analog data, sensor data and satellite data. Data size, the
complexity of scientific analysis and processing are growing exponentially. Scientific
Workflow Management System (SWFMS) provides some necessary supports for
scientific computing, such as data management, task dependencies, job scheduling and
execution, resource tracking. Workflow systems, such as Taverna [65], Kepler [63],
Vistrails [64], Pegasus [62], Swift [39] and VIEW [66], have a wide range of
applications in many fields, such as physics, astronomy, bioinformatics, neuroscience,
earth science, and social science. Meanwhile, the development of scientific equipment
and network computing has challenged the reliable workflow systems in the aspect of
data size and application complexity. We have combined scientific workflow systems
with cloud platforms as a service [67] of cloud computing, to deal with the growing
amount of data and analysis complexity. A cloud computing systems with large-scale
data center resources pool and on-demand resource allocation function can provide
scientific workflow systems better services than the environment’s above, which make
the workflow systems can handle scientific questions of PB-level.
6 Summary
Big Data is a hot frontier of today's information technology development. Internet
of Things, Internet and the rapid development of mobile communication networks has
spawned the big data problem, and have brought problems of a various aspects, such as
speed’s, structure’s, volume’s, cost’s, value’s, security privacy’s, interoperability’s.
Traditional IT processing methods are powerless when face the big data problem, for
their lack of scalability and efficiency. Big Data problem needs cloud computing
technique to be solved, while big data also can promote the real landing and
implementation of the cloud computing technique. There is a complementary
relationship between them. We focus on infrastructure support, data acquisition, data
storage, data computing, data display and interaction and other aspects to describe
several kinds of technique covered by big data, describe the challenges and
opportunities of big data technique from another angle for the scholars from related
fields, and provide reference classification methods of big data technology. Big data
technology is constantly growing with the surge of data amount and processing
requirements, affecting our life habits and styles.
Acknowledgements: Here, we express gratitude to the colleagues that have given
supports and advices for this article, especially the students and teachers in the Limit
Network Computing and Service laboratory at School of Computer Science and
Engineering, University of Electronic and Science Technology of China.
31
References
1.
2.
3.
4.
5.
6.
7.
8.
9.
10.
11.
12.
13.
14.
15.
16.
17.
18.
19.
20.
21.
22.
23.
24.
25.
26.
Bell G, Hey T, Szalay A. Beyond the data deluge[J]. Science, 2009, 323(5919): 1297-1298.
Manyika J, Chui M, Brown B, Bughin J, Dobbs R, Roxburgh C, Byers AH. Big data: The next frontier for
innovation, competition, and productivity. May 2011[J]. MacKinsey Global Institute, 2011.
Big Data Research and Development Initiative,
http://www.whitehouse.gov/sites/default/files/microsites/ostp/big_data_press_ release_final_2.pdf
http://wikibon.org/wiki/v/Big_Data_Market_Size_and_Vendor_Revenues
Foster I, Zhao Y, Raicu I, Shiyong L. Cloud computing and grid computing 360-degree compared[C]//Grid
Computing Environments Workshop, 2008. GCE'08. Ieee, 2008: 1-10.
OpenNebula, http://www.opennebula.org.
OpenNebula Architecture, http://www.opennebula.org/documentation: archives:rel2.2:architecture.
Openstack, http://www.openstack.org.
Keahey K, Freeman T. Contextualization: Providing one-click virtual clusters[C]//eScience, 2008.
eScience'08. IEEE Fourth International Conference on. IEEE, 2008: 301-308.
Barham P, Dragovic B, Fraser K, Hand S, Harris T, Ho A, Neugebauer R, Pratt I, Warfield A. Xen and the
art of virtualization[J]. ACM SIGOPS Operating Systems Review, 2003, 37(5): 164-177.
KVM (Kernel Based Virtual Machine). http://www.linux-kvm.org/page/Main Page.
Nurmi D, Wolski R, Grzegorczyk C, Obertelli G, Soman S, Youseff L, Zagorodnov D. The eucalyptus
open-source cloud-computing system[C]//Cluster Computing and the Grid, 2009. CCGRID'09. 9th
IEEE/ACM International Symposium on. IEEE, 2009: 124-131.
Ghemawat S, Gobioff H, Leung ST. The Google file system. In: Proc. of the 19th ACM Symp. on Operating
Systems Principles. New York: ACM Press, 2003. 29−43.
Chang F, Dean J, Ghemawat S, Hsieh WC, Wallach DA, Burrows M, Chandra T, Fikes A, Gruber RE.
Bigtable: A distributed storage system for structured data. In: Proc. of the 7th USENIX Symp. on Operating
Systems Design and Implementation. Berkeley: USENIX Association, 2006. 205−218.
Zheng QL, Fang M, Wang S, Wang XQ, Wu XW, Wang H. Scientific Parallel Computing Based on
MapReduce Model. Micro Electronics & Computer, 2009, 26(8):13-17 (in Chinese with English abstract).
Li GJ, Cheng XQ. Research Status and Scientific Thinking of Big Data[J]. Bulletin of Chinese Academy of
Sciences, 2012, 27(6): 647-657 (in Chinese with English abstract).
Thusoo A, Sarma J S, Jain N, Shao Z, Chakka P, Anthony S, Liu H, Wyckoff P, Murthy R. Hive: a
warehousing solution over a map-reduce framework[J]. Proceedings of the VLDB Endowment, 2009, 2(2):
1626-1629.
Dean J, Ghemawat S. MapReduce: simplified data processing on large clusters[J]. Communications of the
ACM, 2008, 51(1): 107-113.
Gopalakrishna K, Hu G, Seth P. Communication layer using ZooKeeper. Yahoo! Inc., Tech. Rep., 2009.
Malewicz G, Austern MH, Bik AJC, Dehnert JC, Horn I, Leiser N, Czajkowski G. Pregel: a system for
large-scale graph processing[C]//Proceedings of the 2010 ACM SIGMOD International Conference on
Management of data. ACM, 2010: 135-146.
Olston C, Reed B, Srivastava U, Kumar R, Tomkins A. Pig latin: a not-so-foreign language for data
processing[C]//Proceedings of the 2008 ACM SIGMOD international conference on Management of data.
ACM, 2008: 1099-1110.
Malkin J, Schroedl S, Nair A, Neumeyer L. Tuning Hyperparameters on Live Traffic with S4. In TechPulse
2010: Internal Yahoo! Conference, 2010.
Schroedl S, Kesari A, Neumeyer L. Personalized ad placement in web search[C]//Proceedings of the 4th
Annual International Workshop on Data Mining and Audience Intelligence for Online Advertising (AdKDD),
Washington USA. 2010.
Stonebraker M, Çetintemel U, Zdonik S. The 8 requirements of real-time stream processing[J]. ACM
SIGMOD Record, 2005, 34(4): 42-47.
Apache Hadoop. http://hadoop.apache.org/.
Neumeyer L, Robbins B, Nair A, Kesari A. S4: Distributed stream computing platform[C]//Data Mining
Workshops (ICDMW), 2010 IEEE International Conference on. IEEE, 2010: 170-177.
32
27.
28.
29.
30.
31.
32.
33.
34.
35.
36.
37.
38.
39.
40.
41.
42.
43.
44.
45.
46.
47.
48.
49.
50.
51.
52.
53.
54.
55.
56.
Khetrapal A, Ganesh V. HBase and Hypertable for large scale distributed storage systems[J]. Dept. of
Computer Science, Purdue University, 2006.
http://cassandra.apache.org/
http://www.mongodb.org/
http://mahout.apache.org/
Li YL, Dong J. Study and Improvement of MapReduce based on Hadoop. Computer Engineering and Design.
2012, 33(8):3110-3116 (in Chinese with English abstract).
Baker J, Bond C, Corbett JC, Furman JJ, Khorlin A, Larson J, Leon JM, Li YW, Lloyd A, Yushprakh V.
Megastore: Providing Scalable, Highly Available Storage for Interactive Services[C]//CIDR. 2011, 11:
223-234.
Wang S, Wang HJ, Qin XP, Zhou X. Architecting Big Data: Challenges, Studies and Forecasts. Chinese
Journal of Computers, 2011, 34(10): 1741-1752 (in Chinese with English abstract).
Kafka. http://kafka.apache.org/
Flume. https://github.com/cloudera/flume
TimeTunnel. http://code.taobao.org/p/TimeTunnel/src/
Rabkin A, Katz R. Chukwa: A system for reliable large-scale log collection[C]//Proceedings of the 24th
international conference on Large installation system administration. USENIX Association, 2010: 1-15.
Swift Workflow System, http: //www.ci.uchicago.edu/Swift/main/.
Zhao Y, Hategan M, Clifford B, Foster I, von Laszewski G, Nefedova V, Raicu I, Stef-Praun T, Wilde M.
Swift: Fast, reliable, loosely coupled parallel computation[C]//Services, 2007 IEEE Congress on. IEEE, 2007:
199-206.
Melnik S, Gubarev A, Long J J, Romer G, Shivakumar S, Tolton M, Vassilakis T. Dremel: interactive
analysis of web-scale datasets[J]. Proceedings of the VLDB Endowment, 2010, 3(1-2): 330-339.
Zaharia M, Chowdhury M, Franklin M J, Shenker S, Stoica I. Spark: cluster computing with working
sets[C]//Proceedings of the 2nd USENIX conference on Hot topics in cloud computing. 2010: 10-10.
Kornacker M, Erickson J. Cloudera Impala: real-time queries in Apache Hadoop, for real. 2012.
http://blog.cloudera.com/blog/2012/10/cloudera-impala-real-time-queries-in-apache-hadoop-for-real/
Storm, Distributed and fault-tolerant realtime computation, http://storm-project.net/.
Tableau, http://www.tableausoftware.com/
Visual.ly, http://visuanl.ly/
Zaharia M, Chowdhury M, Das T, Dave A, Ma J, McCauley M, Franklin MJ, Shenker S, Stoica I. Resilient
distributed datasets: A fault-tolerant abstraction for in-memory cluster computing[C]//Proceedings of the 9th
USENIX conference on Networked Systems Design and Implementation. USENIX Association, 2012: 2-2.
Gupta R, Gupta H, Mohania M. Cloud Computing and Big Data Analytics: What Is New from Databases
Perspective?[M]//Big Data Analytics. Springer Berlin Heidelberg, 2012: 42-61.
Avro, http://avro.apache.org/
Meng XF, Ci X. Big Data Management: Concept, Techniques and Challenges. Journal of Computer
Research and Development, 2013, 50(1): 146-169 (in Chinese with English abstract).
Scribe. https://github.com/facebook/scribe
Tao XJ, Hu XF, Liu Y. Overview of Big Data Research. Journal of System Simulation, 2013, 25S: 142-146
(in Chinese with English abstract).
Yan XF, Zhang DX. Big Data Research. Computer Technology and Development, 2013, 23(4): 168-172 (in
Chinese with English abstract).
Amini L, Andrade H, Bhagwan R, Eskesen F, King R, Selo P, Park Y, Venkatramani C. SPC: A distributed,
scalable platform for data mining[C]//Proceedings of the 4th international workshop on Data mining
standards, services and platforms. ACM, 2006: 27-37.
Labrinidis A, Jagadish HV. Challenges and opportunities with big data[J]. Proceedings of the VLDB
Endowment, 2012, 5(12): 2032-2033.
Qin XP, Wang HJ, Du XY, Wang S. Big Data Analysis—Competition and Symbiosis of RDBMS and
MapReduce. Journal of Software, 2012, 23(1): 32-45 (in Chinese with English abstract).
Shen DR, Yu G, Wang XT, Nie TZ, Kou Y. Survey on NoSQL for Management of Big Data. Journal of
Software, 2013,24(8): 1786-1803 (in Chinese with English abstract).
33
57.
58.
59.
60.
61.
62.
63.
64.
65.
66.
67.
Zikopoulos PC, Eaton C, DeRoos D, Deutsch T, Lapis G. Understanding big data[J]. New York et al:
McGraw-Hill, 2012.
Qin XP, Wang HJ, Li FR, Li CP, Chen H, Zhou X, Du XY, Wang S New Landscape of Data Management
Technologies. Journal of Software, 2013, 24(2): 175-197 (in Chinese with English abstract).
Abouzeid A, Bajda-Pawlikowski K, Abadi D, Silberschatz A, Rasin A. HadoopDB: An architectural hybrid
of MapReduce and DBMS technologies for analytical workloads. Proc. of the VLDB Endowment,
2009,2(1):922-933.
Abouzied A, Bajda-Pawlikowski K, Huang JW, Abadi DJ, Silberschatz A. HadoopDB in action: Building
real world applications. In: Elmagarmid AK, Agrawal D, eds. Proc. of the SIGMOD 2010. Indianapolis:
ACM Press, 2010.
. [doi: 10.1145/1807167.1807294]
Taylor RC. An overview of the Hadoop/MapReduce/HBase framework and its current applications in
bioinformatics[J]. BMC bioinformatics, 2010, 11(Suppl 12): S1.
Deelman E, Singh G, Su MH, Blythe J, Gil Y, Kesselman C, Mehta G, Vahi K, Berriman GB, Good J, Laity
A, Jacob JC, Katz DS. Pegasus: A framework for mapping complex scientific workflows onto distributed
systems[J]. Scientific Programming, 2005, 13(3): 219-237.
Ludäscher B, Altintas I, Berkley C, Higgins D, Jaeger E, Jones M, Lee EA, Tao J, Zhao Y. Scientific
workflow management and the Kepler system[J]. Concurrency and Computation: Practice and Experience,
2006, 18(10): 1039-1065.
Freire J, Silva CT, Callahan SP, Santos E, Scheidegger CE, Vo HT. Managing rapidly-evolving scientific
workflows[M]//Provenance and Annotation of Data. Springer Berlin Heidelberg, 2006: 10-18.
Hull D, Wolstencroft K, Stevens R, Goble C, Pocock MR, Li P, Oinn T. Taverna: a tool for building and
running workflows of services[J]. Nucleic acids research, 2006, 34(suppl 2): W729-W732.
Lin C, Lu S, Lai Z, Chebotko A, Fei X, Hua J, Fotouhi F. Service-oriented architecture for VIEW: A visual
scientific workflow management system[C]//Services Computing, 2008. SCC'08. IEEE International
Conference on. IEEE, 2008, 1: 335-342.
Zhao Y, Li Y, Tian W, Xue R. Scientific-Workflow-Management-as-a-Service in the Cloud[C]//Cloud and
Green Computing (CGC), 2012 Second International Conference on. IEEE, 2012: 97-104.
Chinese references
15.
16.
31.
33.
49.
51.
52.
55.
56.
58.
郑启龙,房明,汪胜,王向前,吴晓伟,王昊.基于 MapReduce 模型的并行科学计算.微电子学与计算
机,2009,26(8):13-17.
李国杰,程学旗.大数据研究:未来科技及经济社会发展的重大战略领域——大数据的研究现状与科学思
考[J].中国科学院院刊,2012,27(6):647-657.
李玉林,董晶.基于 Hadoop 的 MapReduce 模型的研究与改进.计算机工程与设计.2012,33(8):3110-3116.
王珊,王会举,覃雄派,周烜.架构大数据:挑战,现状与展望[J].计算机学报,2011,34(10):1741-1752.
孟小峰,慈祥.大数据管理:概念,技术与挑战[J].计算机研究与发展,2013,50(1):146-169.
陶雪娇,胡晓峰,刘洋.大数据研究综述[J].系统仿真学报,2013,25S:142-146.
严霄凤,张德馨.大数据研究[J].计算机技术与发展,2013,23(4):168-172
覃 雄 派 , 王 会 举 , 杜 小 勇 , 王 珊 . 大 数 据 分 析 ——RDBMS 与 MapReduce 的 竞 争 与 共 生 [J]. 软 件 学
报,2012,23(1):32-45.
申 德 荣 , 于 戈 , 王 习 特 , 聂 铁 铮 , 寇 月 . 支 持 大 数 据 管 理 的 NoSQL 系 统 研 究 综 述 [J]. 软 件 学
报,2013,24(8):1786-1803.
覃 雄 派 , 王 会 举 , 李 芙 蓉 , 李 翠 平 , 陈 红 , 周 烜 , 杜 小 勇 , 王 珊 . 数 据 管 理 技 术 的 新 格 局 [J]. 软 件 学
报,2013,24(2):175-197.
34
Download