Chapter 1 Overview of Big Data Technology Abstract: We are entering into a “big data” era. Due to the bottlenecks, such as poor scalability,installation and maintenance difficulties, fault tolerance and low performance, in traditional information technique framework, we need to leverage the cloud computing techniques and solutions to deal with big data problems. Cloud computing and big data are complementary to each other and have inherent connection of dialectical unity. The breakthrough of big data techniques will not only resolve the current situation, but also promote the wide application of cloud computing and the internet of things techniques. We focus on discussing the development and the pivotal techniques of big data. And provide a comprehensive description of big data from several perspectives, including the development of big data,the current data-burst situation,the relationship between big data and cloud computing and the big data techniques. Finally,we introduce the related technique researches and our current work. Key Words: big data technique; cloud computing; data acquisition; data storage; data computation; 1 Chapter Index Chapter 1 ........................................................................................................................ 1 Overview of Big Data Technology .............................................................................. 1 1 The Background and Definition of Big Data .............................................................. 3 2 Big data problems ....................................................................................................... 5 2.1 Problems of speed ............................................................................................ 6 2.2 Type and architecture problems ....................................................................... 7 2.3 Volume and flexibility problems ...................................................................... 7 2.4 Cost problems .................................................................................................. 7 2.5 Value mining problems .................................................................................... 8 2.6 Storage and security problems ......................................................................... 9 2.7 Interoperability and data sharing issues ......................................................... 10 3 Dialectical relationships between the cloud computing and big data ....................... 11 4 Big Data technology ................................................................................................. 13 4.1 Infrastructure Supports................................................................................... 14 4.2 Data acquisition ............................................................................................. 16 4.3 Data Storage ................................................................................................... 18 4.4 data computing ............................................................................................... 21 4.4.1 Offline batch ....................................................................................... 21 4.4.2 Real-time interactive computing ......................................................... 23 4.4.3 Streaming computing .......................................................................... 25 4.5 Data presentation and interaction................................................................... 27 5 Related researches and our works ............................................................................. 30 6 Summary ................................................................................................................... 31 References .................................................................................................................... 32 Chinese references ....................................................................................................... 34 2 1 The Background and Definition of Big Data Nowadays, information technology opens the door which makes the human step into the smart society, led to the development of modern services such as Internet e-commerce, modern logistics and e-finance, promoted the development of emerging industries such Telematics, smart grid, new energy, intelligent transportation, smart city, high-end equipment manufacturing. Modern information technology is becoming the engine of operation and development of all walks of life. But the engine is facing a huge test of big data [57]. Various business data is breaking out in the form of geometric series [1], problems such as collection, storage, retrieval, analysis, application and so on, can no longer be solved by the traditional information processing technology, has brought great obstacles for human achieving digital society, network society and intelligent society. The New York stock exchange produces 1TB trading data every day, Twitter will generate more than 7TB data every day; Facebook will produce more than 10TB data every day; the large hadron collider which is located in the European Organization for Nuclear Research produces about 15PB data every year. According to the investigation and statistics conducted by the famous consulting firm IDC, the global information volume of 2007 was about 165EB, even in 2009, happened the global financial crisis, the global information volume reached 800EB, had an increase of 62 percent over the previous year; In the future, the date volume of the whole world will be doubled every 18 months; The number will reach 35ZB in 2020, about 230 times the number in 2007, but the written record of 5000 years of human history only have 5EB data. The statistics and investigation above indicates the era of TB, PB, and EB has become the past, the global data storage will enter the “Zetta era” formally Beginning in 2009, the "big data" has become a buzzword of Internet information technology industry, most applications of big data at the beginning were in the Internet industry, the data on the Internet increased by 50% per year, doubling every two years, The global Internet companies are aware of the advent of "big data" era and the great significance of data. 2011 May, McKinsey Global Institute published a report entitled "Big data: The next frontier for innovation, competition, and productivity" [2], since the report was released, "big data" has become a hot concept in the computer industry. Obama administration in the US launched the "Big Data Research and Development Initiative" [3] and allocated $200 million special fund in 2012 April, and set off a wave of big data all over the world. According to the big data report released by Wikibon in 2011 [4], the big data market is on the eve of a growth spurt, the global market value of big data will reach $50 billion in the next five years. At the beginning of 2012, all the income of large data-related software, hardware and services was just about $5 billion. But as companies gradually realize that the big data and related analysis will form a new differentiated competitive advantage and improve operational efficiency, big data related techniques and services will obtain the considerable development, Big data will gradually fall to the ground and keep 58% compound annual growth rate over the next five years. Greg McDowell, an analyst with JMP Securities said that the market of big data tools will be expected to grow to $86 billion from $9 billion in ten years. By 2020, 3 investment in big data tools will account for 11% of overall corporate IT spending. At present, the industry does not have a unified definition of big data; big data is defined as follows commonly: "Big Data refers to datasets whose size is beyond the ability of typical database software tools to capture, store, manage, and analyze.” - McKinsey. "Big Data usually includes data sets with sizes beyond the ability of commonly used software tools to capture, curate, manage, and process the data within a tolerable elapsed time.” - Wikipedia "Big Data is high volume, high velocity, and/or high variety information assets that require new forms of processing to enable enhanced decision making, insight discovery and process optimization.” - Gartner Big Data has four characteristics: Volume, Velocity, Variety and Value [47] (referred to as "4V", which means a huge amount of data volume, fast processing speed, various data type and low value density), below are brief descriptions for each characteristic. Volume: Means a large amount of data of big data. The scale of data set keep increasing and from GB to TB, then to PB level yet ,even counted by EB and ZB. For instance, video monitors of a medium-sized city can produce tens of TB data every day. Variety: Indicates the types of big data are complex. In the past, the data types that we generated or processed are simpler, and most of the data are structured. But now, with the emerging of new channels and technologies such as social networking, Internet of Things, mobile computing , online advertising, plenty of semi-structured or unstructured data were produced, such as XML, email, blog , instant message, etc. result in a surge of new data types. Companies need to integrate and analyze data that from complex traditional or non-traditional sources of information, including companies’ internal and external data. With the explosive growth of sensors, smart devices and social collaborative technologies, the types of data are uncountable, including: text, micro blogs, sensors’ data, audio, video, click streams, log files, and so on. Velocity: The velocity of data generation, processing and analysis continue to accelerate, there are three reasons, Data Creation’s nature of real-time, the demand of combining streaming data to business processes and decision-making processes. The velocity of data processing is high, processing capacity shifts from batch processing to stream processing. The industry gave the processing capacity of big data a tile “one second rule”. It shows big data’s processing capability adequately, and the essential difference with traditional data mining. Value: Because of the enlarging scale, big data’s value density of per unit data is constantly reducing, however, the overall value of the data is increasing. Somebody even equate big data with the gold and oil, indicates big data contains unlimited commercial value. According to a prediction from IDC research reports, big data technology and services market will rise from $3.2 billion in 2010 to $16.9 billion in 2015, and achieve an annual growth rate of 40%, it will be seven times the growth rate of the whole IT and communications industry. By processing big data, finding out its potential commercial value, we can make enormous commercial profits. In specific 4 applications, big data processing technology can provide technical and platform support for the national pillar enterprises, analysis, process and mining data for enterprises, extract important information and knowledge, and then transform them into useful models and apply to the process of research, production, operation and sale. Meanwhile, the state strongly advocate construction of "smart city", in the context of urbanization and information integration, focusing on improving people's livelihood, enhancing the competitiveness of enterprises, and promoting sustainable development of cities, utilize Internet of Things, cloud computing and other information technology tools comprehensively, combine the city's existing information base, integrate advanced service concept of urban operation, establish a widely covered and deeply linked information Network, perceive many factors of city comprehensively, for example, resources, environment, infrastructures, industry and so on, build a synergistic and shared urban information platform, to process and utilize information intelligently, so that provide intelligent response control for city’s operations and allocation of resources, provide social management of government and public services with intelligent basis for decision making and methods, offer intelligent information resources and open information-use platform’s integrated regional information development process to enterprises and individuals. Data are undoubtedly the cornerstone of the new IT service and scientific research, and big data processing technology become the hot pot of today's information technology development naturally, the flourish of the big data processing technology has also heralded the arrival of another IT revolution. On the other hand, with the deepening of national economic restructuring and industrial upgrading, the role of information processing technologies will become increasingly prominent, and big data processing technology will become the best breakthrough to achieve the core technology’s overtaking around the curve, following the development, application break through, and reducing kidnapping in the information construction of the pillars of the national economy [16]. 2 Big data problems Big data is become an invisible "gold mine" for the potential value it contains. With the accumulating and growing data of productions, operations, management, monitoring, sales, customer services and other aspects’ data, as well as the rise of the number of users, by analyzing the correlation patterns and trends from large amount of data, it is possible to achieve efficient management, precision marketing, and this can become a key to open this "gold mine". However, the traditional IT infrastructure and the methods of data management and analysis cannot adapt to the rapid growth of big data. We summarized the problem of big data as the seven categories listed in Table 1: 5 Table 1 Classification of Big data problem’s Problems of big data Description of Big data problem’s Import and export problem Statistical analysis problems speed Query and retrieval problems real-time response problems Multi-source problem Type and structure Heterogeneous problems The original system’s infrastructure problems Linear scaling problems Volume and flexibility Dynamic scheduling problems Cost Comparison between Mainframe and small server Cost The control of costs of the original system’s modification Data analysis and mining Value mining The actual efficiency after data mining Structured and non-structured Storage and security Data security Privacy security Data standards and interfaces Connectivity and data sharing Shared protocols Access permissions 2.1 Problems of speed Traditional relational database management systems (RDBMS) generally use centralized storage and processing methods, without using a distributed architecture, in many large enterprises, configurations are often based on IOE (IBM Server, Oracle Database, EMC storage). In this typical configuration, a single server’s configuration is usually very high, there can be dozens of CPU cores, memory can reach hundreds of GB, either; Databases are stored in high-speed and large-capacity disk array, and storage space can up to TB level. This configuration can meet the demand of traditional Management Information System (MIS), but when face ever-growing data volumes and dynamic data usage scenario, and this centralized approach was becoming the bottleneck, especially for its limited speed of response. When face importing and exporting large amounts of data, statistical analysis, retrieval and query, because of its dependence of centralized data storage and index, its performance will decline sharply as data volume grow, let alone the statistics and query scenarios that require real-time response. For instance, in Internet of Things, the data of the sensor can be up to billions of items; these data need real-time storage, query and analysis, so traditional RDBMS is no longer suitable for application requirements. 6 2.2 Type and architecture problems RDMBS has formed relatively mature store, query, statistical and processing approaches for data that are structured and have fixed patterns. With the rapid development of Internet of Things, Internet and mobile communication networks, the formats and types of data is constantly changing and developing. In intelligent transportation field, the data involved may contain text, logs, pictures, videos, vector maps and other different kinds of data that from different monitoring sources. The formats of these data are usually not fixed; it will be difficult to respond to changing needs if we adopt structured storage modes. So we need to use various modes of data processing and storage, integrate structured and unstructured data storage to process these data whose types, sources and structures are various. The overall data management mode and architecture also requires new types of distributed file systems and distributed NoSQL database architecture to adapt to large amount of data and shifty structures. 2.3 Volume and flexibility problems As noted earlier, as the huge volume and adopting centralized storage, there are problems with big data’s speed, response. When the amount of data increases, and the amount of concurrent read and write become larger and larger, centralized file system or single database will become the deadly performance bottleneck, after all, a single machine can only withstand limited pressure. We can distribute the pressure to many machines by adopting the frameworks and methods of linear scalability to reach a level that machines can bear, so we can dynamically increase or decrease the amount of files or database servers according to the amount of data and the quantity of concurrent to achieve linear scalability. In terms of data storage, we need to adopt a distributed and scalable architecture, such as well-known Hadoop file system [25] and HBase database [27]. Meanwhile, in the respect of processing the data, we also need to adopt a distributed architecture, assigning the data processing tasks to many computing node, and need to consider the correlation between the data storage nodes and the compute nodes. In the computing field, the allocation of resources and tasks is actually a mission scheduling problem. Its main task is making the best match between resources and jobs or among tasks on the basis of resources (including CPU, memory, storage, and network resources, etc.) of each individual node of current cluster and the operation service quality requests of each user. Since the diverse user requirements of operation service quality, and the changing state of resources, finding the appropriate resources for distributed data processing is a dynamic scheduling problem. 2.4 Cost problems For centralized data storage and processing, when selecting hardware and software, the basic approach is using very high configuration mainframe or minicomputer server and high access speed disk arrays with high security to guarantee the data processing 7 performance. These hardware devices are very expensive and frequently up to several million dollars, and in software, often adopted the products of large foreign software vendors such as Oracle, IBM, SAP, Microsoft, the maintenance of servers and databases also requires professional technical personnel, and investment and operation costs are high. In the face of the challenges of massive data processing, these companies have also introduced a "AIO" solution in the shape of monster, such as Oracle's Exadata, SAP's Hana, etc. that by stacking the multi-server, massive memory, flash memory, High-speed networks and other hardware, to relieve the pressure of data, however, the hardware cost is significantly higher, general enterprises are hard to bear it. The new distributed storage architecture and distributed databases such as HDFS, HBase, Cassandra [28], MongoDB [29], etc. don’t have the bottlenecks of data centralized processing and summary as they use a decentralized and massive parallel processing MPP architecture, and along with linear scalability, can deal with the problems of storage and processing of big data effectively. In the software architecture, they also achieved some self-managing and self-healing mechanisms to face the occasional failure in massive nodes, and protect the robustness of overall system, so the hardware configuration of each node needn’t to be high, we can even use an ordinary PC as a server, so the cost of server can be greatly reduced, and in terms of software, open source software also give us a very large price advantage. Of course, we cannot simply make a comparison of cost of hardware and software when talk about the cost. If we want to migrate systems and applications to the new distributed architecture, we have to do a lot of adjustments from the bottom platform to the upper application, especially in the database schema and application programming interfaces, there is a big difference between NoSQL database and the original RDBMS, enterprises need to assess the costs, cycle time and risk of migration and development. Additionally, they also need to consider the cost of service, training, operation and maintenance aspects, but in the general trend, with these new data architecture and products becoming more mature and perfect, as well as some commercial operating companies provide professional database development and consulting services based on open source, the new distributed, scalable database schema is bound to win in the big data wave, defeat the traditional centralized big machine model from cost to performance perfectly. 2.5 Value mining problems Due to the huge and growing volume, the value density per unit of big data is constantly reducing, while the overall value of big data is steadily increasing, big data is analogous to the oil and gold, so we can mine its huge business value [54]. If we want to extract the hidden patterns from large amount of data, we need deep data mining and analysis. Big data is also quite different from traditional data mining model: traditional data mining focuses on moderate bulk of data in general. Its algorithm is relatively complex and the convergence speed is slow. While in big data area the quantity of data is massive, and for the processes of data storage, data cleaning, ETL (extraction, transformation, loading), we need to deal with the needs and challenges of massive data, 8 which suggests the use of distributed parallel processing method. For example, in the case of Google and Microsoft's search engine, it needs hundreds or even thousands of servers working synchronously to perform the archive storage of users’ search logs which generated by search behaviors of billions of worldwide users. Meanwhile, while mining the data, we also need to adapt traditional data mining algorithms and its underlying processing architecture. And to boost the massive data computing and analysis, it is promising to introduce the parallel processing mechanism. Apache's Mahout [30] project provides a series of parallel implementation of data mining algorithms. In many application scenarios, even the real-time feedbacks of results is needed, which presents the system a huge challenge: the data mining algorithms usually take a long time, especially when the amount of data is huge. In this case, maybe only a combination of real-time calculation and large quantities offline processing can meet the demand. The actual gain of data mining is an issue that needs to be carefully assessed before mining big data’s value. And not all of the data mining programs will acquire the desired results. Firstly, we need to guarantee the authenticity and completeness of the data. For example, if the collection of information introduces big noise itself, or some key data is not included, the value that is dig out will be undermined. Secondly, we also need to consider the costs and benefits of the mining. If the investments of manpower, hardware and software platform are costly, and the project cycle is long, but the information extracted is not very valuable for enterprise’s production decisions, cost-effectiveness and other aspects, then the data mining is also impractical and not worth the candle. 2.6 Storage and security problems In big data’s storage and safety guarantee aspects, its changeable formats, huge volume have also brought a lot of challenges. For structured data, relational database management system (RDBMS) involving storage, access, security and backup control system is rather mature after decades of development. The huge volume of big data also has impact on the traditional RDBMS, as mentioned, centralized data storage and processing are shifting to distributed parallel processing. In most cases, big data are unstructured data, so it derived a lot of distributed file storage system and distributed NoSQL databases to deal with this kind of data. But these emerging systems need to perfect their user managements, data access privilege, backup mechanism, security control and other various aspects. In short, firstly it is necessary to prevent data lost, and to provide reasonable backup and redundancy mechanisms for the massive structure and unstructured data so that the data will not to be lost under any circumstances. Secondly, we should protect the data from unauthorized access. Only the users that have the authority can access the data. Due to the large amounts of unstructured data may require different storage and access mechanisms, the formation of unified security access control mechanisms that focus on multi-source and multiple data types is a big problems to be solved. Because big data assembled more sensitive data together, so it’s more attractive to potential attackers; an attacker will be able to get more information if he commit an attack successfully-"cost performance" is higher. All of these make it 9 easier for big data to become the target of attack. In 2012, LinkedIn was accused that it leaked 6.5 million user account passwords; Yahoo! have faced network attacks, resulting in 450,000 user ID leak. 2011 December, CSDN’s security system was hacked, 6000000 user's login name, password, and email were leaked. Privacy problems also closely associated with big data. Due to the rapid development of Internet technology and the Internet of Things technology, all kinds of information related to our works and lives have been collected and stored. We are always exposed to the "third eye". No matter we are surfing the Internet, phoning, sending microblogs, using Wechat, or in shopping and travel, our actions are always being monitored and analyzed. The depth analysis and modeling of user behaviors can serve customers better, and make precision marketing possible. However, if the information is leaked or abused, it is a direct violation to the user's privacy, bring the adverse effects to users, and even cause life and property loss. In 2006, the U.S. DVD rental company Netflix has organized an algorithm contest. The company released a million lease records that from about 500,000 users, and publicly offering a reward of one million dollars, organizing a software design contest to improve the accuracy of their movie recommendation system, the condition of victory is to improve their recommendation engine’s accuracy by 10%. Although the data were carefully anonymized by the company, a user recognized the data, and a closeted lesbian mother, going by the “Anonymous" has sued Netflix. She comes from a conservative Midwest. On Twitter.com, a popular site in the United States, many users are accustomed to publish their location and dynamic information at any time, there are a few sites, such as "PleaseRobMe.com" - Come rob me, "WeKnowYourHouse.com" - I know you're home, according to the information users sent, speculating the times that the users is not at home, getting the user's exact home address, and even finding out photos of the house. Their behavior is designed to remind us that we are always exposed to the public eye, if we don’t develop an awareness of safety and privacy, we will bring us disaster by ourselves. Nowadays, many countries around the world, including China, are improving the laws that related to data use and privacy, to protect the information of the privacy not to be abused. 2.7 Interoperability and data sharing issues In China's companies’ information construction process, fragmentation and isolated information are common phenomenon. Systems and data between different industries have almost no intersections. The same industry, such as transportation and social security system’s interior, also are divided and constructed according to the administrative areas; the information exchange and collaboration across regions are very difficult. More serious, even within the same unit, such as the construction of some of the hospital's information systems, medical record management, beds information, medicines management and other subsystems are constructed discretely, and there is no information sharing and interoperability. "Smart City" is the emphasis of China's Twelfth Five years plan of information construction. The fundamentality of the “Smart City” is to achieve the interoperability and sharing of information, to achieve intelligent e-government, social management and improving people's livelihood based on data 10 combination. So on the foundation of Digital city, we also need to achieve interconnection, open the data interface of walks of life, to achieve interoperability, and so we can achieve the intelligence. For example, in emergency management of urban areas, we need the data and assistances from many departments such as transportation, population, public security, fire, health care. At present, the data sharing platform constructed by US federal government, www.data.gov, the data resources Network (www.bjdata.gov.cn) of Beijing Municipal Government and other platforms are forceful attempts for data open and data sharing. To achieve the cross-industry data integration, we need to made an uniform data standards, exchange interfaces as well as sharing protocols, so that we can access, exchange and share the data from different industries, different departments , different formats based on a uniform basis. For data access, we also need to make detailed access authority to define what users can access which type of data in which circumstances. In the big data and cloud computing era, data from different industries, enterprises may be stored in a unified platform and data centers, we need to protect some sensitive information, such as the data related to the commercial secrets of the enterprises and transaction information, although its process rely on the platforms, besides their own authorized persons, we should assure the platforms administrators and other companies can’t access such data. 3 Dialectical relationships between the cloud computing and big data Cloud computing has a fast development since 2007. Cloud computing’s core model is large-scale distributed computing, providing computing, storage, networking and other resources to many users in service mode, the users use them when they need [5]. Cloud computing offer enterprises and users the high scalability, high availability and high reliability, efficient use of resources, can improve resource service efficiency and reduce the costs of business information construction, investment and maintenance costs. As the U.S. Amazon, Google, and Microsoft’s public cloud services become more mature and more perfect, more and more companies are migrating toward the cloud computing platform. Because of the strategic planning needs of the country as well as positive guidance, cloud computing and its technology have made great progress in recent years in China. China set up several computing model city, including Beijing , Shanghai , Shenzhen, Hangzhou, and Wuxi, Beijing 's "Lucky Clouds" plan , Shanghai 's "sea of clouds" plan, Shenzhen's "Cloud Computing International joint Laboratory", Wuxi’s "cloud computing projects", and Hangzhou’s " West Lake cloud computing platform for public service " also were launched, other cities, such as Tianjin, Guangzhou , Wuhan, Xi'an, Chongqing and Chengdu have also introduced the corresponding cloud computing development plans or set up cloud computing alliances, to carry out research and development and trial of cloud computing. But the popularity of cloud computing in China is largely limited by the construction of infrastructure, and lack of large-scale 11 industrial applications, cloud computing not really hit the ground. The reason is that the popularity of internet of things and cloud computing technology is our great vision, so we can achieve large-scale, ubiquitous, and collaborative information collection, processing, and application. However, the premise of its application is that most industries, enterprises have a good foundation and experience of information construction, and have an urgent need to transform the existing system architectures, improving the efficiency of the existing systems, while the reality is that most of our SMEs have only just started in the area of information construction, only a few large companies and national ministries have the foundation in the information construction. Outbreak of big data is a thorny problem encountered in society and informatization development. Because of the growth of data traffic and data volume, data formats are multi-source and heterogeneous, and we request a real-time and accurate data processing, it can help us to discover the potential value of large amount of data. Traditional IT architecture has been unable to handle big data problem, there are poor scalability, poor fault tolerance, low performance, difficult installation, deployment and maintenance, and many other bottlenecks in it. Because of the rapid development of Internet of Things, Internet, mobile communication network technology in recent years, the frequency and speed of data transmission are greatly accelerated, gave birth to the big data problem, and the second development, the deep recycle of data let the big data become increasingly acute. We believe that cloud computing and big data are complementary, dialectical relationship. Cloud computing and Internet of things’ widespread use is our vision, and the outbreak of big data is a thorny problem that we encountered in the development; The former is the dream of human’s pursuit of civilization, the latter is the bottleneck to be solved of social development; Cloud computing is a tendency of technology development, big data is an inevitable phenomenon of the rapid development of modern information society. To solve big data problem, we need modern means. The breakthrough of big data technology not only can solve real problems, but also can let technology of cloud computing and Internet of things hit the ground and be promoted and applied modern in depth. From the development of IT technologies, we can summarize a few laws: (1) The competition between Mainframe and personal PC ended in PC’s triumph, the battle between Apple’s IOS and Android, open Android platform has seized 1/3 market share in 2-3 years. Nokia's Symbian operating system has come to the brink of being eliminated because it is not open. All of these indicate that modern IT technology needs an open and crowdsourcing concept to achieve rapid development. (2) The existing conventional technology’ collision with cloud computing technology is similar with it, the advantage of cloud computing technology is utilizing crowd sourcing theory and open source, its construction based on open source platform and distributed architecture of open-source new technologies, can solve the problem that existing centralized approach difficult to solve or can’t solve. TaoBao, Tencent and other large Internet companies once relied on proprietary solutions provide by big company such as Sun, Oracle, EMC, but then abandoned because of the expensive cost and adopted open source technologies, and their products have also be contributed to 12 the open source community ultimately, reflecting the trend of information technology’s development. (3) The traditional industry giants have tilted to the open source system; this is a historic opportunity which is conducive to catching. Traditional industry giants and large central enterprises, such as National Grid, telecommunications, banking, civil aviation, are over-reliance on sophisticated proprietary solutions provided by foreign companies because of historical reasons, resulting in a pattern that lack of innovation and kidnapped by foreign products. Analyze from the path that to crack the problem, to solve big data problem, we must abandon the traditional IT architecture gradually, and utilize the new generation of information technologies marked by "cloud" technology. Despite advanced cloud computing technology originated from the United States mainly, but based on open-source basis, the gap between our technology the developed technology is not large, the urgent big data problem that to apply cloud computing technology to large-scale industry is also our historic opportunity to achieve breakthrough innovation, break the monopoly, and catch up with international advanced technology. 4 Big Data technology Big Data brings not only opportunities but also challenges. Traditional data processing means has been unable to meet the massive real-time demand of big data; we need the new generation of information technology to deal with the outbreak of big data. We summarize the big data technology into five Classifications, as shown in Table 2. Table 2 Classification of big data technology Classification of big data technology big data technology and tools Cloud computing platform Cloud Storage Infrastructure Supports Virtualization technologies Network technology Resource Monitoring Technology Data Bus Data acquisition ETL tools Distributed File System Relational database Data Storage NoSQL technology The integration of Relational databases and non-relational database In-Memory Database Data query, statistics and analysis Data Mining and Prediction Data computing spectrum process BI (Business Intelligence) 13 Graphics and reports Display and Interaction Visualization Tools Augmented Reality Technology Infrastructure supports: mainly includes infrastructure management centers that to support big data processing, cloud computing platforms , cloud storage equipment and technology, network technology, resource monitoring technology. Big data processing needs the supports from cloud data centers that have large-scale physical resources and cloud computing platforms that have efficient scheduling and management function. Data acquisition technology: Data acquisition technology is a requisite for data processing; firstly we need the means of data acquisition, collecting the information and can apply the top data processing technology to them. Besides the various types of sensor and other hardware and software equipment, data acquisition involves the ETL( acquisition, conversion , load) process of data, can pre-process the data, such as washing, filtering, checking and conversion, converting the valid data into suitable formats and types. Meanwhile, to support multi-source and heterogeneous data acquisition and storage access, we also need to design a data bus of companies, to facilitate the data exchange and sharing between the various enterprise applications and services. Data storage technology: after collection and conversion, the data need to be storied archived. Facing the large amounts of data, we generally use distributed file storage systems and distributed databases to distribute the data to multiple storage nodes, and also need to provide many mechanisms such as backup, security, access interfaces and protocols. Data computing: the data queries, statistics, analysis, forecasting, mining, spectrum process, BI business intelligence and other relevant technology are collectively referred to as data Computing technology. Data computing technology cover all aspects of data processing, and are the core technique of big data technology, either. Data’s display and interaction: Data’s show and interaction is also essential in big data technique, since the data will eventually be utilized by people to provide decision supports for productions, operations and planning. Choosing an appropriate, vivid and visual way of display can bring us a better understanding of the data, its connotations and association relationship, and can also help us interpret and use the data more effectively, to develop their value. In the ways of show, in addition to traditional reporting, graphics, we can also combine modern visualization tools and human-computer interactions; even use the Augmented Reality Technology such as Google Glasses, to achieve a seamless interface between the data and reality. 4.1 Infrastructure Supports Big data processing needs the support of cloud data centers that have large-scale physical resources and cloud computing platforms that have efficient scheduling 14 management functions. Cloud computing management platform can provide flexible and efficient deployments, operations and managements environment for large data centers and enterprises, support heterogeneous underlying hardware and operating systems by virtualization technology, provide applications the cloud resource management solution that are safe, high-performance, highly extensible, highly reliable and highly scalable, reduce the costs of application development, deployment, operation and maintenance, improve the efficiency of resource use . As a new computing model, cloud computing gains great momentum in academia and industry. Governments, research institutions and industry leaders are trying to solve the growing computing and storage problems in the internet age by cloud computing actively. In addition to the Amazon’s AWS, Google’s App Engine and Microsoft’s Windows Azure Services and other commercial cloud platforms, there are also some open source cloud computing platforms such as OpenNebula [6][7], Eucalyptus [12], Nimbus [9], and OpenStack [8], each platform has its significant features and constantly evolving community . Amazon's AWS is, as it were, the most popular cloud computing platform, and the first half of 2013 its platform and cloud computing services have earned $1.7 billion, with year-on-year growth of 60%. The most important feature of its system architecture is opening data functions via Web Service interfaces, and achieves systems’ loosely coupled by SOA architecture. The web service stack provided by AWS provides can be divided into four layers: (1) The Access Layer: Provides management console, API, and various command-line, etc. (2) The Common Service Layer: including authentication, monitoring, deployment and automation, etc. (3) The PaaS Layer: including parallel processing, content deliveries and messaging services (4) The IaaS Layer: including cloud computing platform EC2, cloud storage services S3/EBS, network services VPC / ELB, database services, etc. Eucalyptus is an open source cloud computing platform that attempt to clone AWS, has achieved a function similar to Amazon EC2, achieve flexible and practical cloud computing by computing clusters or workstations’ group, it provides the compatibility for EC2 and S3 storage system interface. The applications that use these interfaces can interact directly with Eucalyptus, and it support Xen [10] and KVM [11] virtual technology, as well as cloud management tools for system managements and user settlements. Eucalyptus consists of five major components, namely cloud controller CLC, cloud storage service Walrus, cluster controller CC, storage controller SC and node controller NC. Eucalyptus manage computing resources through Agent way, these components can collaborate together to provide the required cloud services. OpenNebula is an open-source implementation of virtualization management of virtual infrastructure and cloud computing plan, and it was initiated by the European Search Institute in 2005. It's an open source tool which is used to create IaaS private clouds, public clouds and hybrid clouds, and is also a modular system that can achieve different cloud architecture and interact with a variety of data center services. 15 OpenNebula has integrated storage, network, virtualization, monitoring and security technologies. It can deploy multi-layered service in the distributed infrastructure according to the allocation policies in the form of virtual machine. OpenNebula divided into three levels, namely the interface layer, the core layer and the driver layer. (1) The interface layer provides the native XML-RPC interfaces, and achieves various API such as EC2, OCCI (Open Cloud Computing Interface) and OpenNebula Cloud API (OCA), gives users a variety of options of access. (2) The core layer provide unified plug-in management, request management, VM lifecycle management, Hypervisor management, network resources management and storage resources management and other core functions. (3) In the bottom layer, there are interactions that between the driver layer (consists of a variety of drivers) and virtualization software (KVM, XEN), and interactions between driver layer and physical infrastructure. OpenStack is an open source cloud computing virtual infrastructure, which users can use to build and run their cloud computing and storage infrastructure. Users can use the cloud computing services provided by OpenStack by the API that compatible with Amazon EC2/S3, and it allows the client tools that written for the Amazon Web Services (AWS) can also be used together with OpenStack. OpenStack has done the best of SOA and decoupling of service-oriented components. The overall structure of OpenStack is divided into three layers, either, the first layer is applications, management portals (Horizon), API and other access layer; core layer comprises computing services (Nova), storage services (including the object storage service Swift and block storage service Cinder) and network services (Quantum); layer 3 are shared services, now they are account rights management service (keystone) and mirrored service (Glance). Nimbus Systems is an open source system, providing interfaces that compatible with Amazon EC2, can create a virtual machine cluster quickly and easily, that we can use cluster scheduling system to schedule tasks just like an ordinary cluster. Nimbus also supports different virtual realization (Xen and KVM). It is mainly used in scientific computing. 4.2 Data acquisition Sufficient scale of data is the basis of the big data strategic construction of enterprises, so data acquisition has become the first step of big data analysis. Data acquisition is an important part of value mining of big data and the subsequent analysis and mining are based on it. The significance of big data in not in grasping the sheer scale of the data information, but rather the intelligent processing of these data, analyze and mine valuable information from it, but the premise is to have a lot of data. Most enterprises are difficult to judge which data will become data assets in the future and by what means to refine the data into real income. For this, even big data services business is difficult to give a definitive answer. But one thing is sure, the era of big data, the one who have mastered enough data is likely to master the future, the acquisition of big data now is the accumulation of assets in the future. Data acquisition can be based on the sensors of internet of things, and also can be 16 based on network information. For example, in the intelligent transportations, there are several data acquisitions, information collection based on GPS positioning information, image collection based on traffic crossroads, coil signal collection based on intersections, etc. While data acquisition on the Internet is to collect a variety of page information and user access information on various network media, such as search engines, news sites, forums, microblog, blog, e-commerce sites, etc. and the main contents of it are text, URL, access logs, dates and pictures. Then we need the pretreatment such as clean, filter, remove duplicates and give them a categorized and summarized storage. The ETL tools is responsible for extracting the different types and structures data that from distributive, heterogeneous data sources such as text data, relational data, as well as pictures, video and other unstructured data, to the temporary middle layer to clean, convert, classify, integrate, and finally load them into the corresponding data storage systems such as data warehouses or data marts, has become the basis for online analytical processing. The ETL tool for big data is different from the traditional ETL process, on the one hand the volume of big data is huge, on the other hand the data’s production speed is very fast, for instance, a video camera in the city and smart meter generate large amounts of data every second, pre-processing of data requires have to be real–time and fast, so in the choice of ETL architectures and tools, we also adopt the modern information technology such as distributed memory databases, real-time stream processing systems. There are various applications and various data formats and storage requirements of modern enterprises, but between enterprises, and within enterprises, there exist the problems of fragmentations and information island, the enterprises can’t achieve controlled data exchange and sharing, and the limitations of development technologies and the environment applications between applications set up barriers to enterprises data sharing, hindered the data exchange and sharing between application, also hindered the enterprise’s demand of data control, data management and data security. To achieve cross-industry and across-departments data integration, especially in the construction of smart city, we need to develop unified data standards, as well as exchange interfaces and sharing protocols, so data from different industries, different departments and have different formats can be accessed, exchanged and shared based on a unified basis. By enterprise data bus (EDS), we can provide data access functions of all kinds of data, and separate the enterprises’ data access integrations from the enterprises’ functional integrations. Enterprise data bus has created an abstraction layer for data access, to make corporate business functions avoid the details of data access. Business components just need to contain service functional components (used to implement existing services) and data access components (by the use of enterprise data bus). We use the way of enterprise data bus to provide a unified data conversion interface for enterprises’ data management models and application system data models, and reduce the coupling between the various application services effectively. In big data scene, there are a large number of synchronized data access requests in enterprise data bus, the performance degradation of any module on the bus will greatly affect the bus’ function, so the 17 enterprise data bus need a large-scale, concurrent and highly scalable implementations way. 4.3 Data Storage The amount of data increase rapidly every year, along with the existing historical data information, it has brought great opportunities and challenges to data storage and data processing industry. In order to meet the storage demand that growing rapidly, cloud storage requires high scalability, high reliability, high availability, low cost, automatic fault tolerance, decentralization and other characteristics. Common forms of cloud storage can be divided into distributed file system and distributed database, the distributed file system use a large-scale distributed storage nodes to meet the needs of storing large amounts of files, and distributed database NoSQL support the processing and analysis of mass unstructured data. When Google faced the problems of storing and analysis massive web pages early, as a pioneer, it developed the Google File System GFS [13] and the MapReduce distributed computing analysis model [15, 18, 31] based on GFS. As part of applications need to deal with a large number of formatted and semi-formatted data, Google has built a large-scale database system named BigTable[14], which has weak consistency requests, and is capable of indexing, querying and analyzing massive data. This series of Google products has open the door to mass data storage, query and processing in cloud computing era, and become the de facto standard in this field, has remained the leader in related technique. Google's technology is not open, so Yahoo and open source community has developed Hadoop system collaboratively, which is an open source implementation of MapReduce and GFS. The design principles of its underlying file system HDFS is completely consistent with GFS, and it also achieved an open source implementation of Bigtable, a distributed database system named HBase. Since their launch, Hadoop and HBase has been widely applied all over the world, they are managed by the Apache Foundation now, Yahoo's own search system runs on Hadoop clusters of million units. Google has considered the harsh environment that faced by running distributed file system in a large-scale data cluster: 1) Take full account of the problems that large number of nodes may encounter failure, and need to integrate the fault tolerance and automatic recovery functions into the system; 2) Construct special file system parameters, files are usually size in G Bytes, and contains a large number of small files; 3) Consider the characteristics of the applications, support file append operations, optimize sequential read and write speeds;4) Some specific operations of file system are no longer transparent, and need the assistances of application programs. 18 Fig 1 Architecture of the Google File System Figure 1 depicts the architecture of the Google File System, namely a GFS cluster contains a primary server (GFS Master) and several blocks servers (GFS chunkserver), and accessed by multiple clients (GFS Client). Large files are split into blocks with fixed size, block server store the blocks on the local hard drive as if they are Linux files, read and write block data according to the specified block handle and byte range. In order to guarantee the reliability, each block has three backups by default. Primary server manages all of the metadata of file system, including namespaces, access control, mapping of files to blocks, physical location of block and other relevant information. By the joint design of server and client, GFS provide application supports that have optimal performance and availability. GFS was designed for Google applications themselves; there are many deployments of GFS cluster in it. Some clusters have more than thousands of storage nodes, storage space that over PB, and visited by thousands of clients from different machines continuously and frequently. In order to deal with massive data challenges, some commercial database systems attempt to combine traditional RDBMS technologies with distributed, parallel computing technologies to process the needs of big data. Many systems accelerate data processing on the hardware level. Typical systems include IBM’s Netezza, Oracle's Exadata, EMC's Greenplum, HP's Vertica and Teradata. From the functional perspective, these systems can support the operational semantics and analyze patterns of traditional databases and data warehouses continuously, as for scalability, they can also use massive cluster resources to process data concurrently, dramatically reduce the time that for loading, indexing and query processing of data. Exadata and Netezza have adopted data warehouse appliance solutions. Combined software and hardware together, have seamless integrated database management system, Server, Storage and networks. For users, one machine can be installed quickly and easily, and satisfy users’ needs by standard interfaces and simple operations, but these one machine solutions have many shortcomings, including expensive hardware, large energy consumption, expensive system service fee, purchasing a whole system 19 when need upgrade, etc. The biggest problem of Oracle's Exadata is the Shared-Everything architecture, resulting in limited IO processing capacity and scalability. The storage layer in Exadata can’t communicate with each other, any results of intermediate computing have to be delivered from storage layer to the RAC Node, then be delivered to the corresponding storage layer Node by the RAC Node, and then be computed. The large amounts of data movements result in unnecessary IO and network resource consumption. Exadata’s query performance is not stable; its performance tuning also requires experience and in-depth knowledge. NoSQL database, by definition is to break the paradigm constraints of traditional relational databases. From the data storage perspective, many NoSQL databases are not relational database, but the hash database that have key-value data format. Because of the abandonment of powerful SQL query language and transactional consistency and paradigm constraints of relational databases, NoSQL database can solve many challenges faced by traditional relational database to a great extent. In the design, they are concerned about the high concurrent read and write of data and massive data storage, etc. Compared with the relational databases, they have a great advantage in scalability, concurrency and fault tolerance. The mainstream NoSQL databases include Google BigTable, an open source implementation similar to BigTable named HBase, and Facebook Cassandra, etc. As part of Google applications need to process a large number of formatted and semi-formatted data, Google built a large-scale database systems that need weak consistency requirements named BigTable. BigTable applications include search logs, maps, Orkut online community, RSS reader and so on. Fig 2 Data model in BigTable Figure 2 describes the data model of applications in the BigTable model. The data model includes rows, columns and corresponding timestamps, all the data are stored in the table cells. BigTable contents are divided by rows, it integrates several rows to form a small table, and save to a single server node. This small table is called Tablet. Similar to the foregoing systems, BigTable is also a joint design of client and server, making the performance can meet the needs of the applications furthest. BigTable system relies on the underlying structure of the cluster system, a distributed cluster task scheduler, the Google file system, as well as a distributed lock service Chubby. Chubby is a very robust coarse-grained lock, which BigTable use to save the pointer of root data, thus users can obtain root server’s location from the Chubby lock firstly, and then access the data. BigTable use one server as the primary server, to store and manipulate metadata. Besides metadata management, the primary server is also responsible for 20 remote management and load deployment of tablet server (the general sense of the data server). Client use the programming interfaces for metadata communications with the main server, data communications with tablet server. As for large-scale distributed databases, mainstream databases such as HBase and Cassandra NoSQL are mainly to provide high scalability support, and make some sacrifices of consistency and availability, have shortcomings of the traditional RDBMS ACID semantics, transaction supporting, etc. Google Megastore [32], however, strives to integrate NoSQL with traditional relational database, and provide a strong guarantee for the consistency and high availability. Megastore use synchronous replication to achieve high availability and consistent view of the data. In short, MegaStore provides a complete serialization ACID semantics for "low-latency data copies in different regions" to support interactive online services. Megastore combines the advantages of NoSQL and RDBMS, can meet the high scalability, high fault tolerance and low latency under the principle of consistency in protection, providing services for Google's hundreds of production applications. 4.4 data computing Data queries, statistics, analysis, mining and other needs for big data processing have motivated different computing models of big data, and we divide big data computing into three parts, offline batch computing, real-time interactive computing and stream computing. 4.4.1 Offline batch With the wide range of applications and development of cloud computing technique, Hadoop distributed storage systems and MapReduce data processing mode analysis systems based on open source have also been widely used. Hadoop can support PB level of distributed data storage through data partitioning and self-recovery mechanism, as well as analyze and process these data based on MapReduce distributed processing model. MapReduce programming model can make many general data batch processing tasks and operations parallel on a large-scale cluster, and have automated failover capability. Led by open source software such as Hadoop, MapReduce programming model has been widely adopted, and is applied to Web search, fraud detection, and other varieties of practical applications. Hadoop is a software framework that can achieve large amounts of data’s distributed processing, and process by a reliable, efficient and scalable way, relying on horizontal expansion, improving the computing and storage capacity by increasing low-cost commodity servers. Users can easily develop and run applications that for dealing with massive amounts of data easily, we conclude Hadoop has the following advantages: (1) High Reliability: the ability to store and process data bit worthy of the trust; (2) High Scalability: Allocate the data and complete computing tasks in available computer clusters, these clusters can be expanded to the scale of thousands of nodes easily; (3) High Efficiency: can dynamically move data between nodes, and ensure dynamic balance of each node, the processing speed is very fast; 21 (4) High Fault-tolerance: can save multiple copies of the data automatically, and reassign failed tasks automatically. Fig 3 The Hadoop ecosystem The big data processing platform technology [61] that the Hadoop platform represents include MapReduce, HDFS, HBase, Hive, Zookeeper, Avro [48] and Pig, etc. has formed a Hadoop ecosystem, as shown in Figure 3. (1) MapReduce programming model is the heart of Hadoop, and used for parallel computation of massive data clusters. It is this programming model that has achieved massive scalability that across hundreds or thousands of servers of a Hadoop cluster; (2) Distributed File System HDFS provides mass data storage based on Hadoop processing platform, NameNode provides metadata services, DataNode is used to store the file blocks of file system; (3) HBase is built on HDFS, is used to provide a database system that has high reliability, high performance, column storage, scalability and real-time read and write, can store unstructured and semi-structured loose data; (4) Hive [17] is a large data warehouse based on Hadoop, can be used for data extraction, transformation and loading (ETL), storage, query and analyze large-scale data that stored in Hadoop; (5) Pig [21] is a large-scale data analysis platform based on Hadoop, can transform SQL-like data analysis requests into a series of MapReduce operations that were optimized, provides a simple operation and programming Interface for complex massive data parallel computing; (6) Zookeeper [19] is an efficient and reliable collaborative systems, it is used to coordinate a variety of services on distributed applications, we can use Zookeeper to build a coordination service that can prevent single point of failures and deal with load balancing effectively; (7) As the middleware of binary and high Performance, Avro provides data serialization capabilities and RPC services between Hadoop platforms. Hadoop platform is mainly for offline batch applications, typical application is to operate static data by scheduling batch tasks, the computing process is relatively slow, 22 to get results, some queries may take hours or even longer, so it’s powerless when face applications and services that require high real time. MapReduce is a good cluster parallel programming model, and can meet the needs of most applications. Although MapReduce is a good abstract of distributed/parallel computing, is not necessarily suitable for solving any problem of computing. For example , for those applications that require getting the results in real time, such as advertisement putting of pay-per-click model based on traffic, social recommendations based on data analysis of real-time users’ behavior analysis, anti-cheating statistics based on web search and clickstream, etc. MapReduce can’t provide effective treatments for these real-time applications, because the processing of these application logics requires multiple operations, or splitting the input data into a tiny particle size. MapReduce model has the following limitations: (1) The intermediate data transfer is difficult to be fully optimized; (2) Restart of individual tasks is costly; (3) Big intermediate data storage spending; (4) The master node can easily become a bottleneck; (5) The unified file fragment size support only, difficult to deal with complex collection of documents that have variety of sizes; (6) Difficult to store and access structured data storage directly. In addition to MapReduce computing model, workflow computing models represented by Swift [38, 39], figure computing models represented by Pregel [20], can handle the application processes and graph algorithms that contain large-scale computing tasks. As a bridge between scientific workflow and parallel computing, Swift system is a parallel programming tool for fast and reliable definitions, executions and managements of large-scale science and engineering workflows. Swift uses a structured approach to manage workflows’ definitions, scheduling and execution, which includes a simple scripting language SwiftScript, SwiftScript can describe complex parallel computing [40] based on data set types and iterative briefly, meanwhile, it can map data set dynamically for large-scale data that have different data formats. When it is running, the system provides an efficient workflow engine for scheduling and load balancing, and it can interact with resources management systems such as PBS and Condor, to finish the tasks. Pregel is distributed programming framework for graph algorithms, it can be used in graph traversal, the shortest path, PageRank computing. It adopt the iterative computing model: In each round, every vertex process the messages that received in last round, sends messages to other vertexes, and update their status and topology (outgoing edges, incoming edges) and so on. 4.4.2 Real-time interactive computing Nowadays, the real-time computing are generally for massive data, in addition to meeting some of the requirements of non-real-time computing (e.g., accurate results), the most important requirement of real-time computing is responding computed results in real time, millisecond level in general. Real-time computing can be categorized into the following two application scenarios generally: (1) The amount of data is huge and the results can’t be computed in advance, 23 while the response time of users has to be real-time. Mainly used for data analysis and processing in certain occasions. When the amount of data is large, and we have found out that listing all the query combinations of possible conditions is impossible, or the exhaustive condition combinations is useless, then real-time computing can play a role, it postpone the computing process to the query phase, but need to provide users with real-time response. In this case, it can process a part of the data in advance, and combine the real-time computing results, to improve the processing efficiency. (2) Data sources is real-time and uninterrupted, requires a real-time user response time The data sources are real-time, in other words, streaming data. So-called streaming data means viewing in the data as a data stream to process. Data stream is the aggregation of a series of data records that are unlimited in time distribution and number; Data records are the smallest units of data streams. For example, the data generated by sensors of Internet of Things may be is continuous. We will introduce the stream processing systems for in the next section separately. Real-time data computing and analysis can analyze and count the data dynamically and in real time, this has important practical significance to the monitoring of system’s state and scheduling management. The real-time computing process of massive data can be divided into the following three phases: data generation and collection, data analysis and process, and providing services. As shown in Figure 4. Fig 4 Process of real-time calculation Real-time data acquisition: It need to ensure that can collect all the data integrally in function, and provide real-time data for real-time applications; In response time, need to ensure real time and low latency; Configuration is simple, and easy to deploy; The system is stable and reliable, etc. Currently, Internet companies’ massive data acquisition tools include Facebook’s open-source Scribe [50], LinkedIn’s open-source Kafka [34], Cloudera’s open-source Flume [35], Taobao’s open-source TimeTunnel [36], Hadoop 's Chukwa [37], etc. all of these can meet the log data acquisitions and transmission requirements that hundreds of MB per second. Real-time data computing: the traditional data manipulations, collect the data and store them in a database management system (DBMS) firstly, then interact with DBMS by query, and get the answer users want. In the whole process, the users are active, while the DBMS system is passive. However, there are a lot of real-time data now, such data have strong real-timeliness, huge data volume, and diverse data formats, traditional relational database schema is not appropriate. The new real-time computing architectures generally adopt the distributed architecture of massive parallel processing(MPP), the data storage and processing will be assigned to large-scale nodes 24 to meet the real-time requirements, on the data storage, they use large-scale distributed file system, such as Hadoop’s HDFS file system, or the new NoSQL distributed databases. Real-time query Service: its implementation can be divided into three ways: 1) Full Memory: provide data read services directly, dump to disks or databases for persistence regularly. 2) Semi-Memory: Use Redis, Memcache, MongoDB, BerkeleyDB and other databases to provide Real-time Polling Service, carrying out persistence operations by these systems. 3) Full Disk: use NoSQL databases that based on distributed file system (HDFS) such as HBase, as for key-value engine, the key is to design the distribution of the key. Among Real-time and interactive computing technologies, Google's Dremel [40] system is the most prominent. Dremel is Google's "interactive" data analysis system. It can build clusters of a scale of thousands, process PB-level data. As the sponsor of MapReduce, Google has developed the Dremel system to shorten the processing time to the second level, as a strong complement to MapReduce. As a report engine of Google BigQuery, Dremel has achieved a big success. Like MapReduce, Dremel also need to run with data and move the computing to the data. It requires file systems such as GFS as the storage layer. Dremel supports a nested data model, similar to JSON. The traditional relational model, as there are inevitable large amounts of Join operations in it, it often powerless when deal with such large-scale data. Dremel also can use the column storage, so it can only scan the part of the data that needed to reduce the visits of CPUs and disks. Meanwhile, column storage is compression friendly, using compression can reduce the amount of storage, and enable maximize performance. Spark [41] is a real-time data analysis system developed by the AMP Lab of University of California-Berkeley, adopts a open-source cluster computing environment similar to Hadoop, but Spark is more superior for the design and performance of task scheduling, workload optimization. Spark uses memory distribution data sets, in addition to providing interactive queries, and it can also optimize the workloads of iteration [46]. Spark is implemented in Scala and they can be tightly integrated. Scala can operate distributed data sets just like local collections object easily. Spark support iterative operations on distributed data sets, and is an effective complement to Hadoop, support the fast data statistical analysis. It can also run concurrently on Hadoop file system, supporting by a third-party cluster framework named Mesos. Spark can be used to build large-scale, low-latency data analysis applications. Impala [42], released by Cloudera recently, similar to Google's Dremel system, is an effective tool for big data real-time queries. Impala can offer fast, interactive SQL queries on HDFS or HBase, besides a unified storage platform, it also uses Metastore and SQL syntax that as same as used by Hive, provides a unified platform for batches and real-time queries. 4.4.3 Streaming computing In many real-time application scenarios, such as real-time trading systems, real-time fraud analysis and real-time ad push [23], real-time monitoring, real-time analysis of social networks, etc. large amount of existing data, high real-time 25 requirements, and the data source is continuous. New data’ processing must be processed immediately, or subsequent data will pile up, and the processing will never end. We often need a sub-second or even sub-millisecond response time, which requires a highly scalable streaming computing solution. Stream Computing [24][26] is for the real-time and continuous types of data. Analyze in real time during the movement that the stream data are changing, capture the information that may be useful to the users, and sends the result out. In the process, the data analysis and processing system is active, the users are in a passive state of reception. As shown in Figure 5. Figure 5 Process of Streaming computing Traditional streaming computing systems are generally based on the event mechanism, and the amount of data processed by them is small. The new stream processing technique, such as Yahoo's S4 [22][26], mainly to solve the streaming processing issue that have a high data rate and a large amount of data. S4 is a general-purpose, distributed, scalable, partially fault-tolerant, pluggable platform. Developers can easily develop applications for unbounded, uninterrupted streaming data processing on it. Data events are routed to processing Elements (PEs), PEs consume these events and deal with as follows: (1) Send out one or more events that may be can be processed by other PE; (2) Publish results. S4’s design is primarily driven by data acquisitions and machine learning that used in the production environments on a large scale. Its main features are: (1) Provides a simple programming interface to handle the data streaming (2) Design a high-availability cluster that are scalable on general hardware. (3) Use local memory of every processing node to avoid disk I/O bottleneck and minimize the latency. (4) Use a de-centered, peer-to-peer architecture; all nodes provide the same functions and responsibilities. There is not a central node that takes a special responsibility. This greatly simplifies the deployment and maintenance. (5) Use a pluggable architecture to make design universal and customizable as much as possible. (6) Friendly design concept, easy to program, and with flexibility. There are many same characteristics between S4’s design and IBM's stream processing core SPC middleware [53]. Both systems are designed for large amounts of 26 data. Both of them have the ability to use user-defined operations to collect information in continuous data streams. The main difference is in the structure design, SPC’s design is from Publish/Subscribe mode, however S4’s design come from the combination of MapReduce and Actor model. Yahoo! Believe that because of its equivalent structure, S4’s design has achieved a very high degree of simplicity. All nodes in the cluster are identical, there is no central control. SPC is a distributed stream processing middleware to support applications that extract information from large-scale data stream. SPC contains programming models and development environments to achieve distributed, dynamic, scalable applications, its programming models includes API for declaring and creating processing unit (PE), as well as a toolset that for assembling, testing, debugging, and applications deployments. Unlike the other stream processing middleware, in addition to supporting relational operators, it also supports non-relational operators and user-defined functions. Storm [43] is a real-time data processing framework and open source by Twitter, and it’s similar to Hadoop, this kind of streaming computing solutions that with high scalability and can process high-frequency data, large-scale data will be applied to real-time search, high-frequency trading and social networks. Storm has three action scopes: (1) Stream Processing Storm can be used to process new data in real time and update the databases, have both fault tolerance and scalability. (2) Continuous Computation Storm can carry out continuous query and feed back the results to the customers, such as sending the hot topic of Twitter to the clients. (3) Distributed RPC Storm can be used to process intensive queries concurrently, Storm’s topology is a distributed function that waiting for invocation messages, when receives an invocation message, it will compute the queries and returns the results. 4.5 Data presentation and interaction The results need to be shown in a simple and intuitive way, so that end-users can understand and use them, and form effective statistics, analyses, predictions and decision-making that applied to the production practices and business operations, so the showing technology of big data, as well as technology for interacting with data, occupy an important position in the whole big data, either. Excel spreadsheets and graphics are the show ways that people have known and used a long time, also provide great convenience for everyday simple data applications. Many Wall Street traders rely on Excel and the years of accumulated, summarized formulas to carry out the large stock exchanges, either, and Microsoft and a number of entrepreneurs have seen the market potential, are developing the big data processing platform that display and interact by Excel as well as using Hadoop and other technology. The understanding and processing speed of the human brain of graphics is much 27 greater than the speed of texts. Therefore, presenting data for visualization can show the potential or complex patterns and relationships deeply. With the rise of big data, it has emerged many new data presentation and interactive ways and start-up companies that focus on this area. These new methods include interactive charts, which can be rendered to the web page, and support interactions, can operate and control icons, have animations and presentations. Besides, interactive map applications such as Google Maps, can mark dynamically, generate route, superimpose panoramic aerial maps, etc. Because it’s open API can combine with many users maps and location-based service applications, it has got an extensive application. Google Chart Tools also offers a variety of flexible approaches to website data visualization. From simple line graph, Geograph, Gauges (measuring instrument), to complex tree graph, Google Chart Tools provide a large number of well-designed charting tools. Tableau [44], who was born in a big data start-up company of Stanford, is becoming one of the excellent data analysis tools. Tableau has joined data in computing and aesthetic chart together perfectly, as shown in Figure 6. Companies can use it to drag and drop large amounts of data onto a digital "canvas", and can create a variety of charts shortly. Tableau’s design and implementation philosophy is: The more easily to manipulate the data on interface, the more thoroughly the companies can understand what they have done in the field of is right or wrong. Fast processing and easy sharing is another feature of Tableau. Only a few seconds that Tableau Server can publish the interactive control panel on the Internet, users only need a browser to filter, select the data easily and get a response of their questions, which will increase the users’ enthusiasm of using data. Another big data visualization start-up company Visual.ly [45] is known as its abundant infographics resources. It is a socializing creation and sharing platform of infographics. We live in an era of data acquisition and content creation. Visual.ly is the product of the data age, a new visual infographics platform. Many users are willing to upload the infographics to the website and share with others. Infographics will greatly stimulate the visual performance, and promote mutual learning and discussion between users. Have visualization services of exploration, sharing and promotion. It’s not complicated to use Visual.ly to make infographics. It is an automated tool that makes the insertion of different types of data quick and easy, and it expresses the data by graphics. 28 Fig 6 Visualization examples of Tableau In addition, 3D digital rendering technology has been applied in many fields widely, such as digital city, Digital Park, modeling and simulation, design and manufacturing, with a highly intuitive operability. Modern Augmented Reality Technology (AR) applies the virtual information to the real world by computer technologies, and the real environments and virtual objects are superimposed in a same picture or space, existing at the same time. Combined with virtual 3D digital models and real-life scenarios, it provides a better sense of presence and interaction. By AR technology, users can interact with the virtual objects, such as trying on virtual glasses, trying on virtual clothes, driving simulation aircrafts, etc. In Germany, engineering and technical personnel, when they are conducting mechanical installations, maintenance or tuning, by the helmets monitor, the internal structure of the machine and its associated information that can’t be presented before can be fully presented. Modern Motion Sensing technologies, such as Microsoft's Kinect and Leap's Leap Motion somatosensory controller, are capable of detecting and perceived body movements and gestures, and then convert the actions to the controls of computers and systems, free people from the constraints of keyboard, mouse, remote control and other traditional interactive devices, and make us interact with computers and data directly by our bodies and gestures. Today's hottest wearable technologies, such as Google glasses, have combined big data technology, augmented reality, and somatosensory technology organically. With the improvement of data and technologies, we can perceive the realities around us in real time. Through search and calculation by big data, we can achieve the real-time identifications and data captures of the surrounding buildings, businesses, people and objects, and project on our retina, which can help us to work, shop, relax, etc. provide a great convenience. Of course, the drawbacks of this new equipment and technology are obvious. We are in a state that is monitored at any time, 29 sustain privacy spying and violations, so the security issues that brings by big data technique can’t be ignored. 5 Related researches and our works The scale effect of big data brings a great challenge to the data storage management and data analysis, and the change of data management methods is brewing and occurring. Meng Xiaofeng and other scholars have analyzed the basic concept of big data and have compared it with the major applications of big data simply , have explained and analyzed the basic framework of big data processing and the effect of cloud computing technology to the data management of big data era, and summarized the new challenges we face in the era of big data [49] .Tao XueJiao et al. [51] described and analyzed the related concepts and features of big data, big data technique’s domestic and overseas development especially in data mining aspects, and the challenges we face in the era of big data. Meanwhile, some scholars have pointed out that when face the real-time and validity needs of data processing, we need a technological change of conventional data processing technique that set out according to big data characteristics, to form technique that for big data collection, storage, management, processing, analysis, sharing and visualization [52]. The review paper above pay more attention on the analysis of big data characteristics and development trends, and it’s inadequate in problems that big data face and the classified introduction summarizes. Compared with traditional data warehousing applications, Big Data analysis has features of large volumes of data, complex query and analysis, etc. From the perspective of big data analysis and data warehouse architecture design, the literature [33] firstly listed several important features that big data analysis platform need to have, have analyzed and concluded current mainstream implementation platform parallel databases, MapReduce and hybrid architectures of the both, and pointed out their strengths and weaknesses. HadoopDB [59] [60] is an attempt of the combination of the two architectures. Some scholars discuss from the competition and symbiotic relationship of RDBMS and MapReduce, analyze the challenges they encountered in the development, and pointed out that relational data management technologies and non-relational data management technologies complement each other in constant competition, and locate the right position in the new big data analysis ecosystem [55] [58]. In the study of NoSQL systems, scholars like Shen Derong [56] summarized the related research of NoSQL systems systematically, including architecture, data model, access method, index technique, transaction characteristics, system flexibility, dynamic load balancing, replication policy, data consistency policy, multi-level caching mechanism based on flash, data processing policy based on MapReduce and a new generation of data management systems. The review papers above tend to introduce the data storage for massive data, analyze different storage policies and their advantages and disadvantages, but it is short of a comprehensive exposition of big data technique, and ignores the synergy between different big data technology and between big data technology and cloud computing. 30 Modern science in the 21st century brings tremendous challenges to scientific researchers. The scientific community is facing some "data deluge” problems [1] that come from experimental data, analog data, sensor data and satellite data. Data size, the complexity of scientific analysis and processing are growing exponentially. Scientific Workflow Management System (SWFMS) provides some necessary supports for scientific computing, such as data management, task dependencies, job scheduling and execution, resource tracking. Workflow systems, such as Taverna [65], Kepler [63], Vistrails [64], Pegasus [62], Swift [39] and VIEW [66], have a wide range of applications in many fields, such as physics, astronomy, bioinformatics, neuroscience, earth science, and social science. Meanwhile, the development of scientific equipment and network computing has challenged the reliable workflow systems in the aspect of data size and application complexity. We have combined scientific workflow systems with cloud platforms as a service [67] of cloud computing, to deal with the growing amount of data and analysis complexity. A cloud computing systems with large-scale data center resources pool and on-demand resource allocation function can provide scientific workflow systems better services than the environment’s above, which make the workflow systems can handle scientific questions of PB-level. 6 Summary Big Data is a hot frontier of today's information technology development. Internet of Things, Internet and the rapid development of mobile communication networks has spawned the big data problem, and have brought problems of a various aspects, such as speed’s, structure’s, volume’s, cost’s, value’s, security privacy’s, interoperability’s. Traditional IT processing methods are powerless when face the big data problem, for their lack of scalability and efficiency. Big Data problem needs cloud computing technique to be solved, while big data also can promote the real landing and implementation of the cloud computing technique. There is a complementary relationship between them. We focus on infrastructure support, data acquisition, data storage, data computing, data display and interaction and other aspects to describe several kinds of technique covered by big data, describe the challenges and opportunities of big data technique from another angle for the scholars from related fields, and provide reference classification methods of big data technology. Big data technology is constantly growing with the surge of data amount and processing requirements, affecting our life habits and styles. Acknowledgements: Here, we express gratitude to the colleagues that have given supports and advices for this article, especially the students and teachers in the Limit Network Computing and Service laboratory at School of Computer Science and Engineering, University of Electronic and Science Technology of China. 31 References 1. 2. 3. 4. 5. 6. 7. 8. 9. 10. 11. 12. 13. 14. 15. 16. 17. 18. 19. 20. 21. 22. 23. 24. 25. 26. Bell G, Hey T, Szalay A. Beyond the data deluge[J]. Science, 2009, 323(5919): 1297-1298. Manyika J, Chui M, Brown B, Bughin J, Dobbs R, Roxburgh C, Byers AH. Big data: The next frontier for innovation, competition, and productivity. May 2011[J]. MacKinsey Global Institute, 2011. Big Data Research and Development Initiative, http://www.whitehouse.gov/sites/default/files/microsites/ostp/big_data_press_ release_final_2.pdf http://wikibon.org/wiki/v/Big_Data_Market_Size_and_Vendor_Revenues Foster I, Zhao Y, Raicu I, Shiyong L. Cloud computing and grid computing 360-degree compared[C]//Grid Computing Environments Workshop, 2008. GCE'08. Ieee, 2008: 1-10. OpenNebula, http://www.opennebula.org. OpenNebula Architecture, http://www.opennebula.org/documentation: archives:rel2.2:architecture. Openstack, http://www.openstack.org. Keahey K, Freeman T. Contextualization: Providing one-click virtual clusters[C]//eScience, 2008. eScience'08. IEEE Fourth International Conference on. IEEE, 2008: 301-308. Barham P, Dragovic B, Fraser K, Hand S, Harris T, Ho A, Neugebauer R, Pratt I, Warfield A. Xen and the art of virtualization[J]. ACM SIGOPS Operating Systems Review, 2003, 37(5): 164-177. KVM (Kernel Based Virtual Machine). http://www.linux-kvm.org/page/Main Page. Nurmi D, Wolski R, Grzegorczyk C, Obertelli G, Soman S, Youseff L, Zagorodnov D. The eucalyptus open-source cloud-computing system[C]//Cluster Computing and the Grid, 2009. CCGRID'09. 9th IEEE/ACM International Symposium on. IEEE, 2009: 124-131. Ghemawat S, Gobioff H, Leung ST. The Google file system. In: Proc. of the 19th ACM Symp. on Operating Systems Principles. New York: ACM Press, 2003. 29−43. Chang F, Dean J, Ghemawat S, Hsieh WC, Wallach DA, Burrows M, Chandra T, Fikes A, Gruber RE. Bigtable: A distributed storage system for structured data. In: Proc. of the 7th USENIX Symp. on Operating Systems Design and Implementation. Berkeley: USENIX Association, 2006. 205−218. Zheng QL, Fang M, Wang S, Wang XQ, Wu XW, Wang H. Scientific Parallel Computing Based on MapReduce Model. Micro Electronics & Computer, 2009, 26(8):13-17 (in Chinese with English abstract). Li GJ, Cheng XQ. Research Status and Scientific Thinking of Big Data[J]. Bulletin of Chinese Academy of Sciences, 2012, 27(6): 647-657 (in Chinese with English abstract). Thusoo A, Sarma J S, Jain N, Shao Z, Chakka P, Anthony S, Liu H, Wyckoff P, Murthy R. Hive: a warehousing solution over a map-reduce framework[J]. Proceedings of the VLDB Endowment, 2009, 2(2): 1626-1629. Dean J, Ghemawat S. MapReduce: simplified data processing on large clusters[J]. Communications of the ACM, 2008, 51(1): 107-113. Gopalakrishna K, Hu G, Seth P. Communication layer using ZooKeeper. Yahoo! Inc., Tech. Rep., 2009. Malewicz G, Austern MH, Bik AJC, Dehnert JC, Horn I, Leiser N, Czajkowski G. Pregel: a system for large-scale graph processing[C]//Proceedings of the 2010 ACM SIGMOD International Conference on Management of data. ACM, 2010: 135-146. Olston C, Reed B, Srivastava U, Kumar R, Tomkins A. Pig latin: a not-so-foreign language for data processing[C]//Proceedings of the 2008 ACM SIGMOD international conference on Management of data. ACM, 2008: 1099-1110. Malkin J, Schroedl S, Nair A, Neumeyer L. Tuning Hyperparameters on Live Traffic with S4. In TechPulse 2010: Internal Yahoo! Conference, 2010. Schroedl S, Kesari A, Neumeyer L. Personalized ad placement in web search[C]//Proceedings of the 4th Annual International Workshop on Data Mining and Audience Intelligence for Online Advertising (AdKDD), Washington USA. 2010. Stonebraker M, Çetintemel U, Zdonik S. The 8 requirements of real-time stream processing[J]. ACM SIGMOD Record, 2005, 34(4): 42-47. Apache Hadoop. http://hadoop.apache.org/. Neumeyer L, Robbins B, Nair A, Kesari A. S4: Distributed stream computing platform[C]//Data Mining Workshops (ICDMW), 2010 IEEE International Conference on. IEEE, 2010: 170-177. 32 27. 28. 29. 30. 31. 32. 33. 34. 35. 36. 37. 38. 39. 40. 41. 42. 43. 44. 45. 46. 47. 48. 49. 50. 51. 52. 53. 54. 55. 56. Khetrapal A, Ganesh V. HBase and Hypertable for large scale distributed storage systems[J]. Dept. of Computer Science, Purdue University, 2006. http://cassandra.apache.org/ http://www.mongodb.org/ http://mahout.apache.org/ Li YL, Dong J. Study and Improvement of MapReduce based on Hadoop. Computer Engineering and Design. 2012, 33(8):3110-3116 (in Chinese with English abstract). Baker J, Bond C, Corbett JC, Furman JJ, Khorlin A, Larson J, Leon JM, Li YW, Lloyd A, Yushprakh V. Megastore: Providing Scalable, Highly Available Storage for Interactive Services[C]//CIDR. 2011, 11: 223-234. Wang S, Wang HJ, Qin XP, Zhou X. Architecting Big Data: Challenges, Studies and Forecasts. Chinese Journal of Computers, 2011, 34(10): 1741-1752 (in Chinese with English abstract). Kafka. http://kafka.apache.org/ Flume. https://github.com/cloudera/flume TimeTunnel. http://code.taobao.org/p/TimeTunnel/src/ Rabkin A, Katz R. Chukwa: A system for reliable large-scale log collection[C]//Proceedings of the 24th international conference on Large installation system administration. USENIX Association, 2010: 1-15. Swift Workflow System, http: //www.ci.uchicago.edu/Swift/main/. Zhao Y, Hategan M, Clifford B, Foster I, von Laszewski G, Nefedova V, Raicu I, Stef-Praun T, Wilde M. Swift: Fast, reliable, loosely coupled parallel computation[C]//Services, 2007 IEEE Congress on. IEEE, 2007: 199-206. Melnik S, Gubarev A, Long J J, Romer G, Shivakumar S, Tolton M, Vassilakis T. Dremel: interactive analysis of web-scale datasets[J]. Proceedings of the VLDB Endowment, 2010, 3(1-2): 330-339. Zaharia M, Chowdhury M, Franklin M J, Shenker S, Stoica I. Spark: cluster computing with working sets[C]//Proceedings of the 2nd USENIX conference on Hot topics in cloud computing. 2010: 10-10. Kornacker M, Erickson J. Cloudera Impala: real-time queries in Apache Hadoop, for real. 2012. http://blog.cloudera.com/blog/2012/10/cloudera-impala-real-time-queries-in-apache-hadoop-for-real/ Storm, Distributed and fault-tolerant realtime computation, http://storm-project.net/. Tableau, http://www.tableausoftware.com/ Visual.ly, http://visuanl.ly/ Zaharia M, Chowdhury M, Das T, Dave A, Ma J, McCauley M, Franklin MJ, Shenker S, Stoica I. Resilient distributed datasets: A fault-tolerant abstraction for in-memory cluster computing[C]//Proceedings of the 9th USENIX conference on Networked Systems Design and Implementation. USENIX Association, 2012: 2-2. Gupta R, Gupta H, Mohania M. Cloud Computing and Big Data Analytics: What Is New from Databases Perspective?[M]//Big Data Analytics. Springer Berlin Heidelberg, 2012: 42-61. Avro, http://avro.apache.org/ Meng XF, Ci X. Big Data Management: Concept, Techniques and Challenges. Journal of Computer Research and Development, 2013, 50(1): 146-169 (in Chinese with English abstract). Scribe. https://github.com/facebook/scribe Tao XJ, Hu XF, Liu Y. Overview of Big Data Research. Journal of System Simulation, 2013, 25S: 142-146 (in Chinese with English abstract). Yan XF, Zhang DX. Big Data Research. Computer Technology and Development, 2013, 23(4): 168-172 (in Chinese with English abstract). Amini L, Andrade H, Bhagwan R, Eskesen F, King R, Selo P, Park Y, Venkatramani C. SPC: A distributed, scalable platform for data mining[C]//Proceedings of the 4th international workshop on Data mining standards, services and platforms. ACM, 2006: 27-37. Labrinidis A, Jagadish HV. Challenges and opportunities with big data[J]. Proceedings of the VLDB Endowment, 2012, 5(12): 2032-2033. Qin XP, Wang HJ, Du XY, Wang S. Big Data Analysis—Competition and Symbiosis of RDBMS and MapReduce. Journal of Software, 2012, 23(1): 32-45 (in Chinese with English abstract). Shen DR, Yu G, Wang XT, Nie TZ, Kou Y. Survey on NoSQL for Management of Big Data. Journal of Software, 2013,24(8): 1786-1803 (in Chinese with English abstract). 33 57. 58. 59. 60. 61. 62. 63. 64. 65. 66. 67. Zikopoulos PC, Eaton C, DeRoos D, Deutsch T, Lapis G. Understanding big data[J]. New York et al: McGraw-Hill, 2012. Qin XP, Wang HJ, Li FR, Li CP, Chen H, Zhou X, Du XY, Wang S New Landscape of Data Management Technologies. Journal of Software, 2013, 24(2): 175-197 (in Chinese with English abstract). Abouzeid A, Bajda-Pawlikowski K, Abadi D, Silberschatz A, Rasin A. HadoopDB: An architectural hybrid of MapReduce and DBMS technologies for analytical workloads. Proc. of the VLDB Endowment, 2009,2(1):922-933. Abouzied A, Bajda-Pawlikowski K, Huang JW, Abadi DJ, Silberschatz A. HadoopDB in action: Building real world applications. In: Elmagarmid AK, Agrawal D, eds. Proc. of the SIGMOD 2010. Indianapolis: ACM Press, 2010. . [doi: 10.1145/1807167.1807294] Taylor RC. An overview of the Hadoop/MapReduce/HBase framework and its current applications in bioinformatics[J]. BMC bioinformatics, 2010, 11(Suppl 12): S1. Deelman E, Singh G, Su MH, Blythe J, Gil Y, Kesselman C, Mehta G, Vahi K, Berriman GB, Good J, Laity A, Jacob JC, Katz DS. Pegasus: A framework for mapping complex scientific workflows onto distributed systems[J]. Scientific Programming, 2005, 13(3): 219-237. Ludäscher B, Altintas I, Berkley C, Higgins D, Jaeger E, Jones M, Lee EA, Tao J, Zhao Y. Scientific workflow management and the Kepler system[J]. Concurrency and Computation: Practice and Experience, 2006, 18(10): 1039-1065. Freire J, Silva CT, Callahan SP, Santos E, Scheidegger CE, Vo HT. Managing rapidly-evolving scientific workflows[M]//Provenance and Annotation of Data. Springer Berlin Heidelberg, 2006: 10-18. Hull D, Wolstencroft K, Stevens R, Goble C, Pocock MR, Li P, Oinn T. Taverna: a tool for building and running workflows of services[J]. Nucleic acids research, 2006, 34(suppl 2): W729-W732. Lin C, Lu S, Lai Z, Chebotko A, Fei X, Hua J, Fotouhi F. Service-oriented architecture for VIEW: A visual scientific workflow management system[C]//Services Computing, 2008. SCC'08. IEEE International Conference on. IEEE, 2008, 1: 335-342. Zhao Y, Li Y, Tian W, Xue R. Scientific-Workflow-Management-as-a-Service in the Cloud[C]//Cloud and Green Computing (CGC), 2012 Second International Conference on. IEEE, 2012: 97-104. Chinese references 15. 16. 31. 33. 49. 51. 52. 55. 56. 58. 郑启龙,房明,汪胜,王向前,吴晓伟,王昊.基于 MapReduce 模型的并行科学计算.微电子学与计算 机,2009,26(8):13-17. 李国杰,程学旗.大数据研究:未来科技及经济社会发展的重大战略领域——大数据的研究现状与科学思 考[J].中国科学院院刊,2012,27(6):647-657. 李玉林,董晶.基于 Hadoop 的 MapReduce 模型的研究与改进.计算机工程与设计.2012,33(8):3110-3116. 王珊,王会举,覃雄派,周烜.架构大数据:挑战,现状与展望[J].计算机学报,2011,34(10):1741-1752. 孟小峰,慈祥.大数据管理:概念,技术与挑战[J].计算机研究与发展,2013,50(1):146-169. 陶雪娇,胡晓峰,刘洋.大数据研究综述[J].系统仿真学报,2013,25S:142-146. 严霄凤,张德馨.大数据研究[J].计算机技术与发展,2013,23(4):168-172 覃 雄 派 , 王 会 举 , 杜 小 勇 , 王 珊 . 大 数 据 分 析 ——RDBMS 与 MapReduce 的 竞 争 与 共 生 [J]. 软 件 学 报,2012,23(1):32-45. 申 德 荣 , 于 戈 , 王 习 特 , 聂 铁 铮 , 寇 月 . 支 持 大 数 据 管 理 的 NoSQL 系 统 研 究 综 述 [J]. 软 件 学 报,2013,24(8):1786-1803. 覃 雄 派 , 王 会 举 , 李 芙 蓉 , 李 翠 平 , 陈 红 , 周 烜 , 杜 小 勇 , 王 珊 . 数 据 管 理 技 术 的 新 格 局 [J]. 软 件 学 报,2013,24(2):175-197. 34