A Review on Cloud to Handle and Process Big Data Nishu Arora1

advertisement
A Review on Cloud to Handle and Process Big Data
Nishu Arora1, Rajesh Kumar Bawa2
M.Tech Student1, Associate Professor2
Department of Computer Science, Punjabi University Patiala1, 2
Abstract: Cloud computing provides services to
internet by utilizing resources of the computing
infrastructure to provide different services of the
internet. It allows consumers and businesses to use
applications without installation and access their
personal files at any computer with internet access. A
distributed storage system for managing structured
data at Google called Bigtable. Bigtable is designed
to reliably scale to petabytes of data and thousands of
machines. Bigtable has achieved several goals: wide
applicability, scalability, high performance, and high
availability. Bigtable is used by more than sixty
Google products and projects, including Google
Analytics, Google Finance, Personalized Search,
Google Earth and many more. In this paper a review
is done to analyze the cloud performance on data
stored at data centers.
Keywords: Cloud data, data center, map reduce,
distributed file system.
INTRODUCTION
The idea behind the Cloud is that users can
use the service anytime, anywhere through
the Internet, directly through the use of
browser. In cloud computing data is stored
in virtual space as it uses the browsers to use
network services. Since, networks are
associated so main concern is of security of
data.
a. Bigtable
Resemblance of Bigtable with database is
due to sharing of implementation strategies
with databases. Bigtable provides a different
interface than Parallel databases and mainmemory databases. Data is indexed using
row and column names that can be arbitrary
strings. Bigtable also treats data as
uninterpreted strings, although clients often
serialize various forms of structured and
semi-structured data into these strings.
Clients can control the locality of their data
through careful choices in their schemas.
Finally, Bigtable schema parameters let
clients dynamically control whether to serve
data out of memory or from disk [1].
b. Big data
It is large pools of data that can be captured,
communicated, aggregated, stored, and
analysed. Big Data is not a technology, but
rather a phenomenon resulting from the vast
amount of raw information generated across
society, and collected by commercial and
government organisations.
Big Data generally refers to datasets that are
not susceptible to analysis by the relational
database tools, statistical analysis tools and
visualisation aids that have become familiar
over the past twenty years since the start of
the rapid increase in digitised sensor data.
Instead, it requires ‘massively parallel
software running on tens, hundreds, or even
thousands of servers in some (currently
extreme) cases’.
Important characteristics of data include:
 Volume. It includes the mass data of
enterprise
 Variety. It deals with the complexity of
multiple data types, including structured
and unstructured data.
 Velocity. It is the speed at which data is
disseminated and also the speed at which
it changes or is.
 Variable Veracity. It deals with the level
of reliability associated with certain
types of data. It uses the normalization
effect to analye data at these vastly
different orders of magnitude.
Big data can be broken into two areas:
LITERATURE SURVEY
1. Big Data Transaction Processing (a.k.a.
1. F.Chang et al. [1] suggested that Bigtable is a
Big transactions)
distributed storage system for which manages
2. Big Data Analytics
structured data that is designed to scale to a
very large size: petabytes of data across
Big Data transaction processing deals with
thousands of commodity servers. Many
extreme volumes of transactions that may
projects at Google store data in Bigtable,
update data in relational DBMSs or file
including web indexing, Google Earth, and
systems. Typically, relational DBMSs are
Google Finance. These applications place very
used as it is often the case that so-called
different demand son Bigtable, both in terms
ACID properties are found missing in many
of data size (from URLs to web pages to
NoSQL DBMSs. This is only a problem if it
satellite imagery) and latency requirements
is unacceptable to lose a transaction e.g. a
(from backend bulk processing to real-time
Banking deposit.
data serving). Despite these varied demands,
Bigtable has successfully provided an
Big Data Analytics is about segregation and
extensible, high-performance solution for all
segmentation of data. it is about advanced
of these Google products. They describe the
analytics on traditional structured and multisimple data model provided by Bigtable,
structured data. It is a term associated with
which gives clients dynamic control over data
the new types of workloads and underlying
layout and format, and we describe the design
technologies needed to solve business
and implementation of Bigtable.
problem that we could not previously
support due to technology limitations,
A. Kumar et al. [2] highlighted that the cloud
prohibitive cost or both.
has been used as a metaphor for the internet.
It is one of the most active applications for
c. Types of big data
enterprise. It has been more and more
accepted by enterprises which can take
The most popular new types of data that
advantage of low cost, fast deployment and
organisations want to analyse include:
elastic scaling. Due to demand of large
 Web data - e.g. web logs, e-commerce
volume of data processing in enterprises,
logs and social network interaction data
huge amount of data are generated and
 Industry specific big transaction data dispersed on internet. There is no guarantee
e.g., Telco call data records (CDRs),
that data stored on Cloud is securely
geo-location data and retail transaction
protected. A method to build a trusted
data
computing environment by providing a
 Machine generated/sensor data - to
secure platform in a Cloud computing system
monitor everything from movement,
is proposed. The proposed method can store
temperature, light, vibration, location,
data safely and efficiently in the Cloud. It
airflow, liquid flow and pressure. RFIDs
solves many problem of handling big data
are another example.
and security issues using encryption and
 Text - e.g. from archived documents,
compression technique while uploading data
external content sources or customer
to the Cloud storage.
interaction data (including emails for
sentiment analysis)
X. Zhang et al. [3] discussed that with the
development of cloud computing and mobile
internet, the issues related to big data have
been concerned by both academe and industry.
Based on the analysis of the existed work, the
research progress of how to use distributed file
system to meet the challenge in storing big
data is expatiated, including four key
techniques: storage of small files, load
balancing,
copy
consistency,
and
deduplication. They also indicated key issues
need to concern in the future work. This
paper discusses the four technologies of
distributed file system, small files problem,
load balancing, replica consistency, deduplication. These technologies were analyzed
and new ideas were given.
Z. Zheng et al. [4] discussed that with the
prevalence of service computing and cloud
computing, more services are emerging on the
Internet, generating huge volume of data. The
overwhelming service-generated data become
too large and complex to be effectively
processed by traditional approaches. How to
store, manage, and create values from the
service-oriented big data become an important
research problem. On the other side, with the
increasingly large amount of data, a single
infrastructure which provides common
functionality for managing and analyzing
different types of service-generated big data is
urgently required. To address this challenge,
they provide an overview of service-generated
big data and Big Data-as-a-Service. First,
three types of service-generated big data are
exploited to enhance system performance.
Then, Big Data-as-a-Service, including Big
Data Infrastructure-as-a-Service, Big Data
Platform-as-a-Service, and Big Data Analytics
Software-as-a-Service, is employed to provide
common big data related services to users to
enhance efficiency and reduce cost. To
provide common functionality of big data
management and analysis, Big Data-as-aService is investigated to provide APIs for
users to access the service-generated big data
and the big data analytics results.
L. Zhang et al. [5] illustrated that cloud
computing, rapidly emerging as a new
computation paradigm, provides agile and
scalable resource access in a utility-like
fashion, especially for the processing of big
data. In cloud computing the data should be
moved efficiently from different locations
around the world. The de facto approach of
hard drive shipping is not flexible or secure.
This work studies timely, cost-minimizing
upload of massive, dynamically-generated,
geo-dispersed data into the cloud, for
processing using a MapReduce - like
framework. It targets at a cloud encompassing
disparate data centers. Model is a costminimizing data migration problem, and
propose two online algorithms: an online lazy
migration (OLM) algorithm and a randomized
fixed horizon control (RFHC) algorithm, for
optimizing at any given time the choice of the
data center for data aggregation and
processing, as well as the routes for
transmitting data there. Careful comparisons
among these online and offline algorithms in
realistic settings are conducted through
extensive experiments, which demonstrate the
close-to-offline-optimum performance of the
online algorithms.
W. Dou et al. [6] pointed out that cloud
computing promises a scalable infrastructure
for processing big data applications such as
medical data analysis. Cross-cloud service
composition provides a concrete approach
capable for large-scale big data processing.
However, the complexity of potential
compositions of cloud services calls for new
composition and aggregation methods,
especially when some private clouds refuse to
disclose all details of their service transaction
records due to business privacy concerns in
crosscloud scenarios. If a cloud fails to deliver
its services according to its “promised” quality
the credibility of cross-clouds and on-line
service compositions will become suspicious.
In view of these challenges, they proposed a
privacy-aware crosscloud service composition
method, named HireSome-II (History recordbased Service optimization method) based on
its previous basic version HireSome-I. In this
method, to enhance the credibility of a
composition plan, the evaluation of a service
is promoted by some of its QoS history
records, rather than its advertised QoS values.
Besides, the k-means algorithm is introduced
into proposed method as a data filtering tool to
select representative history records. As a
result, HireSome-II can protect cloud privacy,
as a cloud is not required to unveil all its
transaction
records.
Furthermore,
it
significantly reduces the time complexity of
developing a cross-cloud service composition
plan as only representative ones are recruited,
which is demanded for big data processing.
Simulation and analytical results demonstrate
the validity of our method compared to a
benchmark.
This composition evaluation approach
achieves two advantages. Firstly, it reduces
the time complexity as only some
representative history records are recruited,
which is highly demanded for big data
applications. Secondly, the method protects
privacy of a cloud as a cloud is not required to
unveil all of its transaction records, which
accordingly protects privacy in big data.
Simulation and analytical results have
demonstrated the validity of our method
compared to a benchmark.
CONCLUSION
In this paper, we have discussed bigtable,
bigdata in cloud, types of bigdata. Research
findings of various authors have been studied
and their results are discussed. Bigdata is a
useful technology which makes the work easy
to handle.
REFERENCES
[1] F. Chang, J. Dean, S. Ghemawat, W.C. Hsieh, D.
A. Wallach, M. Burrows, T. Chandra, A. Fikes, R. E.
Gruber, “Bigtable: A Distributed Storage System for
Structured Data,” pp.1-14, OSDI 2006.
[2] A. Kumar , H. Lee and R. P. Singh, “Efficient and
Secure Cloud Storage for Handling Big Data,” pp.1-5,
2012.
[3] X. Zhang and F. Xu, “Survey of Research on Big
Data Storage, ” pp.1-5, IEEE, 2013.
[4] Z. Zheng, J. Zhu, and M. R. Lyu, “Servicegenerated Big Data and Big Data-as-a-Service: An
Overview,” pp.1-8, IEEE, 2013.
[5] L. Zhang, C. Wu, Z. Li, C. Guo, M. Chen, and F. C.
M. Lau, “Moving Big Data to The Cloud: An Online
Cost-Minimizing Approach,” IJCA, Vol.31- No. 12,
Dec 2013.
[6] W. Dou, X. Zhang, J. Liu, and J. Chen, “HireSomeII: Towards Privacy-Aware Cross-Cloud Service
Composition for Big DataApplications,” pp.1-11,IEEE
TPDS-2013-08-0725.
[7] K. Kanagasabapathi, S. B. Akshaya, “Secure
sharing of financial records with third party application
integration in cloud computing,” pp.1-3, IEEE, 2013.
[8] M. Ferguson, “Enterprise Information ProtectionThe Impact of Big Data”, pp.1-40, Intelligent Business
Strategies, March 2013.
[9] N. Couch and B. Robins, “Big Data for Defence and
Security,”mOccasional Paper, 2013.
[10] P. Amirian, A. Basiri, A. Winstanley, “Effiecint
Online Sharing of Geospatial Big Data Using NoSQL
XML Databases ,” pp.1, IEEE, 2013.
Download