BIG DATA AND CLOUD SYSTEMS Bhavika Arigala, Kshitij Dhar, Aishwary, M S Vishvesh Department of Computer Science, Sir M Visvesvaraya Institute of Technology Abstract- The term big data came to light under the enormous increase of global data as a technology that is able to store, manage, analyze large datasets providing both enterprises and science with deep understanding over its clients/experiments. Big Data includes structured as well as unstructured data. Cloud computing is one of the most powerful technology which performs massive-scale and complex computing. It eliminates the requirement to support expensive computing hardware, dedicated space and related software. Within the context of this paper we present the overview , related work, challenges of both the concepts mentioned above. Keywords : Hadoop, MapReduce, HDFS, Big Data, Cloud computing. 1. INTRODUCTION Big Data Analytics (BDA) is a field that treats ways to analyze, systematically extract information from, or otherwise deal with data sets that are too large or complex to be dealt with by traditional data-processing application software. Data with many cases provides greater and efficient statistical advantage, but data sets with numerous rows and columns can make the process cumbersome which may lead to false discovery rate. Current usage of the term big data tends to refer to the use of predictive analytics,user behaviour analytics, or certain other advanced data analytics methods that extract value from data, and seldom to a particular size of data set. Big data is associated with five keywords, also known as the five V’s of big data- volume, variety, velocity, value, veracity. 1. Volume includes many factors that contribute to the increase in volume of data such as IoT. 2. Variety concerns with the different types of data from various sources that big data frameworks have to deal with. 3. Velocity concerns with the different rates at which data streams may get in or out of the system and provides an abstraction layer so that big data systems can store data independently of the incoming or outgoing rate. 4. Value concerns the true value of data (i.e., the potential value of the data regarding the information they contain). Huge amounts of data are worthless unless they provide value. 5. Veracity refers to the dependability of the data, addressing data confidentiality, integrity, and availability. Organizations need to ensure that data as well as the analyses performed on the data are correct. Cloud computing is another paradigm which promises theoretically unlimited on-demand services to its users. Clouds may be limited to a single organization or be available to many organisations. Large clouds, predominant today, often have functions distributed over multiple locations from central servers. If the connection to the user is relatively close, it may be designated an edge design. A number of architectures and deployment models exist for cloud computing, and these architectures and models are able to be used with other technologies and design approaches. Owners of small to medium sized businesses who are unable to afford adoption of clustered NAS technology can consider a number of cloud computing models to meet their big data needs. Small to medium sized business owners need to consider the correct cloud computing in order to remain both competitive and profitable. Cloud providers usually offer three different basic services: Infrastructure as a Service (IaaS); Platform as a Service (PaaS); and Software as a Service (SaaS) Since the cloud virtualizes resources in an ondemand fashion, it is the most suitable and compliant framework for big data processing, which through hardware virtualization creates a high processing power environment for big data. 2. RELATED WORK There are several software products and technologies available to facilitate big data analytics, with the help of analytical techniques. These tools help in big data processing and resource management. We have described the most common ones in this paper. 2.1 Hadoop Hadoop is a free Java based programming framework that manages data processing and storage for big data applications in a distributed computing environment. Formerly known as Apache Hadoop, it is developed as a part of an open source project within the Apache Software Foundation. Hadoop’s ability to process and store different types of data makes it a particularly good fit for big data environments. [1] Hadoop systems can handle various forms of data thereby providing more flexibility than those than relational databases and warehouses provide. Hadoop Cluster uses a Master/Slave structure. Using Hadoop, large amount of datasets can be processed over cluster of servers and applications can be run on systems with thousands of nodes involving thousands of terabytes of information as shown in fig 1. [2] H Fig 1. Hadoop Framework Structure Distributed file system in Hadoop helps in rapid data transfer rates and allows the system to continue its normal operations even in the case of any node failures, This approach lowers the risk of an entire system failure, enabling a computing solution that is scalable, cost effective, flexible and fault tolerant. Hadoop framework is used by several company giants such as IBM, Google, Yahoo, Amazon, etc. to support their applications involving huge amounts of data. Some popular institutions manage their analytical data in the enterprise data based warehouse (EDW) by its own while others use a different platform which helps relieve some of the burden on the server, resulting from managing your data on the EDW. [3] Servers can be added or removed from the cluster dynamically without causing any interruption to the operations. Hadoop has two main subprojects - Map Reduce and Hadoop Distributed File System (HDFS). 2.2 Map Re duce Map Reduce is a programming model suitable for processing of huge data. Hadoop is capable of running MapReduce programs written in various languages: Java, Ruby, Python, and C++. MapReduce programs are parallel in nature, thus are very useful for performing large-scale data analysis using multiple machines in the cluster. [4] In simple terms, it is a programming model and an associated implementation for processing and generating large data sets with a parallel, distributed algorithm on a cluster. A MapReduce program is composed of a map procedure (or method), which performs filtering and sorting (such as sorting students by first name into queues, one queue for each name), and a reduce method, which performs a summary operation (such as counting the number of students in each queue, yielding name frequencies). [5] This framework is reliable and fault tolerant. It is a processing technique built on the divide and conquer algorithm. individual map jobs in parallel. The result of the maps sorted by the framework are then sent as an input to reduce tasks. Consider we have the following input to our Map Reduce Program : Initially, the application is divided into individual chunks which are processed by Welcome to Hadoop Class | Hadoop is good | Hadoop is bad Fig 2. Map Reduce Architecture using an example The mapping step takes a set of data in order to convert it into another set of data by breaking the individual elements into key/value pairs called tuples. The second step of reducing takes the output derived from the mapping process and combines the data tuples into a smaller set of tuples. [5] Scheduling, Monitoring and Re-executing failed tasks are taken care of by the framework. [6] HDFS Master (NameNode) - Namenode regulates file access to the clients. It maintains and manages the slave nodes and assign tasks to them. Namenode executes file system namespace operations like opening, closing, and renaming files and directories. 2.3 Hadoop Distribute d File Syste m (HDFS) The Hadoop Distributed File System (HDFS) is the primary data storage system used by Hadoop applications. It employs a NameNode and DataNode architecture to implement a distributed file system that provides high-performance access to data across highly scalable Hadoop clusters. [7] It spans all the nodes in a Hadoop cluster for data storage. HDFS Slave (DataNode) - There are n number of slaves (where n can be upto 1000) or data nodes in Hadoop Distributed File System which manage storage of data. These slave nodes are the actual worker nodes which do the tasks and serve read and write requests from the file system’s clients. They also perform block creation, deletion, and replication upon instruction from the NameNode. [8] Fig 3. HDFS Architecture 3. ADVANTAGES Ability to make faster, more informed business decisions, backed up by facts. Deeper understanding of customer requirements which, in turn, builds better business relationships. Increased awareness of risk, enabling the implementation of preventative measures. Here are few more benefits: [9] ● Knowing errors instantly within the organisation. ● Implementing new strategies. ● To improve service dramatically. ● Fraud can be detected the moment it happens. ● Cost savings. ● Better sales insights. ● Keep up the customer trends. ● Using big data increases your efficiency. ● Using big data improves your pricing. ● You can compete with big businesses. ● Allows you to focus on local preferences. ● Using big data helps you increase sales and loyalty. 4. RISKS 4.1 Legal compliance Where the data contains personal data, a business needs to comply with data protection laws and the upcoming requirements of the General Data Protection Regulation. Personal data is any data which can identify an individual (e.g. name, location data, IP address). A failure to comply with data protection legislation could result in a serious data breach which can attract large fines and lead to reputational damage. [10] Regulatory burden and risks of breaching data protection laws can be mitigated by considering whether all of the data is necessary, whether a time limit for retention could be imposed or whether the personal data can be anonymised. Where personal data is used, business need appropriate practices and policies in place. 4.2 Accuracy Big Data Analytics is predictive in nature and sometimes means that it draws inaccurate conclusions. If the information inputted is biased, the results are also likely to be biased. For example, The City of Boston provides a Street Bump app for smartphones. On a car journey, the app uses the phone’s accelerometer and GPS data to record movements due to problems with the roads, (e.g. potholes) and transmits the data to the council. Levels of smartphone ownership among different socioeconomic groups means more data may be collected from more affluent areas, rather than those with the worst roads. 4.3 Intellectual Property Third parties may have rights over the databases, software or algorithms used to analyse data sets. Business must make sure they have adequate licences to not infringe on another party’s intellectual property rights. 4.4 Cyber security If data is valuable to one business, it is likely to be valuable to others. Businesses must make sure they have adequate security measures in place to protect the data they own and collect. Where personal data is involved, safeguards may need to be higher. it quickly becomes one of the main challenges facing cloud computing – private solutions should be carefully addressed. Creating an internal or private cloud will cause a significant benefit: having all the data in-house. But IT managers and departments will need to face building and gluing it all together by themselves, which can cause one of the challenges of moving to cloud computing extremely difficult. It is important to keep in mind also the steps that are needed to ensure the smooth operation of the cloud: 4.5 Competition law As data is a valuable commodity, competition authorities may prioritise investigating how companies use Big Data and analytics associated. For example, the German competition authority found that Facebook abused a dominant market position through the use of targeted advertising on consumers. Other Risks To Consider: ● ● ● ● ● Unorganized data Data storage and retention Cost management Incompetent analytics Data privacy. ● Automating as many manual tasks as possible (which would require an inventory management system) ● Orchestration of tasks which has to ensure that each of them is executed in the right order. 5.2 Managing multiple clouds Challenges facing cloud computing haven’t just been concentrated in one, single cloud. The state of multi-cloud has grown exponentially in recent years. Companies are shifting or combining public and private clouds and, as mentioned earlier, tech giants like Alibaba and Amazon are leading the way. 5. CHALLENGES 5.1 Building a private cloud Although building a private cloud isn’t a top priority for many organizations, for those who are likely to implement such a solution, In the referred survey, 81 percent of enterprises have a multi-cloud strategy. Enterprises with a hybrid strategy (combining public and private clouds) fell from 58 percent in 2017 to 51 percent in 2018, while organizations with a strategy of multiple public clouds or multiple private clouds grew slightly. 5.3 Dealing with data growth The most obvious challenge associated with big data is simply storing and analyzing all that information. In its Digital Universe report, IDC estimates that the amount of information stored in the world's IT systems is doubling about every two years. By 2020, the total amount will be enough to fill a stack of tablets that reaches from the earth to the moon 6.6 times. And enterprises have responsibility or liability for about 85 percent of that information. Much of that data is unstructured, meaning that it doesn't reside in a database. Documents, photos, audio, videos and other unstructured data can be difficult to search and analyze. When it comes to storage, converged and hyperconverged infrastructure and software-defined storage can make it easier for companies to scale their hardware. And technologies like compression, deduplication and tiering can reduce the amount of space and the costs associated with big data storage. On the management and analysis side, enterprises are using tools like NoSQL databases, Hadoop, Spark, big data analytics software, business intelligence applications, artificial intelligence and machine learning to help them comb through their big data stores to find the insights their companies need. 5.4 Generating insights in a timely manner Organizations don't just want to store their big data, they want to use that big data to achieve business goals. According to the NewVantage Partners survey, the most common goals associated with big data projects included the following: 1. Decreasing expenses through operational cost efficiencies 2. Establishing a data-driven culture 3. Creating new avenues for innovation and disruption 4. Accelerating the speed with which new capabilities and services are deployed 5. Launching new product and service offerings All of those goals can help organizations become more competitive.”Everyone wants decision-making to be faster, especially in banking, insurance, and healthcare." To achieve that speed, some organizations are looking to a new generation of ETL and analytics tools that dramatically reduce the time it takes to generate reports. They are investing in software with real-time analytics capabilities that allows them to respond to developments in the marketplace immediately. 5.5 Recruiting and retaining big data talent But in order to develop, manage and run those applications that generate insights, organizations need professionals with big data skills. That has driven up demand for big data experts — and big data salaries have increased dramatically as a result. The 2017 Robert Half Technology Salary Guide reported that big data engineers were earning between $135,000 and $196,000 on average, while data scientist salaries ranged from $116,000 to $163, 500. Even business intelligence analysts were very well paid, making $118,000 to $138,750 per year. In order to deal with talent shortages, organizations have a couple of options. First, many are increasing their budgets and their recruitment and retention efforts. Second, they are offering more training opportunities to their current staff members in an attempt to develop the talent they need from within. Third, many organizations are looking to technology. They are buying analytics solutions with self-service and/or machine learning capabilities. 5.6 Integrating disparate data sources The variety associated with big data leads to challenges in data integration. Big data comes from a lot of different places — enterprise applications, social media streams, email systems, employee-created documents, etc. Combining all that data and reconciling it so that it can be used to create reports can be incredibly difficult. Vendors offer a variety of ETL and data integration tools designed to make the process easier, but many enterprises say that they have not solved the data integration problem yet. In response, many enterprises are turning to new technology solutions. In the IDG report, 89 percent of those surveyed said that their companies planned to invest in new big data tools in the next 12 to 18 months. When asked which kind of tools they were planning to purchase, integration technology was second on the list, behind data analytics software. 5.7 Validating data Organizations are getting similar pieces of data from different systems, and the data in those different systems doesn't always agree. For example, the ecommerce system may show daily sales at a certain level while the enterprise resource planning (ERP) system has a slightly different number. Or a hospital's electronic health record (EHR) system may have one address for a patient, while a partner pharmacy has a different address on record. The process of getting those records to agree, as well as making sure the records are accurate, usable and secure, is called data governance. And in the AtScale 2016 Big Data Maturity Survey, the fastest-growing area of concern cited by respondents was data governance. [11] Solving data governance challenges is very complex and is usually requires a combination of policy changes and technology. Organizations often set up a group of people to oversee data governance and write a set of policies and procedures. They may also invest in data management solutions designed to simplify data governance and help ensure the accuracy of big data stores — and the insights derived from them. 5.8 Securing big data Security is also a big concern for organizations with big data stores. After all, some big data stores can be attractive targets for hackers or advanced persistent threats (APTs). However, most organizations seem to believe that their existing data security methods are sufficient for their big data needs as well. In the IDG survey, less than half of those surveyed (39 percent) said that they were using additional security measures for their big data repositories or analyses. Among those who do use additional measures, the most popular include identity and access control (59 percent), data encryption (52 percent) and data segregation (42 percent). 6. FUTURE WORK Here’s the next generation of big data technology, it should be possible to manage both traditional and new data sets together on a single cloud platform. This allows us to use the data storage, the object store that’s native to the cloud infrastructure, and the compute capabilities to the cloud infrastructure, out of the box. No more setting up and managing Hadoop clusters, no more provisioning hardware. It’s a paradigm shift in how we think about data management because now, the cloud is the data platform. It also enables us to allow any user to work with any kind of data quickly, securely and efficiently in a way that fits our immediate business needs. 7. CONCLUSION 5.9 Organizational resistance It is not only the technological aspects of big data that can be challenging — people can be an issue too. In the NewVantage Partners survey, 85.5 percent of those surveyed said that their firms were committed to creating a data-driven culture, but only 37.1 percent said they had been successful with those efforts. When asked about the impediments to that culture shift, respondents pointed to three big obstacles within their organizations: ● Insufficient organizational alignment (4.6 percent) ● Lack of middle management adoption and understanding (41.0 percent) ● Business resistance or lack of understanding (41.0 percent) There is a generational shift now in how we’re thinking about big data processing. The big data platform of the future is highly performant, scalable, elastic—and in the cloud. We really don’t need to stand up and maintain our own big data infrastructure anymore, since all the capabilities we need are available in the cloud today. Many organizations have benefited through implementation of cloud in their business. The cloud is a cost effective solution which caters to many needs of a company and is being adopted by enterprises in every industry. Cloud computing makes it easy for IT heads to store and analyze their data with the help of optimum computing resources. It has significant advantages over a traditional system, however, in some instances, cloud platforms have to be integrated traditional architectures. with Decision-makers are sometimes faced with a dilemma over the fact that is cloud really the solution to their big data project. Big data exhibits unpredictable and immense computing power and storage needs. The traditional systems have proved to be slower since storing data and managing the same is a very time-consuming and tedious process, and so it is vital to look for a solution which will do the trick. Since the adoption of cloud by many organizations, it has been providing all the resources to run multiple virtual servers in cloud database seamlessly within a matter of minutes. The compute power and ability to store data, which the cloud provides greatly, results in agility in business functions. Huge amounts of data can be processed within minimum time frame due to flexible resource allocation. It is not easy to switch to a different technology because there are many factors which need to be taken into consideration. Organizations have a certain budget when they wish to switch to a newer technology and in this case, cloud is a blessing being an avant garde technology under a budget. Companies are free to choose the services they need according to their business and budget requirements. Applications and resources which are needed to manage big data don’t cost much and can be easily implemented by enterprises. We only pay for the amount of storage space we use and no additional charges will be incurred. When there are high business demands, traditional solutions require extra physical servers in the cluster to cater to the needs with maximum processing power and storage space but the virtual nature of cloud allows allocation of resources on demand for smooth functioning of business. Scaling is a great option to get the desired processing power and storage space whenever required. Big data requires high data processing platform for analytics and there can be variations in demand which would be satisfied by only the cloud environment. The right management, monitoring and tools should be available to analyze big data on cloud. There are various big data analytics platforms like Apache Hadoop, a java-based programming framework which processes structured and unstructured data. Big Data and Cloud Computing has truly changed the way organizations process their data and implement it in their business. These technologies have impacted businesses in a good way because every decision made through analysis of big data leads to the success of a business. The future is bright as we can see more growth for cloud computing and big data analytics References [1]https://searchdatamanagement.techtarget.com/definition/Hadoop [2]https://beyondcorner.com/learn-apache-hadoop/hadoop-ecosystem-architecture-components/ [3]https://www.researchgate.net/publication/325011725_Review_Paper_on_Big_Data_Analytics _in_Cloud_Computing [4]https://www.guru99.com/introduction-to-mapreduce.html [5]https://en.wikipedia.org/wiki/MapReduce [6]https://www.researchgate.net/publication/316051568_Big_Data_and_Cloud_Computing_Trend s_and_Challenges [7]https://searchdatamanagement.techtarget.com/definition/Hadoop-Distributed-File-System-HDF S [8] https://data-flair.training/blogs/hadoop-hdfs-tutorial/ [9] https://blogs.oracle.com/bigdata/big-data-future-is-cloud [10] https://yourstory.com/mystory/3ddbbf1fb6-big-data-and-cloud-com [11] https://www.datapine.com/blog/cloud-computing-risks-and-challenges