Mining Big Data Akshat Shah Vidhi Shah Sindhu Nair Faculty Department of Computer Engg Dwarkadas J. Sanghvi COE Mumbai, India Department of Computer Engg Dwarkadas J. Sanghvi COE Mumbai, India Department of Computer Engg Dwarkadas J. Sanghvi COE Mumbai, India akshatshah1710@gmail.com vidhi240194@gmail.com Sindhu.Nair@djsce.ac.in Big Data, Data mining, datasets, HACE theorem, Hadoop, MapReduce, 3V’s hacker tools also increased, which with the help of big data analytics tools will acquire computing resources and create security setbacks that never existed before [5]. Nowadays we have been witnessing an ever increasing growth of data which produces a serious threat to the security and privacy of the data. It is necessary to safeguard the data from such threats since we remain at high risks because privacy may be compromised with rise of new technologies [3]. We are facing various challenges in leveraging the vast amount of data, including (1) system capabilities (2) algorithmic design (3) business models (4) security (5) privacy. As an example, Big Data is having in the data mining community, which is being considered as one of the most exciting opportunities in the years to come [3]. Moreover this, Secure Management of Big Data with today’s threat spectrum is also a biggest challenging problem. This paper shows that significant research effort is needed to build a generic architectural framework towards addressing these security and privacy challenges in a holistic manner as future work. 1. INTRODUCTION 2. MINING BIG DATA ABSTRACT Collection of large and complex datasets are generally called Big Data. This large datasets cannot be stored, managed or analyzed easily by the current tools and methodologies because of their large size and complexity. However such datasets provide various opportunities like modelling or predicting the future. This overwhelming growth of data is now coupled with various new challenges. Increase in data at a massive rate has resulted in most exciting opportunities for researchers in the upcoming years. In this paper, we discuss about the topic in detail, its current scenario, characteristics and challenges to forecast the future. This paper discusses about the tools and technologies used to manage these large datasets and also some of the problems and essential issues such as security management and privacy of data. Keywords The technological advancement during the recent years resulted in a meteoric increase in the data which are collected from various sensors, devices, in different formats, from independent or connected applications [1][2][3]. This dramatic increase in datasets surpassed our capability to process, analyze, store and understand these datasets [2]. According to Google, in 1998 the web pages that were indexed were 1 million but it reaches 1 billion in the year 2000 and already exceeded 1 trillion in 2008. The rise of the social networking sites such as Facebook, Twitter etc., allowed users to freely create contents and drastically increased the Web Volume. With the advent of android and other operating systems, mobile phones became the centre piece for collection of data. It allowed collection of real time data of people from different aspects and increased the CDR (call data record) based processing. Internet of Things (IoT) can also be considered an important factor which led to an unprecedented increase in the data [1][2][3]. Real time analytics is required in order to mange this rapidly increasing data from various networking applications, emails, blogs, tweets, posts and others [4]. With this it also increases the security risks and affects the privacy of the sensitive information, or the integrity of results of data analytics may be used by different enterprises to understand the general trends, problems and also to predict future.These data occurs in various forms such as text, images, video and hence establishing reliability and trustworthiness, as well as completeness of data from different sources become very difficult [5]. These worsens the problem of providing overall quality of data. This also generates security and privacy risks as they can be used to reveal very sensitive and confidential information [5]. Advancement of network technologies, mobile and cloud computing possess great threat to process unfiltered data in motion. The increasing volume, velocity, and variety of data present an unprecedented level of security and privacy challenges with respect to Big Data [5]. This increased the number of threats in short periods of time and resulted in many hacking and cyber crimes over the years. With these the number of Big data is a new term used to identify the datasets that due to their large size and complexity, cannot be managed efficiently with their current methodologies or mining software tools. Data can be extracted from large data sets or streams of data easily which was earlier not possible because of its volume, variability and velocity [3][11][12][13]. The big data challenge is becoming one of the most exciting opportunities for the next years. Big Data are generally of two types; structured and unstructured. The data that can be easily categorized and analyzed is called structured data. The data from various smartphones, GPS devices, sales figures, account balances are all structured data. The information such as comments on social media, photos, reviews, tweets are all unstructured data. Big Data mining for structured data is comparatively easier than the unstructured data [6][7]. New algorithms and new tools to deal with this collected data. Volume Variety Velocity Figure 1. 3V’s in Big Data Management [2][3] Doug Laney [2] talked first about 3 V’s in Big Data Management : Variety: There are many different types of data such as numerical data, images, videos, etc. Volume: Data is ever increasing, but not the amount of data that present tools can analyse. Velocity: Data is arriving continuously as streams of data, and we are interested in obtaining useful information in real time. Generally data mining for Big Data analyzes data and extracts useful data from it. Data mining contains several algorithms which fall into following four categories. They are Association: It is used to search relationship between variables and applied in searching for frequently visited items. Clustering: It discovers structures and groups in the data and classifies the data according to its group. Regression: It finds a function to model the data. Classification: It associates an unknown structure to a known structure. These algorithms can be converted into big data map reduce algorithm. Data clustering is also attracted during the recent years. However the enlarging data makes it a challenging task [10]. 3. BIG DATA CHARACTERISTICS: HACE THEOREM Big Data starts with large volume, heterogeneous, autonomous sources seeks to explore complex and evolving relationships among data. Therefore it becomes a challenging task in order to discover useful information from these data. HACE theorem suggests the key characteristics of the Big Data [20]. They are 3.1. Huge Data with Heterogeneous and Diverse Dimensionality One of the important characteristic of Big Data is its heterogeneity and diverse dimensionality in its representation. This heterogeneity exists because different information is represented in different schemas according to users convenience. The schema for different applications depends on its nature and also results in diverse representations of the data. For example, an individual in a biomedical world can be represented by using demographic information such as sex, age, past medical history, allergies etc. For X-Ray examination of patients, images or videos are used to represent the results to examine the patient. For DNA test, microarray expression images are used to represent the genetic code information. Therefore the different types of representations for the same individuals is called heterogeneous features, and variety of features involved to represent each single observation are called diverse features [14]. 3.2. Autonomous Sources with Distributed and Decentralized Control Autonomous sources enable to generate and collect information from various sources without any centralized control. Moreover the data might be corrupted or is vulnerable to threats, therefore these large volume of data is generally distributed across many servers. Consider the World Wide Web (WWW) in which each web server provides certain amount of information and is able to function fully without relying on other servers [14]. 3.3. Complex and Evolving Relationships Complex data consists of multi structure, multisource data such as types of bills, documents, images, videos etc. This requires a complicated approach in order to deal with Big Data [14]. 4. PROBLEM ARISES Increase in data has resulted in increasing enthusiasm about the possibilities of measuring and monitoring Big Data in order to explore these data for some useful purposes. There is a major problem with Big Data since it was not developed for statistical purposes and increases the risk of yielding hypothetical figures. Data extraction from Big data sometimes results in poor analytics because the data obtained from digital media, electronics, social media websites, sensors are largely unstructured and unfiltered and unlike the traditional data sources which are well structured and used for good analytics, these involve a fairly high cost (for data collection) [17][18]. Big Data mining often results into risk of drawing misleading conclusions because of large data sets. There are several problems associated with Big Data mining: Bias in Selection: The data mining is mostly performed on observational data which means there is no control over the treatments, conditions of the object being studied whereas in experimental studies, such control is well exercised. Data may be out of date: In businesses, data mining is to be generally performed on streaming data. Past data is of seldom use in mining applications. Empirical rather than Iconic Models: Iconic models are mathematical representations whereas empirical models are based on finding convenient summaries of a data set. Most data mining models are empirical which yield comparatively less accurate results as compared to iconic models. Performance Measurement: Since different performance criteria are likely to yield different orders of merit, it is important to choose a criterion which closely matches the objectives of the analysis. 5. CHALLENGES IN BIG DATA Big Data mining refers to the collection of data sets with sizes outside the ability of software tools such as database management tools or other traditional tools to capture, manage and analyze data within a stipulated time period. Since data across the Internet is increasing continuously, hence meeting this challenges is quite difficult. Big Data provides extraordinary benefits to the world economy in the field of national security and also in areas ranging from marketing to risk analysis in medical research and urban planning. However the concerns of privacy and data protection outweigh the benefits. Big Data needs to be continuously verified because data sources can be easily corrupted by adding malicious data. Hence information security is becoming big data analytics problem where massive data sets will be explored. Any security control must meet following requirements. Basic functionalities should not be compromised. Big Data characteristics should not be compromised. Security threats should be addressed properly. Security can be enhanced in order to provide tight security to the data by using following techniques: Authentication Methods: A process that verifies the user before accessing the system. For example, Kerberos Encryption: This ensures confidentiality and privacy of user information and secures vital information. It protects data even if malicious user gains access to the system. Access Control: It specifies access control priviledges to users. Logging: To detect attacks, diagnose failures or investigate unusual behavior, logging helps to keep record of activity. It gives a place to look up when something fails, or system is hacked. Secure Communication: It can be established between nodes and between nodes and applications which requires an SSl/TLS that protects all network communications. 6. BIG DATA FOR DEVELOPMENT: CHALLENGES AND OPPORTUNITIES : Real-time awareness: Programs and policies with more refined representation of reality [15]. Early warning: Response in time of crisis, detecting anomalies in the usage of digital media [15]. Real-time feedback: Checking which programs fail by monitoring in real time and making changes accordingly [15]. Big Data for Development sources share some of the following features: Digitally Generated: Data is produced digitally and stored sequentially and can be manipulated easily [16]. Passively Produced: Data is generally a by product of our daily lives or interaction with digital services [16]. Automatically Collected: Data is generated and collected automatically [16]. Trackable: Location can be tracked from the data [16]. Continuously Analysed: Information is relevant to human well – being and development and can be analysed in real time [16]. the HDFS is managed through a dedicated NameNode server to host file system index and a secondary NameNode to generate snapshots of the memory structures, in order to prevent loss of data [24]. 7. TOOLS AND TECHNIQUES FOR BIG DATA MINING Big Data provides extraordinary benefits to big firms to produce useful information which help to manage their problems. Big Data analysis involves automatic discovery of information and analyzing the underlying patterns and hidden rules. These data sets are too large and complex for humans to extract information manually without the use of any computational tool. Various emerging technologies help to process and transform Big Data into meaningful knowledge. 7.1. Hadoop: Hadoop is a scalable, open source, fault tolerant Virtual Grid operating system architecture for data storage and processing which runs on commodity hardware and uses a fault-tolerant high bandwidth clustered storage architecture called HDFS. For distributed data processing, it runs MapReduce and works with both structured as well as unstructured data. Tools like Hive, Pig and Mahout are used in order to handle the velocity and heterogeneity of data which are parts of Hadoop and HDFS framework. Hadoop and HDFS ( Hadoop Distributed File System ) by Apache is widely used for storing and managing big data [3][21][22]. Hadoop consists of Hadoop Distributed File System (HDFS) which stores the data and MapReduce that processes the data at an efficient rate. Hadoop splits files into large blocks. It then distributes them amongst the nodes in the cluster. The Hadoop MapReduce transfers packaged code to process the data in parallel. The advantage of this approach is data locality-nodes manipulate data they have on hand. This allows data processing to be more faster and efficient. Hadoop framework is composed of the following modules: Hadoop Common – contains libraries and utilities; Hadoop Distributed File System (HDFS) – a distributed file-system that stores data on commodity machines; Hadoop YARN – a resource-management platform responsible for managing computing resources in clusters and using them for scheduling of users' applications; Hadoop MapReduce – a programming model for large scale data processing. Hadoop package consists of a MapReduce engine, file system, Hadoop Distributed File System (HDFS) and OS level abstractions. Hadoop clusters can be small or large. A small Hadoop cluster consists of a single master node which consists of Job Tracker, Task Tracker, Name node, and Data node and multiple worker nodes that act as both a Data node as well as a Task tracker. In a large cluster, Figure 2. Hadoop Cluster [24] Hadoop Distributed File System: A Hadoop cluster has a single Name nodes plus a cluster of Data nodes in order to avoid redundancy. HDFS uses a block protocol over the network in order to serve blocks of data. It uses TCP/IP socket for communication. It stores large files across multiple machines. Replicating the data across multiple hosts improves reliability and provides efficient storage. Job Tracker and Task Tracker: The client applications submit the MapReduce jobs to the Job tracker which pass on the work to the Task tracker nodes. Scheduling: Hadoop framework makes use of First-In First-Out (FIFO) approach for scheduling. MapReduce: It is a programming model used to process large data sets with a parallel, distributed algorithm on a cluster. It is a software framework for writing applications that rapidly process vast amounts of data in parallel on large clusters of computer nodes [19][23]. It consists of two methods, map() which performs the task of filtering and sorting and reduce() which summarizes the result. 7.2. S4: It provides a platform for processing continuous data streams. In order to manage datastreams S4 is generally used. S4 apps are designed combining streams and processing elements in real time [2][3]. 7.3. Storm: This software is used for streaming data-intensive distributed applications, similar to S4, and developed by Nathan Marz at Twitter [1][3]. 8. FORECAST TO THE FUTURE Big Data mining will face important challenges in the future due to the nature of big data. There are many other challenges which researchers will have to deal with: Analytics Architecture: Researchers need to find out an optimal architecture in order to deal with historic data as well as increasing streaming data at the same time. One example for such an architecture is the Lambda architecture of Nathan Marz. Time evolving data: It is important that Big Data mining deal with the data that evolve over time. It must adapt in some cases to detect change first. Distributed mining: Many data mining are not trivial to paralyze. Research is needed to perform in order to have distributed versions of some methods [22]. Compression: Ever increasing data means the amount of space needed to store the data also needs to be considered. Hence data needs to be compressed which may take more time but less space is required. Statistical Significance: Statistically significant results are required in order to understand the data properly. Hence, this section dealt with challenges related to forecast the future. Finally in the next section, we conclude this paper in brief. 9. CONCLUSION Due to expansion of Internet, the data is increasing at tremendous rate and hence Big Data is going to continue growing during next years and will become one of the exciting opportunities in the near future. This paper provides an insight about the topic, its characteristics, challenges etc. for the future. Hence Big Data is becoming important for scientific data research and business applications. Security Management and Privacy protection is one of the many challenges faced by Big Data and is becoming a big issue because of ever increasing growth of data in terms of volume, velocity and variety. From security point of view, the threats associated to this data is increasing at an unprecedented rate. So future research is needed to find out an optimal architecture in order to manage this overwhelming data and also to address the security and privacy challenges in a holistic manner. Now we are in a new era where Big Data mining will help us to discover knowledge that no one has discovered before. REFERENCES [1] http://big-data-mining.org/ [2] Ms. Neha A. Kandalkar , Prof. Avinash Wadhe,” Extracting Large Data using Big Data Mining”, International Journal of Engineering Trends and Technology (IJETT) –Volume 9 Number 11 - Mar 2014 [3] Wei Fan, Albert Bifet,” Mining Big Data: Current Status, and Forecast to the Future” [4] J. Gama. Knowledge discovery from data streams. Chapman & Hall/CRC, 2010. [5] James Joshi, Balaji Palanisamy,” Towards Risk aware Policy based Framework for Big Data Security and Privacy”, 2014 [6] Bharti Thakur, Manish Mann,” Data Mining for Big Data: A Review”, International Journal of Advanced Research in Computer Science and Software Engineering, Volume 4, Issue 5, May 2014, ISSN: 2277 128X. [7] http://www.bls.gov/careeroutlook/2013/fall/art01.pdf [8] UN Global Pulse, http://www.unglobalpulse.org. [9] Puneet Singh Duggal, Sanchita Paul, (2013), “Big Data Analysis:Challenges and Solutions”, Int. Conf. on Cloud, Big Data and Trust, RGPV [10] Dr. A.N. Nandhakumar, Nandita Yambem „A Survey of Data Mining Algorithms on Apache Hadoop Platforms‟, IJETAC, Volume 4, Issue 1, January 2014. [11] D. Pratiba, G. Shobha, Ph. D,” Educational BigData Mining Approach in Cloud: Reviewing the Trend”, International Journal of Computer Applications (0975 –8887) Volume 92 – No.13, April 2014. [12] Mrs. Deepali Kishor Jadhav,” Big Data: The New Challenges in Data Mining”, IJIRCST,ISSN: 2347-5552, Volume-1, Issue2, September,2013. [13] Harshawardhan S. Bhosale , Prof. Devendra P. Gadekar,” A Review Paper on Big Data and Hadoop”, International Journal of Scientific and Research Publications, Volume 4, Issue 10, October 2014 1 ISSN 2250-3153. [14] Xindong Wu, Xingquan Zhu, Gong Qing Wu, Wei Ding, Data mining with Big data‟, IEEE, Volume 26, Issue 1, January 2014. [15] http://albertbifet.com/global-pulse-big-data-for-development/ [16] http://www.ciard.net/sites/default/files/bb40_backgroundnote.p df [17] http://paris21.org/newsletter/fall2013/big-data-dr-jose-ramonalbert [18] Xindong Wu, Xingquan Zhu , Gong-Qing Wu, Wei Ding,” Data Mining with Big Data” [19] Richa Gupta, Sunny Gupta, Anuradha Singhal, (2014), “Big Data:Overview”, IJCTT, 9 (5) [20] http://www.ijarcsse.com/docs/papers/Volume_4/5_May2014/V 4I5-0328.pdf [21] http://ieeexplore.ieee.org/xpl/articleDetails.jsp?tp=&arnumber= 7021746 [22] http://www.academia.edu/10083723/ISSUES_CHALLENGES _AND_SOLUTIONS_BIG_DATA_MINING [23] http://www.cs.uml.edu/~jlu1/doc/source/report/MapReduce.ht ml [24] https://en.wikipedia.org/wiki/Apache_Hadoop#Architecture