Introduction: The past decade has been explosion of "data-intensive" or "data-centric" applications where the analysis of large volume of heterogeneous data is the basis of analysis problems. These commonly known "Big Data applications", and the system that supported the management and processing this data commonly referred to as "Big Data processing system" What's the Big Data? Big Data is aboard term for data sets so large or complex that traditional data processing applications are inadequate. Obviously, there are many challenges to deals with this type of data such: analysis, capture, data manipulate, search, sharing, storage, transfer, visualization, and information privacy. How to describe Big Data? Can describing it by following characteristics: 1. Volume: the quantity of data is important in this context, since it's determined whether Big Data or not. 2. Varity: this means that category to which Big Data belongs to. This very essential fact that needed to be known by the data analysis. 3. Velocity: this term refers to the speed of generation of data or how fast the data is generated and processed to meet the demands and the challenges which lie ahead in the path of growth and development. 4. Veracity: the data that arrived to Big Data systems that came from deferent sources and could be noise, bias, or inconsistencies data, therefore, the Big Data systems need to clean the data and maintain their provenance in order to reason about their trustworthiness. And there are some authors added another characteristics such Complexity, which Data management can become a very complex process when it deals with large volumes of data that come from deferments sources with many undesired information, this process need data to be linked, connected, and correlated in order to be able to grasp the information that's supposed to be conveyed by these data. What's the benefits and uses of Big Data? Ability to process Big Data brings in multiple benefits, such as1. Businesses can utilize outside intelligence while taking decisions. 2. Universities use data science in their research but also to enhance the study experience of their students. 3. Improved customer service: Traditional customer feedback systems are getting replaced by new systems designed with Big Data technologies. 4. Early identification of risk to the product/services, if any 5. Better operational efficiency: Big Data technologies can be used for creating a staging area or landing zone for new data before identifying what data should be moved to the data warehouse What’s Facets of data? In big data you’ll come across many different types of data, and each of them tends to require different tools and techniques. The main categories of data are these: Structure. Unstructured. Natural language. Machine-generated. Graph-based. Audio, video, and images. Streaming. In following examples of Big data: 1. The New York Stock Exchange generates about one terabyte of new trade data per day. 2. Social Media: Facebook, and twitter... etc. 3. YouTube. 4. And more… The different between traditional data and Big data: Traditional Data Big Data Volume Ranges from gigabytes to terabytes is in manageable volume Ranges from petabytes to zettabytes or exabytes this huge volume which becomes unmanageable. Data Deals with structured Deals with different types data, stable, and inter- that’s unstable, and unknown relationship relationship Generated Per hour or per day or More frequently, mainly per more, From specific second, and from very sources sources. Architecture Is centerized and it’s managed in centerized form Distributed and it’s managed in distributed form Integration Easy Difficult Process Normal system is capable to process this data High-system configuration used to process big data Data model Is strict schema based, it’s static Dynamic Source May be financial data, organization data, or others. Social media, device data, sensor data, video, image, audio, etc. Storage Block storage Stored in files or objects distributed over nodes. Big Data architecture: 1. Data sources govern Big Data architecture. It involves all those sources from where the data extraction pipeline gets built. 2. Data Storage is the receiving end for Big Data. Data Storage receives data of varying formats from multiple data sources and stores them. 3. Real-time Message Ingestion: We need to build a mechanism in our Big Data architecture that captures and stores real-time data that is consumed by stream processing consumers. 4. Batch Processing: The architecture requires a batch processing system for filtering, aggregating, and processing data which is huge in size for advanced analytics. This job involved reading the data from the data storage, processing it, and writing outputs to the new files. 5. Stream Processing: There is a little difference between stream processing and real-time message ingestion. Stream processing handles all streaming data which occurs in windows or streams. It then writes the data to the output. 6. Analytical Data Store: After processing data, we need to bring data in one place so that we can accomplish an analysis of the entire data set. 7. Analytics and Reporting: After ingesting and processing data from varying data sources we require a tool for analyzing the data. For this, there are many data analytics and visualization tools that analyze the data and generate reports or a dashboard. Companies use these reports for making data-driven decisions. 8. Orchestration: Moving data through these systems requires orchestration in some form of automation. Ingesting data, transforming the data, moving data in batches and stream processes, then loading it to an analytical data store, and then analyzing it to derive insights must be in a repeatable workflow. This allows us to continuously gain insights from our big data. Challenges in Designing Big Data Architecture: 1. Data quality is a challenge while working with multiple data sources. Cost 2. Big Data architecture must be Tools designed in such a way that it can scale up when the need arises. Quality Secuirty 3. Data Security is the most crucial part. It is the biggest challenge while Scalabilty dealing with big data. Hackers and Fraudsters may try to add their own fake data or skim companies’ data for sensitive information. Cybercriminal would easily mine company data if companies do not encrypt the data, secure the perimeters, and work to anonymize the data for removing sensitive information. 4. There are many tools and technologies with their pros and cons for big data analytics like Apache Hadoop, Spark, Casandra, Hive, etc. Choosing the right technology set is difficult. 5. Big data architecture entails lots of expenses. During architecture design, the Big data company must know the hardware expenses, new hires expenses, electricity expenses, needed framework is open-source or not, and many more. Big Data Management System: it’s uses a different software stack with the following layers: 1. Distributed Storage Systems Big data management relies on a distributed storage layer, whereby data is typically stored in files or objects distributed over the nodes. This is one major difference with the software stack of current DBMSs that relies on block storage. The distributed storage layer typically provides two solutions to store data, objects, or files, distributed over cluster nodes. These two solutions are complementary, as they have different purposes and can be combined, in order to support both high numbers of objects and large files. One of the most influential distributed file systems is Google File System (GFS). 2. Big Data Processing Frameworks: An important class of big data applications requires data management without the overhead of full database management, and cloud services require scalability for applications that are easy to partition into a number of parallel but smaller tasks the so-called embarrassingly parallelizable applications. For these cases where scalability is more importance, a parallel processing platform called MapReduce has been proposed. We’ll discussed bellow. 3. Stream Data Management: is computer software system to manage continuous data streams. It’s offers flexible query processing so that the information is needed can be expressed using queries. Since most DSMS are data-driven, a continuous query produces new results as a new data arrive at the system. One of the biggest challenges for a DSMS is to handle potentially infinite data streams using a fixed amount of memory and no random access to the data. There are different approaches to limit the amount of data in one pass, which can be divided into two classes: 1. compression techniques that try to summarize the data. 2. window techniques that try to portion the data into (finite) parts. Following some different between DBMS and DSMS: DBMS Persistent data relations Random access One-time queries Unlimited secondary storage Only the current state is relevant Relatively low update rate Little or no real time require Assumes exact data Plannable query process DSMS Volatile data streams Sequential access Continuous queries Limited main memory Consideration of the input Potentially extremely high update rate Real time requirements Assumes outdated \ inaccurate data Variable data arrival and data characteristics 4.Data Analysis Platforms: The simplicity of this model allows for rapidly absorbing and connecting large volumes of data from many sources. Big data analytics systems should enable a platform that can support different analytics techniques that can be adapted in ways that help solve a variety of challenging problems. The biggest advantage to using graphs is you can analyze these graphs and use them for analyzing complex datasets. We required this analytic method to build from massive big datasets in order to derive quick insights. Obviously, appear point intersection to discern between graph analysis model and rational model, at the case when we need to analysis Big Data. Hadoop: The Apache Hadoop software library is a framework that allows for the distributed processing of large data sets across clusters of computers using simple programming models. It is designed to scale up from single servers to thousands of machines, each offering local computation and storage. Rather than rely on hardware to deliver high-availability, the library itself is designed to detect and handle failures at the application layer, so delivering a highly-available service on top of a cluster of computers, each of which may be prone to failures. A wide variety of companies and organizations use Hadoop for both research and production. It provides a software work for distributed storage and processing of Big Data using the MapReduce model. MapReduce: Hadoop uses a programming method called MapReduce to achieve parallelism. It’s a programming model and an associated implementation for processing and generating Big Data with parallel, distributed algorithm on cluster. MapReduce: is composed from “Map” procedure which perform filtering and sorting (such as sorting students by first name in queue, one queue for each name). And a “Reduce” method, which performs a summary operation (such as counting the no. of students in each queue, yielding name frequently). MapReduce process stored data that may be either file system (unstructured) or database (structure). It can take advantage of the locality of data, by processing it near the place it’s stored in order to minimize communication overhead. MapReduce Stages: 1. Map: each worker node applies the map function to the local data, and writes the output to a temporary storage. A master node ensures that only one copy of the redundant input data is processed. 2. Shuffle: worker node redistributed data based on the output keys (produced by Map function), such that all data belonging to one key is located on the same worker node. 3. Reduce: worker nodes new process each group of output data, per key, in parallel. For commonly cited advantages of this type of processing from work are: 1. Flexibility: Hadoop MapReduce programming model offers flexibility to process structure or unstructured data by various business organizations who can make use of the data and can operate on different types of data. 2. Scalability: Hadoop as a platform that is highly scalable and is largely because of its ability that it stores and distributes large data sets across lots of servers. 3. Cost-effective solution: the business was forced to downsize the data and further implement classification based on assumptions how certain data could be valuable to the organization and hence removing the raw data. Here the Hadoop scale-out architecture with MapReduce programming comes to the rescue. 4. Fault tolerance: This is a unique functionality offered in Hadoop MapReduce that it is able to quickly recognize the fault and apply a quick fix for an automatic recovery solution 5. Parallel processing: The programming model divides the tasks in a manner that allows the execution of the independent task in parallel. Hence this parallel processing makes it easier for the processes to take on each of the tasks which helps to run the program in much less time. 6. A simple model of programming: MapReduce programming is based on a very simple programming model which basically allows the programmers to develop a MapReduce program that can handle many more tasks with more ease and efficiency 7. Fast: Hadoop MapReduce processes large volumes of data that is unstructured or semi-structured in less time. 8. Security and Authentication: If any outsider person gets access to all the data of the organization and can manipulate multiple petabytes of the data it can do much harm in terms of business dealing in operation to the business organization. This risk is addressed by the MapReduce programming model by working with hdfs and HBase that allows high security allowing only the approved user to operate on the stored data in the system.