Overview of Big Data Module 1 Evaluation Criteria - Theory Criteria Marks Mid Marks(Best of Three) 30M Assignment 5M Quiz 5M Total 40M Evaluation Criteria - Lab Criteria Marks Continuous Evaluation 40M Lab Exam 1 15M Lab Exam 2 15M Coursera 20M Case Study Total 10M 100M Coursera Course Link Introduction to Big data with Spark and Hadoop offered by IBM Link: https://www.coursera.org/learn/introduction-tobig-data-with-spark-hadoop Syllabus – Module 1 Getting an overview of Big Data Big Data definition, History of Data Management, Structuring Big Data, Elements of Big Data, Big Data Analytics. Exploring use of Big Data in Business Context: Use of Big Data in Social Networking, Use of Big Data in preventing Fraudulent Activities in Insurance Sector & in Retail Industry. Why Big Data ? • Soaring Demand for Analytics Professionals • Salary Aspects • Big Data Analytics: A Top Priority in a lot of Organizations • Big Data Analytics is Used Everywhere! 6 Soaring Demand for Analytics Professionals Salary Aspects 8 Big Data – Job Titles Big Data – Required skills Big Data/Analytics Jobs (Toronto) 11 • Banks • RBC, TD, CIBC, Scotiabank, AMEX, CapitalOne, ING Direct • Telcommunications • Rogers, Telus, Bell, etc. • Technology • BlackBerry, Huawei, CGI • Manufacture/Services • GM, Canada Post, Workopolis • Insurance • SunLife, Manulife • Web/Mobile/Startup – Google, Mozilla • Digital Media/Agencies • Globe and Mail, Kobo • Consulting – Accenture, IBM, Deloitte, SAS • Retail/e-commerce – Amazon, HR, Hudson Bay, Sears, Shoppers, Canadian Tire, Sobeys • Pharmaceutical/Healthcare – Hospitals, Clinical Research Companies etc. Job Market • Where are big data jobs? • North America • • Sillicon Valley, Seattle, NYC, Toronto India/China Big data jobs around the nation: http://www.tableausoftware.com/public/gallery/big-data-jobs Big Data Salary: http://goo.gl/4et998 Oreilly Media Data Science Salary Survey: http://www.oreilly.com/data/free/files/stratasurvey.pdf KDNuggets 2014 Analytics/Data Science Salary Poll: http://goo.gl/VhO9IW Why is Big Data important now? 'Big Data' has many valuable applications: • Product recommendation • Prediction • Market Analysis • Fraud detection And many, many more ... Data must be processed to glean insights from it and derive the value from it. Big Data Made Possible Hardware ‒ Big cluster of commodity machines at lower cost • Faster processor • Cheaper memory • Bigger hard drive space • Faster network bandwidth Software ‒ Algorithms to allow parallel computing (map-reduce) What is Big Data? Think of the following: • Every second, there are around 8,22 tweets on Twitter. • Every minute, nearly 510 comments are posted, 293,000 status are updated and 136,000 photos are uploaded on Facebook. • Every hour, Walmart handles more than 1 million customer transactions. • Everyday, Customers make around 11.5 million payments by using PayPal. - Digital world -> increase in data rapidly ->increase in the use of internet, sensors at a very high rate. - The sheer volume, variety, velocity and veracity of such data is signified by the term ‘Big Data What is Big Data? • Big data is structured, unstructured and semi-structured in nature. • Difficult for computing systems due to high speed and volume. • Traditional data management, warehousing and analysis fizzle to analyze the high speed of data. • Hadoop by Apache is widely used for storing an managing Big data. • According to IBM, everyday we create 2.5 quintillion bytes of data – so much that 90% of the world today has been created in the last two years alone. • Data – sensor data, climate data, GPS data, bank data to name a few.This data is Big data. Big Data - Definition • “Big data” is high-volume, velocity, and variety information assets that demand cost-effective, innovative forms of information processing for enhanced insight and decision making.” • In simple words, Big data is a collection of data that is huge in volume, yet growing exponentially with time. It is a data with so large size and complexity that none of traditional data management tools can store it or process it efficiently. Data Expansion – Day by Day • One of the data production source - Smart electronic devices • Amount of data – 175 ZettaBytes by 2025. • Total volume of data – double every two years. Tabular Representation of various Memory Sizes NAME EQUAL TO SIZE(IN BYTES) Bit 1 bit 1/8 Nibble 4 bits 1/2 (rare) Byte 8 bits 1 Kilobyte-KB 1024 bytes 1024 Megabyte-MB 1, 024kilobytes 1, 048, 576 Gigabyte-GB 1, 024 megabytes 1, 073, 741, 824 Terrabyte-TB 1, 024 gigabytes 1, 099, 511, 627, 776 Petabyte-PB 1, 024 terrabytes 1, 125, 899, 906, 842, 624 Exabyte-EB 1, 024 petabytes 1, 152, 921, 504, 606, 846, 976 Zettabyte-ZB 1, 024 exabytes 1, 180, 591, 620, 717, 411, 303, 424 Yottabyte-YB 1, 024 zettabytes 1, 208, 925, 819, 614, 629, 174, 706, 176 19 In simple words, various memory sizes Sources of Big Data • Social media • Sensor placed in various cities • Customer satisfaction feedback • IoT Appliance • E-Commerce • Global Positioning System(GPS) Sources of Big Data Social Media • Whatsapp, Facebook, Instagram, Twitter, YouTube etc • Each activity – upload photo/video, making comment, sending a message, like etc create data. Sensors • Sensors in city – gather temperature, humidity etc • Camera beside roads gather information • Security cameras in airports/banks – create a lot of data Customer Satisfaction feedback • Amazon, flipkart, firstcry, licious, swiggy, blinkit, zepto etc – gather customer feedback – quality of product/deliver time. It creates a lot of data. Sources of Big Data IoT Appliance • Electronic devices connected to the internet create data for their smart functionality. Example : Samsung smartthings. E-Commerce • Payments through Credit card, Debit card, pay later, or all electronic ways are recorded as data. Global Positioning System(GPS) • Vehicle movement – directions/ traffic congestion. Creates a lot of data on vehicle position and movement. Features of Big data Big Data Is a new data challenge that requires leveraging existing systems differently Is classified in terms of 4 V’s Volume Variety Velocity Veracity Is usually unstructured and qualitative in nature Real world examples – Big data • Social media analytics – Consumer product companies and retail organizations are observing data on social media websites to analyze customer behaviour, preferences etc • Insurance companies use BDA to see which home insurance applications can be immediately processed and which ones need a validating in person visit from an agent. • Hospitals are analysing medical data and patient records to predict those patients that are likely for readmission within few months of discharge. • Relying on Social networks and analytics, Companies are gathering volumes of data from the web to help musicians and music companies better understand their audiences. Types and Sources of data Type Description Source Social Data Information collected from various social networking sites and online portals Facebook, Twitter and Linkedin Machine Data Information generated from RFID chips, bar code scanners and sensors RFID chip readings, Global positioning System(GPS) Results Transactional Data Information generated from online shopping sites, retailers and Business to Business(B2B) transactions Retail websites like ebay and Amazon Caselet History of Data Management – Evolution of Big Data • Big data is the new term of data evolution directed by velocity, variety and volume of data. • Velocity implies the speed with which the data flows in an organization. • Variety refers to the varied forms of data, such as structured, semi-structured or unstructured. • Volume defines the amount or quantity of data an organization has to deal with. Challenges faced while handling the data over the past few decades In the early 60’s, technology witnessed problems with velocity. This need, inspired the evolution of databases. In the 90’s,technology witnessed issues with variety (emails,documents,videos), leading to the emergence of non-SQL stores. Today, the technology is facing issues related to huge volume, leading to new storage and processing solutions, Structuring Big Data • In simple terms, arranging the available data so that it becomes easy to study, analyse, and derive conclusion from it. • Information processing systems – Can analyse on basis of what you searched, what you looked at, for how long you remained at a particular page or website. • When a user regularly visits or purchases from Amazon, each time he/she logs in, the system can present a recommended a list of products that may interest the user on the basis of his/her purchases or searches. This is the power of Big Data Analytics. Types of Data • Data that comes from multiple sources such as databases, ERP systems, weblogs, chat history and GPA maps varies in its format. • Data is obtained primarily from the following types of sources: (a) Internal sources, such as organizational data (b) External sources, such as social data Types of Data Data Source Internal External Definition Examples Provides structured data that originates within the enterprise and helps run business • Provides unstructured data that originates from external environment of an organization Application • • • Customer Relationship Management Enterprise Resource Planning Customers, details Products and sales data This data is used to support daily business operations of an organization • • • Business partners Internet Market research organizations This data is analyzed to understand the entities mostly to external organizations, such as customers, competitors, market and environmemt. Types of Data • Big data comprises - Structured data - Unstructured data - Semi-structured data Structured data • • • • Is organized data in a predefined format Is stored in tabular form Is the data that resides in fixed fields within a record or file Is formatted data that has entities and their attributes mapped • Is used to query and report against predetermined datatypes • SQL is used for managing and querying data - represent only 5 to 10% of all the data • When data grows beyond the size of RDBMS, it Can be stored & analyzed in data warehouses but only up to certain limit Example –Sample of Structured data Customer ID Name Product ID City State 123 Jack 4689 Graz Styria 321 Sandy 5688 Wolfsberg Carinthia 459 Robert 459 Enns Upper Austria Unstructured Data • lack of structure • About 85% of total data is un-structured. Ex: • e-mail messages, • word processing documents, • videos, photos, audio files, presentations, • web pages • other kinds of business documents. Semi Structured Data Also known as having a schema-less or self describing structure refers to a form of structured data that contains tags in order to separate elements and generate hierarchies of records and fields in the given table. Sl No Name E-Mail 1 Sam smj@xyz.com 2 First Name : David Second Name : Brown davidb@xyz.com Elements of Big Data • According to Gartner, data is growing at the rate of 59% every year. This growth can be depicted in terms of the following four Vs: (i) Volume (ii) Velocity (iii) Variety (iv) Veracity Video Box Position Department of CSE, GIT 8 May 2023 Course Code: EID449 Course Title: BIG DATA ANALYTICS 40 Volume • Volume is the amount of data generated by organizations or individuals. • At present, Volume of data – exabytes • In coming years, Volume of data – zettabytes • Organizations are doing their best to handle this everincreasing volume of data. Example : - Every minute, over 571+ new websites are being created. - Boeing 737 will generate 240 terabytes of flight data during a single flight across US. Velocity • Velocity describes the rate at which data is generated, captured and shared. • Information processing systems face problem with the data, as the data which keeps adding up but cannot be processed quickly. Example : eBay analyses around 5 million transactions per day in real time to detect and prevent frauds arising from the use of PayPal. Sources of high velocity data: - IT devices, including routers,firewalls, switches etc generate valuable data - Social media, including Facebook posts, tweets create huge amount of data, to be analyzed at fast speed as the value degrades quickly with the time. Variety • refers to structured, unstructured, and semi structured data that is gathered from multiple sources and comes in different formats, such as images, text, videos etc. • While in the past, data could only be collected from spreadsheets and databases, today data comes in an array of forms such as emails, PDFs, photos, videos, audios, SM posts, and so much more. Veracity • Refers to Uncertainty of data i.e., that is data which is available can sometimes get messy and quality and accuracy are difficult to control. Example: Data in bulk could create confusion whereas less amount of data could convey half or Incomplete Information. In short, Simple 4V’s Big Data Analytics • Big Data analytics is a process used to extract meaningful insights, such as hidden patterns, unknown correlations, market trends, and customer preferences. • Big Data analytics provides various advantages—it can be used for better decision making, preventing fraudulent activities, among other things. • There are three main types of business/data analytics: (a) Descriptive Analytics (b) Diagnostics Analytics (c) Predictive Analytics (d) Prescriptive Analytics Big Data Analytics - Descriptive analytics – “What happened in the business”? • Descriptive analytics analyses a database to provide information on the trends of past or current business events that can help managers, planners, leaders to develop a roadmap for the future actions. • In short, Identifying the root cause of the problem and the underlying reason for failures. Example: During the pandemic, a leading pharmaceuticals company conducted data analysis on its offices and research labs. Descriptive analytics helped them identify unutilized spaces and departments that were consolidated, saving the company millions of dollars. Big Data Analytics - Diagnostics analytics • Diagnostics analytics helps companies understand why a problem occurred. Big data technologies and tools allow users to mine and recover data that helps dissect an issue and prevent it from happening in the future. Example: A clothing company’s sales have decreased even though customers continue to add items to their shopping carts. Diagnostics analytics helped to understand that the payment page was not working properly for a few weeks. Big Data Analytics - Predictive analytics – “What could happen”? Understanding and predicting the future by using statistical models and different forecast techniques. Here, we use statistics, data mining techniques and machine learning to analyze the future. Example: In the manufacturing sector, companies can use algorithms based on historical data to predict if or when a piece of equipment will malfunction or break down. Big Data Analytics - Prescriptive analytics – “What should we do”? • Based on complex data from descriptive and predictive analyses, prescriptive analytics is used. • By using the optimization technique, this analytics determines the finest substitute to minimize or maximize some equitable marketing and many other areas. Example: If we have to find the best way of shipping goods from a factory to a destination to minimize costs, we will use the prescriptive analytics. Questions • List the four elements of Big Data. • As an HR manager of a company providing Big Data solutions to clients, what characteristics would you look for recruiting a potential candidate for a position of a data analyst? • You are planning the marketing strategy for a new product in your company. Identify and list some limitations of structured data related to the work. Exploring the Use of Big Data in Business Context Use of Big Data in Social Networking Use of Big Data in preventing fraudulent activities Use of Big Data in preventing fraudulent activities in Insurance Sector Use of Big Data in Retail Industry Exploring the Use of Big Data in Business Context • An organization generally has to spend huge amounts to collect data and information. • For example, customer surveys collecting information goes on escalating as an organization keeps on collecting more information. The continuously increasing cost decreases the value of the collected information. • In other words, collecting and maintain a pool of data and information is just a waste of resources unless any logical conclusions and business insights can be derived from it. • This is where Big data analytics come into the picture. Use of Big Data in Social Networking Use of Big Data in Social Networking Use of Big Data in Social Networking • Social network data refers to the data generated from people socializing on social media. • Some popular social networking sites are Twitter, Facebook,LinkedIn etc • On the social networking site, different people constantly add and update comments, status, likes, preference etc. All these activities generate large amounts of data. • This data can be segregated on the basis of different age groups, locations and genders for the purpose of analysis. Use of Big Data in Social Networking • Social Networking Analysis(SNA) – Analysis performed on the data from social media. Example : Mobile Network Operator(MNO) • The data captures by MNO in the form of phone calls, text messages and other record details of all its customers per day is very huge in volume. • The company should study the data of people whom the customer called and also of the people who called back. Such a network is called Social Network. Use of Big Data in Social Networking • The data analysis process can go deeper and deeper within the network to get a complete picture of a social network. • As the analysis goes deeper, the volume of data to be analyzed also becomes massive. • The same structure of SNA is followed when it comes to social networking sites. Use of Big Data in Social Networking • Following are the areas in which decision-making processes are influenced by social network data: (a) Business Intelligence (b) Marketing (c) Product design and development Use of Big Data in Social Networking – Business Intelligence(BI) • Data analysis process to convert a raw dataset to meaningful information. • Allows a company to collect, store, access and analyse the data for adding value to decision making. • The data generated from different social media is analyzed using Social Customer Relationship Management(CRM) which is used to describe the data. Use of Big Data in Social Networking – Business Intelligence(BI) Example: • Mobile service provider that has a low-value customer. • If the low-value customer is not satisfied with the services and if he wants to leave the company generally has no problems to let the customer go as he is providing low-revenue. • With the help of SNA, the organization can identify some connections of the customers network make a large number of calls and text messaged and have a large network of friends. • With such an analysis, the organization might take an altogether decision making and might start valuing the customer more – influence of a customer is very important to organization. Use of Big Data in Social Networking – Marketing • Today the customer preferences has changes due to their busy schedules – No time to read newspaper, TV commercials or go through marketing emails. • Customers can now make their preferences clear and select the marketing messages they wish to receive. • In today’s world, marketers aim to deliver what consumers want by using interactive communication across digital channels such as email, mobile, social and the Web which inturn generates the social data. Use of Big Data in Social Networking – Marketing Product Design and Development • By listening to customers needs, ny understanding where the gap in the offering is, and so on, organizations can make the right decisions in the direction of their product development and offerings. Example : YouTube – Rate a brand on a scale of 1-10/ know a brand etc • Once the brand rating crosses 300 or more, the applications sends out a report about the information what the customer is feeling about the product and the detailed analysis of the brand’s reputation. • In this way, social network can help organizations to improve the product development by making sure about the customer needs. • Sentiment analysis analyses human emotions, attitudes and views across popular social networks. Product Design and Development Use of Big Data in Fraudulent Activities • Most common types of Financial frauds: (a) Credit card fraud (b) Exchange or return policy fraud – Amazon/Flipkart (c) Personal information fraud – Obtaining the login details of a customer, purchase a product online, and then change the delivery address to different location. The actual customer keeps calling to retailer to refund the amount as he has not made the transaction Preventing Fraud using Big Data Analytics Analyzing Big Data allows organizations to: • Keep track of and process huge volumes of data. • Differentiate between real and fraudulent entries. • Identify new methods of fraud of fraud and add them to the list of fraud-prevention checks. • Verify whether a product has actually been delivered to valid recipient • Determine the location of the customer and the time when the product was actually delivered. Use of Big Data in Detecting Fraudulent Activities in Insurance Sector • Insurance company wants to improve the ability to take decisions while processing claims. • Decides to implement a Big Data analytical platform, which will use the data from social media to provide the real-time view of the case in hand. • The information obtained will enable the insurance agent to diagnose the patterns of customer’s claim, behavior and other issues. Example: In some cases, social media could also provide great triggers to identify fraud – A customer might indicate that his car was destroyed in a flood, but the documentation from the social media feed any show that the car was actually in another city on the day flood occurred. Fraud Detection • Fraudulent claims were identified by insurance companies by using statistical models. • Social Networking Analysis(SNA) is an innovative way to identify and detect frauds. • SNA tool uses a mix of analytical methods which includes statistical methods, pattern analysis and link analysis to identify any kinds of relationships or patterns within large amounts of data collected from different sources. • When link analysis is used in fraud detection, one looks for clusters of data and how these clusters are linked to other data clusters. Fraud detection using SNA method Social Customer Relationship Management(CRM) • Social CRM enables effective fraud detection in the insurance sector. • Social CRM is a process, it is not a platform or technology. • Makes critical for insurance companies to link social media sites to their CRM systems. • If social media is integrated within an organization, it provides high transparency in various issues related to customer. Social CRM Process • Collects data from organization’s existing CRM and different social media platforms • Reference data obtained from the social media platform and the data stored in CR, are loaded into claim management system, which compares and analyses the data and provides results. • The response received from claim management system is then investigated. Use of Big Data in Retail Industry • Big data has huge potential for the retail industry by considering the immense number of transactions and their correlation. • A single retail location has a small customer database and it is easy to answer the simple questions like : (a) How many basic tees did we sell today? (b) What time of the year do we sell most leggings? (c) What else has customer bought ,and what kind of coupons can we sent to the customer? • However, with millions of transactions spread across at multiple locations, it is impossible to find answers to such questions. Use of Big Data in Retail Industry Use of Big Data in Detecting Fraudulent Activities in Retail Sector Retail fraud: It is an illegal transaction that a fraudster performs using stolen credit card details or loopholes in the order placement and payment systems and company policies. As technology grew, so did the fraudsters' sophistication of executing frauds online. Types of Retail fraud: (a) Transaction fraud (b) Return fraud (c) Chargeback guarantee fraud Types of Retail fraud • Transaction fraud It is also called card-not-present (CNP) fraud where the fraudster uses a stolen credit card for online purchases. The company loses money when the original owner of the card demands a chargeback. • Return fraud Example - e-commerce industry • Chargeback guarantee fraud Many online retail fraud prevention solutions guarantee that they will block all transactions and friendly frauds and even pay the admin fee out of their pocket. The problem arises when the company blocks even legitimate customers. This is called a false positive that not only damages your reputation but also results in loss of revenue. Use of Big Data in Detecting Fraudulent Activities in Retail Sector -Fraud Detection in Real time • Big Data helps to detect frauds in real time. Example : (a) In an online transaction, BigData would compare the incoming IP address with the geotag received from customer’s smartphone apps. A valid match between the two confirms the authenticity of transaction. (b) Also, examines the entire historical data to track suspicious patterns of the customer order – Big Data analysis is performed in real time by retailers to know the actual time of the product delivered. Costly products of have sensors attached to transmit their location information,thereby, preventing frauds. Questions • Discuss some areas in which decision-making processes are influenced by social network data. • List some common types of financial frauds prevalent in the current business scenario. • In what ways does analyzing Big Data help organizations prevent fraud? • List some methods used for verification of credit cards. • List the steps that SNA follows to detect fraud. • What is Social Customer Relationship Management(CRM)