Chapter 01 Introduction Introduction to Data Science https://sherbold.github.io/intro-to-data-science Outline • Introduction to Big Data • Data Science and Business Intelligence • The Skillset of Data Scientists • Summary Introduction to Data Science https://sherbold.github.io/intro-to-data-science What is „Big Data“?!? Introduction to Data Science https://sherbold.github.io/intro-to-data-science Is this really about size? What is Data? •The quantities, characters, or symbols on which operations are performed by a computer, which may be stored and transmitted in the form of electrical signals and recorded on magnetic, optical, or mechanical recording media. •Now, let’s learn Big Data definition Introduction to Data Science https://sherbold.github.io/intro-to-data-science What is Big Data? • Big Data is a collection of data that is huge in volume, yet growing exponentially with time. It is a data with so large size and complexity that none of traditional data management tools can store it or process it efficiently. Big data is also a data but with huge size. Introduction to Data Science https://sherbold.github.io/intro-to-data-science What is an Example of Big Data? • Following are some of the Big Data examples• The New York Stock Exchange is an example of Big Data that generates about one terabyte of new trade data per day Introduction to Data Science https://sherbold.github.io/intro-to-data-science Social Media • The statistic shows that 500+terabytes of new data get ingested into the databases of social media site Facebook, every day. This data is mainly generated in terms of photo and video uploads, message exchanges, putting comments etc Introduction to Data Science https://sherbold.github.io/intro-to-data-science Types Of Big Data •Following are the types of Big Data: 1.Structured 2.Unstructured 3.Semi-structured Introduction to Data Science https://sherbold.github.io/intro-to-data-science Structured • Any data that can be stored, accessed and processed in the form of fixed format is termed as a ‘structured’ data. Over the period of time, talent in computer science has achieved greater success in developing techniques for working with such kind of data (where the format is well known in advance) and also deriving value out of it. However, nowadays, we are foreseeing issues when a size of such data grows to a huge extent, typical sizes are being in the range of multiple zettabytes. • Do you know? 1021 bytes equal to 1 zettabyte or one billion terabytes forms a zettabyte. Introduction to Data Science https://sherbold.github.io/intro-to-data-science Examples Of Structured Data Employee_ID Employee_Name Gender Department Salary_In_lacs 2365 Rajesh Kulkarni Male Finance 650000 3398 Pratibha Joshi Female Admin 650000 7465 Shushil Roy Male Admin 500000 7500 Shubhojit Das Male Finance 500000 7699 Priya Sane Female Finance 550000 Introduction to Data Science https://sherbold.github.io/intro-to-data-science Unstructured • Any data with unknown form or the structure is classified as unstructured data. In addition to the size being huge, unstructured data poses multiple challenges in terms of its processing for deriving value out of it. A typical example of unstructured data is a heterogeneous data source containing a combination of simple text files, images, videos etc. Now day organizations have wealth of data available with them but unfortunately, they don’t know how to derive value out of it since this data is in its raw form or unstructured format Introduction to Data Science https://sherbold.github.io/intro-to-data-science Examples Of Un-structured Data • The output returned by ‘Google Search’ Introduction to Data Science https://sherbold.github.io/intro-to-data-science Semi-structured • Semi-structured data can contain both the forms of data. We can see semi-structured data as a structured in form but it is actually not defined with e.g. a table definition in relational DBMS. Example of semistructured data is a data represented in an XML file. • Examples Of Semi-structured Data: Personal data stored in an XML file- Introduction to Data Science https://sherbold.github.io/intro-to-data-science Characteristics Of Big Data •Big data can be described by the following characteristics: • Volume • Variety • Velocity Introduction to Data Science https://sherbold.github.io/intro-to-data-science Volume • Volume – The name Big Data itself is related to a size which is enormous. Size of data plays a very crucial role in determining value out of data. Also, whether a particular data can actually be considered as a Big Data or not, is dependent upon the volume of data. Hence, ‘Volume’ is one characteristic which needs to be considered while dealing with Big Data solutions Introduction to Data Science https://sherbold.github.io/intro-to-data-science The 3 Vs: Volume • Scale of the data must be „big“ • No clear definition • „that demand […] innovative forms of information processing“ (Gartner) Data center storage worldwide Introduction to Data Science https://sherbold.github.io/intro-to-data-science © Statista 2018 Variety • Variety – The next aspect of Big Data is its variety. • Variety refers to heterogeneous sources and the nature of data, both structured and unstructured. During earlier days, spreadsheets and databases were the only sources of data considered by most of the applications. Nowadays, data in the form of emails, photos, videos, monitoring devices, PDFs, audio, etc. are also being considered in the analysis applications. This variety of unstructured data poses certain issues for storage, mining and analyzing data. Introduction to Data Science https://sherbold.github.io/intro-to-data-science The 3 Vs: Variety • Diversity in data types and data sources Structured • • Data with defined types and structure Example: comma separated values SemiStructured • • Textual data with parseable pattern Example: XML files with schema • Textual data with erratic formats that can be formated with effort Example: Clickstream data Quasi-Structured • • Unstructured • Data that has no inherent structure, often with multiple formats Example: Web site, videos Introduction to Data Science https://sherbold.github.io/intro-to-data-science Velocity • Velocity – The term ‘velocity’ refers to the speed of generation of data. How fast the data is generated and processed to meet the demands, determines real potential in the data. • Big Data Velocity deals with the speed at which data flows in from sources like business processes, application logs, networks, and social media sites, sensors, Mobile devices, etc. The flow of data is massive and continuous. Introduction to Data Science https://sherbold.github.io/intro-to-data-science The 3 Vs: Velocity • Speed at which new data is created • Speed at which data must be processed and analyzed • Often close to real-time Introduction to Data Science https://sherbold.github.io/intro-to-data-science Outline • Introduction to Big Data • Data Science and Business Intelligence • The Skillset of Data Scientists • Summary Introduction to Data Science https://sherbold.github.io/intro-to-data-science Defining Data Science • Unfortunately, there is no clear definition (yet?) • Goal is the extraction of knowledge from data • Combination of techniques from different disciplines • Scientific principles guide the data analysis Introduction to Data Science https://sherbold.github.io/intro-to-data-science What is data science? • Data science combines math and statistics, specialized programming, advanced analytics, artificial intelligence (AI), and machine learning with specific subject matter expertise to uncover actionable insights hidden in an organization’s data. These insights can be used to guide decision making and strategic planning. Introduction to Data Science https://sherbold.github.io/intro-to-data-science a data science project undergoes the following stages: • Data ingestion: The lifecycle begins with the data collection--both raw structured and unstructured data from all relevant sources using a variety of methods. These methods can include manual entry, web scraping, and real-time streaming data from systems and devices. Data sources can include structured data, such as customer data, along with unstructured data like log files, video, audio, pictures, the Internet of Things (IoT), social media, and more. Introduction to Data Science https://sherbold.github.io/intro-to-data-science a data science project undergoes the following stages: • Data storage and data processing: Since data can have different formats and structures, companies need to consider different storage systems based on the type of data that needs to be captured. Data management teams help to set standards around data storage and structure, which facilitate workflows around analytics, machine learning and deep learning models. This stage includes cleaning data, deduplicating, transforming and combining the data using ETL (extract, transform, load) jobs or other data integration technologies. This data preparation is essential for promoting data quality before loading into a data warehouse, data lake, or other repository. Introduction to Data Science https://sherbold.github.io/intro-to-data-science a data science project undergoes the following stages: • Data analysis: Here, data scientists conduct an exploratory data analysis to examine biases, patterns, ranges, and distributions of values within the data. This data analytics exploration drives hypothesis generation for a/b testing. It also allows analysts to determine the data’s relevance for use within modeling efforts for predictive analytics, machine learning, and/or deep learning. Depending on a model’s accuracy, organizations can become reliant on these insights for business decision making, allowing them to drive more scalability. Introduction to Data Science https://sherbold.github.io/intro-to-data-science Data science versus business intelligence • It may be easy to confuse the terms “data science” and “business intelligence” (BI) because they both relate to an organization’s data and analysis of that data, but they do differ in focus. • Business intelligence (BI) is typically an umbrella term for the technology that enables data preparation, data mining, data management, and data visualization. Business intelligence tools and processes allow end users to identify actionable information from raw data, facilitating data-driven decision-making within organizations across various industries. Introduction to Data Science https://sherbold.github.io/intro-to-data-science Data science versus business intelligence • While data science tools overlap in much of this regard, business intelligence focuses more on data from the past, and the insights from BI tools are more descriptive in nature. It uses data to understand what happened before to inform a course of action. BI is geared toward static (unchanging) data that is usually structured. While data science uses descriptive data, it typically utilizes it to determine predictive variables, which are then used to categorize data or to make forecasts Introduction to Data Science https://sherbold.github.io/intro-to-data-science Data Science vs. Business Intelligence • Business Intelligence (Gartner IT Glossary) • […] best practices that enable access to and analysis of information to improve and optimize decisions and performance. Business Intelligence High Data Science Depth of Insights Techniques Dashboards, alerts, queries Optimization, predictive modelling, forecasting Data Types Structured, data warehouses Any kind, often unstructured Common questions What happened…? How much did…? When did…? What if…? What will…? How can we…? Business Intelligence Low Past Present Data Science Future Time Introduction to Data Science https://sherbold.github.io/intro-to-data-science Mathematical Aspects Computational Geometry Optimization Scientific Computing Stochastics Machine Learning Introduction to Data Science https://sherbold.github.io/intro-to-data-science Computer Science Aspects Data Structures and Algorithms Software Engineering Databases Artificial Intelligence Introduction to Data Science https://sherbold.github.io/intro-to-data-science Distributed Computing Machine Learning Statistical Aspects Linear Models Statistical Tests Time Series Analysis Inference Machine Learning Introduction to Data Science https://sherbold.github.io/intro-to-data-science Applications Intelligent Systems Robotics Marketing Medicine Autonomous Driving Social Networks Introduction to Data Science https://sherbold.github.io/intro-to-data-science More Data More Opportunities TERABYTES PETABYTES EXABYTES VOLUME OF INFORMATION LARGE SMALL 1990’s Relational Databases & Data Warehouses 2000’s Content Management Introduction to Data Science https://sherbold.github.io/intro-to-data-science 2010’s Key-Value Storages & Unstructured Data Outline • Introduction to Big Data • Data Science and Business Intelligence • The Skillset of Data Scientists • Summary Introduction to Data Science https://sherbold.github.io/intro-to-data-science What are Data Scientists? • Not computer scientists • But should know about databases, data structures, algorithms, etc. • Not mathematicians • But should know about optimization, stochastics, etc. • Not statisticians • But should know about regression, statistical tests, etc. • Not domain experts • But must work together with them Introduction to Data Science https://sherbold.github.io/intro-to-data-science Skills of Data Scientists Quantitative • Maths • Algorithms • Statistics A bit of everything Collaborative • Teamwork • Communication skills Data Scientists Technical • Programming • Infrastructures … but actually as much as possible of everything Skeptical • Create hypotheses, but be skeptical about them Introduction to Data Science https://sherbold.github.io/intro-to-data-science Tips For Becoming a Data Scientist • There is no one pathway to becoming a data scientist. Everyone follows a different trajectory to hail as a data scientist, but here are some tips which can make this journey easy. Introduction to Data Science https://sherbold.github.io/intro-to-data-science Different types of Data Scientists • According to Microsoft Research: • Polymath • Data Analyzer • „Do it all“ • Analyzing data • Data Evangelist • Data analysis, disseminating and acting on insights • Data Preparer • Querying existing data, preparing data for analysis • Platform Builder • Collect data and create infrastructures • Moonlighters (50%/20%) • „Spare time“ data scientists • Insight Actors • Data Shapers • Analyzing and preparing data • Use the outcome and act on insights. Miyung Kim, Thomas Zimmermann, Robert DeLine, Andrew Begel: Data Scientists in Software Teams: State of the Art and Challenges, IEEE Transactions on Software Engineering (Online First) Introduction to Data Science https://sherbold.github.io/intro-to-data-science Outline • Introduction to Big Data • Data Science and Business Intelligence • The Skillset of Data Scientists • Summary Introduction to Data Science https://sherbold.github.io/intro-to-data-science Summary • Big data has a high volume, velocity, and variety • Different data structures • Structured, semi-structured, quasi-structured, unstructured • Data science is a very diverse discipline • Maths, computer science, statistics, applications Data scientists require a diverse skillset Introduction to Data Science https://sherbold.github.io/intro-to-data-science END Thank you for your listening………… Introduction to Data Science https://sherbold.github.io/intro-to-data-science