UNIVERSITY INSTITUTE OF ENGINEERING DEPARTMENT OF COMPUTER SCIENCE AND ENGG. Introduction To Data Science, Usage Of R Language For Data Analytics By: Er. Gursimran Bakshi (E8003) Data Science Using R DISCOVER . LEARN . EMPOWER 1 Syllabus/Topics To be Covered • Data Science • Components of Data Science • Data Strategy • Data Engineering • Data analysis and mathematical model • Data Visualization and operationalization • Introduction to R • Features of R • R for data science 2 Data Science • Data science is an interdisciplinary field (it consists of more than one branch of study) that uses statistics, computer science, and machine learning algorithms to gain insights from structured and unstructured data. • Data science is the field of study that combines domain expertise, programming skills, and knowledge of mathematics and statistics to extract meaningful insights from data. • Data science practitioners apply machine learning algorithms to numbers, text, images, video, audio, and more to produce artificial intelligence (AI) systems to perform tasks that ordinarily require human intelligence. • More and more companies are coming to realize the importance of data science, AI, and machine learning • Regardless of industry or size, organizations that wish to remain competitive in the age of big data need to efficiently develop and implement data science capabilities or risk being left behind. • Data Science is one of the fastest-growing, challenging, and high-paying jobs of this decade. • According to Economic Times, India has seen a more than 400 per cent rise in demand for data science professionals across varied industry sectors at a time when the supply of such talent witness slow growth. The Four Components Of Data Science Include: 1. Data Strategy 2. Data Engineering 3. Data Analysis and Models 4. Data Visualization and Operationalization. Data Strategy • Developing a data strategy is simply determining what data are you going to gather and why. As obvious as that seems, it’s often either overlooked, not given enough thought or not formalized. • We’re talking only about the data you need to address your business problem/opportunity and why – the other considerations are important, but they’re not the first step. • Deciding on a data strategy requires you to make the connection between the data you’re going to gather and your business goals. • In the end, the effort you put into gathering data, as well as formatting it correctly and getting rid of “garbage” data that doesn’t serve your business goals, will be a reflection of both how hard that is to do, and how valuable it might be. Data Engineering • Data Engineering is about the technology and systems that are leveraged to access, organize and use the data. It primarily involves the creation of software solutions for data problems. • These solutions typically involve establishing a data system then creating data pipelines and endpoints within that system. • Data engineering is important to data science overall because you can’t actually do any science without it. • data engineering allows data to flow from or to the product and through the ecosystem to various stakeholders Data Analysis and Mathematical Models • This is the “heart” of data science; it’s where a lot of what we associate with data science happens • We take data and use Math or an algorithm (arguably in some form it’s always both), we try to model how a “system” works • The data analysis and mathematical modeling involve : a. Computing (could possibly be a person doing this, though it’s rare today), b. Math and/or Statistics, c. A domain (like healthcare), d. The application of the scientific method or aspects of it • To further break it down, we think of data analysis and mathematical models in terms of how you can use data: Data Analysis and Mathematical Models • To describe, extract insights or make predictions about a service, product, person, business or technology or more likely – a combination of them (aka an “ecosystem”) • To create a “tool” that replaces or supplements what a person does • This is what most machine learning does – plays Go, reads an X-ray, schedules a patient and so on. Instead of being a mechanical robot replacing a person putting in lug nuts, it replaces a person “thinking about” and doing a task. • The first use case refers to what science has always done: obtain an understanding and where possible, create a model to make a prediction utilizing data • The second use case, again, refers to what engineers have always done with math and science Visualization and Operationalization • We’ve lumped operationalization and visualization into one category because they occur hand-in-hand so often. • Operationalization is the more general notion, though. Simply put, it is the idea that you’re going to do something with the data at hand (after analysis and modeling) – draw a conclusion or take an action, for instance. • Visualization is often the easiest way to convey the meaning of the data or analysis to the person whose job it is to interpret the output of the data science. Data Visualization • Visualization is not just about taking the data analysis and presenting it “correctly”’. Sometimes, it involves going back into the raw data and understanding what needs to be visualized based on the needs and goals of both the user and the operations. • If you are developing a device that visualizes any data, a deep understanding of the following is required:I. How the data will be used, II. The needs and capabilities of the person consuming that data, III. Users’ context of use including physical location, devices being used, physical environment, and situational context, IV. The complexity of the analysis (ie. is it important to convey the number of variables that have been analyzed to create a prediction?). Data Operationalization • Operationalizing is really about doing something with the data; someone (or occasionally a machine) has to make a decision and/or take an action based on the math and computing that has happened. • This could be in the form of: 1. A real-time person decision/action (ie. human intervention based on analysis of patient data gathered by a device); 2. 3. A longer-term response (ie the decision to restructure resource deployment in a hospital, based on business operational efficiencies) or; A recommendation on a very specific task (ie an “AI” diagnosing a broken leg on an x-ray). Data Operationalization • If these ideas are relatively new to your organization and you’re, for instance, planning a new release or new product, and you want to bring to bear data science, then a simple start is to draw an ecosystem diagram. • Then use this tool to have conversations about what data you’re going to gather and why, and how you’re either going to optimize or transform a system with your product or service. This will naturally lead to the steps of data strategy, data engineering and so on. • You could also take a look at your existing approach for defining and designing other product features and follow that if drawing out an ecosystem doesn’t interest you, although, based on our experience, we’d highly recommend you give it a go. When it comes to applying data science, treat it exactly as though you were creating a product feature, because, from a practical point of view, that’s what it is. Introduction to R • R is not a programming language like C or Java. • It was not created by software engineers for software development. • It was developed by statisticians as an interactive environment for data analysis. • The interactivity is an indispensable feature in data science because, as you will soon learn, the ability to quickly explore data is a necessity for success in this field. • Like in other programming languages, you can save your work as scripts that can be easily executed at any moment. • These scripts serve as a record of the analysis you performed, a key feature that facilitates reproducible work. • If you are patient, you will come to appreciate the unequal power of R when it comes to data analysis and, specifically, data visualization. Attractive features of R 1. R is free and open source. 2. It runs on all major platforms: Windows, Mac Os, UNIX/Linux. 3. Scripts and data objects can be shared seamlessly across platforms. 4. There is a large, growing, and active community of R users and, as a result, there are numerous resources for learning and asking questions. 5. It is easy for others to contribute add-ons which enables developers to share software implementations of new data science methodologies. R for Data Science • R for data science focuses on the language’s statistical and graphical uses. • When you learn R for data science, you’ll learn how to use the language to perform statistical analyses and develop data visualizations. • R’s statistical functions also make it easy to clean, import and analyze data. • It may be equipped with an Integrated Development Environment (IDE). • According to computer software company GitHub, the purpose of an IDE is to make writing and working with software packages easier. • RStudio is an IDE for R that improves the accessibility of graphics and includes a syntax-highlighting editor that helps with code execution. THANK YOU For queries Email: Gursimran.e8003@cumail.in 16