Math 1130 - Data Science, Machine Learning, and Artificial Intelligence at a Glance Kelly Ramsay Course overview • Introductions • Syllabus overview • Discord • Project etc. Goals By the end of this case we hope to have properly defined data science, machine learning and artificial intelligence. You should be able to identify examples of each in the real world. What is “data”? • You can go onto Wikipedia or read books to get an answer to this question, but most of those sources will give you a very pedantic, unintuitive definition. What is “data”? • You can go onto Wikipedia or read books to get an answer to this question, but most of those sources will give you a very pedantic, unintuitive definition. • Instead, we’re going to go with the colloquial definition of data as “something whose value you care about”. What is “data”? • You can go onto Wikipedia or read books to get an answer to this question, but most of those sources will give you a very pedantic, unintuitive definition. • Instead, we’re going to go with the colloquial definition of data as “something whose value you care about”. • You won’t find that in any formal treatment of the subject, but for now, it is good enough. What is “data”? • You can go onto Wikipedia or read books to get an answer to this question, but most of those sources will give you a very pedantic, unintuitive definition. • Instead, we’re going to go with the colloquial definition of data as “something whose value you care about”. • You won’t find that in any formal treatment of the subject, but for now, it is good enough. • Your name, age, and telephone number are data about you. Your bank savings, your address, and your parents’ names are data that relate to you. What is “data”? • You can go onto Wikipedia or read books to get an answer to this question, but most of those sources will give you a very pedantic, unintuitive definition. • Instead, we’re going to go with the colloquial definition of data as “something whose value you care about”. • You won’t find that in any formal treatment of the subject, but for now, it is good enough. • Your name, age, and telephone number are data about you. Your bank savings, your address, and your parents’ names are data that relate to you. • We have data about everything, everywhere. What is “data”? We can also think of data as recorded information, think daily temperature recordings, stock prices over time, monthly blood pressure recordings. For example: radius 17.99 20.57 19.69 11.42 20.29 texture 10.38 17.77 21.25 20.38 14.34 perimeter 122.8 132.9 130 77.58 135.1 area 1001 1326 1203 386.1 1297 The data life cycle Note: To be a data scientist, you must learn skills from each of the stages of the data life cycle. Note: Other variants of this life cycle exist, but they send the same message. Capture • This is actually getting the data Capture • This is actually getting the data • You may collect this in many ways: Capture • This is actually getting the data • You may collect this in many ways: • Walking outside and recording if it is raining or not Capture • This is actually getting the data • You may collect this in many ways: • Walking outside and recording if it is raining or not • Designing a pharmaceutical experiment Capture • This is actually getting the data • You may collect this in many ways: • Walking outside and recording if it is raining or not • Designing a pharmaceutical experiment • Data entry Capture • This is actually getting the data • You may collect this in many ways: • • • • Walking outside and recording if it is raining or not Designing a pharmaceutical experiment Data entry Webscraping Capture • This is actually getting the data • You may collect this in many ways: • • • • • Walking outside and recording if it is raining or not Designing a pharmaceutical experiment Data entry Webscraping Application Programming Interface (API) Capture • This is actually getting the data • You may collect this in many ways: • • • • • Walking outside and recording if it is raining or not Designing a pharmaceutical experiment Data entry Webscraping Application Programming Interface (API) • In this stage, you are taking the data and moving it into ‘storage’ Maintain • This is placing the data in a form in which it can be used for analysis Maintain • This is placing the data in a form in which it can be used for analysis • This may include: Maintain • This is placing the data in a form in which it can be used for analysis • This may include: • Documenting the data, what exactly was recorded and what should be noted Maintain • This is placing the data in a form in which it can be used for analysis • This may include: • Documenting the data, what exactly was recorded and what should be noted • Data cleaning, pre-processing Maintain • This is placing the data in a form in which it can be used for analysis • This may include: • Documenting the data, what exactly was recorded and what should be noted • Data cleaning, pre-processing • Data warehousing Maintain • This is placing the data in a form in which it can be used for analysis • This may include: • Documenting the data, what exactly was recorded and what should be noted • Data cleaning, pre-processing • Data warehousing • Data is accessed from the warehouse via software such as SQL Analysis • Data analysis is where we transform the data to insights, such as predictions, qualitative analysis, exploratory analysis Analysis • Data analysis is where we transform the data to insights, such as predictions, qualitative analysis, exploratory analysis • Data analysis is typically done in python or R Analysis • Data analysis is where we transform the data to insights, such as predictions, qualitative analysis, exploratory analysis • Data analysis is typically done in python or R • Knowledge of statistics sits mostly in this category Analysis • Data analysis is where we transform the data to insights, such as predictions, qualitative analysis, exploratory analysis • Data analysis is typically done in python or R • Knowledge of statistics sits mostly in this category • Technique for analysis will be based on the question we want to answer and the data characteristics Analysis Common analysis techniques include: • Graphical and statistical summaries of the data: Mean, standard deviation, bar chart, scatter plot • Regression: Analysis technique for assessing the relationship between two variables • Natural language processing: Huge variety of methods, many machine learning models • Classification: Naive Bayes, logistic regression, neural networks, support vector machines • Time series analysis: ARIMA modelling, GARCH modelling, etc. Communicate • Results of our analysis can be long and technical Communicate • Results of our analysis can be long and technical • Distill these results to insights, action items and decision making Communicate • Results of our analysis can be long and technical • Distill these results to insights, action items and decision making • Build reports, such as in power-bi or markdown, which can be read by non-technical staff Communicate • Results of our analysis can be long and technical • Distill these results to insights, action items and decision making • Build reports, such as in power-bi or markdown, which can be read by non-technical staff • Reports need to contain all of the relevant information, in a concise and clear manner Communicate • Results of our analysis can be long and technical • Distill these results to insights, action items and decision making • Build reports, such as in power-bi or markdown, which can be read by non-technical staff • Reports need to contain all of the relevant information, in a concise and clear manner • For example: This is the predicted weather for the next week 20 degrees, here is our high likelihood range (15,25) What is Data Science? Now that we know what data is, we can now ask: “What is data science?” Science, in the language of the scientific method, is: 1 Formulating hypotheses, or guesses about how the world works, based on observations of the world around us 2 Validating or invalidating those hypotheses by conducting experiments What is Data Science? • Unlike the pure sciences, working with data doesn’t necessarily require conducting experiments (although it could!). What is Data Science? • Unlike the pure sciences, working with data doesn’t necessarily require conducting experiments (although it could!). • Rather, many times the data has already been collected and organized by someone else. What is Data Science? • Unlike the pure sciences, working with data doesn’t necessarily require conducting experiments (although it could!). • Rather, many times the data has already been collected and organized by someone else. • So the scientific method, as applied to data, can be summarized as: “Formulating hypotheses based on the world around us, then analyzing relevant data to validate or invalidate our hypotheses.” Which of the following reflect the entire data science process? (a) Anecdotally noticing that millennials seem to respond more positively to discussions of your firm’s new product version that is in beta, versus your existing one. Next, setting up an A/B test funnelling millenials equally to both versions, then conducting statistical significance tests on this data to verify that millenials prefer the new version. (b) Observing that Uber pricing seems to be correlated to a small set of factors, obtaining open-source data on Uber pricing rates, then building a pricing model based on those factors and verifying that they explain most of the variation in rates. (c) Converting images of crop circles into structured pixels and storing them into a database for later use. (d) Building an algorithm that allows a computer to recognize images of cats and dogs. Exercise • Answer. (a) and (b). Notice that both of these scenarios follow the “hypothesis - experiment/investigation - analysis” sequence described earlier. Exercise • Answer. (a) and (b). Notice that both of these scenarios follow the “hypothesis - experiment/investigation - analysis” sequence described earlier. • (c) is a data engineering problem; while not reflective of the entire data science process, this sort of manipulation of data to get it into a form suitable for analysis is a crucial part of the data science process. Exercise • Answer. (a) and (b). Notice that both of these scenarios follow the “hypothesis - experiment/investigation - analysis” sequence described earlier. • (c) is a data engineering problem; while not reflective of the entire data science process, this sort of manipulation of data to get it into a form suitable for analysis is a crucial part of the data science process. • (d) is also not reflective of the entire process, as we are not doing any hypothesis testing, just building a model. We will revisit this soon enough. What is Data Science not? Notice that data science is NOT what is often brought up in the media: 1 It is NOT computers recognizing images of cats and dogs What is Data Science not? Notice that data science is NOT what is often brought up in the media: 1 It is NOT computers recognizing images of cats and dogs 2 It is NOT IBM Watson screening human tissues for disease What is Data Science not? Notice that data science is NOT what is often brought up in the media: 1 It is NOT computers recognizing images of cats and dogs 2 It is NOT IBM Watson screening human tissues for disease 3 It is NOT AlphaGo beating the world’s top Go player What is Data Science not? Notice that data science is NOT what is often brought up in the media: 1 It is NOT computers recognizing images of cats and dogs 2 It is NOT IBM Watson screening human tissues for disease 3 It is NOT AlphaGo beating the world’s top Go player 4 It is NOT ChatGPT telling you how to best do laundry What is Data Science not? • In fact, the VAST majority of data use cases are NOT like the three examples above. What is Data Science not? • In fact, the VAST majority of data use cases are NOT like the three examples above. • Instead, they are much more similar to those of the traditional sciences. What is Data Science not? • In fact, the VAST majority of data use cases are NOT like the three examples above. • Instead, they are much more similar to those of the traditional sciences. • Most examples of data science are what we described in choices (a) and (b) in Exercise 1. What is Data Science? A common corporate data science example is a firm: • collecting user data, analyzing the data to categorize their customers, What is Data Science? A common corporate data science example is a firm: • collecting user data, analyzing the data to categorize their customers, • creating marketing campaigns to better target its potential customers, What is Data Science? A common corporate data science example is a firm: • collecting user data, analyzing the data to categorize their customers, • creating marketing campaigns to better target its potential customers, • testing the effects of the solutions, What is Data Science? A common corporate data science example is a firm: • collecting user data, analyzing the data to categorize their customers, • creating marketing campaigns to better target its potential customers, • testing the effects of the solutions, • updating their marketing materials accordingly. What skills does a data scientist need? What skills does a data scientist need? • You may have noticed that every stage in the data life cycle involves computer programming: Python/iPython, R, SQL, Github, Power-bi, Excel, Cloud computing, Apache products, AWS What skills does a data scientist need? • You may have noticed that every stage in the data life cycle involves computer programming: Python/iPython, R, SQL, Github, Power-bi, Excel, Cloud computing, Apache products, AWS • Data analysis requires a strong knowledge of statistics: Mathematics, regression, machine learning, exploratory analysis What skills does a data scientist need? • Data capturing involves ethics and legal knowledge: Privacy concerns, am I allowed to collect this data? How am I allowed to use this data? What skills does a data scientist need? • Data capturing involves ethics and legal knowledge: Privacy concerns, am I allowed to collect this data? How am I allowed to use this data? • Communication results must be communicated clearly, data must be documented appropriately and well-understood: Charts, KPI, metrics, presentation skills What skills does a data scientist need? • Data capturing involves ethics and legal knowledge: Privacy concerns, am I allowed to collect this data? How am I allowed to use this data? • Communication results must be communicated clearly, data must be documented appropriately and well-understood: Charts, KPI, metrics, presentation skills In this course, we are going to cover elements from each stage of the data life cycle, giving you the basic skills to complete each stage of the data life cycle. What is Statistics? • a population is a collection of units we would like to learn about, e.g., Canadians, students at York What is Statistics? • a population is a collection of units we would like to learn about, e.g., Canadians, students at York • a sample is a part of the population that is observed. What is Statistics? • a population is a collection of units we would like to learn about, e.g., Canadians, students at York • a sample is a part of the population that is observed. • statistics concerns using samples to make claims about populations. What is Statistics? Some common areas of statistics are: • Sampling effectively: Does the sample represent the population? What is Statistics? Some common areas of statistics are: • Sampling effectively: Does the sample represent the population? • Design of experiments: What is the best way to design an experiment which tests a given hypothesis What is Statistics? Some common areas of statistics are: • Sampling effectively: Does the sample represent the population? • Design of experiments: What is the best way to design an experiment which tests a given hypothesis • Causal inference: Separating causation from correlation; does X imply Y, Y imply X? or neither? Do people buy ice cream because it is hot outside, or does it get hot outside when people buy ice cream? What is Statistics? Some common areas of statistics are: • Sampling effectively: Does the sample represent the population? • Design of experiments: What is the best way to design an experiment which tests a given hypothesis • Causal inference: Separating causation from correlation; does X imply Y, Y imply X? or neither? Do people buy ice cream because it is hot outside, or does it get hot outside when people buy ice cream? • Inference: Does a drug really work, or is it just an unlucky sample? What is Statistics? Theoretical statistics studies the mathematics of these issues and applied statistics focuses on modelling different kinds of data. Though statistics has significant overlap with data science they are not quite the same. For example, building a web scraper to collect data is not necessarily performing statistics. But it could be part of the data science process. What is Machine Learning? Choice (d) of Exercise 1, as well as examples given on the what data science is not slide, are instances of machine learning. What does this mean? What is Machine Learning? • “Learn” means to “gain or acquire knowledge or skill in something via experience.” What is Machine Learning? • “Learn” means to “gain or acquire knowledge or skill in something via experience.” • So one could frame “machine learning” as “how a machine gains or acquires knowledge via experience.” How does a machine gain experience? What is Machine Learning? • “Learn” means to “gain or acquire knowledge or skill in something via experience.” • So one could frame “machine learning” as “how a machine gains or acquires knowledge via experience.” How does a machine gain experience? • All machine inputs are essentially binary strings of 0s and 1s, which is really just – you guessed it – data! What is Machine Learning? • “Learn” means to “gain or acquire knowledge or skill in something via experience.” • So one could frame “machine learning” as “how a machine gains or acquires knowledge via experience.” How does a machine gain experience? • All machine inputs are essentially binary strings of 0s and 1s, which is really just – you guessed it – data! • So machine learning is really just how a computer acquires knowledge via data. What is Machine Learning? Of course, this gives no insight into the “how” at all; it just says that there is something that is done with input data to generate this knowledge as an output. What is Machine Learning? To make a math analogy, machine learning is some function f such that knowledge = f (data). What is Machine Learning? Other than that, there are no other real stipulations on f ! Therefore, f could be as mechanical as a simple mathematical function (say, the sum of all the data points) and qualify as machine learning. What is Machine Learning In practice, this is what most of the common machine learning algorithms are, including: • Logistic regression • Random forests • Support vector machines • k - means clustering • Neural networks (You will learn about all of these later in the program.) What is Machine Learning? This may seem disappointing, given how the media hypes up “artificial intelligence” and makes it seem like there is something “smart” going on with machine learning, but in fact many mechanical methods satisfy the conditions required to be classified as machine learning. What is Machine Learning? This doesn’t mean these mechanical methods are limited in usefulness – in fact, they are quite powerful if used properly – but it does mean that they don’t resemble anything that we would naturally associate with human-like intelligence. What is Machine Learning? • More specifically, machine learning is “mechanical” in the sense that how these algorithms “learn” is strictly based upon mathematical principles. What is Machine Learning? • More specifically, machine learning is “mechanical” in the sense that how these algorithms “learn” is strictly based upon mathematical principles. • So one could frame “machine learning” as “how a machine gains or acquires knowledge via experience.” How does a machine gain experience? What is Machine Learning? • For example, linear regression is an algorithm that learns by adjusting the coefficients of the input data to best predict an output value. How the coefficients change is entirely based on mathematical protocols (in this case, the gradients of the input data). What is Machine Learning? • For example, linear regression is an algorithm that learns by adjusting the coefficients of the input data to best predict an output value. How the coefficients change is entirely based on mathematical protocols (in this case, the gradients of the input data). • A common application of linear regression would be predicting housing prices based on various input data such as size, number of rooms, and age of the house. What is Machine Learning? • For example, linear regression is an algorithm that learns by adjusting the coefficients of the input data to best predict an output value. How the coefficients change is entirely based on mathematical protocols (in this case, the gradients of the input data). • A common application of linear regression would be predicting housing prices based on various input data such as size, number of rooms, and age of the house. • The model would take in the data and learn from it by choosing the set of coefficients that minimizes the error of its predictions vs. actual prices. Exercise 2 Based on the above definition, which of the following tasks would likely involve the use of “machine learning”? Select all that apply. (a) Building the model backing Apple iPhones’ facial recognition system. (b) Constructing the model backing Netflix’s movie recommendation system, which is based on your previous viewing activity. (c) Investigating factors which affect Airbnb pricing and developing a pricing tool based on this analysis. (d) Setting up an automated system to approve or reject mortgage loan applications. Exercise 2 Answer. All of the above. Exercise 2 (a) is similar to the task of recognizing cats and dogs in images, which, as we have discussed, is a machine learning problem. (b) involves building a computer model that can learn your movie preferences based on your previous viewing activity; again, this fits into our previous definition of machine learning. Exercise 2 While (c) does not explicitly reference building a machine learning model, a pricing tool which takes into account all the factors affecting Airbnb pricing would likely be complex enough to benefit from the use of machine learning. (d) also does not explicitly reference building a model, but there are many parts to a mortgage loan application, and therefore likely many factors that are relevant in determining approval. Thus, any automated system would likely require a complex model that takes all of these factors into account. Such a model would benefit from incorporating machine learning. What is Machine Learning Machine learning can be, and often is, a part of proper data science. Data science is fundamentally a process, while machine learning is a tool that can be immensely useful in conducting the data science process. What is artificial intelligence? But the elephant is still in the room: even though some mechanical, “dumb” methods may qualify as machine learning, this doesn’t exclude human-like, “smart” methods from being classified as such either. And this is completely true – it doesn’t, yet people have chosen to name it something else entirely: artificial intelligence. What is artificial intelligence? But why? Why give “smart” methods an entirely different name if they can also fall in the bucket of machine learning? That is the question we will explore for the remainder of this module. What is artificial intelligence? Let’s start by taking a look at an iconic demonstration of this so-called intelligence: AlphaGo beating the world’s top human Go player What is artificial intelligence? Quite impressive! But does this feat alone prove that a machine exhibits human-like intelligence? What is artificial intelligence? What, to you, counts as “artificial intelligence”? What demonstrations of aptitude would, beyond a shadow of a doubt, convince you that something is as intelligent as us humans are? What is artificial intelligence? You may have found that it was quite hard to come up with answers to the second part of the previous question, and that most ideas you had either: 1 1) seemed like they could well have a “dumb” mechanical solution as in the previous discussion of machine learning, despite seeming impressive at first; or 2 2) actually made you question how unique human intelligence really is and whether virtually all of what we do could be reduced to such “dumb” mechanical methods. We’ll explore both ideas below. What is artificial intelligence? So, is there any sensible test we could use to determine if something is as intelligent as a human? What is artificial intelligence? There have been many proposals over time. The most famous aptitude test developed was the Turing test, named after the English mathematician and famous World War II cryptographer Alan Turing. What is artificial intelligence? • In the Turing test, there is a human evaluator and two conversation partners: one machine and one human. What is artificial intelligence? • In the Turing test, there is a human evaluator and two conversation partners: one machine and one human. • The evaluator would conduct a conversation with each through a text-only channel. What is artificial intelligence? • In the Turing test, there is a human evaluator and two conversation partners: one machine and one human. • The evaluator would conduct a conversation with each through a text-only channel. • If the evaluator cannot reliably tell the machine from the human, the machine is said to have passed the test. What is artificial intelligence? What are some shortcomings of this proposal? What insight does this lend into the criteria that a rigorous test of human-like intelligence would have to satisfy? What is artificial intelligence? • Turing did not explicitly state that his test could be used as a measure of intelligence, but many who came after him thrust his test into the limelight. What is artificial intelligence? • Turing did not explicitly state that his test could be used as a measure of intelligence, but many who came after him thrust his test into the limelight. • Of course, the implication is that if a computer can converse like a human, then it is effectively as intelligent as a human. What is artificial intelligence? • In addition to the flaws with the Turing test (and in fact with almost any other test you can likely come up with), this brings to light one of our society’s unhealthy obsessions when it comes to the field of artificial intelligence – its singular focus on mimicking human intelligence via machines. What is artificial intelligence? • In addition to the flaws with the Turing test (and in fact with almost any other test you can likely come up with), this brings to light one of our society’s unhealthy obsessions when it comes to the field of artificial intelligence – its singular focus on mimicking human intelligence via machines. • But what if machine intelligence is fundamentally different (note: not worse or better) than human intelligence? What if machines are more “intelligent” about certain things than we are, and vice versa? What is artificial intelligence? Brainstorm with a partner, and then we will discuss aloud in class: • What are some things that machines can already do better than we can? What specifically about them allows them to do these better? What is artificial intelligence? Brainstorm with a partner, and then we will discuss aloud in class: • What are some things that machines can already do better than we can? What specifically about them allows them to do these better? • What are some things that we can do better than machines? Do you think this advantage will likely be sustained over time? Why? Takeaways In this case, you learned what “data science” and “machine learning” really are, in contrast to the misleading connotations that they are often given in public discussion. You also learned that “artificial intelligence” is a very ambiguous term – nobody really agrees on its exact definition and it is unclear if its current focus on imitating human intelligence is even the correct approach. Takeaways Throughout this program, we will focus primarily on data science and machine learning, and not so much artificial intelligence. Yet the philosophical questions surrounding artificial intelligence are fascinating, and we encourage you to continue pondering them as you become more involved in this new and exciting field. Takeaways In the coming weeks, we will train you on the numerous techniques (e.g. linear regression) and tools (e.g. pandas, scikit-learn) of data science so that you will be able to conduct professional data science processes in your own lives. Homework • Write a bullet form/paragraph summarising the difference between AI, statistics, machine learning and data science. • Write down a hypothesis you have about the world, and possible data you might use to verify this hypothesis.