Uploaded by AbdulRahman AlKhalidi

Lesson 1 1130

advertisement
Math 1130 - Data Science, Machine Learning,
and Artificial Intelligence at a Glance
Kelly Ramsay
Course overview
• Introductions
• Syllabus overview
• Discord
• Project etc.
Goals
By the end of this case we hope to have properly defined data
science, machine learning and artificial intelligence. You should be
able to identify examples of each in the real world.
What is “data”?
• You can go onto Wikipedia or read books to get an answer to
this question, but most of those sources will give you a very
pedantic, unintuitive definition.
What is “data”?
• You can go onto Wikipedia or read books to get an answer to
this question, but most of those sources will give you a very
pedantic, unintuitive definition.
• Instead, we’re going to go with the colloquial definition of
data as “something whose value you care about”.
What is “data”?
• You can go onto Wikipedia or read books to get an answer to
this question, but most of those sources will give you a very
pedantic, unintuitive definition.
• Instead, we’re going to go with the colloquial definition of
data as “something whose value you care about”.
• You won’t find that in any formal treatment of the subject,
but for now, it is good enough.
What is “data”?
• You can go onto Wikipedia or read books to get an answer to
this question, but most of those sources will give you a very
pedantic, unintuitive definition.
• Instead, we’re going to go with the colloquial definition of
data as “something whose value you care about”.
• You won’t find that in any formal treatment of the subject,
but for now, it is good enough.
• Your name, age, and telephone number are data about you.
Your bank savings, your address, and your parents’ names are
data that relate to you.
What is “data”?
• You can go onto Wikipedia or read books to get an answer to
this question, but most of those sources will give you a very
pedantic, unintuitive definition.
• Instead, we’re going to go with the colloquial definition of
data as “something whose value you care about”.
• You won’t find that in any formal treatment of the subject,
but for now, it is good enough.
• Your name, age, and telephone number are data about you.
Your bank savings, your address, and your parents’ names are
data that relate to you.
• We have data about everything, everywhere.
What is “data”?
We can also think of data as recorded information, think daily
temperature recordings, stock prices over time, monthly blood
pressure recordings. For example:
radius
17.99
20.57
19.69
11.42
20.29
texture
10.38
17.77
21.25
20.38
14.34
perimeter
122.8
132.9
130
77.58
135.1
area
1001
1326
1203
386.1
1297
The data life cycle
Note: To be a data scientist, you must learn skills from each of the
stages of the data life cycle.
Note: Other variants of this life cycle exist, but they send the same
message.
Capture
• This is actually getting the data
Capture
• This is actually getting the data
• You may collect this in many ways:
Capture
• This is actually getting the data
• You may collect this in many ways:
• Walking outside and recording if it is raining or not
Capture
• This is actually getting the data
• You may collect this in many ways:
• Walking outside and recording if it is raining or not
• Designing a pharmaceutical experiment
Capture
• This is actually getting the data
• You may collect this in many ways:
• Walking outside and recording if it is raining or not
• Designing a pharmaceutical experiment
• Data entry
Capture
• This is actually getting the data
• You may collect this in many ways:
•
•
•
•
Walking outside and recording if it is raining or not
Designing a pharmaceutical experiment
Data entry
Webscraping
Capture
• This is actually getting the data
• You may collect this in many ways:
•
•
•
•
•
Walking outside and recording if it is raining or not
Designing a pharmaceutical experiment
Data entry
Webscraping
Application Programming Interface (API)
Capture
• This is actually getting the data
• You may collect this in many ways:
•
•
•
•
•
Walking outside and recording if it is raining or not
Designing a pharmaceutical experiment
Data entry
Webscraping
Application Programming Interface (API)
• In this stage, you are taking the data and moving it into
‘storage’
Maintain
• This is placing the data in a form in which it can be used for
analysis
Maintain
• This is placing the data in a form in which it can be used for
analysis
• This may include:
Maintain
• This is placing the data in a form in which it can be used for
analysis
• This may include:
• Documenting the data, what exactly was recorded and what
should be noted
Maintain
• This is placing the data in a form in which it can be used for
analysis
• This may include:
• Documenting the data, what exactly was recorded and what
should be noted
• Data cleaning, pre-processing
Maintain
• This is placing the data in a form in which it can be used for
analysis
• This may include:
• Documenting the data, what exactly was recorded and what
should be noted
• Data cleaning, pre-processing
• Data warehousing
Maintain
• This is placing the data in a form in which it can be used for
analysis
• This may include:
• Documenting the data, what exactly was recorded and what
should be noted
• Data cleaning, pre-processing
• Data warehousing
• Data is accessed from the warehouse via software such as SQL
Analysis
• Data analysis is where we transform the data to insights, such
as predictions, qualitative analysis, exploratory analysis
Analysis
• Data analysis is where we transform the data to insights, such
as predictions, qualitative analysis, exploratory analysis
• Data analysis is typically done in python or R
Analysis
• Data analysis is where we transform the data to insights, such
as predictions, qualitative analysis, exploratory analysis
• Data analysis is typically done in python or R
• Knowledge of statistics sits mostly in this category
Analysis
• Data analysis is where we transform the data to insights, such
as predictions, qualitative analysis, exploratory analysis
• Data analysis is typically done in python or R
• Knowledge of statistics sits mostly in this category
• Technique for analysis will be based on the question we want
to answer and the data characteristics
Analysis
Common analysis techniques include:
• Graphical and statistical summaries of the data: Mean,
standard deviation, bar chart, scatter plot
• Regression: Analysis technique for assessing the relationship
between two variables
• Natural language processing: Huge variety of methods, many
machine learning models
• Classification: Naive Bayes, logistic regression, neural
networks, support vector machines
• Time series analysis: ARIMA modelling, GARCH modelling,
etc.
Communicate
• Results of our analysis can be long and technical
Communicate
• Results of our analysis can be long and technical
• Distill these results to insights, action items and decision
making
Communicate
• Results of our analysis can be long and technical
• Distill these results to insights, action items and decision
making
• Build reports, such as in power-bi or markdown, which can be
read by non-technical staff
Communicate
• Results of our analysis can be long and technical
• Distill these results to insights, action items and decision
making
• Build reports, such as in power-bi or markdown, which can be
read by non-technical staff
• Reports need to contain all of the relevant information, in a
concise and clear manner
Communicate
• Results of our analysis can be long and technical
• Distill these results to insights, action items and decision
making
• Build reports, such as in power-bi or markdown, which can be
read by non-technical staff
• Reports need to contain all of the relevant information, in a
concise and clear manner
• For example: This is the predicted weather for the next week
20 degrees, here is our high likelihood range (15,25)
What is Data Science?
Now that we know what data is, we can now ask: “What is data
science?” Science, in the language of the scientific method, is:
1
Formulating hypotheses, or guesses about how the world
works, based on observations of the world around us
2
Validating or invalidating those hypotheses by conducting
experiments
What is Data Science?
• Unlike the pure sciences, working with data doesn’t necessarily
require conducting experiments (although it could!).
What is Data Science?
• Unlike the pure sciences, working with data doesn’t necessarily
require conducting experiments (although it could!).
• Rather, many times the data has already been collected and
organized by someone else.
What is Data Science?
• Unlike the pure sciences, working with data doesn’t necessarily
require conducting experiments (although it could!).
• Rather, many times the data has already been collected and
organized by someone else.
• So the scientific method, as applied to data, can be
summarized as: “Formulating hypotheses based on the world
around us, then analyzing relevant data to validate or
invalidate our hypotheses.”
Which of the following reflect the entire data science process?
(a) Anecdotally noticing that millennials seem to respond more
positively to discussions of your firm’s new product version
that is in beta, versus your existing one. Next, setting up an
A/B test funnelling millenials equally to both versions, then
conducting statistical significance tests on this data to verify
that millenials prefer the new version.
(b) Observing that Uber pricing seems to be correlated to a small
set of factors, obtaining open-source data on Uber pricing
rates, then building a pricing model based on those factors
and verifying that they explain most of the variation in rates.
(c) Converting images of crop circles into structured pixels and
storing them into a database for later use.
(d) Building an algorithm that allows a computer to recognize
images of cats and dogs.
Exercise
• Answer. (a) and (b). Notice that both of these scenarios
follow the “hypothesis - experiment/investigation - analysis”
sequence described earlier.
Exercise
• Answer. (a) and (b). Notice that both of these scenarios
follow the “hypothesis - experiment/investigation - analysis”
sequence described earlier.
• (c) is a data engineering problem; while not reflective of the
entire data science process, this sort of manipulation of data
to get it into a form suitable for analysis is a crucial part of
the data science process.
Exercise
• Answer. (a) and (b). Notice that both of these scenarios
follow the “hypothesis - experiment/investigation - analysis”
sequence described earlier.
• (c) is a data engineering problem; while not reflective of the
entire data science process, this sort of manipulation of data
to get it into a form suitable for analysis is a crucial part of
the data science process.
• (d) is also not reflective of the entire process, as we are not
doing any hypothesis testing, just building a model. We will
revisit this soon enough.
What is Data Science not?
Notice that data science is NOT what is often brought up in the
media:
1
It is NOT computers recognizing images of cats and dogs
What is Data Science not?
Notice that data science is NOT what is often brought up in the
media:
1
It is NOT computers recognizing images of cats and dogs
2
It is NOT IBM Watson screening human tissues for disease
What is Data Science not?
Notice that data science is NOT what is often brought up in the
media:
1
It is NOT computers recognizing images of cats and dogs
2
It is NOT IBM Watson screening human tissues for disease
3
It is NOT AlphaGo beating the world’s top Go player
What is Data Science not?
Notice that data science is NOT what is often brought up in the
media:
1
It is NOT computers recognizing images of cats and dogs
2
It is NOT IBM Watson screening human tissues for disease
3
It is NOT AlphaGo beating the world’s top Go player
4
It is NOT ChatGPT telling you how to best do laundry
What is Data Science not?
• In fact, the VAST majority of data use cases are NOT like the
three examples above.
What is Data Science not?
• In fact, the VAST majority of data use cases are NOT like the
three examples above.
• Instead, they are much more similar to those of the traditional
sciences.
What is Data Science not?
• In fact, the VAST majority of data use cases are NOT like the
three examples above.
• Instead, they are much more similar to those of the traditional
sciences.
• Most examples of data science are what we described in
choices (a) and (b) in Exercise 1.
What is Data Science?
A common corporate data science example is a firm:
• collecting user data, analyzing the data to categorize their
customers,
What is Data Science?
A common corporate data science example is a firm:
• collecting user data, analyzing the data to categorize their
customers,
• creating marketing campaigns to better target its potential
customers,
What is Data Science?
A common corporate data science example is a firm:
• collecting user data, analyzing the data to categorize their
customers,
• creating marketing campaigns to better target its potential
customers,
• testing the effects of the solutions,
What is Data Science?
A common corporate data science example is a firm:
• collecting user data, analyzing the data to categorize their
customers,
• creating marketing campaigns to better target its potential
customers,
• testing the effects of the solutions,
• updating their marketing materials accordingly.
What skills does a data scientist need?
What skills does a data scientist need?
• You may have noticed that every stage in the data life cycle
involves computer programming: Python/iPython, R, SQL,
Github, Power-bi, Excel, Cloud computing, Apache products,
AWS
What skills does a data scientist need?
• You may have noticed that every stage in the data life cycle
involves computer programming: Python/iPython, R, SQL,
Github, Power-bi, Excel, Cloud computing, Apache products,
AWS
• Data analysis requires a strong knowledge of statistics:
Mathematics, regression, machine learning, exploratory
analysis
What skills does a data scientist need?
• Data capturing involves ethics and legal knowledge: Privacy
concerns, am I allowed to collect this data? How am I allowed
to use this data?
What skills does a data scientist need?
• Data capturing involves ethics and legal knowledge: Privacy
concerns, am I allowed to collect this data? How am I allowed
to use this data?
• Communication results must be communicated clearly, data
must be documented appropriately and well-understood:
Charts, KPI, metrics, presentation skills
What skills does a data scientist need?
• Data capturing involves ethics and legal knowledge: Privacy
concerns, am I allowed to collect this data? How am I allowed
to use this data?
• Communication results must be communicated clearly, data
must be documented appropriately and well-understood:
Charts, KPI, metrics, presentation skills
In this course, we are going to cover elements from each stage of
the data life cycle, giving you the basic skills to complete each
stage of the data life cycle.
What is Statistics?
• a population is a collection of units we would like to learn
about, e.g., Canadians, students at York
What is Statistics?
• a population is a collection of units we would like to learn
about, e.g., Canadians, students at York
• a sample is a part of the population that is observed.
What is Statistics?
• a population is a collection of units we would like to learn
about, e.g., Canadians, students at York
• a sample is a part of the population that is observed.
• statistics concerns using samples to make claims about
populations.
What is Statistics?
Some common areas of statistics are:
• Sampling effectively: Does the sample represent the
population?
What is Statistics?
Some common areas of statistics are:
• Sampling effectively: Does the sample represent the
population?
• Design of experiments: What is the best way to design an
experiment which tests a given hypothesis
What is Statistics?
Some common areas of statistics are:
• Sampling effectively: Does the sample represent the
population?
• Design of experiments: What is the best way to design an
experiment which tests a given hypothesis
• Causal inference: Separating causation from correlation; does
X imply Y, Y imply X? or neither? Do people buy ice cream
because it is hot outside, or does it get hot outside when
people buy ice cream?
What is Statistics?
Some common areas of statistics are:
• Sampling effectively: Does the sample represent the
population?
• Design of experiments: What is the best way to design an
experiment which tests a given hypothesis
• Causal inference: Separating causation from correlation; does
X imply Y, Y imply X? or neither? Do people buy ice cream
because it is hot outside, or does it get hot outside when
people buy ice cream?
• Inference: Does a drug really work, or is it just an unlucky
sample?
What is Statistics?
Theoretical statistics studies the mathematics of these issues and
applied statistics focuses on modelling different kinds of data.
Though statistics has significant overlap with data science they are
not quite the same. For example, building a web scraper to collect
data is not necessarily performing statistics. But it could be part of
the data science process.
What is Machine Learning?
Choice (d) of Exercise 1, as well as examples given on the what
data science is not slide, are instances of machine learning. What
does this mean?
What is Machine Learning?
• “Learn” means to “gain or acquire knowledge or skill in
something via experience.”
What is Machine Learning?
• “Learn” means to “gain or acquire knowledge or skill in
something via experience.”
• So one could frame “machine learning” as “how a machine
gains or acquires knowledge via experience.” How does a
machine gain experience?
What is Machine Learning?
• “Learn” means to “gain or acquire knowledge or skill in
something via experience.”
• So one could frame “machine learning” as “how a machine
gains or acquires knowledge via experience.” How does a
machine gain experience?
• All machine inputs are essentially binary strings of 0s and 1s,
which is really just – you guessed it – data!
What is Machine Learning?
• “Learn” means to “gain or acquire knowledge or skill in
something via experience.”
• So one could frame “machine learning” as “how a machine
gains or acquires knowledge via experience.” How does a
machine gain experience?
• All machine inputs are essentially binary strings of 0s and 1s,
which is really just – you guessed it – data!
• So machine learning is really just how a computer acquires
knowledge via data.
What is Machine Learning?
Of course, this gives no insight into the “how” at all; it just says
that there is something that is done with input data to generate
this knowledge as an output.
What is Machine Learning?
To make a math analogy, machine learning is some function f such
that
knowledge = f (data).
What is Machine Learning?
Other than that, there are no other real stipulations on f !
Therefore, f could be as mechanical as a simple mathematical
function (say, the sum of all the data points) and qualify as
machine learning.
What is Machine Learning
In practice, this is what most of the common machine learning
algorithms are, including:
• Logistic regression
• Random forests
• Support vector machines
• k - means clustering
• Neural networks
(You will learn about all of these later in the program.)
What is Machine Learning?
This may seem disappointing, given how the media hypes up
“artificial intelligence” and makes it seem like there is something
“smart” going on with machine learning, but in fact many
mechanical methods satisfy the conditions required to be classified
as machine learning.
What is Machine Learning?
This doesn’t mean these mechanical methods are limited in
usefulness – in fact, they are quite powerful if used properly – but
it does mean that they don’t resemble anything that we would
naturally associate with human-like intelligence.
What is Machine Learning?
• More specifically, machine learning is “mechanical” in the
sense that how these algorithms “learn” is strictly based upon
mathematical principles.
What is Machine Learning?
• More specifically, machine learning is “mechanical” in the
sense that how these algorithms “learn” is strictly based upon
mathematical principles.
• So one could frame “machine learning” as “how a machine
gains or acquires knowledge via experience.” How does a
machine gain experience?
What is Machine Learning?
• For example, linear regression is an algorithm that learns by
adjusting the coefficients of the input data to best predict an
output value. How the coefficients change is entirely based on
mathematical protocols (in this case, the gradients of the
input data).
What is Machine Learning?
• For example, linear regression is an algorithm that learns by
adjusting the coefficients of the input data to best predict an
output value. How the coefficients change is entirely based on
mathematical protocols (in this case, the gradients of the
input data).
• A common application of linear regression would be predicting
housing prices based on various input data such as size,
number of rooms, and age of the house.
What is Machine Learning?
• For example, linear regression is an algorithm that learns by
adjusting the coefficients of the input data to best predict an
output value. How the coefficients change is entirely based on
mathematical protocols (in this case, the gradients of the
input data).
• A common application of linear regression would be predicting
housing prices based on various input data such as size,
number of rooms, and age of the house.
• The model would take in the data and learn from it by
choosing the set of coefficients that minimizes the error of its
predictions vs. actual prices.
Exercise 2
Based on the above definition, which of the following tasks would
likely involve the use of “machine learning”? Select all that apply.
(a) Building the model backing Apple iPhones’ facial recognition
system.
(b) Constructing the model backing Netflix’s movie
recommendation system, which is based on your previous viewing
activity.
(c) Investigating factors which affect Airbnb pricing and developing
a pricing tool based on this analysis.
(d) Setting up an automated system to approve or reject mortgage
loan applications.
Exercise 2
Answer. All of the above.
Exercise 2
(a) is similar to the task of recognizing cats and dogs in images,
which, as we have discussed, is a machine learning problem.
(b) involves building a computer model that can learn your movie
preferences based on your previous viewing activity; again, this fits
into our previous definition of machine learning.
Exercise 2
While (c) does not explicitly reference building a machine learning
model, a pricing tool which takes into account all the factors
affecting Airbnb pricing would likely be complex enough to benefit
from the use of machine learning.
(d) also does not explicitly reference building a model, but there
are many parts to a mortgage loan application, and therefore likely
many factors that are relevant in determining approval. Thus, any
automated system would likely require a complex model that takes
all of these factors into account. Such a model would benefit from
incorporating machine learning.
What is Machine Learning
Machine learning can be, and often is, a part of proper data
science. Data science is fundamentally a process, while machine
learning is a tool that can be immensely useful in conducting the
data science process.
What is artificial intelligence?
But the elephant is still in the room: even though some
mechanical, “dumb” methods may qualify as machine learning,
this doesn’t exclude human-like, “smart” methods from being
classified as such either. And this is completely true – it doesn’t,
yet people have chosen to name it something else entirely: artificial
intelligence.
What is artificial intelligence?
But why? Why give “smart” methods an entirely different name if
they can also fall in the bucket of machine learning? That is the
question we will explore for the remainder of this module.
What is artificial intelligence?
Let’s start by taking a look at an iconic demonstration of this
so-called intelligence: AlphaGo beating the world’s top human Go
player
What is artificial intelligence?
Quite impressive! But does this feat alone prove that a machine
exhibits human-like intelligence?
What is artificial intelligence?
What, to you, counts as “artificial intelligence”? What
demonstrations of aptitude would, beyond a shadow of a doubt,
convince you that something is as intelligent as us humans are?
What is artificial intelligence?
You may have found that it was quite hard to come up with
answers to the second part of the previous question, and that most
ideas you had either:
1
1) seemed like they could well have a “dumb” mechanical
solution as in the previous discussion of machine learning,
despite seeming impressive at first; or
2
2) actually made you question how unique human intelligence
really is and whether virtually all of what we do could be
reduced to such “dumb” mechanical methods.
We’ll explore both ideas below.
What is artificial intelligence?
So, is there any sensible test we could use to determine if
something is as intelligent as a human?
What is artificial intelligence?
There have been many proposals over time. The most famous
aptitude test developed was the Turing test, named after the
English mathematician and famous World War II cryptographer
Alan Turing.
What is artificial intelligence?
• In the Turing test, there is a human evaluator and two
conversation partners: one machine and one human.
What is artificial intelligence?
• In the Turing test, there is a human evaluator and two
conversation partners: one machine and one human.
• The evaluator would conduct a conversation with each
through a text-only channel.
What is artificial intelligence?
• In the Turing test, there is a human evaluator and two
conversation partners: one machine and one human.
• The evaluator would conduct a conversation with each
through a text-only channel.
• If the evaluator cannot reliably tell the machine from the
human, the machine is said to have passed the test.
What is artificial intelligence?
What are some shortcomings of this proposal? What insight does
this lend into the criteria that a rigorous test of human-like
intelligence would have to satisfy?
What is artificial intelligence?
• Turing did not explicitly state that his test could be used as a
measure of intelligence, but many who came after him thrust
his test into the limelight.
What is artificial intelligence?
• Turing did not explicitly state that his test could be used as a
measure of intelligence, but many who came after him thrust
his test into the limelight.
• Of course, the implication is that if a computer can converse
like a human, then it is effectively as intelligent as a human.
What is artificial intelligence?
• In addition to the flaws with the Turing test (and in fact with
almost any other test you can likely come up with), this brings
to light one of our society’s unhealthy obsessions when it
comes to the field of artificial intelligence – its singular focus
on mimicking human intelligence via machines.
What is artificial intelligence?
• In addition to the flaws with the Turing test (and in fact with
almost any other test you can likely come up with), this brings
to light one of our society’s unhealthy obsessions when it
comes to the field of artificial intelligence – its singular focus
on mimicking human intelligence via machines.
• But what if machine intelligence is fundamentally different
(note: not worse or better) than human intelligence? What if
machines are more “intelligent” about certain things than we
are, and vice versa?
What is artificial intelligence?
Brainstorm with a partner, and then we will discuss aloud in class:
• What are some things that machines can already do better
than we can? What specifically about them allows them to do
these better?
What is artificial intelligence?
Brainstorm with a partner, and then we will discuss aloud in class:
• What are some things that machines can already do better
than we can? What specifically about them allows them to do
these better?
• What are some things that we can do better than machines?
Do you think this advantage will likely be sustained over time?
Why?
Takeaways
In this case, you learned what “data science” and “machine
learning” really are, in contrast to the misleading connotations that
they are often given in public discussion. You also learned that
“artificial intelligence” is a very ambiguous term – nobody really
agrees on its exact definition and it is unclear if its current focus
on imitating human intelligence is even the correct approach.
Takeaways
Throughout this program, we will focus primarily on data science
and machine learning, and not so much artificial intelligence. Yet
the philosophical questions surrounding artificial intelligence are
fascinating, and we encourage you to continue pondering them as
you become more involved in this new and exciting field.
Takeaways
In the coming weeks, we will train you on the numerous techniques
(e.g. linear regression) and tools (e.g. pandas, scikit-learn) of data
science so that you will be able to conduct professional data
science processes in your own lives.
Homework
• Write a bullet form/paragraph summarising the difference
between AI, statistics, machine learning and data science.
• Write down a hypothesis you have about the world, and
possible data you might use to verify this hypothesis.
Download