Finish this Book and get your Data Scientist Job Data Science Introduction Data Science is a combination of multiple disciplines that uses statistics, data analysis, and machine learning to analyze data and to extract knowledge and insights from it. What is Data Science? Data Science is about data gathering, analysis and decision-making. Data Science is about finding patterns in data, through analysis, and make future predictions. By using Data Science, companies are able to make: Better decisions (should we choose A or B) Predictive analysis (what will happen next?) Pattern discoveries (find pattern, or maybe hidden information in the data) Where is Data Science Needed? Data Science is used in many industries in the world today, e.g. banking, consultancy, healthcare, and manufacturing. Examples of where Data Science is needed: For route planning: To discover the best routes to ship To foresee delays for flight/ship/train etc. (through predictive analysis) To create promotional offers To find the best suited time to deliver goods To forecast the next years revenue for a company To analyze health benefit of training To predict who will win elections Where is Data Science Needed? Data Science can be applied in nearly every part of a business where data is available. Examples are: Consumer goods Stock markets Industry Politics Logistic companies E-commerce How Does a Data Scientist Work? A Data Scientist requires expertise in several backgrounds: Machine Learning Statistics Programming (Python or R) Mathematics Databases A Data Scientist must find patterns within the data. Before he/she can find the patterns, he/she must organize the data in a standard format. Here is how a Data Scientist works: Ask the right questions - To understand the business problem. Explore and collect data - From database, web logs, customer feedback, etc. Extract the data - Transform the data to a standardized format. Clean the data - Remove erroneous values from the data. Find and replace missing values - Check for missing values and replace them with a suitable value (e.g. an average value). Normalize data - Scale the values in a practical range (e.g. 140 cm is smaller than 1,8 m. However, the number 140 is larger than 1,8. - so scaling is important). Analyze data, find patterns and make future predictions. Represent the result - Present the result with useful insights in a way the "company" can understand. What is Data? Data is a collection of information. One purpose of Data Science is to structure data, making it interpretable and easy to work with. Data can be categorized into two groups: Structured data Unstructured data Data Types? Unstructured Data Unstructured data is not organized. We must organize the data for analysis purposes. Structured Data Structured data is organized and easier to work with. How to Structure Data? We can use an array or a database table to structure or present data. Example of an array: [80, 85, 90, 95, 100, 105, 110, 115, 120, 125] The following example shows how to create an array in Python: #Example Array = [80, 85, 90, 95, 100, 105, 110, 115, 120, 125] print(Array) o/p: [80, 85, 90, 95, 100, 105, 110, 115, 120, 125] Data Science - Database Table: What is Database Table? A database table is a table with structured data. The following table shows a database table with health data extracted from a sports watch: This dataset contains information of a typical training session such as duration, average pulse, calorie burnage etc. Data Science - Database Table: Database Table Structure A database table consists of column(s) and row(s): A row is a horizontal representation of data. A column is a vertical representation of data. Data Science - Database Table: Variables A variable is defined as something that can be measured or counted. Examples can be characters, numbers or time. In the example under, we can observe that each column represents a variable. Data Science - Database Table: Variables A variable is defined as something that can be measured or counted. There are 6 columns, meaning that there are 6 variables (Duration, Average_Pulse, Max_Pulse, Calorie_Burnage, Hours_Work, Hours_Sleep). There are 11 rows, meaning that each variable has 10 observations. But if there are 11 rows, how come there are only 10 observations? It is because the first row is the label, meaning that it is the name of the variable. Data Science & Python Python Python is a programming language widely used by Data Scientists. Python has in-built mathematical libraries and functions, making it easier to calculate mathematical problems and to perform data analysis. Python Libraries Python has libraries with large collections of mathematical functions and analytical tools. In this course, we will use the following libraries: Pandas - This library is used for structured data operations, like import CSV files, create dataframes, and data preparation Numpy - This is a mathematical library. Has a powerful N-dimensional array object, linear algebra, Fourier transform, etc. Matplotlib - This library is used for visualization of data. SciPy - This library has linear algebra modules Data Science - Python DataFrame Create a DataFrame with Pandas A data frame is a structured representation of data. Let's define a data frame with 3 columns and 5 rows with fictional numbers: Example import pandas as pd d = {'col1': [1, 2, 3, 4, 7], 'col2': [4, 5, 6, 9, 5], 'col3': [7, 8, 12, 1, 11]} df = pd.DataFrame(data=d) print(df) Example Explained Import the Pandas library as pd Define data with column and rows in a variable named d Create a data frame using the function pd.DataFrame() The data frame contains 3 columns and 5 rows Print the data frame output with the print() function Data Science - Python DataFrame We write pd. in front of DataFrame() to let Python know that we want to activate the DataFrame() function from the Pandas library. Be aware of the capital D and F in DataFrame! Interpreting the Output : This is the output: We see that "col1", "col2" and "col3" are the names of the columns. Do not be confused about the vertical numbers ranging from 04. They tell us the information about the position of the rows. In Python, the numbering of rows starts with zero. Now, we can use Python to count the columns and rows. We can use df.shape[1] to find the number of columns: Data Science Functions The data set above consists of 6 variables, each with 10 observations: Duration - How long lasted the training session in minutes? Average_Pulse - What was the average pulse of the training session? This is measured by beats per minute Max_Pulse - What was the max pulse of the training session? Calorie_Burnage - How much calories were burnt on the training session? Hours_Work - How many hours did we work at our job before the training session? Hours_Sleep - How much did we sleep the night before the training session? We use underscore (_) to separate strings because Python cannot read space as separator. The Sports Watch Data Set Data Science Functions The max() function The Python max() function is used to find the highest value in an array. Average_pulse_max = max(80, 85, 90, 95, 100, 105, 110, 115, 120, 125) print (Average_pulse_max) o/p: 125 DataThe min() function The Python min() function is used to find the lowest value in an array. Science Functions Average_pulse_min = min(80, 85, 90, 95, 100, 105, 110, 115, 120, 125) print (Average_pulse_min) o/p: 80 The mean() function The NumPy mean() function is used to find the average value of an array. import numpy as np Calorie_burnage = [240, 250, 260, 270, 280, 290, 300, 310, 320, 330] Average_calorie_burnage = np.mean(Calorie_burnage) print(Average_calorie_burnage) o/p: 285.0 Data Science - Data Preparation Extract and Read Data With Pandas • Before analyzing data, a Data Scientist must extract the data, and make it clean and valuable. • Before data can be analyzed, it must be imported/extracted. In the example below, we show you how to import data using Pandas in Python. We use the read_csv() function to import a CSV file with the health data: import pandas as pd health_data = pd.read_csv("data.csv", header=0, sep=",") print(health_data) Data Science - Data Preparation Example Explained • Import the Pandas library • Name the data frame as health_data. • header=0 means that the headers for the variable names are to be found in the first row (note that 0 means the first row in Python) • sep="," means that "," is used as the separator between the values. This is because we are using the file type .csv (comma separated values) • Tip: If you have a large CSV file, you can use the head() function to only show the top 5rows: Data Science - Data Preparation Data Cleaning Look at the imported data. As you can see, the data are "dirty" with wrongly or unregistered values: • • • • • There are some blank fields Average pulse of 9 000 is not possible 9 000 will be treated as non-numeric, because of the space separator One observation of max pulse is denoted as "AF", which does not make sense So, we must clean the data in order to perform the analysis. Data Science - Data Preparation Data Cleaning Remove Blank Rows We see that the non-numeric values (9 000 and AF) are in the same rows with missing values. Solution: We can remove the rows with missing observations to fix this problem. When we load a data set using Pandas, all blank cells are automatically converted into "NaN" values. So, removing the NaN cells gives us a clean data set that can be analyzed. We can use the dropna() function to remove the NaNs. axis=0 means that we want to remove all rows that have a NaN value: import pandas as pd health_data = pd.read_csv("data.csv", header=0, sep=",") health_data.dropna(axis=0,inplace=True) print(health_data) Data Science - Data Preparation Data Cleaning Remove Blank Rows We see that the non-numeric values (9 000 and AF) are in the same rows with missing values. Solution: We can remove the rows with missing observations to fix this problem. When we load a data set using Pandas, all blank cells are automatically converted into "NaN" values. So, removing the NaN cells gives us a clean data set that can be analyzed. We can use the dropna() function to remove the NaNs. axis=0 means that we want to remove all rows that have a NaN value: import pandas as pd health_data = pd.read_csv("data.csv", header=0, sep=",") health_data.dropna(axis=0,inplace=True) print(health_data) Data Science - Data Preparation Data Categories To analyze data, we also need to know the types of data we are dealing with. Data can be split into three main categories: o Numerical - Contains numerical values. Can be divided into two categories: Discrete: Numbers are counted as "whole". Example: You cannot have trained 2.5 sessions, it is either 2 or 3 Continuous: Numbers can be of infinite precision. For example, you can sleep for 7 hours, 30 minutes and 20 seconds, or 7.533 hours o Categorical - Contains values that cannot be measured up against each other. Example: A color or a type of training o Ordinal - Contains categorical data that can be measured up against each other. Example: School grades where A is better than B and so on o By knowing the type of your data, you will be able to know what technique to use when analyzing them. Data Science - Data Preparation Data Types We can use the info() function to list the data types within our data set: import pandas as pd health_data = pd.read_csv("data.csv", header=0, sep=",") print(health_data.info()) We see that this data set has two different o/p: types of data: Float64 Object We cannot use objects to calculate and perform analysis here. We must convert the type object to float64 (float64 is a number with a decimal in Python). We cannot use objects to calculate and perform analysis here. We must convert the type object to float64 (float64 is a number with a decimal in Python). Data Science - Data Preparation Analyze the Data When we have cleaned the data set, we can start analyzing the data. We can use the describe() function in Python to summarize data: import pandas as pd health_data = pd.read_csv("data.csv", header=0, sep=",") pd.set_option('display.max_columns',None) print(health_data.describe()) Count - Counts the number of observations Mean - The average value Std - Standard deviation (explained in the statistics chapter) Min - The lowest value 25%, 50% and 75% are percentiles (explained in the statistics chapter) Max - The highest value DS Math Data Science - Linear Functions Mathematical functions are important to know as a data scientist, because we want to make predictions and interpret them. Linear Functions In mathematics a function is used to relate one variable to another variable. Suppose we consider the relationship between calorie burnage and average pulse. It is reasonable to assume that, in general, the calorie burnage will change as the average pulse changes - we say that the calorie burnage depends upon the average pulse. Furthermore, it may be reasonable to assume that as the average pulse increases, so will the calorie burnage. Calorie burnage and average pulse are the two variables being considered. Because the calorie burnage depends upon the average pulse, we say that calorie burnage is the dependent variable and the average pulse is the independent variable. The relationship between a dependent and an independent variable can often be expressed mathematically using a formula (function). DS Math Data Science - Linear Functions Mathematical functions are important to know as a data scientist, because we want to make predictions and interpret them. A linear function has one independent variable (x) and one dependent variable (y), and has the following form: y = f(x) = ax + b This function is used to calculate a value for the dependent variable when we choose a value for the independent variable. Explanation: o f(x) = the output (the dependant variable) o x = the input (the independant variable) o a = slope = is the coefficient of the independent variable. It gives the rate of change of the dependent variable o b = intercept = is the value of the dependent variable when x = 0. It is also the point where the diagonal line crosses the vertical axis. DS Math Data Science - Linear Functions Linear Function With One Explanatory Variable A function with one explanatory variable means that we use one variable for prediction. Let us say we want to predict calorie burnage using average pulse. We have the following formula: f(x) = 2x + 80 Here, the numbers and variables means: o f(x) = The output. This number is where we get the predicted value of Calorie_Burnage o x = The input, which is Average_Pulse o 2 = Slope = Specifies how much Calorie_Burnage increases if Average_Pulse increases by one. It tells us how "steep" the diagonal line is o 80 = Intercept = A fixed value. It is the value of the dependent variable when x = 0 DS Math Data Science - Linear Functions Plotting a Linear Function The term linearity means a "straight line". So, if you show a linear function graphically, the line will always be a straight line. The line can slope upwards, downwards, and in some cases may be horizontal or vertical. Here is a graphical representation of the mathematical function above: Graph Explanations: o The horizontal axis is generally called the x-axis. Here, it represents Average_Pulse. o The vertical axis is generally called the y-axis. Here, it represents Calorie_Burnage. o Calorie_Burnage is a function of Average_Pulse, because Calorie_Burnage is assumed to be dependent on Average_Pulse. o In other words, we use Average_Pulse to predict Calorie_Burnage. o The blue (diagonal) line represents the structure of the mathematical function that predicts calorie burnage. DS Math Data Science - Plotting Linear Functions The Sports Watch Data Set Take a look at our health data set: Plot the Existing Data in Python Now, we can first plot the values of Average_Pulse against Calorie_Burnage using the matplotlib library. The plot() function is used to make a 2D hexagonal binning plot of points x,y: DS Math Data Science - Plotting Linear Functions The Sports Watch Data Set Take a look at our health data set: o Example Explained #Three lines to make our compiler able to draw: o Import the pyplot module of the matplotlib library import sys o Plot the data from Average_Pulse against import matplotlib Calorie_Burnage matplotlib.use('Agg') o kind='line' tells us which type of plot we want. Here, we want to have a straight line import pandas as pd import matplotlib.pyplot as plt o plt.ylim() and plt.xlim() tells us what value we want the axis to start on. Here, we want the axis to begin health_data = pd.read_csv("data.csv", header=0, sep=",") from zero health_data.plot(x ='Average_Pulse', y='Calorie_Burnage', kind='line') o plt.show() shows us the output The code above will produce the following result: plt.ylim(ymin=0) plt.xlim(xmin=0) plt.show() #Two lines to make our compiler able to draw: plt.savefig(sys.stdout.buffer) sys.stdout.flush() DS Math Data Science - Plotting Linear Functions The Sports Watch Data Set Take a look at our health data set: The Graph Output As we can see, there is a relationship between Average_Pulse and Calorie_Burnage. Calorie_Burnage increases proportionally with Average_Pulse. It means that we can use Average_Pulse to predict Calorie_Burnage. DS Math Data Science - Plotting Linear Functions Why is The Line Not Fully Drawn Down to The y-axis? The reason is that we do not have observations where Average_Pulse or Calorie_Burnage are equal to zero. 80 is the first observation of Average_Pulse and 240 is the first observation of Calorie_Burnage. Look at the line. What happens to calorie burnage if average pulse increases from 80 to 90? DS Math Data Science - Plotting Linear Functions We can use the diagonal line to find the mathematical function to predict calorie burnage. As it turns out: o If the average pulse is 80, the calorie burnage is 240 o If the average pulse is 90, the calorie burnage is 260 o If the average pulse is 100, the calorie burnage is 280 o There is a pattern. If average pulse increases by 10, the calorie burnage increases by 20. DS Math Data Science - Slope and Intercept Slope and Intercept Now we will explain how we found the slope and intercept of our function: f(x) = 2x + 80 The image points to the Slope which indicates how steep the line is, and the Intercept - which is the value of y, when x = 0 (the point where the diagonal line crosses the vertical axis). The red line is the continuation of the blue line from previous page. DS Math Data Science - Slope and Intercept Find The Slope The slope is defined as how much calorie burnage increases, if average pulse increases by one. It tells us how "steep" the diagonal line is. We can find the slope by using the proportional difference of two points from the graph. If the average pulse is 80, the calorie burnage is 240 If the average pulse is 90, the calorie burnage is 260 We see that if average pulse increases with 10, the calorie burnage increases by 20. Slope = 20/10 = 2 The slope is 2. DS Math Data Science - Slope and Intercept Find The Slope Mathematically, Slope is Defined as: Slope = f(x2) - f(x1) / x2-x1 f(x2) = Second observation of Calorie_Burnage = 260 f(x1) = First observation of Calorie_Burnage = 240 x2 = Second observation of Average_Pulse = 90 x1 = First observation of Average_Pulse = 80 Use Python to Find the Slope Calculate the slope with the following code: def slope(x1, y1, x2, y2): s = (y2-y1)/(x2-x1) return s print(slope(80,240,90,260)) o/p: 2.0 Slope = (260-240) / (90 - 80) = 2 Be consistent to define the observations in the correct order! If not, the prediction will not be correct! DS Math Data Science - Slope and Intercept Find The Intercept The intercept is used to fine tune the functions ability to predict Calorie_Burnage. The intercept is where the diagonal line crosses the y-axis, if it were fully drawn. The intercept is the value of y, when x = 0. Here, we see that if average pulse (x) is zero, then the calorie burnage (y) is 80. So, the intercept is 80. Sometimes, the intercept has a practical meaning. Sometimes not. Does it make sense that average pulse is zero? No, you would be dead and you certainly would not burn any calories. However, we need to include the intercept in order to complete the mathematical function's ability to predict Calorie_Burnage correctly. Other examples where the intercept of a mathematical function can have a practical meaning: Predicting next years revenue by using marketing expenditure (How much revenue will we have next year, if marketing expenditure is zero?). It is likely to assume that a company will still have some revenue even though if it does not spend money on marketing. Fuel usage with speed (How much fuel do we use if speed is equal to 0 mph?). A car that uses gasoline will still use fuel when it is idle. DS Math Data Science - Slope and Intercept Find the Slope and Intercept Using Python The np.polyfit() function returns the slope and intercept. If we proceed with the following code, we can both get the slope and intercept from the function. import pandas as pd import numpy as np health_data = pd.read_csv("data.csv", header=0, sep=",") x = health_data["Average_Pulse"] y = health_data["Calorie_Burnage"] slope_intercept = np.polyfit(x,y,1) print(slope_intercept) o/p: 2.80 Example Explained: Isolate the variables Average_Pulse (x) and Calorie_Burnage (y) from health_data. Call the np.polyfit() function. The last parameter of the function specifies the degree of the function, which in this case is "1". DS Math Data Science - Slope and Intercept Find the Slope and Intercept Using Python The np.polyfit() function returns the slope and intercept. If we proceed with the following code, we can both get the slope and intercept from the function. import pandas as pd import numpy as np health_data = pd.read_csv("data.csv", header=0, sep=",") x = health_data["Average_Pulse"] y = health_data["Calorie_Burnage"] slope_intercept = np.polyfit(x,y,1) print(slope_intercept) o/p: 2.80 We have now calculated the slope (2) and the intercept (80). We can write the mathematical function as follow: Predict Calorie_Burnage by using a mathematical expression: f(x) = 2x + 80 DS Math Data Science - Slope and Intercept Task: Now, we want to predict calorie burnage if average pulse is 135. Remember that the intercept is a constant. A constant is a number that does not change. We can now substitute the input x with 135: f(135) = 2 * 135 + 80 = 350 If average pulse is 135, the calorie burnage is 350. Define the Mathematical Function in Python Here is the exact same mathematical function, but in Python. The function returns 2*x + 80, with x as the input: #Try to replace x with 140 and 150. def my_function(x): return 2*x + 80 print (my_function(135)) o/p: 350 DS Math Data Science - Slope and Intercept Plot a New Graph in Python Here, we plot the same graph as earlier, but formatted the axis a little bit. Max value of the y-axis is now 400 and for x-axis is 150: import matplotlib.pyplot as plt health_data.plot(x ='Average_Pulse', y='Calorie_Burnage', kind='line'), plt.ylim(ymin=0, ymax=400) Example Explained plt.xlim(xmin=0, xmax=150) Import the pyplot module of the matplotlib library plt.show() Plot the data from Average_Pulse against Calorie_Burnage kind='line' tells us which type of plot we want. Here, we want to have a straight line plt.ylim() and plt.xlim() tells us what value we want the axis to start and stop on. plt.show() shows us the output DS- Statistics Introduction to Statistics Statistics is the discipline that concerns the collection, organization, analysis, interpretation, and presentation of data. When we have created a model for prediction, we must assess the prediction's reliability. Statistics is a method of interpreting, analyzing and summarizing the data. The types of statistics are categorized based on these features: Descriptive and inferential statistics Based on the representation of data such as using pie charts, bar graphs, or tables, we analyse and interpret it. DS- Statistics Descriptive Statistics import pandas as pd full_health_data = pd.read_csv("data.csv", header=0, sep=",") pd.set_option('display.max_columns',None) pd.set_option('display.max_rows',None) print (full_health_data.describe()) DS- Statistics Statistics Percentiles 25%, 50% and 75% - Percentiles Percentiles are used in statistics to give you a number that describes the value that a given percent of the values are lower than. DS- Statistics Statistics Percentiles Let us try to explain it by some examples, using Average_Pulse. The 25% percentile of Average_Pulse means that 25% of all of the training sessions have an average pulse of 100 beats per minute or lower. If we flip the statement, it means that 75% of all of the training sessions have an average pulse of 100 beats per minute or higher The 75% percentile of Average_Pulse means that 75% of all the training session have an average pulse of 111 or lower. If we flip the statement, it means that 25% of all of the training sessions have an average pulse of 111 beats per minute or higher Task: Find the 10% percentile for Max_Pulse The following example shows how to do it in Python: DS- Statistics Statistics Percentiles import pandas as pd import numpy as np full_health_data = pd.read_csv("data.csv", header=0, sep=",") Max_Pulse= full_health_data["Max_Pulse"] percentile10 = np.percentile(Max_Pulse, 10) print(percentile10) o/p: 120.00 Max_Pulse = full_health_data["Max_Pulse"] - Isolate the variable Max_Pulse from the full health data set. np.percentile() is used to define that we want the 10% percentile from Max_Pulse. The 10% percentile of Max_Pulse is 120. This means that 10% of all the training sessions have a Max_Pulse of 120 or lower. DS- Statistics Statistics Standard Deviation Standard Deviation Standard deviation is a number that describes how spread out the observations are A mathematical function will have difficulties in predicting precise values, if the observations are "spread". Standard deviation is a measure of uncertainty. A low standard deviation means that most of the numbers are close to the mean (average) value. A high standard deviation means that the values are spread out over a wider range. DS- Statistics Statistics Standard Deviation Standard Deviation import pandas as pd import numpy as np full_health_data = pd.read_csv("data.csv", header=0, sep=",") std = np.std(full_health_data) print(std) DS- Statistics Statistics Standard Deviation Coefficient of Variation The coefficient of variation is used to get an idea of how large the standard deviation is. Mathematically, the coefficient of variation is defined as: Coefficient of Variation = Standard Deviation / Mean We can do this in Python if we proceed with the following code: import numpy as np cv = np.std(full_health_data) / np.mean(full_health_data) print(cv) o/p: We see that the variables Duration, Calorie_Burnage and Hours_Work has a high Standard Deviation compared to Max_Pulse, Average_Pulse and Hours_Sleep. DS- Statistics Data Science - Statistics Variance Variance Variance is another number that indicates how spread out the values are. In fact, if you take the square root of the variance, you get the standard deviation. Or the other way around, if you multiply the standard deviation by itself, you get the variance! We will first use the data set with 10 observations to give an example of how we can calculate the variance: Tip: Variance is often represented by the symbol Sigma Square: σ^2 DS- Statistics Data Science - Statistics Variance Variance Step 2: For Each Value - Find the Difference From the Mean 2. Find the difference from the mean for each Step 1 to Calculate the Variance: Find the value: Mean We want to find the variance of 80 - 102.5 = -22.5 85 - 102.5 = -17.5 Average_Pulse. 90 - 102.5 = -12.5 95 - 102.5 = -7.5 1. Find the mean: 100 - 102.5 = -2.5 (80+85+90+95+100+105+110+115+120+125) / 10 105 - 102.5 = 2.5 110 - 102.5 = 7.5 = 102.5 115 - 102.5 = 12.5 The mean is 102.5 120 - 102.5 = 17.5 125 - 102.5 = 22.5 DS- Statistics Step 3: For Each Difference - Find the Square Value 3. Find the square value for each difference: Step 4: The Variance is the Average Number of These Squared Values 4. Sum the squared values and find the average: (-22.5)^2 = 506.25 (-17.5)^2 = 306.25 (-12.5)^2 = 156.25 (-7.5)^2 = 56.25 (-2.5)^2 = 6.25 2.5^2 = 6.25 7.5^2 = 56.25 12.5^2 = 156.25 17.5^2 = 306.25 22.5^2 = 506.25 Note: We must square the values to get the total spread. (506.25 + 306.25 + 156.25 + 56.25 + 6.25 + 6.25 + 56.25 + 156.25 + 306.25 + 506.25) / 10 = 206.25 The variance is 206.25. DS- Statistics Data Science - Statistics Variance Variance Use Python to Find the Variance of health_data We can use the var() function from Numpy to find the variance (remember that we now use the first data set with 10 observations): Here we calculate the variance for each column for the full data set: import numpy as np var = np.var(health_data) import numpy as np print(var) var_full = np.var(full_health_data) o/p: print(var_full) o/p: DS - Statistics Correlation Correlation Correlation measures the relationship between two variables. We mentioned that a function has a purpose to predict a value, by converting input (x) to output (f(x)). We can say also say that a function uses the relationship between two variables for prediction. Correlation Coefficient The correlation coefficient measures the relationship between two variables. The correlation coefficient can never be less than -1 or higher than 1. o 1 = there is a perfect linear relationship between the variables (like Average_Pulse against Calorie_Burnage) o 0 = there is no linear relationship between the variables o -1 = there is a perfect negative linear relationship between the variables (e.g. Less hours worked, leads to higher calorie burnage during a training session) DS - Statistics Correlation Example of a Perfect Linear Relationship (Correlation Coefficient = 1) We will use scatterplot to visualize the relationship between Average_Pulse and Calorie_Burnage (we have used the small data set of the sports watch with 10 observations). This time we want scatter plots, so we change kind to "scatter": import matplotlib.pyplot as plt health_data.plot(x ='Average_Pulse', y='Calorie_Burnage', kind='scatter') plt.show() DS - Statistics Correlation Example of a Perfect Negative Linear Relationship (Correlation Coefficient = -1) We have plotted fictional data here. The x-axis represents the amount of hours worked at our job before a training session. The y-axis is Calorie_Burnage. If we work longer hours, we tend to have lower calorie burnage because we are exhausted before the training session. The correlation coefficient here is -1. DS - Statistics Correlation Example of a Perfect Negative Linear Relationship (Correlation Coefficient = -1) import pandas as pd import matplotlib.pyplot as plt negative_corr = {'Hours_Work_Before_Training': [10,9,8,7,6,5,4,3,2,1], 'Calorie_Burnage': [220,240,260,280,300,320,340,360,380,400]} negative_corr = pd.DataFrame(data=negative_corr) negative_corr.plot(x ='Hours_Work_Before_Training', y='Calorie_Burnage', kind='scatter') plt.show() DS - Statistics Correlation Example of No Linear Relationship (Correlation coefficient = 0) Here, we have plotted Max_Pulse against Duration from the full_health_data set. As you can see, there is no linear relationship between the two variables. It means that longer training session does not lead to higher Max_Pulse. The correlation coefficient here is 0. import matplotlib.pyplot as plt full_health_data.plot(x ='Duration', y='Max_Pulse', kind='scatter') plt.show() o/p: D S - Statistics Correlation Matrix Correlation Matrix A matrix is an array of numbers arranged in rows and columns. A correlation matrix is simply a table showing the correlation coefficients between variables. Here, the variables are represented in the first row, and in the first column: The table here has used data from the full health data set. Observations: We observe that Duration and Calorie_Burnage are closely related, with a correlation coefficient of 0.89. This makes sense as the longer we train, the more calories we burn We observe that there is almost no linear relationships between Average_Pulse and Calorie_Burnage (correlation coefficient of 0.02) Can we conclude that Average_Pulse does not affect Calorie_Burnage? No. We will come back to answer this question later! D S - Statistics Correlation Matrix Correlation Matrix Correlation Matrix in Python We can use the corr() function in Python to create a correlation matrix. We also use the round() function to round the output to two decimals: import pandas as pd full_health_data = pd.read_csv("data.csv", header=0, sep=",") Corr_Matrix = round(full_health_data.corr(),2) print(Corr_Matrix) o/p: D S - Statistics Correlation Matrix Correlation Matrix Using a Heatmap We can use a Heatmap to Visualize the Correlation Between Variables: The closer the correlation coefficient is to 1, the greener the squares get. The closer the correlation coefficient is to -1, the browner the squares get. D S - Statistics Correlation Matrix Correlation Matrix Use Seaborn to Create a Heatmap We can use the Seaborn library to create a correlation heat map (Seaborn is a visualization library based on matplotlib): import matplotlib.pyplot as plt import seaborn as sns correlation_full_health = full_health_data.corr() axis_corr = sns.heatmap( correlation_full_health, vmin=-1, vmax=1, center=0, cmap=sns.diverging_palette(50, 500, n=500), square=True ) plt.show() D S - Statistics Correlation Matrix Correlation Matrix Use Seaborn to Create a Heatmap We can use the Seaborn library to create a correlation heat map (Seaborn is a visualization library based on matplotlib): Example Explained: Import the library seaborn as sns. Use the full_health_data set. Use sns.heatmap() to tell Python that we want a heatmap to visualize the correlation matrix. o Use the correlation matrix. Define the maximal and minimal values of the heatmap. Define that 0 is the center. o Define the colors with sns.diverging_palette. n=500 means that we want 500 types of color in the same color palette. o square = True means that we want to see squares. o o o o D S - Statistics Correlation vs. Causality Correlation Does Not Imply Causality Correlation measures the numerical relationship between two variables. A high correlation coefficient (close to 1), does not mean that we can for sure conclude an actual relationship between two variables. A classic example: During the summer, the sale of ice cream at a beach increases Simultaneously, drowning accidents also increase as well Does this mean that increase of ice cream sale is a direct cause of increased drowning accidents? The Beach Example in Python Here, we constructed a fictional data set for you to try: D S - Statistics Correlation vs. Causality Correlation Does Not Imply Causality The Beach Example in Python Here, we constructed a fictional data set for you to try: import pandas as pd import matplotlib.pyplot as plt Drowning_Accident = [20,40,60,80,100,120,140,160,180,200] Ice_Cream_Sale = [20,40,60,80,100,120,140,160,180,200] Drowning = {"Drowning_Accident": [20,40,60,80,100,120,140,160,180,200], "Ice_Cream_Sale": [20,40,60,80,100,120,140,160,180,200]} Drowning = pd.DataFrame(data=Drowning) Drowning.plot(x="Ice_Cream_Sale", y="Drowning_Accident", kind="scatter") plt.show() correlation_beach = Drowning.corr() print(correlation_beach) D S - Statistics Correlation vs. Causality Correlation Does Not Imply Causality The Beach Example in Python o/p: D S - Statistics Correlation vs. Causality Correlation Does Not Imply Causality The Beach Example in Python Correlation vs Causality - The Beach Example In other words: can we use ice cream sale to predict drowning accidents? The answer is - Probably not. It is likely that these two variables are accidentally correlating with each other. What causes drowning then? Unskilled swimmers Waves Cramp Seizure disorders Lack of supervision Alcohol (mis)use etc. D S - Statistics Correlation vs. Causality Correlation Does Not Imply Causality The Beach Example in Python Let us reverse the argument: Does a low correlation coefficient (close to zero) mean that change in x does not affect y? Back to the question: Can we conclude that Average_Pulse does not affect Calorie_Burnage because of a low correlation coefficient? The answer is no. There is an important difference between correlation and causality: Correlation is a number that measures how closely the data are related Causality is the conclusion that x causes y. Tip: Always critically reflect over the concept of causality when doing predictions! D S - Statistics Correlation vs. Causality Correlation Does Not Imply Causality The Beach Example in Python Let us reverse the argument: Does a low correlation coefficient (close to zero) mean that change in x does not affect y? Back to the question: Can we conclude that Average_Pulse does not affect Calorie_Burnage because of a low correlation coefficient? The answer is no. There is an important difference between correlation and causality: Correlation is a number that measures how closely the data are related Causality is the conclusion that x causes y. Statistics gives us methods of gaining knowledge from data. What is Statistics Used for? Statistics is used in all kinds of science and business applications. Statistics gives us more accurate knowledge which helps us make better decisions. Statistics can focus on making predictions about what will happen in the future. It can also focus on explaining how different things are connected. Typical Steps of Statistical Methods The typical steps are: Gathering data Describing and visualizing data Making conclusions It is important to keep all three steps in mind for any questions we want more knowledge about. Knowing which types of data are available can tell you what kinds of questions you can answer with statistical methods. Knowing which questions you want to answer can help guide what sort of data you need. A lot of data might be available, and knowing what to focus on is important. How is Statistics Used? Statistics can be used to explain things in a precise way. You can use it to understand and make conclusions about the group that you want to know more about. This group is called the population. • A population could be many different kinds of groups. It could be: • All of the people in a country • All the businesses in an industry • All the customers of a business • All people that play football who are older than 45 and so on - it just depends on what you want to know about. Gathering data about the population will give you a sample. This is a part of the whole population. Statistical methods are then used on that sample. The results of the statistical methods from the sample is used to make conclusions about the population. Important Concepts in Statistics o o o o o o o o o Predictions and Explanations Populations and Samples Parameters and Sample Statistics Sampling Methods Data Types Measurement Level Descriptive Statistics Random Variables Univariate and Multivariate Statistics o o o o o o o o Probability Calculation Probability Distributions Statistical Inference Parameter Estimation Hypothesis Testing Correlation Regression Analysis Causal Inference Statistics and Programming: Statistical analysis is typically done with computers. Small amounts of data can analyzed reasonably well without computers. Historically, all data analysis was performed by manually. It was time-consuming and prone to errors. Nowadays, programming and software is typically used for data analysis. In this course, we will see examples of code to do statistics with the programming languages Python and R. Statistics - Describing Data Describing data is typically the second step of statistical analysis after gathering data. Descriptive Statistics The information (data) from your sample or population can be visualized with graphs or summarized by numbers. This will show key information in a simpler way than just looking at raw data. It can help us understand how the data is distributed. Graphs can visually show the data distribution. Examples of graphs include: o o o o Histograms Pie charts Bar graphs Box plots Statistics - Describing Data Some graphs have a close connection to numerical summary statistics. Calculating those gives us the basis of these graphs. For example, a box plot visually shows the quartiles of a data distribution. Quartiles are the data split into four equal size parts, or quarters. A quartile is one type of summary statistics. Summary statistics take a large amount of information and sums it up in a few key values. Numbers are calculated from the data which also describe the shape of the distributions. These are individual 'statistics'. Statistics - Making Conclusions Using statistics to make conclusions about a population is called statistical inference. Statistics from the data in the sample is used to make conclusions about the whole population. This is a type of statistical inference. Probability theory is used to calculate the certainty that those statistics also apply to the population. When using a sample, there will always be some uncertainty about what the data looks like for the population. When using a sample, there will always be some uncertainty about what the data looks like for the population. Uncertainty is often expressed as confidence intervals. Statistics - Making Conclusions Confidence intervals are numerical ways of showing how likely it is that the true value of this statistic is within a certain range for the population. Hypothesis testing is a another way of checking if a statement about a population is true. More precisely, it checks how likely it is that a hypothesis is true is based on the sample data. Some examples of statements or questions that can be checked with hypothesis testing: People in the Netherlands taller than people in Denmark Do people prefer Pepsi or Coke? Does a new medicine cure a disease? Statistics - Making Conclusions Causal Inference Causal inference is used to investigate if something causes another thing. For example: Does rain make plants grow? If we think two things are related we can investigate to see if they correlate. Statistics can be used to find out how strong this relation is. Even if things are correlated, finding out of something is caused by other things can be difficult. It can be done with good experimental design or other special statistical techniques. Note: Good experimental design is often difficult to achieve because of ethical concerns or other practical reasons. Statistics - Prediction and Explanation Some types of statistical methods are focused on predicting what will happen. Other types of statistical methods are focused on explaining how things are connected. Prediction Some statistical methods are not focused on explaining how things are connected. Only the accuracy of prediction is important. Many statistical methods are successful at predicting without giving insight into how things are connected. Some types of machine learning let computers do the hard work, but the way they predict is difficult to understand. These approaches can also be vulnerable to mistakes if the circumstances change, since the how they work is less clear. Statistics - Prediction and Explanation Some types of statistical methods are focused on predicting what will happen. Other types of statistical methods are focused on explaining how things are connected. Prediction Some statistical methods are not focused on explaining how things are connected. Only the accuracy of prediction is important. Many statistical methods are successful at predicting without giving insight into how things are connected. Some types of machine learning let computers do the hard work, but the way they predict is difficult to understand. These approaches can also be vulnerable to mistakes if the circumstances change, since the how they work is less clear. Note: Predictions about future events are called forecasts. Not all predictions are about the future. Some predictions can be about something else that is unknown, even if it is not in the future. Statistics - Prediction and Explanation Explanation Different statistical methods are often used for explaining how things are connected. These statistical methods may not make good predictions. These statistical methods often explain only small parts of the whole situation. But, if you only want to know how a few things are connected, the rest might not matter. If these methods accurately explains how all the relevant things are connected, they will also be good at prediction. But managing to explain every detail is often challenging. Some times we are specifically interested in figuring out if one thing causes another. This is called causal inference. If we are looking at complicated situations, many things are connected. To figure out what causes what, we need to untangle every way these things are connected. Statistics - Population and Samples Population: Everything in the group that we want to learn about. Sample: A part of the population. For good statistical analysis, the sample needs to be as "similar" as possible to the population. If they are similar enough, we say that the sample is representative of the population. The sample is used to make conclusions about the whole population. If the sample is not similar enough to the whole population, the conclusions could be useless. Statistics - Parameters and Statistics The terms 'parameter' and (sample) 'statistic' refer to key concepts that are closely related in statistics. They are also directly connected to the concepts of populations and samples. Parameter: A number that describes something about the whole population. Sample statistic: A number that describes something about the sample. The parameters are the key things we want to learn about. The parameters are usually unknown. Sample statistics gives us estimates for parameters. There will always be some uncertainty about how accurate estimates are. More certainty gives us more useful knowledge. For every parameter we want to learn about we can get a sample and calculate a sample statistic, which gives us an estimate of the parameter. Statistics - Parameters and Statistics Some Important Examples Mean, median and mode are different types of averages (typical values in a population). For example: Variance and standard deviation are two types of values describing how spread out the values are. A single class of students in a school would usually be about the same age. The age of the students will have low variance and standard deviation. A whole country will have people of all kinds of different ages. The variance and standard deviation of age in the whole country would then be bigger than in a single school grade. The typical age of people in a country The typical profits of a company The typical range of an electric car Statistics - Study Types A statistical study can be a part of the process of gathering data. There are different types of studies. Some are better than others, but they might be harder to do. Main Types of Statistical Studies The main types of statistical studies are observational and experimental studies. We are often interested in knowing if something is the cause of another thing. Experimental studies are generally better than observational studies for investigating this, but usually require more effort. An observational study is when observe and gather data without changing anything. Statistics - Study Types Experimental Studies In an experimental study, the circumstances around the sample is changed. Usually, we compare two groups from a population and these two groups are treated differently. One example can be a medical study to see if a new medicine is effective. One group receives the medicine and the other does not. These are the different circumstances around those samples. We can compare the health of both groups afterwards and see if the results are different. Experimental studies can allow us to investigate causal relationships. A well designed experimental study can be useful since it can isolate the relationship we are interested in from other effects. Then we can be more confident that we are measuring the true effect. Statistics - Sample Types A study needs participants and there are different ways of gathering them. Some methods are better than others, but they might be more difficult. Different Types of Sampling Methods: Random Sampling A random sample is where every member of the population has an equal chance to be chosen. Random sampling is the best. But, it can be difficult, or impossible, to make sure that it is completely random. Note: Every other sampling method is compared to how close it is to a random sample - the closer, the better. Convenience Sampling A convenience sample is where the participants that are the easiest to reach are chosen. Note: Convenience sampling is the easiest to do. In many cases this sample will not be similar enough to the population, and the conclusions can potentially be useless. Statistics - Sample Types Systematic Sampling A systematic sample is where the participants are chosen by some regular system. For example: The first 30 people in a queue Every third on a list The first 10 and the last 10 Stratified Sampling A stratified sample is where the population is split into smaller groups called 'strata'. The 'strata' can, for example, be based on demographics, like: Different age groups Professions Stratification of a sample is the first step. Another sampling method (like random sampling) is used for the second step of choosing participants from all of the smaller groups (strata). Statistics - Sample Types Clustered Sampling A clustered sample is where the population is split into smaller groups called 'clusters'. The clusters are usually natural, like different cities in a country. The clusters are chosen randomly for the sample. All members of the clusters can participate in the sample, or members can be chosen randomly from the clusters in a third step. Statistics - Data Types Data can be different types, and require different types of statistical methods to analyze Different types of data There are two main types of data: Qualitative (or 'categorical') and quantitative (or 'numerical'). These main types also have different sub-types depending on their measurement level. Qualitative Data Information about something that can be sorted into different categories that can't be described directly by numbers. Examples: • Brands • Nationality • Professions Statistics - Data Types With categorical data we can calculate statistics like proportions. For example, the proportion of Indian people in the world, or the percent of people who prefer one brand to another. Quantitative Data Information about something that is described by numbers. Examples: • Income • Age • Height With numerical data we can calculate statistics like the average income in a country, or the range of heights of players in a football team. Statistics - Measurement Levels Different data types have different measurement levels. Measurement levels are important for what types of statistics can be calculated and how to best present the data. The main types of data are Qualitative (categories) and Quantitative (numerical). These are further split into the following measurement levels. These measurement levels are also called measurement 'scales' Nominal Level Categories (qualitative data) without any order. Examples: • Brand names • Countries • Colors Statistics - Measurement Levels Ordinal level Categories that can be ordered (from low to high), but the precise "distance" between each is not meaningful. Examples: • Letter grade scales from F to A • Military ranks • Level of satisfaction with a product Consider letter grades from F to A: Is the grade A precisely twice as good as a B? And, is the grade B also twice as good as C? Exactly how much distance it is between grades is not clear and precise. If the grades are based on amounts of points on a test, you can say that there is a precise "distance" on the point scale, but not the grades themselves. Statistics - Measurement Levels Interval Level Data that can be ordered and the distance between them is objectively meaningful. But there is no natural 0-value where the scale originates. Examples: Years in a calendar Temperature measured in Fahrenheit Note: Interval scales are usually invented by people, like degrees of temperature. 0 degrees Celsius is 32 degrees of Fahrenheit. There is consistent distances between each degree (for every 1 extra degree of Celsius, there is 1.8 extra Fahrenheit), but they do not agree on where 0 degrees is. Statistics - Descriptive Statistics Descriptive statistics gives us insight into data without having to look at all of it in detail. Key Features to Describe about Data: Getting a quick overview of how the data is distributed is a important step in statistical methods. We calculate key numerical values about the data that tells us about the distribution of the data. We also draw graphs showing visually how the data is distributed. Key Features of Data: • • • • Where is the center of the data? (location) How much does the data vary? (scale) What is the shape of the data? (shape) These can be described by summary statistics (numerical values). Statistics - Descriptive Statistics The Center of the Data The center of the data is where most of the values are concentrated. Different kinds of averages, like mean, median and mode, are measures of the center. Note: Measures of the center are also called location parameters, because they tell us something about where data is 'located' on a number line. The Variation of the Data The variation of the data is how spread out the data are around the center. Statistics like standard deviation, range and quartiles are measures of variation. Note: Measures of variation are also called scale parameters. Statistics - Descriptive Statistics The Shape of the Data The shape of the data can refer to the how the data are bunched up on either side of the center. Statistics like skew describe if the right or left side of the center is bigger. Skew is one type of shape parameters. Frequency Tables One typical of presenting data is with frequency tables. A frequency table counts and orders data into a table. Typically, the data will need to be sorted into intervals. Frequency tables are often the basis for making graphs to visually present the data. Statistics - Descriptive Statistics Visualizing Data Different types of graphs are used for different kinds of data. For example: • • • • Pie charts for qualitative data Histograms for quantitative data Scatter plots for bivariate data Graphs often have a close connection to numerical summary statistics. For example, box plots show where the quartiles are. Quartiles also tell us where the minimum and maximum values, range, interquartile range, and median are. Statistics - Frequency Tables We can see that there is only one winner from ages 10 to 19. And that the highest number of winners are in their 60s. Statistics - Descriptive Statistics Relative Frequency Tables Relative frequency means the number of times a value appears in the data compared to the total amount. A percentage is a relative frequency. Here are the relative frequencies of ages of Noble Prize winners. Now, all the frequencies are divided by the total (934) to give percentages. Statistics - Descriptive Statistics Cumulative Frequency Tables Cumulative frequency counts up to a particular value. Here are the cumulative frequencies of ages of Nobel Prize winners. Now, we can see how many winners have been younger than a certain age. Cumulative frequency tables can also be made with relative frequencies (percentages). Statistics - Histograms A histogram visually presents quantitative data. A histogram is a widely used graph to show the distribution of quantitative (numerical) data. It shows the frequency of values in the data, usually in intervals of values. Frequency is the amount of times that value appeared in the data. Each interval is represented with a bar, placed next to the other intervals on a number line. The height of the bar represents the frequency of values in that interval. Here is a histogram of the age of all 934 Nobel Prize winners up to the year 2020: This histogram uses age intervals from 10 to 19, 20 to 29, and so on. Note: Histograms are similar to bar graphs, which are used for qualitative data. Statistics - Histograms Bin Width The intervals of values are often called 'bins'. And the length of an interval is called 'bin width'. We can choose any width. It is best with a bin width that shows enough detail without being confusing. Here is a histogram of the same Nobel Prize winner data, but with bin widths of 5 instead of 10: This histogram uses age intervals from from 15 to 19, 20 to 24, 25 to 29, and so on. Smaller intervals gives a more detailed look at the distribution of the age values in the data. Statistics - Bar Graphs A bar graph visually presents qualitative data. Bar graphs are used to show the distribution of qualitative (categorical) data. It shows the frequency of values in the data. Frequency is the amount of times that value appeared in the data. Each category is represented with a bar. The height of the bar represents the frequency of values from that category in the data. Here is a bar graph of the number of people who have won a Nobel Prize in each category up to the year 2020: Some of the categories have existed longer than others. Multiple winners are also more common in some categories. So there is a different number of winners in each category. Note: Bar graphs are similar to histograms, which are used for quantitative data Statistics - Pie Charts A pie chart visually presents qualitative data. Pie graphs are used to show the distribution of qualitative (categorical) data. It shows the frequency or relative frequency of values in the data. Frequency is the amount of times that value appeared in the data. Relative frequency is the percentage of the total. Each category is represented with a slice in the 'pie' (circle). The size of each slice represents the frequency of values from that category in the data. Here is a pie chart of the number of people who have won a Nobel Prize in each category up to the year 2020: This pie chart shows relative frequency. So each slice is sized by the percentage for each category. Some of the categories have existed longer than others. Multiple winners are also more common in some categories. So there is a different number of winners in each category. Statistics - Box Plots A box plot is a graph used to show key features of quantitative data. A box plot is a good way to show many important features of quantitative (numerical) data. It shows the median of the data. This is the middle value of the data and one type of an average value. It also shows the range and the quartiles of the data. This tells us something about how spread out the data is. Note: Box plots are also called 'box and whiskers plots'. Here is a box plot of the age of all the Nobel Prize winners up to the year 2020: Statistics - Box Plots The median is the red line through the middle of the 'box'. We can see that this is just above the number 60 on the number line below. So the middle value of age is 60 years. The left side of the box is the 1st quartile. This is the value that separates the first quarter, or 25% of the data, from the rest. Here, this is 51 years. The right side of the box is the 3rd quartile. This is the value that separates the first three quarters, or 75% of the data, from the rest. Here, this is 69 years. Statistics - Box Plots The distance between the sides of the box is called the inter-quartile range (IQR). This tells us where the 'middle half' of the values are. Here, half of the winners were between 51 and 69 years. The ends of the lines from the box at the left and the right are the minimum and maximum values in the data. The distance between these is called the range. The youngest winner was 17 years old, and the oldest was 97 years old. So the range of the age of winners was 80 years. Statistics - Average An average is a measure of where most of the values in the data are located. The center of the data is where most of the values in the data are located. Averages are measures of the location of the center. There are different types of averages. The most commonly used are: o Mean o Median o Mode Note: In statistics, averages are often referred to as 'measures of central tendency'. For example, using the values: 40, 21, 55, 21, 48, 13, 72 Statistics - Average Median The median is the 'middle value' of the data. The median is found by ordering all the values in the data and picking the middle value: 13, 21, 21, 40, 48, 55, 72 The median is less influenced by extreme values in the data than the mean. Changing the last value to 356 does not change the median: 13, 21, 21, 40, 48, 55, 356 The median is still 40. Changing the last value to 356 changes the mean a lot: (13 + 21 + 21 + 40 + 48 + 55 + 72)/7 = 38.57 (13 + 21 + 21 + 40 + 48 + 55 + 356)/7 = 79.14 Note: Extreme values are values in the data that are much smaller or larger than the average values in the data. Statistics - Average Mode The mode is the value(s) that appears most often in the data: 40, 21, 55, 21, 48, 13, 72 Here, 21 appears two times, and the other values only once. The mode of this data is 21. The mode is also used for categorical data, unlike the median and mean. Categorical data can't be described directly with numbers, like names: Alice, John, Bob, Maria, John, Julia, Carol Here, John appears two times, and the other values only once. The mode of this data is John. Note: There can be more than one mode if multiple values appear the same number of times in the data. Statistics - Mean The mean is a type of average value, which describes where center of the data is located. Mean The mean is usually referred to as 'the average'. The mean is the sum of all the values in the data divided by the total number of values in the data. The mean is calculated for numerical variables. A variable is something in the data that can vary, like: Age Height Income Note: There are multiple types of mean values. The most common type of mean is the arithmetic mean. In this tutorial 'mean' refers to the arithmetic mean. Statistics - Mean Calculating the Mean You can calculate the mean for both the population and the sample. The formulas are the same and uses different symbols to refer to the population mean (μ) and sample mean (x¯). Statistics - Mean Calculation with Programming The mean can easily be calculated with many programming languages. Using software and programming to calculate statistics is more common for bigger sets of data, as calculating by hand becomes difficult. Example With Python use the NumPy library mean() method to find the mean of the values 4,11,7,14: import numpy values = [4,11,7,14] x = numpy.mean(values) print(x) o/p: 9.0 Statistics - Mean Calculation with Programming The mean can easily be calculated with many programming languages. Using software and programming to calculate statistics is more common for bigger sets of data, as calculating by hand becomes difficult. Example With Python use the NumPy library mean() method to find the mean of the values 4,11,7,14: import numpy values = [4,11,7,14] x = numpy.mean(values) print(x) Use the R mean() function to find the mean of the values 4,11,7,14: values <- c(4,7,11,14) mean(values) o/p: 9.0 o/p: [1] 9 Statistics - Median The median is a type of average value, which describes where the center of the data is located. The median is the middle value in a data set ordered from low to high. Finding the Median The median can only be calculated for numerical variables. The formula for finding the middle value is: Where n is the total number of observations. If the total number of observations is an odd number, the formula gives a whole number and the value of this observation is the median. 13, 21, 21, 40, 48, 55, 72 Here, there are 7 total observations, so the median is the 4th value: The 4th value in the ordered list is 40, so that is the median. Statistics - Median If the total number of observations is an even number, the formula gives a decimal number between two observations. 13, 21, 21, 40, 42, 48, 55, 72 Here, there are 8 total observations, so the median is between the 4th and 5th values: The 4th and 5th values in the ordered list is 40 and 42, so the median is the mean of these two values. That is, the sum of those two values divided by 2: Note: It is important that the numbers are ordered before you can find the median. Statistics - Median Finding the Median with Programming The median can easily be found with many programming languages. Using software and programming to calculate statistics is more common for bigger sets of data, as finding it manually becomes difficult. The median can easily be found with many programming languages. Using software and programming to calculate statistics is more common for bigger sets of data, as finding it manually becomes difficult. Example With Python use the NumPy library median() method to find the median of the values 13, 21, 21, 40, 42, 48, 55, 72: import numpy values = [13,21,21,40,42,48,55,72] x = numpy.median(values) print(x) o/p: 41.0 Statistics - Mode The mode is a type of average value, which describes where most of the data is located. Mode The mode is the value(s) that are the most common in the data. A dataset can have multiple values that are modes. A distribution of values with only one mode is called unimodal. A distribution of values with two modes is called bimodal. In general, a distribution with more than one mode is called multimodal. Mode can be found for both categorical and numerical data. Statistics - Mode Finding the Mode Here is a numerical example: 4, 7, 3, 8, 11, 7, 10, 19, 6, 9, 12, 12 Both 7 and 12 appears two times each, and the other values only once. The modes of this data is 7 and 12. Here is a categorical example with names: Alice, John, Bob, Maria, John, Julia, Carol John appears two times, and the other values only once. The mode of this data is John. Statistics - Mode Finding the Mode with Programming The mode can easily be found with many programming languages. Using software and programming to calculate statistics is more common for bigger sets of data, as calculating manually becomes difficult. Example With Python use the statistics library multimode() method to find the modes of the values 4,7,3,8,11,7,10,19,6,9,12,12: from statistics import multimode values = [4,7,3,8,11,7,10,19,6,9,12,12] x = multimode(values) print(x) o/p: [7, 12] Statistics - Variation Variation is a measure of how spread out the data is around the center of the data. Measures of variation are statistics of how far away the values in the observations (data points) are from each other. There are different measures of variation. The most commonly used are: o Range o Quartiles and Percentiles o Interquartile Range o Standard Deviation Measures of variation combined with an average (measure of center) gives a good picture of the distribution of the data. Note: These measures of variation can only be calculated for numerical data. Statistics - Variation Range The range is the difference between the smallest and the largest value of the data. Range is the simplest measure of variation. Here is a histogram of the age of all 934 Nobel Prize winners up to the year 2020, showing the range: The youngest winner was 17 years and the oldest was 97 years. The range of ages for Nobel Prize winners is then 80 years. Statistics - Variation Quartiles and Percentiles Quartiles and percentiles are ways of separating equal numbers of values in the data into parts. Quartiles are values that separate the data into four equal parts. Percentiles are values that separate the data into 100 equal parts. Here is a histogram of the age of all 934 Nobel Prize winners up to the year 2020, showing the quartiles: The quartiles (Q0,Q1,Q2,Q3,Q4) are the values that separate each quarter. Between Q0 and Q1 are the 25% lowest values in the data. Between Q1 and Q2 are the next 25%. And so on. o Q0 is the smallest value in the data. o Q2 is the middle value (median). o Q4 is the largest value in the data. Statistics - Variation Interquartile Range Interquartile range is the difference between the first and third quartiles (Q1 and Q3). The 'middle half' of the data is between the first and third quartile. Here is a histogram of the age of all 934 Nobel Prize winners up to the year 2020, showing the interquartile range (IQR): Here, the middle half of is between 51 and 69 years. The interquartile range for Nobel Prize winners is then 18 years. Statistics - Variation Standard deviation : It is the most used measure of variation. Standard deviation (σ) measures how far a 'typical' observation is from the average of the data (μ). Standard deviation is important for many statistical methods. Here is a histogram of the age of all 934 Nobel Prize winners up to the year 2020, showing standard deviations: Note: Values within one standard deviation (σ) are considered to be typical. Values outside three standard deviations are considered to be outliers. Statistics - Range The range is a measure of variation, which describes how spread out the data is. Range The range is the difference between the smallest and the largest value of the data. Range is the simplest measure of variation. Here is a histogram of the age of all 934 Nobel Prize winners up to the year 2020, showing the range: The youngest winner was 17 years and the oldest was 97 years. The range of ages for Nobel Prize winners is then 80 years. Statistics - Range The range is a measure of variation, which describes how spread out the data is. Calculating the Range The range can only be calculated for numerical data. First, find the smallest and largest values of this example: 13, 21, 21, 40, 48, 55, 72 Calculate the difference by subtracting the smallest from the largest: 72 - 13 = 59 Statistics - Range The range is a measure of variation, which describes how spread out the data is. Calculating the Range with Programming The range can easily be found with many programming languages. Using software and programming to calculate statistics is more common for bigger sets of data, as finding it manually becomes difficult. Example With Python use the NumPy library ptp() method to find the range of the values 13, 21, 21, 40, 48, 55, 72: import numpy values = [13,21,21,40,48,55,72] x = numpy.ptp(values) print(x) o/p: 59 Statistics - Quartiles and Percentiles Quartiles and percentiles are a measures of variation, which describes how spread out the data is. Quartiles and percentiles are both types of quantiles. Quartiles Quartiles are values that separate the data into four equal parts. Here is a histogram of the age of all 934 Nobel Prize winners up to the year 2020, showing the quartiles: The quartiles (Q0,Q1,Q2,Q3,Q4) are the values that separate each quarter. Between Q0 and Q1 are the 25% lowest values in the data. Between Q1 and Q2 are the next 25%. And so on. Q0 is the smallest value in the data. Q1 is the value separating the first quarter from the second quarter of the data. Q2 is the middle value (median), separating the bottom from the top half. Q3 is the value separating the third quarter from the fourth quarter Q4 is the largest value in the data. Statistics - Quartiles and Percentiles The range is a measure of variation, which describes how spread out the data is. Calculating Quartiles with Programming Quartiles can easily be found with many programming languages. Using software and programming to calculate statistics is more common for bigger sets of data, as finding it manually becomes difficult. Example With Python use the NumPy library quantile() method to find the quartiles of the values 13, 21, 21, 40, 42, 48, 55, 72: import numpy values = [13,21,21,40,42,48,55,72] x = numpy.quantile(values, [0,0.25,0.5,0.75,1]) print(x) O/P : [13. 21. 41. 49.75 72. ] Statistics - Interquartile Range Interquartile range is a measure of variation, which describes how spread out the data is. Interquartile Range is the difference between the first and third quartiles (Q1 and Q3). The 'middle half' of the data is between the first and third quartile. The first quartile is the value in the data that separates the bottom 25% of values from the top 75%. The third quartile is the value in the data that separates the bottom 75% of the values from the top 25% Here is a histogram of the age of all 934 Nobel Prize winners up to the year 2020, showing the interquartile range (IQR): Statistics - Interquartile Range Here, the middle half of is between 51 and 69 years. The interquartile range for Nobel Prize winners is then 18 years. Statistics - Interquartile Range Calculating the Interquartile Range with Programming The interquartile range can easily be found with many programming languages. Using software and programming to calculate statistics is more common for bigger sets of data, as finding it manually becomes difficult. Example With Python use the SciPy library iqr() method to find the interquartile range of the values 13, 21, 21, 40, 42, 48, 55, 72: from scipy import stats values = [13,21,21,40,42,48,55,72] x = stats.iqr(values) print(x) O/P : 28.75 Statistics - Standard Deviation Standard deviation is the most commonly used measure of variation, which describes how spread out the data is. Standard deviation (σ) measures how far a 'typical' observation is from the average of the data (μ). Standard deviation is important for many statistical methods. Here is a histogram of the age of all 934 Nobel Prize winners up to the year 2020, showing standard deviations: Each dotted line in the histogram shows a shift of one extra standard deviation. If the data is normally distributed: o Roughly 68.3% of the data is within 1 standard deviation of the average (from μ-1σ to μ+1σ) o Roughly 95.5% of the data is within 2 standard deviations of the average (from μ-2σ to μ+2σ) o Roughly 99.7% of the data is within 3 standard deviations of the average (from μ-3σ to μ+3σ) Note: A normal distribution has a "bell" shape and spreads out equally on both sides. Statistics - Standard Deviation Calculating the Standard Deviation You can calculate the standard deviation for both the population and the sample. The formulas are almost the same and uses different symbols to refer to the standard deviation () and sample standard deviation (s). Calculating the standard deviation (σ) is done with this formula: Statistics - Standard Deviation Calculating the Standard Deviation You can calculate the standard deviation for both the population and the sample. The formulas are almost the same and uses different symbols to refer to the standard deviation () and sample standard deviation (s). Calculating the standard deviation (σ) is done with this formula: Statistics - Standard Deviation Calculating the Standard Deviation You can calculate the standard deviation for both the population and the sample. The formulas are almost the same and uses different symbols to refer to the standard deviation () and sample standard deviation (s). Calculating the standard deviation (σ) is done with this formula: Statistics - Standard Deviation Calculating the Standard Deviation with Programming The standard deviation can easily be calculated with many programming languages. Using software and programming to calculate statistics is more common for bigger sets of data, as calculating by hand becomes difficult. Population Standard Deviation Example With Python use the NumPy library std() method to find the standard deviation of the values 4,11,7,14: import numpy values = [4,11,7,14] x = numpy.std(values) print(x) o/p: 3.8078865529319543 Statistics - Standard Deviation Calculating the Standard Deviation with Programming The standard deviation can easily be calculated with many programming languages. Using software and programming to calculate statistics is more common for bigger sets of data, as calculating by hand becomes difficult. Population Standard Deviation Example With Python use the NumPy library std() method to find the standard deviation of the values 4,11,7,14: Sample Standard Deviation import numpy x = numpy.std(values, ddof=1) values = [4,11,7,14] print(x) x = numpy.std(values) print(x) o/p: 4.396968652757639 o/p: 3.8078865529319543 Inferential Statistics Statistics - Statistical Inference Statistical Inference Using data analysis and statistics to make conclusions about a population is called statistical inference. The main types of statistical inference are: • Estimation • Hypothesis testing Estimation Statistics from a sample are used to estimate population parameters. The most likely value is called a point estimate. There is always uncertainty when estimating. The uncertainty is often expressed as confidence intervals defined by a likely lowest and highest value for the parameter. An example could be a confidence interval for the number of bicycles a Dutch person owns: "The average number of bikes a Dutch person owns is between 3.5 and 6." Statistics - Statistical Inference Hypothesis Testing Hypothesis testing is a method to check if a claim about a population is true. More precisely, it checks how likely it is that a hypothesis is true is based on the sample data. There are different types of hypothesis testing. The steps of the test depends on: Type of data (categorical or numerical) If you are looking at: o A single group o Comparing one group to another o Comparing the same group before and after a change Some examples of claims or questions that can be checked with hypothesis testing: o 90% of Australians are left handed o Is the average weight of dogs more than 40kg? o Do doctors make more money than lawyers? Statistics - Normal Distribution The normal distribution is an important probability distribution used in statistics. Many real world examples of data are normally distributed. Normal Distribution The normal distribution is described by the mean (μ) and the standard deviation (σ). The normal distribution is often referred to as a 'bell curve' because of it's shape: Most of the values are around the center (μ) The median and mean are equal It has only one mode It is symmetric, meaning it decreases the same amount on the left and the right of the center The area under the curve of the normal distribution represents probabilities for the data. The area under the whole curve is equal to 1, or 100% Here is a graph of a normal distribution with probabilities between standard deviations (σ): Statistics - Normal Distribution • Roughly 68.3% of the data is within 1 standard deviation of the average (from μ-1σ to μ+1σ) • Roughly 95.5% of the data is within 2 standard deviations of the average (from μ-2σ to μ+2σ) • Roughly 99.7% of the data is within 3 standard deviations of the average (from μ-3σ to μ+3σ) Note: Probabilities of the normal distribution can only be calculated for intervals (between two values). Statistics - Normal Distribution Different Mean and Standard Deviations The mean describes where the center of the normal distribution is. Here is a graph showing three different normal distributions with the same standard deviation but different means. The standard deviation describes how spread out the normal distribution is. Here is a graph showing three different normal distributions with the same mean but different standard deviations. Statistics - Normal Distribution Different Mean and Standard Deviations The mean describes where the center of the normal distribution is. The purple curve has the biggest standard deviation and the black curve has the smallest standard deviation. The area under each of the curves is still 1, or 100%. . Statistics - Normal Distribution A Real Data Example of Normally Distributed Data Real world data is often normally distributed. Here is a histogram of the age of Nobel Prize winners when they won the prize: The normal distribution drawn on top of the histogram is based on the population mean (μ) and standard deviation (σ) of the real data. We can see that the histogram close to a normal distribution. Examples of real world variables that can be normally distributed: • Test scores • Height • Birth weight Statistics - Normal Distribution Probability Distributions Probability distributions are functions that calculates the probabilities of the outcomes of random variables. Typical examples of random variables are coin tosses and dice rolls. Here is an graph showing the results of a growing number of coin tosses and the expected values of the results (heads or tails). The expected values of the coin toss is the probability distribution of the coin toss. Notice how the result of random coin tosses gets closer to the expected values (50%) as the number of tosses increases. Similarly, here is a graph showing the results of a growing number of dice rolls and the expected values of the results (from 1 to 6). Statistics - Normal Distribution Notice again how the result of random dice rolls gets closer to the expected values (1/6, or 16.666%) as the number of rolls increases. When the random variable is a sum of dice rolls the results and expected values take a different shape. The different shape comes from there being more ways of getting a sum of near the middle, than a small or large sum. Statistics - Normal Distribution Notice again how the result of random dice rolls gets closer to the expected values (1/6, or 16.666%) as the number of rolls increases. When the random variable is a sum of dice rolls the results and expected values take a different shape. The different shape comes from there being more ways of getting a sum of near the middle, than a small or large sum. Statistics - Normal Distribution As we keep increasing the number of dice for a sum the shape of the results and expected values look more and more like a normal distribution. Many real world variables follow a similar pattern and naturally form normal distributions. Normally distributed variables can be analyzed with well-known techniques. Statistics - Standard Normal Distribution The standard normal distribution is a normal distribution where the mean is 0 and the standard deviation is 1. Normally distributed data can be transformed into a standard normal distribution. Standardizing normally distributed data makes it easier to compare different sets of data. The standard normal distribution is used for: • Calculating confidence intervals • Hypothesis tests Here is a graph of the standard normal distribution with probability values (p-values) between the standard deviations: Statistics - Standard Normal Distribution Standardizing makes it easier to calculate probabilities. The functions for calculating probabilities are complex and difficult to calculate by hand. Typically, probabilities are found by looking up tables of pre-calculated values, or by using software and programming. The standard normal distribution is also called the 'Z-distribution' and the values are called 'Z-values' (or Zscores). Statistics - Standard Normal Distribution Z-Values Z-values express how many standard deviations from the mean a value is. The formula for calculating a Z-value is: Z=(x−μ)/σ x is the value we are standardizing, μ is the mean, and σ is the standard deviation. For example, if we know that: The mean height of people in Germany is 170 cm (μ) The standard deviation of the height of people in Germany is 10 cm (σ) Bob is 200 cm tall (x) Bob is 30 cm taller than the average person in Germany. 30 cm is 3 times 10 cm. So Bob's height is 3 standard deviations larger than mean height in Germany. Using the formula: Statistics - Standard Normal Distribution Finding the P-value of a Z-Value Using a Z-table or programming we can calculate how many people Germany are shorter than Bob and how many are taller. Example With Python use the Scipy Stats library norm.cdf() function find the probability of getting less than a Z-value of 3: import scipy.stats as stats print(stats.norm.cdf(3)) O/P: 0.9986501019683699 Statistics - Student's T Distribution The student's t-distribution is similar to a normal distribution and used in statistical inference to adjust for uncertainty. It is used for estimation and hypothesis testing of a population mean (average). The t-distribution is adjusted for the extra uncertainty of estimating the mean. If the sample is small, the t-distribution is wider. If the sample is big, the t-distribution is narrower. The bigger the sample size is, the closer the t-distribution gets to the standard normal distribution. Statistics - Student's T Distribution Notice how some of the curves have bigger tails. This is due to the uncertainty from a smaller sample size. The green curve has the smallest sample size. For the t-distribution this is expressed as 'degrees of freedom' (df), which is calculated by subtracting 1 from the sample size (n). For example a sample size of 30 will make 29 degrees of freedom for the t-distribution. The t-distribution is used to find critical tvalues and p-values (probabilities) for estimation and hypothesis testing. Note: Finding the critical t-values and p-values of the t-distribution is similar z-values and p-values of the standard normal distribution. But make sure to use the correct degrees of freedom. Statistics - Student's T Distribution Finding the P-Value of a T-Value You can find the p-values of a t-value by using a ttable or with programming. Finding the T-value of a P-Value You can find the t-values of a p-value by using a t-table or with programming. Example With Python use the Scipy Stats library t.cdf() function find the probability of getting less than a t-value of 2.1 with 29 degrees of freedom: Example With Python use the Scipy Stats library t.ppf() function find the t-value separating the top 25% from the bottom 75% with 29 degrees of freedom: import scipy.stats as stats print(stats.t.cdf(2.1, 29)) import scipy.stats as stats print(stats.t.ppf(0.75, 29)) O/P: 0.9777290209818548 O/P: 0.6830438592467808 Statistics - Estimation Point estimates are the most likely value for a population parameter. Confidence intervals express the uncertainty of an estimated population parameter. A point estimate is calculated from a sample. The point estimate depends on the type of data: • Categorical data: the number of occurrences divided by the sample size. • Numerical data: the mean (the average) of the sample. One example could be: The point estimate for the average height of people in Denmark is 180 cm. Estimates are always uncertain. This uncertainty can be expressed with a confidence interval. Statistics - Estimation Confidence Intervals The confidence interval is defined by a lower bound and an upper bound. This gives us a range of values that the true parameter is likely to be between. For example that: The average height of people in Denmark is between 170 cm and 190 cm. Here, 170 cm is the lower bound, and 190 cm is the upper bound. The lower and upper bounds of a confidence interval is based on the confidence level. Statistics - Hypothesis Testing Hypothesis testing is a formal way of checking if a hypothesis about a population is true or not. A hypothesis is a claim about a population parameter. A hypothesis test is a formal procedure to check if a hypothesis is true or not. Examples of claims that can be checked: The average height of people in Denmark is more than 170 cm. The share of left handed people in Australia is not 10%. The average income of dentists is less the average income of dentists. Statistics - Hypothesis Testing The Null and Alternative Hypothesis Hypothesis testing is based on making two different claims about a population parameter. The null hypothesis (H0) and the alternative hypothesis (H1) are the claims. The two claims needs to be mutually exclusive, meaning only one of them can be true. The alternative hypothesis is typically what we are trying to prove. For example, we want to check the following claim: "The average height of people in Denmark is more than 170 cm." In this case, the parameter is the average height of people in Denmark (μ). The null and alternative hypothesis would be: Null hypothesis: The average height of people in Denmark is 170 cm. Alternative hypothesis: The average height of people in Denmark is more than 170 cm. Statistics - Hypothesis Testing The claims are often expressed with symbols like this: : : If the data supports the alternative hypothesis, we reject the null hypothesis and accept the alternative hypothesis. If the data does not support the alternative hypothesis, we keep the null hypothesis. Note: The alternative hypothesis is also referred to as \(H_{A}\) Statistics - Hypothesis Testing The Significance Level The significance level (α) is the uncertainty we accept when rejecting the null hypothesis in the hypothesis test. The significance level is a percentage probability of accidentally making the wrong conclusion. Typical significance levels are: •α=0.1 (10%) •α=0.05 (5%) •α=0.01 (1%) A lower significance level means that the evidence in the data needs to be stronger to reject the null hypothesis. There is no "correct" significance level - it only states the uncertainty of the conclusion. Note: A 5% significance level means that when we reject a null hypothesis: We expect to reject a true null hypothesis 5 out of 100 times. Statistics - Hypothesis Testing The Critical Value and P-Value Approach There are two main approaches used for hypothesis tests: The critical value approach compares the test statistic with the critical value of the significance level. The p-value approach compares the p-value of the test statistic and with the significance level. The Critical Value Approach The critical value approach checks if the test statistic is in the rejection region. The rejection region is an area of probability in the tails of the distribution. The size of the rejection region is decided by the significance level (). The value that separates the rejection region from the rest is called the critical value. Statistics - Hypothesis Testing Here is a graphical illustration: If the test statistic is inside this rejection region, the null hypothesis is rejected. For example, if the test statistic is 2.3 and the critical value is 2 for a significance level (α=0.05): We reject the null hypothesis (H0) at 0.05 significance level (α) Statistics - Hypothesis Testing The P-Value Approach It checks if the p-value of the test statistic is smaller than the significance level (). The p-value of the test statistic is the area of probability in the tails of the distribution from the value of the test statistic. Here is a graphical illustration: Statistics - Hypothesis Testing If the p-value is smaller than the significance level, the null hypothesis is rejected. The p-value directly tells us the lowest significance level where we can reject the null hypothesis. For example, if the p-value is 0.03: We reject the null hypothesis (Ho) at a 0.05 significance level (α) We keep the null hypothesis (Ho) at a 0.01 significance level (α) Note: The two approaches are only different in how they present the conclusion. Statistics - Hypothesis Testing Steps for a Hypothesis Test The following steps are used for a hypothesis test: 1. 2. 3. 4. 5. Check the conditions Define the claims Decide the significance level Calculate the test statistic Conclusion One condition is that the sample is randomly selected from the population. The other conditions depends on what type of parameter you are testing the hypothesis for. Common parameters to test hypotheses are: o Proportions (for qualitative data) o Mean values (for numerical data) We are missing one important variable that affects Calorie_Burnage, which is the Duration of the training session. Duration in combination with Average_Pulse will together explain Calorie_Burnage more precisely. Data Science - Linear Regression The term regression is used when you try to find the relationship between variables. In Machine Learning and in statistical modeling, that relationship is used to predict the outcome of events. In this module, we will cover the following questions: Can we conclude that Average_Pulse and Duration are related to Calorie_Burnage? Can we use Average_Pulse and Duration to predict Calorie_Burnage? Data Science - Linear Regression Least Square Method Least Square Method Linear regression uses the least square method. The concept is to draw a line through all the plotted data points. The line is positioned in a way that it minimizes the distance to all of the data points. The distance "errors". is called "residuals" or The red dashed lines represents the distance from the data points to the drawn mathematical function. Data Science - Linear Regression Least Square Method Linear Regression Using One Explanatory Variable In this example, we will try to predict Calorie_Burnage with Average_Pulse using Linear Regression: Example import pandas as pd import matplotlib.pyplot as plt from scipy import stats def myfunc(x): return slope * x + intercept full_health_data = pd.read_csv("data.csv", header=0, sep=",") plt.scatter(x, y) plt.plot(x, slope * x + intercept) plt.ylim(ymin=0, ymax=2000) plt.xlim(xmin=0, xmax=200) plt.xlabel("Average_Pulse") plt.ylabel ("Calorie_Burnage") plt.show() x = full_health_data["Average_Pulse"] y = full_health_data ["Calorie_Burnage"] slope, intercept, r, p, std_err = stats.linregress(x, y) mymodel = list(map(myfunc, x)) Data Science - Linear Regression Least Square Method Linear Regression Using One Explanatory Variable In this example, we will try to predict Calorie_Burnage with Average_Pulse using Linear Regression: Do you think that the line is able to predict Calorie_Burnage precisely? We will show that the variable Average_Pulse alone is not enough to make precise prediction of Calorie_Burnage. Data Science - Linear Regression Least Square Method • Example Explained: • Import the modules you need: Pandas, matplotlib and Scipy • Isolate Average_Pulse as x. Isolate • Calorie_burnage as y • Get important key values with: slope, • intercept, r, p, std_err = stats.linregress(x, y) • Create a function that uses the slope and • intercept values to return a new value. This new value represents where on the y-axis • the corresponding x value will be placed • Run each value of the x array through the function. This will result in a new array with new values for the y-axis: mymodel = list(map(myfunc, x)) Draw the original scatter plot: plt.scatter(x, y) Draw the line of linear regression: plt.plot(x, mymodel) Define maximum and minimum values of the axis Label the axis: "Average_Pulse" and "Calorie_Burnage" Data Science - Regression Table Regression Table The output from linear regression can be summarized in a regression table. The content of the table includes: Information about the model Coefficients of the linear regression function Regression statistics Statistics of the coefficients from the linear regression function Other information that we will not cover in this module Data Science - Regression Table You can now begin your journey on analyzing advanced output! Data Science - Regression Table import pandas as pd import statsmodels.formula.api as smf full_health_data = pd.read_csv("data.csv", header=0, sep=",") model = smf.ols('Calorie_Burnage ~ Average_Pulse', data = full_health_data) results = model.fit() print(results.summary()) Example Explained: Import the library statsmodels.formula.api as smf. Statsmodels is a statistical library in Python. Use the full_health_data set. Create a model based on Ordinary Least Squares with smf.ols(). Notice that the explanatory variable must be written first in the parenthesis. Use the full_health_data data set. By calling .fit(), you obtain the variable results. This holds a lot of information about the regression model. Call summary() to get the table with the results of linear regression. Data Science - Regression Table - Info The "Information Part" in Regression Table Dep. Variable: is short for "Dependent Variable". Calorie_Burnage is here the dependent variable. The Dependent variable is here assumed to be explained by Average_Pulse. Model: OLS is short for Ordinary Least Squares. This is a type of model that uses the Least Square method. Date: and Time: shows the date and time the output was calculated in Python. Data Science - Regression Table - Coefficients • Coef is short for coefficient. It is the output of the linear regression function.The linear regression function can be rewritten mathematically as: Calorie_Burnage = 0.3296 * Average_Pulse + 346.8662 These numbers means: • • • • If Average_Pulse increases by 1, Calorie_Burnage increases by 0.3296 (or 0,3 rounded) If Average_Pulse = 0, the Calorie_Burnage is equal to 346.8662 (or 346.9 rounded). Remember that the intercept is used to adjust the model's precision of predicting! Do you think that this is a good model? Data Science - Regression Table - Coefficients Define the Linear Regression Function in Python Define the linear regression function in Python to perform predictions. What is Calorie_Burnage if Average_Pulse is: 120, 130, 150, 180? Example def Predict_Calorie_Burnage(Average_Pulse): return(0.3296*Average_Pulse + 346.8662) print(Predict_Calorie_Burnage(120)) print(Predict_Calorie_Burnage(130)) print(Predict_Calorie_Burnage(150)) print(Predict_Calorie_Burnage(180)) O/P: 386.4182 389.7142 396.3062 406.1942 Data Science - Regression Table: P-Value The "Statistics of the Coefficients Part" in Regression Table: Now, we want to test if the coefficients from the linear regression function has a significant impact on the dependent variable (Calorie_Burnage). This means that we want to prove that it exists a relationship between Average_Pulse and Calorie_Burnage, using statistical tests. Data Science - Regression Table: P-Value The "Statistics of the Coefficients Part" in Regression Table: The P-value The P-value is a statistical number to conclude if there is a relationship between Average_Pulse and Calorie_Burnage. We test if the true value of the coefficient is equal to zero (no relationship). The statistical test for this is called Hypothesis testing. A low P-value (< 0.05) means that the coefficient is likely not to equal zero. A high P-value (> 0.05) means that we cannot conclude that the explanatory variable affects the dependent variable (here: if Average_Pulse affects Calorie_Burnage). A high P-value is also called an insignificant P-value. Data Science - Regression Table: P-Value The "Statistics of the Coefficients Part" in Regression Table: Hypothesis Testing Hypothesis testing is a statistical procedure to test if your results are valid. In our example, we are testing if the true coefficient of Average_Pulse and the intercept is equal to zero. Hypothesis test has two statements. The null hypothesis and the alternative hypothesis. The null hypothesis can be shortly written as H0 The alternative hypothesis can be shortly written as HA Mathematically written: H0: Average_Pulse = 0 HA: Average_Pulse ≠ 0 H0: Intercept = 0 HA: Intercept ≠ 0 The sign ≠ means "not equal to" Data Science - Regression Table: P-Value The "Statistics of the Coefficients Part" in Regression Table: Hypothesis Testing and P-value The null hypothesis can either be rejected or not. If we reject the null hypothesis, we conclude that it exist a relationship between Average_Pulse and Calorie_Burnage. The P-value is used for this conclusion. A common threshold of the P-value is 0.05. Note: A P-value of 0.05 means that 5% of the times, we will falsely reject the null hypothesis. It means that we accept that 5% of the times, we might falsely have concluded a relationship. If the P-value is lower than 0.05, we can reject the null hypothesis and conclude that it exist a relationship between the variables. However, the P-value of Average_Pulse is 0.824. So, we cannot conclude a relationship between Average_Pulse and Calorie_Burnage. It means that there is a 82.4% chance that the true coefficient of Average_Pulse is zero. The intercept is used to adjust the regression function's ability to predict more precisely. It is therefore uncommon to interpret the P-value of the intercept. Data Science - Regression Table: R-Squared R - Squared R-Squared and Adjusted R-Squared describes how well the linear regression model fits the data points: The value of R-Squared is always between 0 to 1 (0% to 100%). A high R-Squared value means that many data points are close to the linear regression function line. A low R-Squared value means that the linear regression function line does not fit the data well. Data Science - Regression Table: R-Squared Visual Example of a Low R - Squared Value (0.00) Our regression model shows a R-Squared value of zero, which means that the linear regression function line does not fit the data well. This can be visualized when we plot the linear regression function through the data points of Average_Pulse and Calorie_Burnage. The value of R-Squared is always between 0 to 1 (0% to 100%). A high R-Squared value means that many data points are close to the linear regression function line. A low R-Squared value means that the linear regression function line does not fit the data well. Data Science - Regression Table: R-Squared import pandas as pd import matplotlib.pyplot as plt from scipy import stats full_health_data = pd.read_csv("data.csv", header=0, sep=",") return slope * x + intercept mymodel = list(map(myfunc, x)) print(mymodel) slope, intercept, r, p, std_err = stats.linregress(x, y) plt.scatter(x, y) plt.plot(x, mymodel) plt.ylim(ymin=0, ymax=2000) plt.xlim(xmin=0, xmax=200) plt.xlabel("Duration") plt.ylabel ("Calorie_Burnage") def myfunc(x): plt.show() x = full_health_data["Duration"] y = full_health_data ["Calorie_Burnage"] Data Science - Regression Table: R-Squared Visual Example of a High R - Squared Value (0.79) However, if we plot Duration and Calorie_Burnage, the R-Squared increases. Here, we see that the data points are close to the linear regression function line: Summary - Predicting Calorie_Burnage with Average_Pulse How can we summarize the linear regression function with Average_Pulse as explanatory variable? Coefficient of 0.3296, which means that Average_Pulse has a very small effect on Calorie_Burnage. High P-value (0.824), which means that we cannot conclude a relationship between Average_Pulse and Calorie_Burnage. R-Squared value of 0, which means that the linear regression function line does not fit the data well. Data Science - Linear Regression Case Case: Use Duration + Average_Pulse to Predict Calorie_Burnage Create a Linear Regression Table with Average_Pulse and Duration as Explanatory Variables: import pandas as pd import statsmodels.formula.api as smf full_health_data = pd.read_csv("data.csv", header=0, sep=",") model = smf.ols('Calorie_Burnage ~ Average_Pulse + Duration', data = full_health_data) results = model.fit() print(results.summary()) Data Science - Linear Regression Case O/P: Data Science - Linear Regression Case Case: Use Duration + Average_Pulse to Predict Calorie_Burnage Create a Linear Regression Table with Average_Pulse and Duration as Explanatory Variables: Example Explained: • Import the library statsmodels.formula.api as smf. Statsmodels is a statistical library in Python. • Use the full_health_data set. • Create a model based on Ordinary Least Squares with smf.ols(). Notice that the explanatory variable must be written first in the parenthesis. Use the full_health_data data set. • By calling .fit(), you obtain the variable results. This holds a lot of information about the regression model. • Call summary() to get the table with the results of linear regression. Data Science - Linear Regression Case Case: Use Duration + Average_Pulse to Predict Calorie_Burnage Create a Linear Regression Table with Average_Pulse and Duration as Explanatory Variables: The linear regression function can be rewritten mathematically as: Calorie_Burnage = Average_Pulse * 3.1695 + Duration * 5.8424 334.5194 Rounded to two decimals: Calorie_Burnage = Average_Pulse * 3.17 + Duration * 5.84 - 334.52 Data Science - Linear Regression Case Define the Linear Regression Function in Python Define the linear regression function in Python to perform predictions. What is Calorie_Burnage if: • Average pulse is 110 and duration of the training session is 60 minutes? • Average pulse is 140 and duration of the training session is 45 minutes? • Average pulse is 175 and duration of the training session is 20 minutes? def Predict_Calorie_Burnage(Average_Pulse, Duration): return(3.1695 * Average_Pulse + 5.8434 * Duration - 334.5194) print(Predict_Calorie_Burnage(110,60)) print(Predict_Calorie_Burnage(140,45)) print(Predict_Calorie_Burnage(175,20)) O/P: 364.7296 372.1636 337.01110000000006 Data Science - Linear Regression Case Define the Linear Regression Function in Python Define the linear regression function in Python to perform predictions. What is Calorie_Burnage if: • Average pulse is 110 and duration of the training session is 60 minutes? • Average pulse is 140 and duration of the training session is 45 minutes? • Average pulse is 175 and duration of the training session is 20 minutes? def Predict_Calorie_Burnage(Average_Pulse, Duration): return(3.1695 * Average_Pulse + 5.8434 * Duration - 334.5194) print(Predict_Calorie_Burnage(110,60)) The Answers: print(Predict_Calorie_Burnage(140,45)) Average pulse is 110 and duration of the training session is 60 print(Predict_Calorie_Burnage(175,20)) O/P: 364.7296 372.1636 337.01110000000006 minutes = 365 Calories Average pulse is 140 and duration of the training session is 45 minutes = 372 Calories Average pulse is 175 and duration of the training session is 20 minutes = 337 Calories Data Science - Linear Regression Case Access the Coefficients Look at the coefficients: • Calorie_Burnage increases with 3.17 if Average_Pulse increases by one. • Calorie_Burnage increases with 5.84 if Duration increases by one. Access the P-Value Look at the P-value for each coefficient. • P-value is 0.00 for Average_Pulse, Duration and the Intercept. • The P-value is statistically significant for all of the variables, as it is less than 0.05. So here we can conclude that Average_Pulse and Duration has a relationship with Calorie_Burnage. Data Science - Linear Regression Case Adjusted R-Squared There is a problem with R-squared if we have more than one explanatory variable. R-squared will almost always increase if we add more variables, and will never decrease. This is because we are adding more data points around the linear regression function. If we add random variables that does not affect Calorie_Burnage, we risk to falsely conclude that the linear regression function is a good fit. Adjusted R-squared adjusts for this problem. It is therefore better to look at the adjusted R-squared value if we have more than one explanatory variable. The Adjusted R-squared is 0.814. The value of R-Squared is always between 0 to 1 (0% to 100%). A high R-Squared value means that many data points are close to the linear regression function line. A low R-Squared value means that the linear regression function line does not fit the data well. Conclusion: The model fits the data point well! Congratulations! You have now finished the final module of the data science library. Data Science - Linear Regression Case • Machine Learning is making the computer learn from studying data and statistics. • Machine Learning is a step into the direction of artificial intelligence (AI). • Machine Learning is a program that analyses data and learns to predict the outcome. Data Science – Intro Data Set In the mind of a computer, a data set is any collection of data. It can be anything from an array to a complete database. Example of an array: [99,86,87,88,111,86,103,87,94,78,77,85,86] Data Types To analyze data, it is important to know what type of data we are dealing with. We can split the data types into three main categories: • Numerical • Categorical • Ordinal Data Science – Intro Numerical data are numbers, and can be split into two numerical categories: • Discrete Data - numbers that are limited to integers. Example: The number of cars passing by. • Continuous Data - numbers that are of infinite value. Example: The price of an item, or the size of an item Categorical data are values that cannot be measured up against each other. Example: a color value, or any yes/no values. Ordinal data are like categorical data, but can be measured up against each other. Example: school grades where A is better than B and so on. By knowing the data type of your data source, you will be able to know what technique to use when analyzing them. Machine Learning - Mean Median Mode Mean, Median, and Mode What can we learn from looking at a group of numbers? In Machine Learning (and in mathematics) there are often three values that interests us: Mean - The average value Median - The mid point value Mode - The most common value Example: We have registered the speed of 13 cars: speed = [99,86,87,88,111,86,103,87,94,78,77,85,86] (99+86+87+88+111+86+103+87+94+78+77+85+86) / 13 = 89.77 #Mean 77, 78, 85, 86, 86, 86, 87, 87, 88, 94, 99, 103, 111 #Median 77, 78, 85, 86, 86, 86, 87, 87, 94, 98, 99, 103 = (86+87)/2= 86.5 #Median The Mode value is the value that appears the most number of times: 99, 86, 87, 88, 111, 86, 103, 87, 94, 78, 77, 85, 86 = 86 # The Mode value is the value that appears the most number of times: Machine Learning - Mean Median Mode Mean, Median, and Mode What can we learn from looking at a group of numbers? In Machine Learning (and in mathematics) there are often three values that interests us: Mean - The average value Median - The mid point value Mode - The most common value speed = [99,86,87,88,111,86,103,87,94,78,77,85,86] import numpy x = numpy.mean(speed) #Mean x = numpy.median(speed) #Median x = stats.mode(speed) #Mode print(x) Machine Learning - Standard Deviation What is Standard Deviation? Standard deviation is a number that describes how spread out the values are. A low standard deviation means that most of the numbers are close to the mean (average) value. A high standard deviation means that the values are spread out over a wider range. Example: This time we have registered the speed of 7 cars: speed = [86,87,88,86,87,85,86] The standard deviation is: 0.9 Meaning that most of the values are within the range of 0.9 from the mean value, which is 86.4. Let us do the same with a selection of numbers with a wider range: speed = [32,111,138,28,59,77,97] The standard deviation is: 37.85 Machine Learning - Standard Deviation Variance Variance is another number that indicates how spread out the values are. In fact, if you take the square root of the variance, you get the standard deviation! Or the other way around, if you multiply the standard deviation by itself, you get the variance! To calculate the variance you have to do as follows: 3. For each difference: find the square value: (-45.4)2 = 2061.16 1. Find the mean: (33.6)2 = 1128.96 (32+111+138+28+59+77+97) / 7 = 77.4 2. For each value: find the difference from the mean: 32 - 77.4 = -45.4 111 - 77.4 = 33.6 138 - 77.4 = 60.6 28 - 77.4 = -49.4 59 - 77.4 = -18.4 77 - 77.4 = - 0.4 97 - 77.4 = 19.6 (60.6)2 = 3672.36 (-49.4)2 = 2440.36 (-18.4)2 = 338.56 (- 0.4)2 = 0.16 (19.6)2 = 384.16 4. The variance is the average number of these squared differences: (2061.16+1128.96+3672.36+2440.36+338.56+0.16+384.16) / 7 = 1432.2 Machine Learning - Standard Deviation Variance Luckily, NumPy has a method to calculate the variance: Example Use the NumPy var() method to find the variance: import numpy speed = [32,111,138,28,59,77,97] x = numpy.var(speed) print(x)(32+111+138+28+59+77+97) / 7 = 77.4 1432.2448979591834 Machine Learning - Standard Deviation Standard Deviation As we have learned, the formula to find the standard deviation is the square root of the variance: √1432.25 = 37.85 Or, as in the example from before, use the NumPy to calculate the standard deviation: Example Use the NumPy std() method to find the standard deviation: import numpy speed = [32,111,138,28,59,77,97] x = numpy.std(speed) print(x) O/P: 37.84501153334721 Machine Learning - Percentiles What are Percentiles? Percentiles are used in statistics to give you a number that describes the value that a given percent of the values are lower than. Example: Let's say we have an array of the ages of all the people that lives in a street ages = [5,31,43,48,50,41,7,11,15,39,80,82,32,2,8,6,25,36,27,61,31] What is the 75. percentile? The answer is 43, meaning that 75% of the people are 43 or younger. The NumPy module has a method for finding the specified percentile: Example What is the age that 90% of the people are younger than? import numpy ages = [5,31,43,48,50,41,7,11,15,39,80,82,32,2,8,6,25,36,27,61,31] x = numpy.percentile(ages, 90) print(x) O/P: 61.0 Machine Learning - Data Distribution Data Distribution In the real world, the data sets are much bigger, but it can be difficult to gather real world data, at least at an early stage of a project. How Can we Get Big Data Sets? To create big data sets for testing, we use the Python module NumPy, which comes with a number of methods to create random data sets, of any size. Example Create an array containing 250 random floats between 0 and 5: import numpy x = numpy.random.uniform(0.0, 5.0, 250) print(x) O/P: Machine Learning - Data Distribution Histogram Explained We use the array from the example above to draw a histogram with 5 bars. The first bar represents how many values in the array are between 0 and 1. The second bar represents how many values are between 1 and 2.Etc. Which gives us this result: 52 values are between 0 and 1 48 values are between 1 and 2 49 values are between 2 and 3 51 values are between 3 and 4 50 values are between 4 and 5 Note: The array values are random numbers and will not show the exact same result on your computer. Machine Learning - Normal Data Distribution Normal Data Distribution In the previous chapter we learned how to create a completely random array, of a given size, and between two given values. Here we will learn how to create an array where the values are concentrated around a given value. In probability theory this kind of data distribution is known as the normal data distribution, or the Gaussian data distribution, after the mathematician Carl Friedrich Gauss who came up with the formula of this data distribution. Example A typical normal data distribution: import numpy import matplotlib.pyplot as plt x = numpy.random.normal(5.0, 1.0, 100000) plt.hist(x, 100) plt.show() O/P: Machine Learning - Scatter Plot Scatter Plot A scatter plot is a diagram where each value in the data set is represented by a dot. x = [5,7,8,7,2,17,2,9,4,11,12,9,6] y = [99,86,87,88,111,86,103,87,94,78,77,85,86] • The x array represents the age of each car. • The y array represents the speed of each car. Example Use the scatter() method to draw a scatter plot diagram: plt.scatter(x, y) plt.show() Machine Learning - Scatter Plot Scatter Plot Explained The x-axis represents ages, and the y-axis represents speeds. What we can read from the diagram is that the two fastest cars were both 2 years old, and the slowest car was 12 years old. Note: It seems that the newer the car, the faster it drives, but that could be a coincidence, after all we only registered 13 cars. Machine Learning - Scatter Plot Random Data Distributions In Machine Learning the data sets can contain thousands-, or even millions, of values. Let us create two arrays that are both filled with 1000 random numbers from a normal data distribution. The first array will have the mean set to 5.0 with a standard deviation of 1.0. The second array will have the mean set to 10.0 with a standard deviation of 2.0: Example A scatter plot with 1000 dots: import numpy import matplotlib.pyplot as plt x = numpy.random.normal(5.0, 1.0, 1000) y = numpy.random.normal(10.0, 2.0, 1000) plt.scatter(x, y) plt.show() Machine Learning - Linear Regression Regression The term regression is used when you try to find the relationship between variables. In Machine Learning, and in statistical modeling, that relationship is used to predict the outcome of future events Linear Regression Linear regression uses the relationship between the data-points to draw a straight line through all them. This line can be used to predict future values. In Machine Learning, predicting the future is very important. Machine Learning - Linear Regression How Does it Work? Python has methods for finding a relationship between data-points and to draw a line of linear regression. We will show you how to use these methods instead of going through the mathematic formula. In the example below, the x-axis represents age, and the y-axis represents speed. We have registered the age and speed of 13 cars as they were passing a tollbooth. Let us see if the data we collected could be used in a linear regression: Example Start by drawing a scatter plot: import matplotlib.pyplot as plt x = [5,7,8,7,2,17,2,9,4,11,12,9,6] y = [99,86,87,88,111,86,103,87,94,78,77,85,86] plt.scatter(x, y) plt.show() Machine Learning - Linear Regression Example Import scipy and draw the line of Linear Regression: import matplotlib.pyplot as plt from scipy import stats x = [5,7,8,7,2,17,2,9,4,11,12,9,6] y = [99,86,87,88,111,86,103,87,94,78,77,85,86] slope, intercept, r, p, std_err = stats.linregress(x, y) def myfunc(x): return slope * x + intercept mymodel = list(map(myfunc, x)) plt.scatter(x, y) plt.plot(x, mymodel) plt.show() Machine Learning - Linear Regression x = [5,7,8,7,2,17,2,9,4,11,12,9,6] y = [99,86,87,88,111,86,103,87,94,78,77,85,86] Execute a method that returns some important key values of Linear Regression: slope, intercept, r, p, std_err = stats.linregress(x, y) Create a function that uses the slope and intercept values to return a new value. This new value represents where on the y-axis the corresponding x value will be placed: def myfunc(x): return slope * x + intercept Run each value of the x array through the function. This will result in a new array with new values for the y-axis: mymodel = list(map(myfunc, x)) Draw the original scatter plot: plt.scatter(x, y) Draw the line of linear regression: plt.plot(x, mymodel) Display the diagram: plt.show() Machine Learning - Polynomial Regression Polynomial Regression If your data points clearly will not fit a linear regression (a straight line through all data points), it might be ideal for polynomial regression. Polynomial regression, like linear regression, uses the relationship between the variables x and y to find the best way to draw a line through the data points. How Does it Work? Python has methods for finding a relationship between data-points and to draw a line of polynomial regression. We will show you how to use these methods instead of going through the mathematic formula. In the example below, we have registered 18 cars as they were passing a certain tollbooth. We have registered the car's speed, and the time of day (hour) the passing occurred. The x-axis represents the hours of the day and the y-axis represents the speed: Machine Learning - Polynomial Regression Polynomial Regression Example Start by drawing a scatter plot: import matplotlib.pyplot as plt x = [1,2,3,5,6,7,8,9,10,12,13,14,15,16,18,19,21,22] y = [100,90,80,60,60,55,60,65,70,70,75,76,78,79,90,99,99,100] plt.scatter(x, y) plt.show() Machine Learning - Polynomial Regression Polynomial Regression Example Import numpy and matplotlib then draw the line of Polynomial Regression: import numpy import matplotlib.pyplot as plt x = [1,2,3,5,6,7,8,9,10,12,13,14,15,16,18,19,21,22] y = [100,90,80,60,60,55,60,65,70,70,75,76,78,79,90,99,99,100] mymodel = numpy.poly1d(numpy.polyfit(x, y, 3)) myline = numpy.linspace(1, 22, 100) plt.scatter(x, y) plt.plot(myline, mymodel(myline)) plt.show() Machine Learning - Polynomial Regression Polynomial Regression Import the modules you need. import numpy import matplotlib.pyplot as plt Create the arrays that represent the values of the x and y axis: x = [1,2,3,5,6,7,8,9,10,12,13,14,15,16,18,19,21,22] y= [100,90,80,60,60,55,60,65,70,70,75,76,78,79,90,99,99,100] NumPy has a method that lets us make a polynomial model: mymodel = numpy.poly1d(numpy.polyfit(x, y, 3)) Then specify how the line will display, we start at position 1, and end at position 22: myline = numpy.linspace(1, 22, 100) Draw the original scatter plot: plt.scatter(x, y) Draw the line of polynomial regression: plt.plot(myline, mymodel(myline)) Display the diagram: plt.show() Machine Learning - Multiple Regression Multiple Regression Multiple regression is like linear regression, but with more than one independent value, meaning that we try to predict a value based on two or more variables. Take a look at the data set below, it contains some information about cars. We can predict the CO2 emission of a car based on the size of the engine, but with multiple regression we can throw in more variables, like the weight of the car, to make the prediction more accurate. Machine Learning - Multiple Regression Multiple Regression How Does it Work? In Python we have modules that will do the work for us. Start by importing the Pandas module. import pandas The Pandas module allows us to read csv files and return a DataFrame object. The file is meant for testing purposes only, you can download it here: cars.csv df = pandas.read_csv("cars.csv") Then make a list of the independent values and call this variable X. Put the dependent values in a variable called y. X = df[['Weight', 'Volume']] y = df['CO2'] Tip: It is common to name the list of independent values with a upper case X, and the list of dependent values with a lower case y. Machine Learning - Multiple Regression Multiple Regression We will use some methods from the sklearn module, so we will have to import that module as well: from sklearn import linear_model From the sklearn module we will use the LinearRegression() method to create a linear regression object. This object has a method called fit() that takes the independent and dependent values as parameters and fills the regression object with data that describes the relationship: regr = linear_model.LinearRegression() regr.fit(X, y) Now we have a regression object that are ready to predict CO2 values based on a car's weight and volume: #predict the CO2 emission of a car where the weight is 2300kg, and the volume is 1300cm3: predictedCO2 = regr.predict([[2300, 1300]]) Machine Learning - Multiple Regression Multiple Regression Example See the whole example in action: import pandas from sklearn import linear_model df = pandas.read_csv("cars.csv") X = df[['Weight', 'Volume']] y = df['CO2'] regr = linear_model.LinearRegression() regr.fit(X, y) #predict the CO2 emission of a car where the weight is 2300kg, and the volume is 1300cm3: predictedCO2 = regr.predict([[2300, 1300]]) print(predictedCO2) O/P: [107.2087328] Machine Learning - Multiple Regression Coefficient Result Explained The result array represents the coefficient values of weight and volume. Weight: 0.00755095 Volume: 0.00780526 These values tell us that if the weight increase by 1kg, the CO2 emission increases by 0.00755095g. And if the engine size (Volume) increases by 1 cm3, the CO2 emission increases by 0.00780526 g. I think that is a fair guess, but let test it! We have already predicted that if a car with a 1300cm3 engine weighs 2300kg, the CO2 emission will be approximately 107g. What if we increase the weight with 1000kg? Machine Learning - Multiple Regression Coefficient Example Copy the example from before, but change the weight from 2300 to 3300: import pandas We have predicted that a car with 1.3 from sklearn import linear_model liter engine, and a weight of 3300 kg, df = pandas.read_csv("cars.csv") will release approximately 115 grams X = df[['Weight', 'Volume']] of CO2 for every kilometer it drives. y = df['CO2'] regr = linear_model.LinearRegression() Which shows that the coefficient of regr.fit(X, y) 0.00755095 is correct: predictedCO2 = regr.predict([[3300, 1300]]) print(predictedCO2) 107.2087328 + (1000 * 0.00755095) = 114.75968 Result: [114.75968007] Machine Learning - Train/Test Evaluate Your Model In Machine Learning we create models to predict the outcome of certain events, like in the previous chapter where we predicted the CO2 emission of a car when we knew the weight and engine size. To measure if the model is good enough, we can use a method called Train/Test. What is Train/Test? o Train/Test is a method to measure the accuracy of your model. o It is called Train/Test because you split the the data set into two sets: a training set and a testing set. o 80% for training, and 20% for testing. o You train the model using the training set. o You test the model using the testing set. o Train the model means create the model. o Test the model means test the accuracy of the model. Machine Learning - Train/Test Start With a Data Set Start with a data set you want to test. Our data set illustrates 100 customers in a shop, and their shopping habits. Example import numpy import matplotlib.pyplot as plt numpy.random.seed(2) x = numpy.random.normal(3, 1, 100) y = numpy.random.normal(150, 40, 100) / x plt.scatter(x, y) plt.show() Result: The x axis represents the number of minutes before making a purchase. The y axis represents the amount of money spent on the purchase. Machine Learning - Train/Test Split Into Train/Test The training set should be a random selection of 80% of the original data. The testing set should be the remaining 20%. train_x = x[:80] train_y = y[:80] test_x = x[80:] test_y = y[80:] Display the Training Set Display the same scatter plot with the training set: Example plt.scatter(train_x, train_y) plt.show() Result: The testing set also looks like the original data set: Machine Learning - Train/Test Fit the Data Set What does the data set look like? In my opinion I think the best fit would be a polynomial regression, so let us draw a line of polynomial regression. To draw a line through the data points, we use the plot() method of the matplotlib module: Example Draw a polynomial regression line through the data points: import numpy import matplotlib.pyplot as plt numpy.random.seed(2) test_x = x[80:] test_y = y[80:] x = numpy.random.normal(3, 1, 100) y = numpy.random.normal(150, 40, 100) / x myline = numpy.linspace(0, 6, 100) train_x = x[:80] train_y = y[:80] mymodel = numpy.poly1d(numpy.polyfit(train_x, train_y, 4)) plt.scatter(train_x, train_y) plt.plot(myline, mymodel(myline)) plt.show() Machine Learning - Train/Test The result can back my suggestion of the data set fitting a polynomial regression, even though it would give us some weird results if we try to predict values outside of the data set. Example: the line indicates that a customer spending 6 minutes in the shop would make a purchase worth 200. That is probably a sign of overfitting. But what about the R-squared score? The R-squared score is a good indicator of how well my data set is fitting the model. R2 Remember R2, also known as R-squared? It measures the relationship between the x axis and the y axis, and the value ranges from 0 to 1, where 0 means no relationship, and 1 means totally related. The sklearn module has a method called r2_score() that will help us find this relationship. In this case we would like to measure the relationship between the minutes a customer stays in the shop and how much money they spend. Machine Learning - Train/Test R2 Example How well does my training data fit in a polynomial regression? import numpy from sklearn.metrics import r2_score numpy.random.seed(2) x = numpy.random.normal(3, 1, 100) y = numpy.random.normal(150, 40, 100) / x train_x = x[:80] train_y = y[:80] test_x = x[80:] test_y = y[80:] mymodel = numpy.poly1d(numpy.polyfit(train_x, train_y, 4)) r2 = r2_score(train_y, mymodel(train_x)) print(r2) O/P: 0.7988645544629795 Note: The result 0.799 shows that there is a OK relationship. Machine Learning - Train/Test Bring in the Testing Set Now we have made a model that is OK, at least when it comes to training data. Now we want to test the model with the testing data as well, to see if gives us the same result. import numpy from sklearn.metrics import r2_score numpy.random.seed(2) x = numpy.random.normal(3, 1, 100) y = numpy.random.normal(150, 40, 100) / x mymodel = numpy.poly1d(numpy.polyfit(train_x, train_y, 4)) train_x = x[:80] train_y = y[:80] Note: The result 0.809 shows that the model fits the testing set as well, and we are confident that we can use the model to predict future values. test_x = x[80:] test_y = y[80:] r2 = r2_score(test_y, mymodel(test_x)) print(r2) Machine Learning - Train/Test Predict Values Now that we have established that our model is OK, we can start predicting new values. Example How much money will a buying customer spend, if she or he stays in the shop for 5 minutes? print(mymodel(5)) The example predicted the customer to spend 22.88 dollars, as seems to correspond to the diagram: Machine Learning - Decision Tree Decision Tree In this chapter we will show you how to make a "Decision Tree". A Decision Tree is a Flow Chart, and can help you make decisions based on previous experience. In the example, a person will try to decide if he/she should go to a comedy show or not. Luckily our example person has registered every time there was a comedy show in town, and registered some information about the comedian, and also registered if he/she went or not. Machine Learning - Decision Tree Now, based on this data set, Python can create a decision tree that can be used to decide if any new shows are worth attending to. Machine Learning - Decision Tree How Does it Work? First, import the modules you need, and read the dataset with pandas: Example Read and print the data set: import pandas from sklearn import tree import pydotplus from sklearn.tree import DecisionTreeClassifier import matplotlib.pyplot as plt import matplotlib.image as pltimg df = pandas.read_csv("shows.csv") print(df) o To make a decision tree, all data has to be numerical. o We have to convert the non numerical columns 'Nationality' and 'Go' into numerical values. o Pandas has a map() method that takes a dictionary with information on how to convert the values. o {'UK': 0, 'USA': 1, 'N': 2} o Means convert the values 'UK' to 0, 'USA' to 1, and 'N' to 2. Machine Learning - Decision Tree Now we can create the actual decision tree, fit it with our details, and save a .png file on the computer Example Create a Decision Tree, save it as an image, and show the image: dtree = DecisionTreeClassifier() dtree = dtree.fit(X, y) data = tree.export_graphviz(dtree, out_file=None, feature_names=features) graph = pydotplus.graph_from_dot_data(data) graph.write_png('mydecisiontree.png') img=pltimg.imread('mydecisiontree.png') imgplot = plt.imshow(img) plt.show() Then we have to separate the feature columns from the target column. The feature columns are the columns that we try to predict from, and the target column is the column with the values we try to predict. Example X is the feature columns, y is the target column: features = ['Age', 'Experience', 'Rank' 'Nationality'] X = df[features] y = df['Go'] print(X) print(y) Machine Learning - Decision Tree Result Explained The decision tree uses your earlier decisions to calculate the odds for you to wanting to go see a comedian or not. Let us read the different aspects of the decision tree: Rank Rank <= 6.5 means that every comedian with a rank of 6.5 or lower will follow the True arrow (to the left), and the rest will follow the False arrow (to the right). gini = 0.497 refers to the quality of the split, and is always a number between 0.0 and 0.5, where 0.0 would mean all of the samples got the same result, and 0.5 would mean that the split is done exactly in the middle. samples = 13 means that there are 13 comedians left at this point in the decision, which is all of them since this is the first step. value = [6, 7] means that of these 13 comedians, 6 will get a "NO", and 7 will get a "GO". Machine Learning - Decision Tree Gini There are many ways to split the samples, we use the GINI method in this tutorial. The Gini method uses this formula: Gini = 1 - (x/n)2 - (y/n)2 Where x is the number of positive answers("GO"), n is the number of samples, and y is the number of negative answers ("NO"), which gives us this calculation: 1 - (7 / 13)2 - (6 / 13)2 = 0.497 The next step contains two boxes, one box for the comedians with a 'Rank' of 6.5 or lower, and one box with the rest. Machine Learning - Decision Tree True - 5 Comedians End Here: gini = 0.0 means all of the samples got the same result. samples = 5 means that there are 5 comedians left in this branch (5 comedian with a Rank of 6.5 or lower). value = [5, 0] means that 5 will get a "NO" and 0 will get a "GO". False - 8 Comedians Continue: Nationality Nationality <= 0.5 means that the comedians with a nationality value of less than 0.5 will follow the arrow to the left (which means everyone from the UK, ), and the rest will follow the arrow to the right. gini = 0.219 means that about 22% of the samples would go in one direction. samples = 8 means that there are 8 comedians left in this branch (8 comedian with a Rank higher than 6.5). value = [1, 7] means that of these 8 comedians, 1 will get a "NO" and 7 will get a "GO". Machine Learning - Decision Tree True - 5 Comedians End Here: gini = 0.0 means all of the samples got the same result. samples = 5 means that there are 5 comedians left in this branch (5 comedian with a Rank of 6.5 or lower). value = [5, 0] means that 5 will get a "NO" and 0 will get a "GO". False - 8 Comedians Continue: Nationality Nationality <= 0.5 means that the comedians with a nationality value of less than 0.5 will follow the arrow to the left (which means everyone from the UK, ), and the rest will follow the arrow to the right. gini = 0.219 means that about 22% of the samples would go in one direction. samples = 8 means that there are 8 comedians left in this branch (8 comedian with a Rank higher than 6.5). value = [1, 7] means that of these 8 comedians, 1 will get a "NO" and 7 will get a "GO". Machine Learning - Decision Tree False - 8 Comedians Continue: Nationality Nationality <= 0.5 means that the comedians with a nationality value of less than 0.5 will follow the arrow to the left (which means everyone from the UK, ), and the rest will follow the arrow to the right. gini = 0.219 means that about 22% of the samples would go in one direction. samples = 8 means that there are 8 comedians left in this branch (8 comedian with a Rank higher than 6.5). value = [1, 7] means that of these 8 comedians, 1 will get a "NO" and 7 will get a "GO". Machine Learning - Decision Tree True - 4 Comedians Continue: Age Age <= 35.5 means that comedians at the age of 35.5 or younger will follow the arrow to the left, and the rest will follow the arrow to the right. gini = 0.375 means that about 37,5% of the samples would go in one direction. samples = 4 means that there are 4 comedians left in this branch (4 comedians from the UK). value = [1, 3] means that of these 4 comedians, 1 will get a "NO" and 3 will get a "GO". False - 4 Comedians End Here: gini = 0.0 means all of the samples got the same result. samples = 4 means that there are 4 comedians left in this branch (4 comedians not from the UK). value = [0, 4] means that of these 4 comedians, 0 will get a "NO" and 4 will get a "GO". Machine Learning - Decision Tree True - 2 Comedians End Here: gini = 0.0 means all of the samples got the same result. samples = 2 means that there are 2 comedians left in this branch (2 comedians at the age 35.5 or younger). value = [0, 2] means that of these 2 comedians, 0 will get a "NO" and 2 will get a "GO". False - 2 Comedians Continue: Experience Experience <= 9.5 means that comedians with 9.5 years of experience, or less, will follow the arrow to the left, and the rest will follow the arrow to the right. gini = 0.5 means that 50% of the samples would go in one direction. samples = 2 means that there are 2 comedians left in this branch (2 comedians older than 35.5). value = [1, 1] means that of these 2 comedians, 1 will get a "NO" and 1 will get a "GO". Machine Learning - Decision Tree True - 1 Comedian Ends Here: gini = 0.0 means all of the samples got the same result. samples = 1 means that there is 1 comedian left in this branch (1 comedian with 9.5 years of experience or less). value = [0, 1] means that 0 will get a "NO" and 1 will get a "GO". False - 1 Comedian Ends Here: gini = 0.0 means all of the samples got the same result. samples = 1 means that there is 1 comedians left in this branch (1 comedian with more than 9.5 years of experience). value = [1, 0] means that 1 will get a "NO" and 0 will get a "GO". Machine Learning - Decision Tree Predict Values: We can use the Decision Tree to predict new values. Example: Should I go see a show starring a 40 years old American comedian, with 10 years of experience, and a comedy ranking of 7? import pandas from sklearn import tree from sklearn.tree import DecisionTreeClassifier df = pandas.read_csv("shows.csv") d = {'UK': 0, 'USA': 1, 'N': 2} df['Nationality'] = df['Nationality'].map(d) d = {'YES': 1, 'NO': 0} df['Go'] = df['Go'].map(d) features = ['Age', 'Experience', 'Rank', 'Nationality'] X = df[features] y = df['Go'] dtree = DecisionTreeClassifier() dtree = dtree.fit(X, y) print(dtree.predict([[40, 10, 7, 1]])) print("[1] means 'GO'") print("[0] means 'NO'") Machine Learning - Decision Tree Predict Values: We can use the Decision Tree to predict new values. Example: Should I go see a show starring a 40 years old American comedian, with 10 years of experience, and a comedy ranking of 7? import pandas from sklearn import tree from sklearn.tree import DecisionTreeClassifier df = pandas.read_csv("shows.csv") d = {'UK': 0, 'USA': 1, 'N': 2} df['Nationality'] = df['Nationality'].map(d) d = {'YES': 1, 'NO': 0} df['Go'] = df['Go'].map(d) features = ['Age', 'Experience', 'Rank', 'Nationality'] X = df[features] y = df['Go'] dtree = DecisionTreeClassifier() dtree = dtree.fit(X, y) print(dtree.predict([[40, 10, 7, 1]])) print("[1] means 'GO'") print("[0] means 'NO'") Machine Learning - Decision Tree import pandas from sklearn import tree from sklearn.tree import DecisionTreeClassifier features = ['Age', 'Nationality'] X = df[features] y = df['Go'] 'Experience', df = pandas.read_csv("shows.csv") d = {'UK': 0, 'USA': 1, 'N': 2} df['Nationality'] = df['Nationality'].map(d) d = {'YES': 1, 'NO': 0} df['Go'] = df['Go'].map(d) dtree = DecisionTreeClassifier() dtree = dtree.fit(X, y) print(dtree.predict([[40, 10, 6, 1]])) print("[1] means 'GO'") print("[0] means 'NO'") 'Rank', Machine Learning - Decision Tree Different Results You will see that the Decision Tree gives you different results if you run it enough times, even if you feed it with the same data. That is because the Decision Tree does not give us a 100% certain answer. It is based on the probability of an outcome, and the answer will vary. Python List VS Array VS Tuple • The following are the main characteristics of a List: • The list is an ordered collection of data types. • The list is mutable. • List are dynamic and can contain objects of different data types. • List elements can be accessed by index number. # Python program to demonstrate List list = ["mango", "strawberry", "orange", "apple", "banana"] print(list) # we can specify the range of the # index by specifying where to start # and where to end print(list[2:4]) # we can also change the item in the # list by using its index number list[1] = "grapes" print(list[1]) Python List VS Array VS Tuple - List • The following are the main characteristics of a List: • The list is an ordered collection of data types. • The list is mutable. • List are dynamic and can contain objects of different data types. • List elements can be accessed by index number. # Python program to demonstrate List List= ["mango”, “strawberry", "orange", "apple", "banana"] print(list) # we can specify the range of the # index by specifying where to start # and where to end print(list[2:4]) # we can also change the item in the # list by using its index number list[1] = "grapes" print(list[1]) Output : ['mango', 'strawberry', 'orange', 'apple', 'banana'] ['orange', 'apple'] grapes Python List VS Array VS Tuple - Array • An array is an ordered collection of the similar data types. • An array is mutable. • An array can be accessed by using its index number. # Python program to demonstrate # Creation of Array # importing "array" for array creations import array as arr # creating an array with integer type a = arr.array('i', [1, 2, 3]) # printing original array print ("The new created array is : ", end =" ") for i in range (0, 3): print (a[i], end =" ") print() # creating an array with float type b = arr.array('d', [2.5, 3.2, 3.3]) # printing original array print ("The new created array is : ", end =" ") for i in range (0, 3): print (b[i], end =" ") O/P : The new created array is : 1 2 3 The new created array is : 2.5 3.2 3.3 Python List VS Array VS Tuple - Tuple • Tuples are immutable and can store any type of data type. • it is defined using (). • it cannot be changed or replaced as it is an immutable data type. tuple = ("orange","apple","banana") print(tuple) # we can access the items in # the tuple by its index number print(tuple[2]) #we can specify the range of the # index by specifying where to start # and where to end print(tuple[0:2]) Output : ('orange', 'apple', 'banana') banana ('orange', 'apple') Python Set • Sets are used to store multiple items in a single variable • Set items are unordered, unchangeable, and do not allow duplicate values. • Sets cannot have two items with the same value. Days1 = {"Monday","Tuesday","Wednesday","Thursday", "Sunday"} Days2 = {"Friday","Saturday","Sunday"} print(Days1|Days2) #printing the union of the sets print(Days1.union(Days2)) #printing the union of the sets print(Days1&Days2) #prints the intersection of the two sets print(set1.intersection(set2)) #prints the intersection of the two sets set3 = set1.intersection(set2) print(set3) Python List VS Array VS Tuple - Tuple Python Pandas Cheat Sheet Simple, expressive and arguably one of the most important libraries in Python, not only does it make real-world Data Analysis significantly easier but provides an optimized feature of being significantly fast. Import Convention: We need to import the library before we get started. import pandas as pd Pandas Data Structure: We have two types of data structures in Pandas, Series and DataFrame. Series Series is a one-dimensional labeled array that can hold any data type. DataFrame DataFrame is a two-dimensional, potentially heterogeneous tabular data structure. Or we can say Series is the data structure for a single column of a DataFrame Now let us see some examples of Series and DataFrames for better understanding. Python Pandas Cheat Sheet Simple, expressive and arguably one of the most important libraries in Python, not only does it make real-world Data Analysis significantly easier but provides an optimized feature of being significantly fast. Importing Convention: Pandas library offers a set of reader functions that can be performed on a wide range of file : Formats which returns a Pandas object. Here we have mentioned a list of reader functions. Similarly, we have a list of write operations which are useful while writing data into a file. pd.read_csv(“filename”) pd.read_table(“filename”) pd.read_excel(“filename”) pd.read_sql(query, connection_object) pd.read_json(json_string) df.to_csv(“filename”) df.to_excel(“filename”) df.to_sql(table_name, connection_object) df.to_json(“filename”) Python Pandas Cheat Sheet Simple, expressive and arguably one of the most important libraries in Python, not only does it make real-world Data Analysis significantly easier but provides an optimized feature of being significantly fast. Operations: Create Test/Fake Data: View DataFrame contents: Pandas library allows us to create fake or df.head(n) – look at first n rows of the DataFrame. test data in order to test our code df.tail(n) – look at last n rows of the DataFrame. segments. Check out the examples given df.shape() – Gives the number of rows and below. columns. df.info() – Information of Index, Datatype and pd.DataFrame(np.random.rand(4,3)) – 3 Memory. columns and 4 rows of random floats df.describe() –Summary statistics for numerical pd.Series(new_series) – Creates a series columns. from an iterablenew_series Python Pandas Cheat Sheet Simple, expressive and arguably one of the most important libraries in Python, not only does it make real-world Data Analysis significantly easier but provides an optimized feature of being significantly fast. Selecting: we want to select and have a look at a chunk of data from our DataFrame. There are two ways of achieving the same.First, selecting by position and second, selecting by label. Selecting by position using iloc: df.iloc[0] – Select first row of data frame df.iloc[1] – Select second row of data frame df.iloc[-1] – Select last row of data frame df.iloc[:,0] – Select first column of data frame df.iloc[:,1] – Select second column of data frame Python Pandas Cheat Sheet Simple, expressive and arguably one of the most important libraries in Python, not only does it make real-world Data Analysis significantly easier but provides an optimized feature of being significantly fast. Sorting: Another very simple yet useful feature offered by Pandas is the sorting of DataFrame. df.sort_index() -Sorts by labels along an axis df.sort_values(column1) – Sorts values by column1 in ascending order df.sort_values(column2,ascending=False) – Sorts values by column2 in Python Pandas Cheat Sheet Simple, expressive and arguably one of the most important libraries in Python, not only does it make real-world Data Analysis significantly easier but provides an optimized feature of being significantly fast. Groupby: Using groupby technique you can create a grouping of categories and then it can be helpful while applying a function to the categories. This simple yet valuable technique is used widely in data science. df.groupby(column) – Returns a groupby object for values from one column df.groupby([column1,column2]) – Returns a groupby object values from multiple columns df.groupby(column1)[column2].mean() – Returns the mean of the values in column2, grouped by the values in column1 df.groupby(column1)[column2].median() – Returns the mean of the values in column2, grouped by the values in column1 Python Pandas Cheat Sheet Simple, expressive and arguably one of the most important libraries in Python, not only does it make real-world Data Analysis significantly easier but provides an optimized feature of being significantly fast. Functions: There are some special methods available in Pandas which makes our calculation easier. Let’s Mean:df.mean() – mean of all columns Median:df.median() – median of each column Standard Deviation:df.std() – standard deviation of each column Max:df.max() – highest value in each column Min:df.min() – lowest value in each column Count:df.count() – number of non-null values in each DataFrame column Describe:df.describe() – Summary statistics for numerical columns apply those methods in our Product_ReviewDataFrame Python Pandas Cheat Sheet Simple, expressive and arguably one of the most important libraries in Python, not only does it make real-world Data Analysis significantly easier but provides an optimized feature of being significantly fast. Plotting: Data Visualization with Pandas is carried out in the following ways. Histogram Scatter Plot Note: Call %matplotlib inline to set up plotting inside the Jupyter notebook. Histogram: df.plot.hist() Scatter Plot:df.plot.scatter(x=’column1′,y=’column2′) Pandas Project Creating Pandas Dataframe: Pandas Project Pandas Project Pandas Project Pandas Project Pandas Project Pandas Project Pandas Project Pandas Project Pandas Project Pandas Project Pandas Project Pandas Project Pandas Project Pandas Project Pandas Project Pandas Project Data Science Project Ideas Data Science continues to grow in popularity as a promising career path for this era. It’s one of the most exciting and attractive options available. Demand for Data Scientists is increasing in the market. According to recent reports, demand will skyrocket in the future years, increasing by many times. Data Science encompasses a wide range of scientific methods, procedures, techniques, and information retrieval systems to detect meaningful patterns in organized and unstructured data. More opportunities emerge in the market as more industries recognize the value of Data Science. Gender and Age Detection Python Project First introducing you with the terminologies used in this advanced python project of gender and age detection – What is Computer Vision? Computer Vision is the field of study that enables computers to see and identify digital images and videos as a human would. The challenges it faces largely follow from the limited understanding of biological vision. Computer Vision involves acquiring, processing, analyzing, and understanding digital images to extract highdimensional data from the real world in order to generate symbolic or numerical information which can then be used to make decisions. The process often includes practices like object recognition, video tracking, motion estimation, and image restoration. What is OpenCV? OpenCV is short for Open Source Computer Vision. Intuitively by the name, it is an opensource Computer Vision and Machine Learning library. This library is capable of processing real-time image and video while also boasting analytical capabilities. It supports the Deep Learning frameworks TensorFlow, Caffe, and PyTorch. What is a CNN? A Convolutional Neural Network is a deep neural network (DNN) widely used for the purposes of image recognition and processing and NLP. Also known as a ConvNet, a CNN has input and output layers, and multiple hidden layers, many of which are convolutional. In a way, CNNs are regularized multilayer perceptrons. Gender and Age Detection Python Project- Objective To build a gender and age detector that can approximately guess the gender and age of the person (face) in a picture using Deep Learning on the Adience dataset. Gender and Age Detection – About the Project In this Python Project, we will use Deep Learning to accurately identify the gender and age of a person from a single image of a face. We will use the models trained by Tal Hassner and Gil Levi. The predicted gender may be one of ‘Male’ and ‘Female’, and the predicted age may be one of the following ranges- (0 – 2), (4 – 6), (8 – 12), (15 – 20), (25 – 32), (38 – 43), (48 – 53), (60 – 100) (8 nodes in the final softmax layer). It is very difficult to accurately guess an exact age from a single image because of factors like makeup, lighting, obstructions, and facial expressions. And so, we make this a classification problem instead of making it one of regression. The CNN Architecture The convolutional neural network for this python project has 3 convolutional layers: • Convolutional layer; 96 nodes, kernel size 7 • Convolutional layer; 256 nodes, kernel size 5 • Convolutional layer; 384 nodes, kernel size 3 It has 2 fully connected layers, each with 512 nodes, and a final output layer of softmax type. To go about the python project, we’ll: Detect faces Classify into Male/Female Classify into one of the 8 age ranges Put the results on the image and display it The Dataset For this python project, we’ll use the Adience dataset; the dataset is available in the public domain and you can find it here. This dataset serves as a benchmark for face photos and is inclusive of various real-world imaging conditions like noise, lighting, pose, and appearance. The images have been collected from Flickr albums and distributed under the Creative Commons (CC) license. It has a total of 26,580 photos of 2,284 subjects in eight age ranges (as mentioned above) and is about 1GB in size. The models we will use have been trained on this dataset. Prerequisites You’ll need to install OpenCV (cv2) to be able to run this project. You can do this with pippip install opencv-python Other packages you’ll be needing are math and argparse, but those come as part of the standard Python library. Steps for practicing gender and age detection python project 1. Download this zip. Unzip it and put its contents in a directory you’ll call gad. The contents of this zip are: • • • • • • • opencv_face_detector.pbtxt opencv_face_detector_uint8.pb age_deploy.prototxt age_net.caffemodel gender_deploy.prototxt gender_net.caffemodel a few pictures to try the project on For face detection, we have a .pb file- this is a protobuf file (protocol buffer); it holds the graph definition and the trained weights of the model. We can use this to run the trained model. And while a .pb file holds the protobuf in binary format, one with the .pbtxt extension holds it in text format. These are TensorFlow files. For age and gender, the .prototxt files describe the network configuration and the .caffemodel file defines the internal states of the parameters of the layers. 2. We use the argparse library to create an argument parser so we can get the image argument from the command prompt. We make it parse the argument holding the path to the image to classify gender and age for. 3. For face, age, and gender, initialize protocol buffer and model. 4. Initialize the mean values for the model and the lists of age ranges and genders to classify from. 5. Now, use the readNet() method to load the networks. The first parameter holds trained weights and the second carries network configuration. 6. Let’s capture video stream in case you’d like to classify on a webcam’s stream. Set padding to 20. 7. Now until any key is pressed, we read the stream and store the content into the names hasFrame and frame. If it isn’t a video, it must wait, and so we call up waitKey() from cv2, then break. 8. Let’s make a call to the highlightFace() function with the faceNet and frame parameters, and what this returns, we will store in the names resultImg and faceBoxes. And if we got 0 faceBoxes, it means there was no face to detect. Here, net is faceNet- this model is the DNN Face Detector and holds only about 2.7MB on disk. Create a shallow copy of frame and get its height and width. Create a blob from the shallow copy. Set the input and make a forward pass to the network. faceBoxes is an empty list now. for each value in 0 to 127, define the confidence (between 0 and 1). Wherever we find the confidence greater than the confidence threshold, which is 0.7, we get the x1, y1, x2, and y2 coordinates and append a list of those to faceBoxes. • Then, we put up rectangles on the image for each such list of coordinates and return two things: the shallow copy and the list of faceBoxes. • • • • • 9. But if there are indeed faceBoxes, for each of those, we define the face, create a 4dimensional blob from the image. In doing this, we scale it, resize it, and pass in the mean values. • 10. We feed the input and give the network a forward pass to get the confidence of the two class. Whichever is higher, that is the gender of the person in the picture. • 11. Then, we do the same thing for age. • 12. We’ll add the gender and age texts to the resulting image and display it with imshow(). Python Project Examples for Gender and Age Detection Let’s try this gender and age classifier out on some of our own images now. We’ll get to the command prompt, run our script with the image option and specify an image to classify: Python Project Example 1 O/P: Python Project Examples for Gender and Age Detection Let’s try this gender and age classifier out on some of our own images now. We’ll get to the command prompt, run our script with the image option and specify an image to classify: Python Project Example 2 O/P: Python Project Examples for Gender and Age Detection Let’s try this gender and age classifier out on some of our own images now. We’ll get to the command prompt, run our script with the image option and specify an image to classify: Python Project Example 3 O/P: Python Project Examples for Gender and Age Detection Let’s try this gender and age classifier out on some of our own images now. We’ll get to the command prompt, run our script with the image option and specify an image to classify: Python Project Example 4 O/P: Python Project Examples for Gender and Age Detection Let’s try this gender and age classifier out on some of our own images now. We’ll get to the command prompt, run our script with the image option and specify an image to classify: Python Project Example 5 O/P: Finished this Book ………… Congrats You got your Data Scientist Job Author - Rohit Dubey Data Science Trainer