Statistics Google Colaboratory https://colab.research.google.com/drive/1b1tb8yA15D-ltAAK 7TObj-_0NQB2JLSp#scrollTo=LI5QLn7xNMrX What is Statistics ? Statistics is the science of collecting, organizing and analysing the data. Better Decision Making What is Data ? Facts or pieces of information that can be measure. Ex. IQ of students in the class. Type of Statistics : Descriptive Statistics It consist of organizing and summarizing the data. What is the average marks of students in the classroom ? Inferential Statistics : Technique where in we use the data that we have measured to form conclusions. Are the age of students of this classroom similar to the age of the math class in the college? What is population and sample : Statistics 1 Things to be careful about when creating samples : Random Sample Size Representative Parameter vs Statistics : A parameter is the characteristics of the population. generally unknown and estimated by statistics. A Statistics is a characteristic of samples. the goal of statistics inference to use information that obtained from the sample to make inference about the population parameter. Population : Whole data. Elections → Any state → Population Capital N Sample : subset of population Choose people random to know about whom they gave vote ? For that you are not going to every single voter cause it’s too hectic that’s why you choose subset from the population and make assumption based on sample result. Small n Sampling Techniques : Statistics 2 1. Simple Random Sampling : a. Every member of the population has a equal chance of being selected for sample. 2. Stratify Sampling : a. Where the population is split into non overlapping groups(strata). b. Ex. Gender → male, female → survey c. Ex. Age → (0-10)(10-20)(20-30) → Non Overlapping Groups 3. Systematic Sampling : a. From the population N → nth individual b. Ex. mall → survey → every 7th person that I see want to take survey. 4. Convenience Sampling : a. Those people who are domain expert in that particular survey those are particular participate in the survey. b. Ex. Data Science → survey → any person who has interested in data science and who have knowledge. Variable : A variable is a property that can take on any value. Ex. Height, width 1. Quantitative Variable a. Measure numerically → add, multiply, divide, subtract b. Discreate Variable → whole number → no of bank account c. Continuous Variable → height → 174.56 2. Qualitative/Categorical Variable a. Based on some characteristics we can derive categorical variable. b. Ex. Gender, Blood group, T-shirt size Variable Measurement Scales : Nominal Categorical / Qualitative Data Statistics 3 Ex. Colour, Gender No order, no measurement Ordinal Order of the data matters, value doesn’t We focus on the rank or orders, not focus on values. Ex. Student marks , Rank Interval Order Matters, value also matter, natural 0 is not present Ex. temperature → 70-80, 80-90 → range of value with order matters but you can’t represent as single 0. Ratio Ratio variables are important in statistical analysis because they allow for meaningful comparisons and calculations of ratios and percentages. A clear definition of zero equal intervals between values meaningful ratios between values continuous or discreate values True Zero Point A true zero point is a value on a scale where the absence of a property or attribute is represented by a value of zero. This means that there is an absolute minimum value that represents the complete absence of the thing being measured. For example, in the case of weight, a true zero point would be the complete absence of weight, which is represented by a weight of zero. Similarly, in the case of temperature measured in Kelvin, zero Kelvin (also known as absolute zero) represents the complete absence of thermal energy. In statistical analysis, the presence or absence of a true zero point is important because it determines whether meaningful ratios can be calculated between values. On a ratio scale, ratios between values are meaningful because they represent the relative amounts of the thing being measured. In contrast, on an Statistics 4 interval scale, ratios between values are not meaningful because there is no true zero point to use as a reference. Meaning of Exit Poll : a survey in which people who have just voted are asked who they voted for in order to predict the result of the election Frequency Distribution : how many time particular item is occurred. → frequency add previous frequency to current frequency so at the end get total number of item which is n. → cumulative frequency Discreate Values → Bar Chart Continues values → histograms PDF → smoothening of histogram → kernel density estimator probability distribution function, probability density function Univariate Analysis Statistics 5 ⇒ Bar Plot ⇒ Pie Chart Bar Vs Histogram : 1. Data type → categorical data vs continuous data 2. Axes → bar → x-axis → categories → y-axis → frequency or count of the values a. histogram → continuous → x-axis → range of values → depend on bins → same y-axis 3. shape of bars : bar → have evenly same space and width → histogram → bit vary 4. gaps between bars Graph for Bivariate Analysis Numerical = Numerical ⇒ Scatter Plot central tendency : Statistics 6 Refers to the measure used to determine the Centre of the distribution of the data. Mean → prone to outlier Median → middle value in dataset after arranged in ascending order. Mode → most frequent value in dataset → categorical data. Weighted Mean Trimmed Mean Outlier : Outlier has an adverse impact on the entire distribution of the data. different technique to remove outlier. It is a number which is completely different from the entire distribution. Statistics 7 Measure of central tendency : Arithmetic mean for population and sample : mean for population : sum(x) / N mean for sample : sum(x) / n Impact of the outlier included that you have seen in above image. Median : find the number which is in middle after sorting. sort → take centre element → even number works well with outlier. it’s not provide much impact on the median of the outlier. Statistics 8 Mode : most frequent value used for both categorical and discreate variable dataset - > 10% missing data → flower name Which measure tendency I have to use for fill this null values. → categorical values which measure tendency should I apply for fill null values of the people ages ex. gender, salary Measure of Dispersion { spread → how well spread your data is.}: Variance identify how two distribution are different at that point of time we used variance. Statistics 9 population variance : sigma square N : population sample variance : small s square n - 1 : sample Statistics 10 In first plot → low variance second plot → high variance high variance for blue because data is highly distributed. low variance for red because data is not highly distributed. Standard Deviation : root of variance one standard deviation to left, one standard deviation to right Statistics 11 mean + 1(standard deviation), mean - 1(standard deviation) variance → how the data is spread ? standard deviation : what is the range of the data falling around 1 standard deviation. The standard deviation is more commonly used than the variance, as it is in the same units as the original data, while variance is in squared units. Standard deviation is also easier to interpret, as it represents the average distance of data points from the mean. In summary, variance is a measure of how spread out a dataset is, while standard deviation is the square root of variance and represents the typical distance of data points from the mean. Percentile {find outliers}: A percentile is a value below which a certain percentage of observation lie. Statistics 12 Statistics 13 Statistics 14 Thing remember while calculating this measure data sorted from low to high. They are not actual value in the data. All other tiles can be easily derived from percentile. You are basically finding location of an observation. Five Number Summary : Minimum First Quartile → Q1 Median → M Third Quartile → Q3 Maximum Statistics 15 removing the outlier : IQR → Inter Quartile Range [ lower fence → higher fence ] [ q1 - 1.5 IQR , q3 + 1.5 IQR ] IQR = Q3 - Q1 Q1 = (25%), Q3 = (75%) Statistics 16 Statistics 17 Benefit of Box Plot : Easy way to see the distribution of data tell about skewness of data can identify outlier compare 2 categories of data Statistics 18 Statistics 19 Why sample variance divide by n -1 ? degree of freedom The sample variance is calculated by dividing the sum of the squared differences from the sample mean by n-1, where n is the number of observations in the sample. The reason for dividing by n-1 instead of n is to correct for the bias in using the sample mean to estimate the population mean. When we calculate the sample variance, we use the sample mean as an estimate of the true population mean. However, the sample mean is itself a random variable, and its variability means that the sample variance calculated using the sample mean is likely to be slightly lower than the true population variance. Dividing by n-1 instead of n corrects for this bias, by increasing the denominator of the variance calculation and making the estimate of the variance slightly larger. This adjustment is known as Bessel's correction, named after Friedrich Bessel, who first described it in 1825. In summary, we divide by n-1 instead of n in the sample variance calculation to correct for the bias introduced by using the sample mean to estimate the population mean, and to obtain an unbiased estimate of the population variance. Covariance : Statistics 20 Statistics 21 Statistics 22 Correlation Statistics 23 Statistics 24 Random Variable : Sample space contain a value that random variable have. Type of random variable 1. Discreate RV → Coin, Dice 2. Continues RV → Height, Width, CGPA, Hold range of value. Probability Distribution What are probability distributions ? A probability distribution is a list of all possible outcome of random variable along with their corresponding probability values. Coin Toss = {H, T} = {1/2, 1/2} 2 Dice Probability Distribution → Table → Row(Dice 1), Col(Dice 2) → Sum → Sum(2) = 1/36 Not every number have a same probability. Sum(7) = 7/36 In Probability distribution write down all of the outcome along with their corresponding probability values. In many scenario number of outcome can be much larger and hence a table would be tedious task to write down. Worst still, the number of outcome could be infinite. Ex. Height of people, What if we use mathematical function to model the relationship between outcome and probability ? By applying relationship between outcome and probability like return probability as given outcome. Statistics 25 Probability distribution function is a mathematical function that describes the probability of obtaining different values of random variable in particular probability distribution. Y = F(X) → find out the function F that hold the value of outcome and return. Using this mathematical function I can also create a graph. Probability Distribution Function → all set of possible values of random variable along with their probability Type of probability Distribution Famous Probability Distribution In nature they have lot’s of similarity. Statistics 26 Why probability distributions important? Gives an idea about shape/distribution of the data. And if our data follow a famous probability distribution then we automatically know a lot about the data. Note on parameter: Parameter in probability distribution are numerical values that determine the shape, scale, location and distribution of the data. Different probability distribution have a different set of parameter that determine their shapes, characteristics and understanding these parameters is essential in statistical analysis and inference. Types of Probability Distribution Function : Statistics 27 Probability Mass Function (PMF) Creating probability distribution function for generating probability for Discreate random variable is known as probability mass function. Ex. Roll a dice, Coin Probability Density Function (PDF) If you calculate the probability of continuous random variable, It’s known as PDF. Create another one function which is combination of both : PMF + PDF Cumulative distribution function PDF → CDF PMF → CDF Probability Mass Function Y = F(X) → Rolling a dice → { 1 : 1/6, 2 : 1/6, …, 6 : 1/6 } → { otherwise : 0 } 2 Dice → Sum → Probability It’s give the probability of each outcome of random variable separately as you have seen in graph. Statistics 28 Example of the Probability Mass Function : Bernoulli Distribution, Binominal Distribution Cumulative Distribution Function(CDF) of PMF CDF describes probability of random variable X with a given probability distribution will be found at a value less than or equal to x. F(x) = P(X ≤ x) It gives the probability at particular point along with all of the outcome which is less than particular outcome. Probability Density Function(PDF) PDF describes a probability distribution of continuous random variable. It’s mathematical function that generate probability of continuous random variable. Statistics 29 X-axis → It’s same number that we had in PMF. Y-axis → It’s not probability that we had in PMF, it’s probability density. 1. Why probability density and why not probability ? a. Since in X-axis we have a infinite value so what is the probability of occurring some value that is close to zero or zero. b. Ex. CGPA → 0 to 10 with infinite value → CGPA = 7.694 → What is the probability of CGPA 7.694 → it’s close to 0. → 100 children of class → 7.694 c. Since we are dealing with infinite range of values we cannot calculate the value of each CGPA. Eventually probability will become zero. d. Area giving the entire probability of every possible outcome which is basically in between 0 to 10. e. Probability Density → What is the probability of particular value lie between two value. f. calculate the probability help of area under the graph as well as help of the probability density. 2. What does the area of the graph represents ? a. Area of the graph consider as total 1 because range of all values covered in area. ex. CGPA → value between → 0 to 10 → P(0≤x≤10) → 1 3. How to calculate Probability then ? a. With the help of the probability density → probability density is a probability of the particular point that lie between two data values. 4. Example of PDF. a. Normal Distribution → Mean → Sigma Square Statistics 30 b. Log Normal Distribution → Mean → Sigma c. Poisson Distribution → Lambda 5. How is graph calculated ? Density Estimation 2 Techniques of Density Estimation Parametric Density Estimation → Follows specific probability distribution Non Parametric Density Estimation → Not follow any specific probability distribution Commonly Used Technique Kernel Density Estimation(KDE) Histogram Estimation Gaussian mixture models(GMM) Statistics 31 Google Colaboratory https://colab.research.google.com/drive/1x97XwbK6TT4csm wQRq3zqsbfYNFKjf5-#scrollTo=fljBTmcZj256 Parametric Density Estimation Plot a histogram → make assumption about data distribution → find a parameter according to the assumed distribution Available Data → Mean, Standard Deviation → Estimate Population mean and standard deviation → All value of X put in formula of PDF → it’s return a probability density value All game is to estimate the parameter as much as nearest to the population mean and standard deviation. Non Parametric Density Estimation Statistics 32 Kernel Density Estimation Statistics 33 Kernel → Probability Distribution → Gaussian Distribution(Normal) Catch Every Point → Assume as centre → around particular point make normal distribution that we have seen in second plot Do same thing for all of the data point At the end → You will have as number of gaussian kernel as number of data points. Take one point → move toward Y-axis → draw perpendicular line toward Y → Check Line intersect how many gaussian distribution and corresponding it’s Y value → add all of the Y-value. Do same for all data points. For generate gaussian distribution of all points → need parameter → mean, standard deviation mean → consider as a data point standard deviation → using kernel function → kernel( bandwidth) → hyper parameter → Low bandwidth(Means less standard deviation) : In distribution We would get spikes spread of the data decrease → High bandwidth(High Standard Deviation) : Curve going to smooth more. Less Bandwidth High Bandwidth Cumulative Density Function Statistics 34 PDF → graph area calculate → get the CDF CDF → calculate the slope of the CDF → get the PDF PDF → perform Integration → CDF CDF → deferential → PDF What is the difference between PDF and CDF Statistics 35 First decide the rule, if range of petal_width lies between particular range than it’s consider as setosa and other type that called PDF. CDF → Useful to check our define range whether right or not. It’s give the quantitative label how much percentage of values belong to right categories ? You have a iris dataset and have 4 kdeplot that showing distrubtion of 4 variable. Each graph contain a 3 distribution for each cotegory. Which 2 will you choose to remove? Distribution that cleaery differentiable. If all 3 distribution overlap on each other than it’s very hard to identify in which category that lie even for machine learning also. Check above graph, able to differential between 3 category whether it is setosa, verginica and versicolor. Remove the graph which is not differentiable easily. Statistics 36 It’s very hard to differential each category becasue they are overlapping each other too much so it’s ok to remove that feature. titanic dataset → age, survived By using this type of analysis, you will get which feature is important for model or which not. How Cummulative Distribution Function is useful? The cumulative distribution function (CDF) of a random variable gives the probability that the variable takes a value less than or equal to a particular value. In other words, the CDF gives us information about the probability distribution of the random variable. Statistics 37 Specifically, the CDF provides the following information: 1. Probability of a value less than or equal to a specific value: The CDF gives the probability that the random variable takes a value less than or equal to a specific value. This can help us understand the likelihood of different outcomes. 2. Probability of a value within a range: By subtracting the CDF value for a lower value from the CDF value for a higher value, we can find the probability that the random variable takes a value within a specific range. 3. Probability of a specific value: Although the CDF does not directly provide the probability of a specific value, we can use the CDF to find the probability that the random variable takes a value very close to a specific value. In summary, the CDF provides important information about the probability distribution of a random variable, which can help us make predictions and draw conclusions about the behavior of the variable. 2D Density Plot At one time, two numerical column, how related to each other, that’s kind of study or knowledge derive from this graph. Statistics 38 Dark area → too much up Light area → down or surface area Which combinations have a highest probability or density? → You got paired probability density about two column. Contour plot → 3rd dimension is Colour. Normal Distribution Statistics 39 Statistics 40 By using mean : shifting is possible By using standard devation : spreading is possible Standard Normal Distribution Statistics 41 Already calculated probability density values for standard normalize distribution which is stored in Z-table. Statistics 42 Properties of Normal Distribution Measure of central tendencies are equal → mean, median, mode → same for proper normal distribution Empirical Rule Area under the curve → 1 What is Skewness? Statistics 43 In Positive skewed, tail on the right side, means outlier is on the right side that make a impact on the mean. so that mean is from right side in positive skewed. Statistics 44 Measure of Skewness : 1. 0 → Proper normal distribution → No Skewed 2. -0.5 to 0.5 → Almost Skewed 3. -1 to 1 → Moderately Skewed 4. Other → Highly Skewed It’s not compulsory that data is not normally distributed that means it’s skewed. Check skewedness and shape of the distribution. Not normal distribution → also possible symmetrical CDF of Normal Distribution Statistics 45 Use of Normal Distribution in data science Outlier detection Assumptions on data for ml algorithams : linear regression, GMM Hypothesis Testing Center limit theorem What is Kurtosis? 4 Statistical moment : 1. Mean 2. Std 3. Skewness 4. Kurtosis There ar e more moments also, but it’s main 4. It’s tell us heaviness of tails. How fattest any distribution tail. it’s not about peakedness. fattest tail → high change of having outliers Statistics 46 Statistics 47 Statistics 48 QQ Plot Statistics 49 Process: 1. take theoretical data → normal → sort → calculate quantiles → Y Quant 2. original data → sort → calculate quantiles → X Quant 3. take first point from X and Y and compare. Do same thing for all points. Statistics 50 Does QQ plot only detect normal distribution? No. We you can compare with any type of distribution that you want. What is the quantiles? In statistics, a quantile is a specific value or cut-off point in a distribution that divides the data into smaller groups or subsets with equal proportions. Quantiles are used to describe the spread of a dataset, and they can be used to identify the location of individual observations within a distribution. The most commonly used quantiles are the quartiles, which divide the data into four equal parts. The first quartile (Q1) represents the 25th percentile, meaning that 25% of the data is less than or equal to this value, while 75% is greater than or equal to this value. The second quartile (Q2) is the median, representing the 50th percentile, and the third quartile (Q3) is the 75th percentile. Other common quantiles include the deciles (which divide the data into ten equal parts) and the percentiles (which divide the data into 100 equal parts). Statistics 51 Quantiles are useful for identifying outliers, detecting skewness in the data, and comparing datasets. For example, the difference between the 25th and 75th percentiles (also known as the interquartile range) can be used to measure the spread of a dataset, while comparing the distribution of two datasets using their quartiles can provide insights into their differences. Non Gaussian Distribution Continuous Non Gaussian Distribution Discrete Non Gaussian Distribution Uniform Distribution It has a two type: 1. Discrete Uniform Distribution 2. Continuous Uniform Distribution Statistics 52 Skewness → 0 → Symmetrical like normal distribution Statistics 53 Log Normal Distribution Statistics 54 Right skewed random variable → calculate log → distribution normally distributed. How to check if a random variable is log normally distribution? X data → log of data → it should be normally distributed. Verify with QQ plot → if you will get the data along with line, then it is else not. Transformation: Statistics 55 In statistics, a transformation refers to a mathematical function applied to a dataset in order to alter its distribution or make it more suitable for a particular analysis or modeling technique. Transformations are commonly used in data analysis to improve the assumptions of statistical models or to improve the accuracy of statistical inference. The most common types of transformations are linear transformations, which involve multiplying or adding a constant value to the data. For example, a common linear transformation is to convert a temperature scale from Fahrenheit to Celsius by subtracting 32 and multiplying by 5/9. Nonlinear transformations are also commonly used in statistics, particularly in cases where the data is not normally distributed or exhibits skewness or outliers. Some common nonlinear transformations include: Logarithmic transformations: used to reduce the effect of outliers or to compress data that spans several orders of magnitude. Square root transformations: used to reduce the effect of positive skewness or to linearize relationships between variables. Box-Cox transformations: a family of transformations that can be used to adjust the skewness and kurtosis of a dataset by selecting a parameter that maximizes the likelihood of the transformed data. Transformations can be applied to the entire dataset or to specific variables or subsets of the data. When selecting a transformation, it is important to consider the underlying assumptions of the statistical model or analysis, as well as the interpretability and practical implications of the transformed data. Bernoulli Distribution Discreate Distribution Two outcome → success, failure → coin toss, spam email, rolling a dices getting a five having a random experiment with binary outcome Statistics 56 coin toss → probability of getting success mean head(x=1) (0.5)1 *(1-0.5)1-1 coin toss → probability of getting failure mean tail(x=0) (0.5)0 *(1-0.5)1-0 Statistics 57 Binomial Distribution describe number of success in a fixed number of independent Bernoulli trail. Perform Bernoulli trails N time → Binominal Statistics 58 n: number of trials p: probability of success x: desired result or outcome of random experiment according to desire getting like from 2 person out of 3 person 3! / (2!1!) * (0.5)2 * (1-0.5)3-2 Statistics 59 If we choose the value of p= 0.8 means high value, it will move toward right because, probability of getting head is 0.8 that means in most of the random experiments have value 7 to 10 among total trial. That’s why it move toward right side, for less value of probability, it will move toward left side. As we seen in above graph. Statistics 60 Day : 3 : Intermediate to Advance Distributions : Get the idea about the dataset that’s why we used different type of distributions. how data is spread out or arranged is known as a distributions. how we can see the data in visualize way ? continuous data → what type of graph useful to understand the data. Statistics 61 multiple way to visualize the data using different graph. What is Distribution ? In statistics, a distribution refers to the way in which a set of data is spread out or arranged. It describes how the data is distributed or arranged across its range of values. There are several types of distributions, including: 1. Normal distribution: This is a symmetrical, bell-shaped distribution in which most of the data falls near the mean or average value, with the rest of the data distributed evenly on either side. 2. Skewed distribution: This is a distribution in which the data is not symmetrical and is "skewed" to one side or the other. A positively skewed distribution has a long tail on the right, while a negatively skewed distribution has a long tail on the left. 3. Bimodal distribution: This is a distribution in which there are two peaks or modes in the data, indicating that there are two different groups or populations represented. 4. Uniform distribution: This is a distribution in which the data is evenly distributed across the range of values, with no peaks or modes. The uniform distribution is a probability distribution that describes a situation where every possible outcome is equally likely. In other words, if a random variable is uniformly distributed, each value within a given range has the same probability of occurring. The probability density function (pdf) of a continuous uniform distribution is given by: f(x) = 1/(b-a) if a ≤ x ≤ b 0 otherwise where 'a' is the minimum value of the range and 'b' is the maximum value of the range. The cumulative distribution function (cdf) of a continuous uniform distribution is: F(x) = 0 if x < a (x-a)/(b-a) if a ≤ x ≤ b 1 if x > b Statistics 62 The uniform distribution is often used in statistics and probability theory as a simple and tractable model for situations where every outcome in a range is equally likely, such as rolling a fair die or selecting a random number from a given interval. Distributions can be described using various measures, including measures of central tendency (such as the mean, median, and mode) and measures of variability (such as the range, variance, and standard deviation). Understanding the distribution of a dataset is important in statistical analysis as it can provide insights into the nature of the data, help identify any outliers or anomalies, and inform decisions about appropriate statistical tests or models to use. 1. Gaussian Or Normal Distribution : Statistics 63 mean and average of the data lie Centre of the distribution. one side is exactly symmetrical to the other side. symmetrical, ball shaped distributions, most of the data falls to the near of the mean or average value, rest of the distributed evenly both side. Known as normal and gaussian distribution. Empirical Formula : Statistics 64 68 - 95 -99.7% rule Dataset → 100 data point Within 1st standard deviation around 68% percentage of the entire distribution over here. Within 2nd standard deviation around 95% percentage of the entire distribution over here. Within 3nd standard deviation around 99.7% percentage of the entire distribution over here. If you have a normal or gaussian distribution then definitely above 3 standard deviation percentage condition fulfilled. Ex. height → normally distributed → domain expert → doctors → within the 1st, 2nd, 3rd standard deviation how much data is falling. Weight, IRIS Dataset Whenever we have gaussian distributed data that time it will follow 68%, 95% and 99.7%. mean = 4, sd = 1 In this case → consider value 4.5 → +0.5 SD toward right side. but in case 4.75 It’s very hard to calculate that’s why we used Z-Score. Z-Score : It’s help to find out a value that tell us how much standard deviation away from the mean. A z-score measures the distance between a data point and the mean using standard deviations. Z-scores can be positive or negative. The sign tells you whether the observation is above or below the mean. x - mean(population mean) / standard deviation → +0.75 SD → This is a positive value that’s why it’s lie on right side. how many standard deviation to the right or left that’s why used z-score. Statistics 65 After applying z-score convert into range of { -3, -2, -1, 0, 1, 2, 3 } this is called standard normal distribution. Statistics 66 Standard Normal Distribution : One most important property of SND : mean = 0 and standard deviation = 1 A random variable y belong to the standard normal distribution if its mean is zero and standard deviation is one. Why do we this ? convert using z-score ? Dataset → Age(year), Salary(rs), Weight(kg) → calculate in units Age (Year) Salary (rupee) Weight( Kg) 34 56k 70 56 110k 87 Here see one thing to unit of all columns are different. main target → bring up form → SNF (0,1) take entire data → apply z-score → SNF(0,1) → standardization Whenever we talk about standardization there are always z-score. Normalization : shift entire range of values in range of (0,1) MinMaxScaler → (0,1) → If you want to shift the range (-1,1) so you can. CNN → Image → Pixel → range(0,255) → MinMaxScaler - (0,1) Practical Example of Z-Score ? Match → Ind vs SA Statistics 67 Series average 2021, 2020 → 250, 260 Standard deviation of score 2021, 2020 → 10, 12 Team Final score 2021, 2020 → 240, 245 compare two scores in which year team final score was better ? we are using z-score for the same. Whenever the standard deviation is more that team had better score. Z-Score help to find out area of the body curve. Statistics 68 It’s running from left to right so if you tried to find the value of 0.25 and if you check 0.25 in right table that contained value 0.5987 means it’s with all left value (0.5) and added other 0.25 valuation toward right. But if I tried to find value of -1 then it’s sufficient to view left table with value of -1 because values come from negative to positive and following one simple rule make things easy to understand. value of -1 : 0.1587 It’s may be chance that you get different z table format so be careful while getting the value of z-score. IQ Between 90 to 120 ? → find z-score → based on z table → find the value of percentage Where we have to use Standardization and Normalization ? Statistics 69 Normalization and standardization are both techniques used to preprocess data before training a machine learning model. However, they are used in different situations depending on the nature and distribution of the data. Normalization is used when the data is on different scales and ranges. It rescales the data to have values between 0 and 1, or between -1 and 1, so that each feature contributes equally to the distance computation. It is also useful when the algorithm used for training the model assumes a normal distribution of the data. Standardization, on the other hand, is used when the data is normally distributed or when the algorithm used for training the model assumes a normal distribution of the data. It transforms the data so that it has a mean of 0 and a standard deviation of 1. This allows the algorithm to assume that the data is centered around 0 and that the scale of the features is similar. In summary, we use normalization when the data is on different scales and ranges, and standardization when the data is normally distributed or when the algorithm used for training the model assumes a normal distribution of the data. How can you find the outlier using the z-score ? using z-score find that how many standard deviation away from the mean. in distribution we discussed about emparical rule that tells us that within 1 standard deviation 68 percentage of the data fall, within 2 standard deviation 95 percentage of the data fall, within 3 standard deviation 99.7 percentage of the data fall. Whatever the value above the range of 3rd standard deviation consider as a outlier. The z-score, also known as the standard score, is a statistical measure that indicates how many standard deviations a data point is from the mean of a distribution. It is calculated by subtracting the mean from the data point and then dividing by the standard deviation. The use of z-scores is to standardize data so that it can be easily compared across different distributions or datasets. By using z-scores, we can convert data from different units or scales into a standard unit of measurement, which makes it easier to analyse and interpret. Z-scores are commonly used in statistics and data analysis to identify outliers or unusual observations in a dataset. An observation that has a z-score greater Statistics 70 than 3 or less than -3 is considered to be an outlier, meaning it is significantly different from the other observations in the dataset. In addition to identifying outliers, z-scores can also be used to calculate confidence intervals and p-values in statistical hypothesis testing, as well as in machine learning algorithms such as clustering and anomaly detection. Overall, the use of z-scores provides a standardized way to compare and analyse data, making it a useful tool in a wide range of applications. Probability : probability is a measure of likelihood of the event. In statistics, probability is a measure of the likelihood of an event occurring. It is a number between 0 and 1, with 0 indicating that an event is impossible and 1 indicating that an event is certain. Probability is used to describe the uncertainty or randomness of a particular event or outcome, and is used in a wide range of statistical applications, from estimating the likelihood of a particular event occurring to modelling complex systems and making predictions about future outcomes. No of way event can occurred / No of possible outcome Addition Rule of Probability : OR : + Mutually Exclusive Event : It two event can’t occurred at the same time. Mutually exclusive events are events that cannot occur at the same time. In other words, the occurrence of one event precludes the occurrence of the other event. For example, when flipping a coin, it can either land heads up or tails up, but not both at the same time. Therefore, these two events are mutually exclusive. When calculating the probability of mutually exclusive events occurring, we use the addition rule of probability, which states that the probability of either of two mutually exclusive events occurring is equal to the sum of their individual probabilities. Non Mutually Exclusive Event : Both the event can occur at the same time. Statistics 71 ex. deck of cards → you are picking a card form a deck of card. what is the probability of choosing a card that is queen or a heart. It’s non mutually exclusive because it can occur at same time. there are 4 queen in deck of cards. there are 13 heart cards in deck of cards. there are 1 queen which is in heart shape in deck of cards. this is a addition rule for non mutually exclusive event. The formula for the addition rule of non-mutually exclusive events is: P(A or B) = P(A) + P(B) - P(A and B) where P(A) is the probability of event A occurring, P(B) is the probability of event B occurring, and P(A and B) is the probability of both events A and B occurring together. This formula takes into account the fact that if both events A and B can occur at the same time, then the probability of their occurrence together needs to be subtracted from the sum of their individual probabilities in order to avoid double counting. Multiplication Rule : Independent Event rolling a dice → {1,2,3,4,5,6} In first try I will get 1 so in next try may be I will get any number in between 1 to 6. Every number has a same chance and same probability to occur. It’s not dependent on any other outcome so it’s called independent event. What is the probability of rolling a 5 and then 4 in dice ? 1/6 * 1/6 = 1/36 Non Independent Event Bag → 3 red marble, 2 blue marble What is the probability of taking out a marble ? In first try you picked out one red marble → now remaining 4 total marble → 2 red, 2 blue After first event occurred that impacted on it’s next event so we say that second event is dependent on first event. Statistics 72 It’s called dependent event. Naive Byes → Condition Probability What is the probability of drawing a queen and then aces from deck of cards? Permutation and combination are two fundamental concepts in combinatorics, which is the branch of mathematics that deals with counting and arranging objects. Permutation : School trip → chocolate factory → { 6 different type of chocolate } → student have to note 3 chocolate → how many permutation can be possible ? → 6*5*4 Permutation refers to the arrangement of objects in a specific order, where the order of the objects matters. In other words, permutations are arrangements of objects where the order matters. For example, if we have three distinct objects A, B, and C, there are six possible permutations: ABC, ACB, BAC, BCA, CAB, and CBA. The formula for calculating the number of permutations of n objects taken r at a time is: nPr = n! / (n-r)! where n is the total number of objects and r is the number of objects chosen. Combination : Statistics 73 Combination, on the other hand, refers to the selection of objects from a set without regard to the order in which they are selected. In other words, combinations are arrangements of objects where the order does not matter. For example, if we have three distinct objects A, B, and C, there are only three possible combinations: ABC, ACB, and BCA (since BAC, CAB, and CBA are all the same combination). The formula for calculating the number of combinations of n objects taken r at a time is: nCr = n! / (r! * (n-r)!) where n is the total number of objects and r is the number of objects chosen. In summary, the main difference between permutation and combination is that permutation is concerned with the arrangement of objects in a specific order, while combination is concerned with the selection of objects without regard to their order. P-Value : mouse pad in laptop → touch more frequently in middle area → as we see in image in middle area having a high distribution of data and around both side have a less distribution. Consider as side of the mouse pad having 0.01 p-value for touching this area. Out of 100 touches I touched 1 times in the particular area. What is the probability with respect to p-value for that specific experiment. Statistics 74 P-value is a statistical measure of the evidence against a null hypothesis. It is used to determine whether the results of a statistical test are significant and can be used to reject the null hypothesis. The null hypothesis is a statement that there is no significant difference between two groups or variables being compared. A p-value is calculated by comparing the observed data with the expected data under the null hypothesis. If the p-value is small (usually less than 0.05), it means that the observed data is unlikely to have occurred by chance alone, and we can reject the null hypothesis. If the p-value is large, it means that the observed data is likely to have occurred by chance alone, and we cannot reject the null hypothesis. For example, if we are testing the hypothesis that a new drug is effective in treating a particular disease, we might conduct a randomized controlled trial and compare the outcomes of patients who received the drug with those who received a placebo. If the p-value is less than 0.05, it means that the difference in outcomes between the two groups is statistically significant, and we can conclude that the drug is effective. P-values are commonly used in hypothesis testing and statistical inference, and are an important tool in many scientific fields. However, they can be controversial and are sometimes criticized for being misinterpreted or misused. As with any statistical measure, it is important to use p-values appropriately and to interpret the results in the context of the study design and the underlying scientific question. Hypothesis Testing : Coin → Test whether this coin is a fair coin or not by performing 100 tosses. When do you thing a coin is a fair coin? → P(H) = 0.5, P(T) = 0.5 If I get 50 times head then I can definitely say this coin is fair. Null Hypothesis → usually given in problem statement. → coin is fair. Alternate Hypothesis → coin is unfair. Experiment Reject or accept null hypothesis mean value = 50, standard deviation = 10 Statistics 75 our experiment should be near to the mean. how can we decide how far it can be away from the mean. For that we used significance value → defined by domain expert In statistics, the symbol for alpha is α. Alpha is commonly used as the significance level in hypothesis testing, and represents the probability of rejecting the null hypothesis when it is actually true. The value of alpha is usually set to 0.05 or 0.01, corresponding to a 5% or 1% chance of rejecting the null hypothesis when it is actually true. In P-value you are talking about probability with respect to the p-value of specific experiment. Significance : What should be within your confidence Interval. What is Type1 and Type2 Error ? Null Hypothesis → H0 → coin is fair Alternate Hypothesis → H1 → coin is not fair Outcome 1 : we reject the null hypothesis, in reality it is false. → YES Outcome 2 : we reject the null hypothesis, in reality it is true. → NO → Type1 Error Outcome 3 : we retain/accept the null hypothesis, in reality it is false. → NO → Type2 Error Outcome 4 : we retain/accept the null hypothesis, in reality it is true. → YES Type 1 and Type 2 errors are two types of errors that can occur in hypothesis testing. Statistics 76 Type 1 error, also known as a false positive, occurs when the null hypothesis is rejected even though it is true. This means that we conclude that there is a statistically significant difference between two groups when in fact there is not. The probability of making a Type 1 error is denoted by the symbol alpha (α) and is usually set at 0.05 or 0.01. Type 2 error, also known as a false negative, occurs when the null hypothesis is accepted even though it is false. This means that we fail to detect a statistically significant difference between two groups when in fact there is one. The probability of making a Type 2 error is denoted by the symbol beta (β) and is dependent on the sample size, the effect size, and the level of significance. In hypothesis testing, the goal is to minimize both Type 1 and Type 2 errors. However, there is often a trade-off between the two, as reducing the probability of one type of error increases the probability of the other. Statisticians use various methods to balance the risks of Type 1 and Type 2 errors, such as increasing the sample size or adjusting the significance level. Ultimately, the choice of which type of error to prioritize depends on the goals and context of the study, as well as the consequences of each type of error. One tail and two tail test College in Karnataka has an 85% rate placement rate. A new college was recently opened and it was found that a sample of 150 student had a placement rate of 88% which standard deviation 4%. does this college has different placement rate then the other colleges ? A hypothesis test can be either a one-tailed test or a two-tailed test, depending on the directionality of the alternative hypothesis.` In a one-tailed test, the alternative hypothesis is directional and predicts that the population parameter is either greater than or less than the null hypothesis value. The one-tailed test is used when there is a clear directional prediction about the outcome of the experiment. For example, if we are testing the hypothesis that a new drug is more effective than a placebo, we might use a one-tailed test with the alternative hypothesis that the mean improvement in the drug group is greater than the mean improvement in the placebo group. In a two-tailed test, the alternative hypothesis is non-directional and predicts that the population parameter is simply different from the null hypothesis value, without specifying the direction of the difference. The two-tailed test is used Statistics 77 when there is no clear directional prediction about the outcome of the experiment. For example, if we are testing the hypothesis that a new drug has an effect on blood pressure, we might use a two-tailed test with the alternative hypothesis that the mean change in blood pressure in the drug group is different from zero. One-Tailed Test: In this example, the null hypothesis is that the mean of Group A is equal to the mean of Group B. The alternative hypothesis is that the mean of Group A is greater than the mean of Group B. The shaded area represents the rejection region for the null hypothesis at the 0.05 level of significance. If the test statistic falls in this region, we reject the null hypothesis in favour of the alternative hypothesis and conclude that Group A has a higher mean than Group B. Two-Tailed Test: In this example, the null hypothesis is that the mean of Group A is equal to the mean of Group B. The alternative hypothesis is that the mean of Group A is different from the mean of Group B. The shaded areas represent the rejection regions for the null hypothesis at the 0.05 level of significance. If the test statistic falls in either of these regions, we reject the null hypothesis in favour of the alternative hypothesis and conclude that Group A has a different mean than Group B. It is important to choose the appropriate type of test based on the research question and the nature of the data being analysed. One-tailed tests are more powerful than two-tailed tests when there is a clear directional prediction, but they are also more prone to Type I errors. Two-tailed tests are more conservative and are appropriate when there is no clear directional prediction, but they are also less powerful than one-tailed tests. Confidence Interval A confidence interval is a range of values that is likely to contain the true value of a population parameter with a certain level of confidence or probability. It is calculated from a sample of data and is used to estimate the likely range of values for a population parameter, such as the mean or the proportion. The confidence interval is typically expressed as a range of values with an associated level of confidence, such as "we are 95% confident that the true value of the population parameter lies between x and y." The level of confidence is usually set at 95% or 99%, although other levels may be used depending on the nature of the data and the research question. Statistics 78 The confidence interval is calculated using statistical methods and takes into account the sample size, the variability of the data, and the level of confidence desired. A wider confidence interval indicates greater uncertainty about the true value of the population parameter, while a narrower confidence interval indicates greater precision in the estimate. Confidence intervals are commonly used in statistical inference to make predictions and draw conclusions about a population based on a sample of data. They are also used in hypothesis testing to determine whether a hypothesis about a population parameter is supported by the data or not. Point Estimate : The value of any statistics that estimate the value of a parameter. Z-Test : Population Standard Deviation Statistics 79 n ≥ 30 point estimation +_ margin of errors A Z-test is a statistical test used to determine whether two population means are different when the variances are known and the sample size is large. It is based on the standard normal distribution and uses the Z-score, which is a measure of how many standard deviations a data point is from the mean of a distribution. To perform a Z-test, we first calculate the difference between the means of the two populations and divide it by the standard error of the difference. The standard error of the difference is calculated by taking the square root of the sum of the variances of the two populations divided by their respective sample sizes. Statistics 80 If the Z-score is greater than the critical value for the desired level of significance (usually 0.05), we reject the null hypothesis and conclude that the two population means are significantly different. Otherwise, we fail to reject the null hypothesis and conclude that there is not enough evidence to support the claim that the two population means are different. To find the Z-score, we use the formula: Z = (x - μ) / (σ / sqrt(n)) where x is the sample mean, μ is the population mean, σ is the population standard deviation, and n is the sample size. We can use the Z-test to compare the means of two populations, such as the effectiveness of two different treatments or the performance of two different groups of students. However, it is important to ensure that the assumptions of the test are met, such as the normality of the populations and the equality of their variances. If these assumptions are not met, alternative tests such as the t-test may be more appropriate. T-Test When Population Standard Deviation doesn’t given to us so here we used ttest. same formula : point of estimate + margin of error formula of z change here : To calculate the t-value → Degree of Freedom → n-1 you should use t-table in order to get value of t-test. A t-test is a statistical test used to determine whether two population means are different when the variances are not known and the sample size is small. It is based on the t-distribution, which is similar to the standard normal distribution but has fatter tails. Statistics 81 The t-test is used when the population standard deviation is unknown and cannot be estimated from the sample. Instead, we use the sample standard deviation to estimate the population standard deviation. The t-test is also used when the sample size is small (less than 30), which violates the assumption of the central limit theorem. The formula for the t-test is similar to the formula for the z-test, but instead of using the standard normal distribution, we use the t-distribution. The formula for the t-test is: t = (x - μ) / (s / sqrt(n)) where x is the sample mean, μ is the population mean, s is the sample standard deviation, and n is the sample size. The degrees of freedom for the t-distribution are n - 1. Here's an example to illustrate the difference between the t-test and the z-test: Suppose we want to test whether the mean height of a sample of 20 men is different from the mean height of the population of all men. We know the population standard deviation is 3 inches. We take a random sample of 20 men and find that the sample mean height is 68 inches and the sample standard deviation is 2 inches. If we use the z-test, the test statistic is: z = (68 - μ) / (3 / sqrt(20)) = (68 - μ) / 0.67 If we use the t-test, the test statistic is: t = (68 - μ) / (2 / sqrt(20)) = (68 - μ) / 0.89 The critical value for a two-tailed test at the 0.05 level of significance with 19 degrees of freedom is 2.093. If the calculated value of the test statistic is greater than the critical value, we reject the null hypothesis and conclude that the two means are significantly different. If the calculated value of the test statistic is less than the critical value, we fail to reject the null hypothesis and conclude that there is not enough evidence to support the claim that the two means are different. In this example, if we use the z-test, the calculated value of the test statistic is 2.985, which is greater than the critical value of 1.96. If we use the t-test, the calculated value of the test statistic is 2.248, which is also greater than the critical value of 2.093. Therefore, we reject the null hypothesis and conclude that the mean height of the sample is significantly different from the mean height of the population. chi-square test Statistics 82 One Sample Z-Test : Population Standard deviation is given Sample size is greater than or equal to 30 Define null hypothesis. → H0 → meau → 100 Alternative hypothesis → H1 → meau ≠ 100 Stat Alpha Value → 0.05 State Decision Rule : Calculate Z Test Statistics Statistics 83 For 1 sample root of n is already 1 that’s we didn’t consider that formula previously. Also we are working with sample data so that we have some kind of sample error. as a sample size keep increasing so that our sample mean value keep moving toward population mean. State of Decision : One Sample T-Test : Statistics 84 Here in this question we don’t have population standard deviation. Reject the null hypothesis. p value < significance value Statistics 85 Chi-Square Covariance Pearson Correlation Coefficient Spearman Rank Correlation Practical Implementation F Test(Anova) Chi-Square : It’s claims about population proportions. It’s non parametric test that is performed on categorical(nominal or ordinal) data. Non parametric test : usually occur with population proportion. given some kind of proportion of data → non parametric test A non-parametric test is a statistical test that does not assume a specific distribution for the population being sampled. Instead, it makes fewer assumptions about the data and can be used when the data are not normally distributed or when the sample size is small. Non-parametric tests are often used for categorical or ordinal data, or for data that do not meet the assumptions of parametric tests. They are generally less powerful than parametric tests, but are more robust to outliers and other violations of the assumptions. Examples of non-parametric tests include the Wilcoxon ranksum test and the Kruskal-Wallis test. A Chi-Square test is a statistical test used to determine whether there is a significant association between two categorical variables. It is used when the data is categorical and the variables are independent of each other. The test is used to determine whether there is a difference between the observed frequencies and the expected frequencies in one or more categories. For example, suppose we want to test whether there is a significant association between gender and political affiliation. We would collect data on the number of men and women in each political party and use a Chi-Square test to determine whether there is a significant difference between the observed frequencies and the expected frequencies. Statistics 86 The Chi-Square test works by comparing the observed frequencies to the expected frequencies. The expected frequencies are calculated based on the assumption that there is no significant association between the two variables being tested. If the observed frequencies are significantly different from the expected frequencies, we can reject the null hypothesis and conclude that there is a significant association between the two variables. The use of the Chi-Square test is not limited to gender and political affiliation. It can be used in many other situations where there are two categorical variables to be compared. For example, it can be used to test whether there is a significant association between smoking status and lung cancer, or between education level and income. In summary, the Chi-Square test is a statistical test used to determine whether there is a significant association between two categorical variables. It is a nonparametric test that can be used in many different situations to test hypotheses about the relationship between two categorical variables. Set value of alpha : 0.05 degree of freedom : n-1 = 2 It’s 2 tail test Check → chi-square table Centre Limit Theorem Confidence Interval Hypothesis Testing Statistics 87 Measuring skill and chance in games Statistics 88