HCS214 ARTIFICIAL INTELLIGENCE Naive Bayes Classification Models, Linear Regression, Significance Tests and Confidence Level. Presentation By: Mabhande Charles Reg Number: M210669 NAIVE BAYES’ CLASSIFICATION MODELS Assumption: The fundamental Naive Bayes’ assumption is that each feature makes an: ο independent ο equal • contribution to the outcome. With relation to our dataset, this concept can be understood as: ο ο We assume that no pair of features are dependent. For example, the temperature being ‘Hot’ has nothing to do with the humidity or the outlook being ‘Rainy’ has no effect on the winds. Hence, the features are assumed to be independent. Secondly, each feature is given the same weight(or importance). For example, knowing only temperature and humidity alone can’t predict the outcome accurately. None of the attributes is irrelevant and assumed to be contributing equally to the outcome. Bayes’ Theorem Bayes’ Theorem finds the probability of an event occurring given the probability of another event that has already occurred. Bayes’ theorem is stated mathematically as the following equation: π π΅|π΄ × π π΄ π π΄|π΅ = π π΅ Bayes’ Theorem ο where A and B are events and P(B) ≠ 0. ο Basically, we are trying to find probability of event A, given the event B is true. Event B is also termed as evidence. ο P(A) is the priori of A (the prior probability, i.e. Probability of event before evidence is seen). The evidence is an attribute value of an unknown instance(here, it is event B). ο P(A|B) is a posteriori probability of B, i.e. probability of event after evidence is seen. Naive assumption. Now, its time to put a naive assumption to the Bayes’ theorem, which is, independence among the features. So now, we split evidence into the independent parts. Now, if any two events A and B are independent, then, P(A,B) = P(A)P(B) Gaussian Naive Bayes classifier In Gaussian Naive Bayes, continuous values associated with each feature are assumed to be distributed according to a Gaussian distribution. A Gaussian distribution is also called Normal distribution. When plotted, it gives a bell shaped curve which is symmetric about the mean of the feature values as shown below: Gaussian Naive Bayes classifier The likelihood of the features is assumed to be Gaussian, hence, conditional probability is given by: Other popular Naive Bayes classifiers are: ο΅ Multinomial Naive Bayes: Feature vectors represent the frequencies with which certain events have been generated by a multinomial distribution. This is the event model typically used for document classification. ο΅ Bernoulli Naive Bayes: In the multivariate Bernoulli event model, features are independent booleans (binary variables) describing inputs. Like the multinomial model, this model is popular for document classification tasks, where binary term occurrence(i.e. a word occurs in a document or not) features are used rather than term frequencies(i.e. frequency of a word in the document). LINEAR REGRESSION Linear Regression is a machine learning algorithm based on supervised learning. It performs a regression task. Regression models a target prediction value based on independent variables. It is mostly used for finding out the relationship between variables and forecasting. Different regression models differ based on – the kind of relationship between dependent and independent variables they are considering, and the number of independent variables getting used. Linear Regression cont... Linear regression performs the task to predict a dependent variable value (y) based on a given independent variable (x). So, this regression technique finds out a linear relationship between x (input) and y(output). Hence, the name is Linear Regression. In the figure above, X (input) is the work experience and Y (output) is the salary of a person. The regression line is the best fit line for our model. Hypothesis function for Linear Regression : While training the model we are given : x: input training data (uni variate – one input variable(parameter)) y: labels to data (supervised learning) When training the model – it fits the best line to predict the value of y for a given value of x. The model gets the best regression fit line by finding the best θ1 and θ2 values. θ1: intercept θ2: coefficient of x Once we find the best θ1 and θ2 values, we get the best fit line. So when we are finally using our model for prediction, it will predict the value of y for the input value of x. SIGNIFICANCE TESTS A test of significance is a formal procedure for comparing observed data with a claim (also called a hypothesis), the truth of which is being assessed. • The claim is a statement about a parameter, like the population proportion p or the population mean μ. • The results of a significance test are expressed in terms of a probability that measures how well the data and the claim agree. Stating Hypotheses A significance test starts with a careful statement of the claims being compared. The claim tested by a statistical test is called the null hypothesis (H0). The test is designed to assess the strength of the evidence against the null hypothesis. Often the null hypothesis is a statement of “no difference.” The claim about the population that evidence is being sought for is the alternative hypothesis (Ha). The alternative is one-sided if it states that a parameter is larger or smaller than the null hypothesis value. It is two-sided if it states that the parameter is different from the null value (it could be either smaller or larger). Example Example....cont... ο¬ ο¬ ο¬ ο¬ ο¬ In the above example, the mean IQ score for the sample is 108. This is slightly higher than the population mean, which is 100. The sample mean is obviously different from the population mean, but tests of significance must be done to determine if the difference is statistically significant. The difference could possibly be attributed to chance or to sampling error. The first step is to compute the test statistic. The test statistic is simply the z score for the sample mean. The only difference is that the population standard deviation is divided by the square root of N, just like when a confidence interval is computed. To compute the test statistic, the population standard deviation must be known for the variable. The population standard deviation for IQ is 16. To compute the test statistic, the sample size must also be known. In this case, it is 16. (In a real research scenario, the sample size would be larger. Small sample sizes are being use in this example to make calculations simpler). After putting the needed information into the formula, the result is a z score of 2. This means that the sample mean is exactly two standard deviations above the population mean. P-Value ο¬ After computing the test statistic, the next step is to find out the probability of obtaining this score when the null hypothesis is true. ο¬ The Normal curve helps researchers determine the percentage of individuals in the population who are located within certain intervals or above or below a certain score. ο¬ To find this information, the score needs to be standardized. In the case of ο¬ the example, this was already done by computing z, the test statistic. P-Value cont... The Normal distribution can be used to compute the probability of obtaining a certain z score. Assuming that H0 is true: Area to the left of z = the probability of obtaining scores lower than z Area to the right of z (p-value) = the probability of obtaining scores higher than z The smaller the p-value, the stronger the evidence against H0 provided by the data. P-Value example P-Value Example cont... ο¬ The z score in the example is exactly 2, so all decimals are zero. The area to the left of the curve for this z score is 0.9772. ο¬ ο¬ However, for hypothesis testing, the area to the right of z is needed. This is called the p-value. Since the entire area under the curve is equal to one, simply subtract the area to the left of the value from one to obtain the p-value. ο¬ ο¬ ο¬ In this example, 1 – 0.9772 = 0.0228. This means that there is only a 2% chance that the null hypothesis is true. In other words, if the population mean is 100, then there is only a 2% chance of having a sample mean equal to 108. P-Value and Statistical Significance It is important to know how small the p-value needs to be in order to reject the null hypothesis. P-Value and Statistical Significance (cont...) ο΅ The cutoff value for p is called alpha, or the significance level. ο΅ The researcher establishes the value of alpha prior to beginning the statistical analysis. ο΅ In social sciences, alpha is typically set at 0.05 (or 5%). This represents the amount of acceptable error, or the probability of rejecting a null hypothesis that is in fact true. It is also called the probability of Type I error. Once the alpha level has been selected and the p-value has been computed: If the p-value is larger than alpha, accept the null hypothesis and reject the alternative hypothesis. If the p-value is smaller than alpha, reject the null hypothesis and accept the alternative hypothesis. Two-Sided Alternative Hypotheses ο΅ A two-sided alternative hypothesis is used when there is no reason to believe that the sample mean can only be higher or lower than a given value. ο΅ Researchers are only hypothesizing that the values are significantly different. ο΅ In the example, the alternative hypothesis was one-sided, so it was only necessary to look at the probability that the sample mean was larger than 100. ο΅ However, when the alternative hypothesis is two-sided, the sample mean can be higher or lower than the given value, so researchers must look for both extremely high and extremely low values. This means that alpha is located at both ends of the curve. ο΅ Half of alpha is located at the higher end, and half is located at the lower end. ο΅ So, there is both a low cutoff value and a high cutoff value. ο΅ Therefore, in two-sided cases, the p-value is obtained by multiplying the area to the right of z by 2. ο΅ Only after doubling this value can the p-value be compared to alpha. Example Example (cont.) ο΅In the above example, the area to the right of z was 0.0228. So, if the alternative hypothesis were two-sided, the p-value would be 0.0456. ο΅The p-value (0.0456) is smaller than the alpha level (0.05), so the null hypothesis is rejected and the alternative hypothesis is accepted. CONFIDENCE LEVEL It is a range of estimates for an unknown parameter. A confidence interval is computed at a designated confidence level; the 95% confidence level is most common, but other levels, such as 90% or 99%, are sometimes used. Formula CI = confidence interval x= sample mean Z = confidence level value S = sample standard deviation n= sample size Factors that Affect Confidence Intervals (CI) ο΅ ο΅ ο΅ Population size: this does not usually affect the CI but can be a factor if you are working with small and known groups of people. Sample Size: the smaller your sample, the less likely it is you can be confident the results reflect the true population parameter. Percentage: Extreme answers come with better accuracy. For example, if 99 percent of voters are for gay marriage, the chances of error are small. However, if 49.9 percent of voters are “for” and 50.1 percent are “against” then the chances of error are bigger 0% and 100% Confidence Level A 0% confidence level means you have no faith at all that if you repeated the survey that you would get the same results. A 100% confidence level means there is no doubt at all that if you repeated the survey you would get the same results. In reality, you would never publish the results from a survey where you had no confidence at all that your statistics were accurate (you would probably repeat the survey with better techniques). A 100% confidence level doesn’t exist in statistics, unless you surveyed an entire population — and even then you probably couldn’t be 100 percent sure that your survey wasn’t open to some kind or error or bias. Confidence Coefficient The confidence coefficient is the confidence level stated as a proportion, rather than as a percentage. For example, if you had a confidence level of 99%, the confidence coefficient would be .99. In general, the higher the coefficient, the more certain you are that your results are accurate. For example, a .99 coefficient is more accurate than a coefficient of .89. It’s extremely rare to see a coefficient of 1 (meaning that you are positive without a doubt that your results are completely, 100% accurate). A coefficient of zero means that you have no faith that your results are accurate at all. Confidence coefficients and the equivalent confidence levels Confidence coefficient (1 – α) Confidence level (1 – α * 100%) 0.90 90% 0.95 95% 0.99 99% The End.