Uploaded by phyeciathy mazonde

M210669 MABHANDE CHARLES NAIVE BAYES MODELS, LINEAR REGRESSION, SIGNIFICANCE TESTS, CONFIDENCE LEVEL

advertisement
HCS214 ARTIFICIAL
INTELLIGENCE
Naive Bayes Classification Models,
Linear Regression, Significance
Tests and Confidence Level.
Presentation By: Mabhande Charles
Reg Number: M210669
NAIVE BAYES’ CLASSIFICATION MODELS
Assumption:
The fundamental Naive Bayes’ assumption is that each feature makes an:
οƒ˜
independent
οƒ˜
equal
•
contribution to the outcome.
With relation to our dataset, this concept can be understood as:
οƒ˜
οƒ˜
We assume that no pair of features are dependent. For example, the temperature being ‘Hot’ has nothing to do with the humidity or the
outlook being ‘Rainy’ has no effect on the winds. Hence, the features are assumed to be independent.
Secondly, each feature is given the same weight(or importance). For example, knowing only temperature and humidity alone can’t predict
the outcome accurately. None of the attributes is irrelevant and assumed to be contributing equally to the outcome.
Bayes’ Theorem
Bayes’ Theorem finds the probability of an event occurring given the
probability of another event that has already occurred. Bayes’ theorem is stated
mathematically as the following equation:
𝑝 𝐡|𝐴 × π‘ 𝐴
𝑝 𝐴|𝐡 =
𝑝 𝐡
Bayes’ Theorem
οƒ˜ where A and B are events and P(B) ≠ 0.
οƒ˜ Basically, we are trying to find probability of event A, given the event B is true. Event B is also termed
as evidence.
οƒ˜ P(A) is the priori of A (the prior probability, i.e. Probability of event before evidence is seen). The
evidence is an attribute value of an unknown instance(here, it is event B).
οƒ˜ P(A|B) is a posteriori probability of B, i.e. probability of event after evidence is seen.
Naive assumption.
Now, its time to put a naive assumption to the Bayes’ theorem, which is,
independence among the features. So now, we split evidence into the
independent parts.
Now, if any two events A and B are independent, then,
P(A,B) = P(A)P(B)
Gaussian Naive Bayes classifier
In Gaussian Naive Bayes, continuous values associated with each feature are
assumed to be distributed according to a Gaussian distribution. A Gaussian
distribution is also called Normal distribution. When plotted, it gives a bell
shaped curve which is symmetric about the mean of the feature values as
shown below:
Gaussian Naive Bayes classifier
The likelihood of the features is assumed to be Gaussian, hence, conditional probability is given by:
Other popular Naive Bayes classifiers are:
 Multinomial Naive Bayes: Feature vectors represent the frequencies with which certain events have been
generated by a multinomial distribution. This is the event model typically used for document classification.
 Bernoulli Naive Bayes: In the multivariate Bernoulli event model, features are independent booleans
(binary variables) describing inputs. Like the multinomial model, this model is popular for document
classification tasks, where binary term occurrence(i.e. a word occurs in a document or not) features are used
rather than term frequencies(i.e. frequency of a word in the document).
LINEAR REGRESSION
Linear Regression is a machine learning algorithm based on supervised learning. It performs a regression task.
Regression models a target prediction value based on independent variables. It is mostly used for finding out the
relationship between variables and forecasting. Different regression models differ based on – the kind of relationship
between dependent and independent variables they are considering, and the number of independent variables
getting used.
Linear Regression cont...
Linear regression performs the task to predict a dependent variable value (y) based on a given
independent variable (x). So, this regression technique finds out a linear relationship between x
(input) and y(output). Hence, the name is Linear Regression.
In the figure above, X (input) is the work experience and Y (output) is the salary of a person. The
regression line is the best fit line for our model.
Hypothesis function for Linear
Regression :
While training the model we are given :
x: input training data (uni variate – one input variable(parameter))
y: labels to data (supervised learning)
When training the model – it fits the best line to predict the value of y for a given value of x. The model gets the best regression fit line by finding the best θ1 and θ2
values.
θ1: intercept
θ2: coefficient of x
Once we find the best θ1 and θ2 values, we get the best fit line. So when we are finally using our model for prediction, it will predict the value of y for the input value of
x.
SIGNIFICANCE TESTS
A test of significance is a formal procedure for comparing observed data with a claim (also
called a hypothesis), the truth of which is being assessed.
• The claim is a statement about a parameter, like the population proportion p or the population mean
μ.
• The results of a significance test are expressed in terms of a probability that measures how well the
data and the claim agree.
Stating Hypotheses
A significance test starts with a careful statement of the claims being compared.
The claim tested by a statistical test is called the null hypothesis (H0). The test is designed to assess the strength of the
evidence against the null hypothesis. Often the null hypothesis is a statement of “no difference.”
The claim about the population that evidence is being sought for is the alternative hypothesis (Ha).
The alternative is one-sided if it states that a parameter is larger or smaller than the null hypothesis value.
It is two-sided if it states that the parameter is different from the null value (it could be either smaller or larger).
Example
Example....cont...





In the above example, the mean IQ score for the sample is 108. This is slightly higher than the
population mean, which is 100. The sample mean is obviously different from the population mean, but
tests of significance must be done to determine if the difference is statistically significant. The difference
could possibly be attributed to chance or to sampling error.
The first step is to compute the test statistic. The test statistic is simply the z score for the sample mean.
The only difference is that the population standard deviation is divided by the square root of N, just like
when a confidence interval is computed.
To compute the test statistic, the population standard deviation must be known for the variable. The
population standard deviation for IQ is 16.
To compute the test statistic, the sample size must also be known. In this case, it is 16. (In a real
research scenario, the sample size would be larger. Small sample sizes are being use in this example to
make calculations simpler).
After putting the needed information into the formula, the result is a z score of 2. This means that the
sample mean is exactly two standard deviations above the population mean.
P-Value
 After computing the test statistic, the next step is to find out the probability of obtaining this score when the null
hypothesis is true.
 The Normal curve helps researchers determine the percentage of individuals in the population who are located within
certain intervals or above or below a certain score.
 To find this information, the score needs to be standardized. In the case of
 the example, this was already done by computing z, the test statistic.
P-Value cont...
The Normal distribution can be used to compute the probability of obtaining a certain z score.
Assuming that H0 is true:
Area to the left of z = the probability of obtaining scores lower than z
Area to the right of z (p-value) = the probability of obtaining scores higher than z
The smaller the p-value, the stronger the evidence against H0 provided by the data.
P-Value example
P-Value Example cont...

The z score in the example is exactly 2, so all decimals are zero. The area to the left of the curve for
this z score is 0.9772.


However, for hypothesis testing, the area to the right of z is needed. This is called the p-value. Since the
entire area under the curve is equal to one, simply subtract the area to the left of the value from one to
obtain the p-value.



In this example, 1 – 0.9772 = 0.0228.
This means that there is only a 2% chance that the null hypothesis is true. In other words, if the
population mean is 100, then there is only a 2% chance of having a sample mean equal to 108.
P-Value and Statistical Significance
It is important to know how small the p-value needs to be in order to reject the null hypothesis.
P-Value and Statistical Significance
(cont...)
 The cutoff value for p is called alpha, or the significance level.
 The researcher establishes the value of alpha prior to beginning the statistical analysis.
 In social sciences, alpha is typically set at 0.05 (or 5%). This represents the amount of acceptable error, or the
probability of rejecting a null hypothesis that is in fact true. It is also called the probability of Type I error.
Once the alpha level has been selected and the p-value has been computed:
If the p-value is larger than alpha, accept the null hypothesis and reject the alternative hypothesis.
If the p-value is smaller than alpha, reject the null hypothesis and accept the alternative hypothesis.
Two-Sided Alternative Hypotheses
 A two-sided alternative hypothesis is used when there is no reason to believe that the sample mean can only be higher or lower than a given value.
 Researchers are only hypothesizing that the values are significantly different.
 In the example, the alternative hypothesis was one-sided, so it was only necessary to look at the probability that the sample mean was larger than 100.
 However, when the alternative hypothesis is two-sided, the sample mean can be higher or lower than the given value, so researchers must look for both extremely high
and extremely low values. This means that alpha is located at both ends of the curve.
 Half of alpha is located at the higher end, and half is located at the lower end.
 So, there is both a low cutoff value and a high cutoff value.
 Therefore, in two-sided cases, the p-value is obtained by multiplying the area to the right of z by 2.
 Only after doubling this value can the p-value be compared to alpha.
Example
Example (cont.)
In the above example, the area to the right of z was 0.0228. So, if the alternative
hypothesis were two-sided, the p-value would be 0.0456.
The p-value (0.0456) is smaller than the alpha level (0.05), so the null hypothesis is
rejected and the alternative hypothesis is accepted.
CONFIDENCE LEVEL
It is a range of estimates for an unknown parameter. A confidence interval is
computed at a designated confidence level; the 95% confidence level is most
common, but other levels, such as 90% or 99%, are sometimes used.
Formula
CI = confidence interval
x= sample mean
Z = confidence level value
S = sample standard deviation
n= sample size
Factors that Affect Confidence Intervals
(CI)



Population size: this does not usually affect the CI but can be a factor if you are working with small and
known groups of people.
Sample Size: the smaller your sample, the less likely it is you can be confident the results reflect the true
population parameter.
Percentage: Extreme answers come with better accuracy. For example, if 99 percent of voters are for gay
marriage, the chances of error are small. However, if 49.9 percent of voters are “for” and 50.1 percent are
“against” then the chances of error are bigger
0% and 100% Confidence Level
A 0% confidence level means you have no faith at all that if you repeated the survey that you
would get the same results. A 100% confidence level means there is no doubt at all that if you
repeated the survey you would get the same results. In reality, you would never publish the
results from a survey where you had no confidence at all that your statistics were accurate (you
would probably repeat the survey with better techniques). A 100% confidence level doesn’t
exist in statistics, unless you surveyed an entire population — and even then you probably
couldn’t be 100 percent sure that your survey wasn’t open to some kind or error or bias.
Confidence Coefficient
The confidence coefficient is the confidence level stated as a proportion, rather than as a percentage. For example, if you had a
confidence level of 99%, the confidence coefficient would be .99.
In general, the higher the coefficient, the more certain you are that your results are accurate. For example, a .99 coefficient is more
accurate than a coefficient of .89. It’s extremely rare to see a coefficient of 1 (meaning that you are positive without a doubt that your
results are completely, 100% accurate). A coefficient of zero means that you have no faith that your results are accurate at all.
Confidence coefficients and the equivalent
confidence levels
Confidence coefficient (1 – α)
Confidence level (1 – α * 100%)
0.90
90%
0.95
95%
0.99
99%
The End.
Download