Uploaded by Fadyibrahem1299

Statistics ملخص

advertisement
[ Basic Definitions ]
Population
the entire group under study
Sample
a representative subset of population
[ Measures of Center ]
Mean (average)
the total of sum of numbers
the count of numbers
[9, 10, 12, 13, 13, 13, 15, 15, 16, 16, 18, 22, 23, 24, 24]
SUM= 243 COUNT= 15 MEAN = 16.2
Median (middle)
the number in middle after arranging all values from lowest to highest
if there are two middle numbers add them and divide by two
[9, 10, 12, 13, 13, 13, 15, 15, 16, 16, 18, 22, 23, 24, 24]
Mode (most)
the number that appears the most in the sample.
[9, 10, 12, 13, 13, 13, 15, 15, 16, 16, 18, 22, 23, 24, 24]
Special Cases*:
No Mode… if all numbers occur the same number of times
Two Modes ... when more than one number being the most frequent
[ Measures of Spread ]
Range (difference)
the difference between the largest and the smallest number.
[72, 110, 134, 190, 238, 287, 305, 324]
MAX= 324 - MIN= 72 = RANGE = 252
IQR (interquartile range)
For any distribution that’s ordered from low to high,
the IQR contains half of the values.
[72, 110, 134, 190, 238, 287, 305, 324]
To find the IQR you first find the values at Q1 and Q3.
Multiply the count of values (8) by 0.25 and 0.75
Q1 position: 0.25 x 8 = 2
Q3 position: 0.75 x 8 = 6
IQR = Q3 – Q1 = 177
i.e., = 110
i.e., = 287
Outliers
Outliers are extreme values present in data
[12,5, 9, 11, 72, 7, 61]. Like 72 or 61 in this array.
The data points that fall below Q1 – 1.5 IQR
or above Q3 + 1.5 IQR are outliers.
Variance (distance)
The avreage distance from each data point to the mean.
1.
Calcuate the mean of the set
2. From each number, subtract the mean
to find the differences (variance)
3. Square each difference.
4. Work out the average of the squared differences.
Standard Deviation
The square root of stantard deviation
measures how far a group of numbers is from the mean
standard deviation S
variance
Z-Score (standard score)
how far a data point is from the mean [-3, -2, -1,0, +1, +2, +3]
[ Measures of Chance ]
Probabilities
Let's assume we have the following Palomar College
Male student characteristics:
1- Simple Probability:
P (attractive) = 6100 attractive men /10,000 men
P (attractive) = 6100 / 10,000 = 0.61 (61%)
2- Joint Probability:
Probability of Event A AND Event B Happening
p (attractive AND wealthy) = p (attractive)
p (wealthy)
p (attractive AND wealthy) = (6100/10,000)
(1700/10,000)
p (attractive AND wealthy) = (0.61)
(0.17) = 0.104 (10.4%)
3- Union Probability:
•
Mutual Exclusive (cannot occur simultaneously):
p (A average OR B average) = p (A average) p (B average)
p (A average OR B average) = (910/10,000)
p (A average OR B average) = (0.091)
•
(2460/10,000)
(0.246) = 0.337 (33.7%)
Non-Mutually Exclusive (can occur simultaneously):
p (A average OR attractive) = p (A average) p (attractive) - p (A average and attractive)
p (A average OR attractive) = (910/10,000)
p (A average OR attractive) = (0.091)
(6100/10,000) - [p (910/10,000)
(0.61) - [(0.091)
(0.61)]
p (A average OR attractive) = (0.701) - (0.056) = 0.645 (64.5%)
4- Conditional Probability:
Where A is the event to measure and B is the condition
P(COVID) = 0.13
P(fever) = 0.42
P(COVID | fever) = P(COVID * fever) / P(fever)
= 0.054/ 0.42
= 0.128 (12.8% )
p (6100/10,000)]
[ Probability Distributions ]
1- Discrete Distributions
Discrete probability distributions are graphs of the outcomes of test results that are finite,
such as a value of 1, 2, 3, true, false, success, or failure.
Binomial Distribution
Probability distribution of number of successes in a
sequence of independent events:
•
Number of heads in q sequence of coin flips
•
10 women in a sample size of 100 person
n = is the number of trials (occurrences)
p = is the probability of success in a single trial
Poisson Distribution
A Poisson distribution measures how many times
an event is likely to occur within “x” period of time
•
Probability of 12 ppl arriving at restaurant per hour
•
Probability of <200 visits to a website per day
x = is a Poisson random variable
λ = is an average rate of value (graph peak)
2- Continues Distributions
A probability distribution in which the random variable X can take on any value.
The probability that X falls between two values (a and b) equals the integral (area under
the curve)
The Normal Distributions
Also known as or Gaussian distribution is a
probability distribution that is symmetric
about the mean (bell curve).
[ Hypothesis Testing ]
Hypothesis Testing is a type of statistical analysis in which you put your assumptions
about a population parameter to the test. It is used to estimate the relationship between 2
statistical variables like:
•
•
Changing the name of the website, might increase traffic?!
Using Vitamin C supplement, affects the sex ration in pregnancy?!
Null Hypothesis (H0):
Assuming nothing, assuming no difference exists
Alternative Hypothesis (H1):
A different exists in birth ration in two populations
P-Value (probability value):
how likely your data could have occurred under the null hypothesis.
Alpha σ (significance level):
probability threshold for rejecting the null hypothesis,
alpha is decided before data collection happens (typically 0.05)
if ( p ≤ σ ) reject null hypothesis
Hypothesis testing work flow:
1- Define the population: adult aged 18-30 who eat meat or are vegans
2- Define Null Hypothesis: no difference of cancer frequency between meat eaters and vegans
3- Define Alternative Hypothesis: meat eaters are more likely to develop cancer
4- Collect Data: diet status and cancer frequency in the past 3 years
5- Perform Statistical Test: difference in mean number of cancers of the two populations
6- Draw Conclusion: are the frequency of cancer is higher in adults who consume meat?
Statistical Tests & Experiments:
Treatment: the independent variable (advertisement)
Response: the dependent variable (number of purchases)
Controlled Experiments:
Treatment group: sees the advertisement
Control group: doesn’t see the advertisement
Experiments Standards:
Randomization: participants are assigned to treatment/control group randomly.
Blinding: participants won’t know which group they are in.
*The previous techniques is referred to as Randomized Controlled Trials (RCT)
A/B Testing: a form of RCTs that tests only two different treatments
[ Correlation ]
Correlation means association – more precisely, it measures the extent to which two variables are
related, we can describe the relationship regarding the direction as either positive or negative.
Or regarding the strength as either Strong – moderate – weak relationship.
Correlation does not imply Causation, remember that!!
Pearson Correlation Coefficient (r):
•
Quantifies the strength of a relationship between two variables, with a number between -1 & 1
•
Applies only to linear relationships where there is a dependent and independent variables
•
The magnitude corresponds to strength of the relationship
•
While the sign (+ or -) correspond to direction of the relationship
Download