Uploaded by Rehan Shah

statistics

advertisement


statistic
Sources
CrashCourse - YouTube
New Source
outliers are number in extremes
Discrete variables represent counts (e.g. the number of objects in a
collection). Continuous variables represent measurable amounts (e.g.
water volume or weight).
Central tendency
m = mean
mode = most frequent data set
medium = take the centre of list arranged in ascending or
descending oder
Skewed data = when mean ≠ medium ,skewed > 0 no symmetrical
char
normal distribution - mean = mode =median , skewed = 0
mode , median alot< mean, graph has large numbers
mode, medium alot> mean, graph has small numbers
mean affected by unusual large values
Measures of Spread
range = max(n) − min(n)
IQR = Q3 − Q1 (interquartile range)
closer the mean to quater , data point less spread out there
outliers are not always bad , sometimes can be usall
range can be easily affect by outliers but iqr can not better way of
measuring average spread
deviation = x − mean (x = data point)
Variance =
∑ deviation2
n−1
= nqp n = number of observation ,
1 as it always biased and little smaller than actual, units in
square , variance show how spread it is can be affected by
outliers
N = trys , q = sucess , p = fails^
σ=
variance σ = normal deviation, show the average
deviation from mean seen in the group seen , can be affected by
outlier
Pooled standard deviation is the average standard deviation of 2
2
2
groups = σ
= S1 S2
pooled
2
Data Visualisation
2 types of data , categorial (comparing categories like types of
pasta),quantitative(ounce of oil, meaning full constant spacing)
Frequency table of categorial , contingency table for more than a
data set
type of graphs fro categorial
bar chart (stacked or multiple bar)
pie chart (online one value)
histogram (use picture)
Always look at axis, can amply data
bining turn quantitative into categorial to express in graph
can change frequency table to suppress data or a group or amplify
histogram good for quantitate
1
Dot plot is when the bar graph is replaced by dots in which respent
a data point
stem and leaf
15, 16, 21, 23, 23, 26, 26, 30, 32, 41 Steam leaf keep the
Stem Leaf
1
2
3
4
56
13366
0 2
1
how to
place "32"
information of a data point
box and whiskers
1
lower fence lowest possible value in normal distribution
upper fence is the highest possible value in normal distribution
outlier is point out side the upper or lower fence
Cumulative frequency graph
frequency of the data sets till that bin
can be used to answer question which length repeats fewer than
20 times
Distributions
distribution a curve formation made of constantly divide histogram
until a smooth curve it formed , it can give us the shape of data
which can be used to compare results to find relation between the
similar data but not the same
for example if day wise electricity bill is compared for 2 years they
might different data set but same shape
Normal distribution
It is unimodal and symmetrical as mean medium mode is same.
the shape can be predicted by mean(this the point of the peak),
standard deviation tells the slims of graphs (smaller it is the
slimmer the graph)
68% data is one deviation away , box and whiskers q1 and q2
equal
exmaple = iq , frootloop in box (are very common)
skewed distribution
long tail in positive direction = positive, long tail in negative
direction = negative
outliers present in graph ,stretch quarters has lot of outliers
can be usually to show problem for example difficult test would
have a high skew as many students have low marks.
most skewed data has larger range and larger normal deviation,
bigger tail , data opposite side of skewed tail,
Biomodal or multimodal
data with 2 modes , peaks (possible can be measuring 2
different mechanics thus combining 2 unimodal)
Uniform distribution
A straight lien, everything has the same odds like rolling a dice'
not that every uniform distribution is found practical
Correlation and relationship
scatter plot is when all the data is put in a diagram , a data point
represent by a point above a line of best fit(line of regression) is
drawn
The line of best fit or regression line equation can tell us the relation
better the data set if in y = mx = c,m shows change in one
variable can lead to how much change on another
regression coefficient = (m ≠ 0 showing there some relation
change in units can change regression coefficient, thus using
standard deviation, know as correlation coefficient, r
r = 1 , prefect negative correlation , can predict one variable by
another, r = 1 , prefect positive correlation , can predict one variable
by another, 0 no correlation
r2 , is between 0 and 1, one means another variable can be
projected from knowing one
Controlled Experiments
Experiments are simulation which compare 2 groups together too
see the effect and predict of treatment
Allocation bias - when scientist allocate experiment to the people
which might react it to more
selection bias - a specify group of people on sing-up with the
experiment showing biased result
Randoms selection of participants is the best option as it avoids
bias
Randomised block - design can be used which separate the focus
group into bins like 30% Indian in each group
Controls - controls are a group which has been given a placebos or
the treatments not been conducted to them, placebos are important
as just the attacks of taking the treatments can make you feel
better.
Single blind study - when the participant does not know what study
they are taking part in which study or even taking part in one to
avoid bias
Double blind study - when the participant and researcher do not
know what the experiment is about to avoid the perception biases
in study.It is the gold standard
Matched-pairs experiments- talking twin as there environment and
genotype are same or choose a specify group (like experiment on
only women)
Repeated - measures design - test different scenarios on a person
to keep the conditions same for the experiment sometimes regard
as the gold standard for science
Sampling Methods and Bias with Surveys
surveys
sometimes it is not possible to preform controlled experiment
thus we take surweys of them
Problem are something the questions can be too restrictive or
word to bias towards a certain response
Sometimes you might asked biased groups corrupting the.
response
Non-response bias - where people who are likely to complete a
survry are systematical different those who don't.
Under representation - minority might not considered ,you can talk
the weight of the sample from the minority higher but if sample
does not represent the minority it can cause bias.
Stratified random sampling - splits the population into groups of
interest and randomly selects people from each of the "stratas" so
that each group in the overall sample is represented appropriately.
cluster sampling - creates clusters that are naturally occurring and
randomly selects a few clusters to survey instead of randomly
selecting individuals.
snowball sampling - telling people to find group of interest and
survey them.
census - surveying the entire population , can give a lot of data
which can be important to find about minorities and general
patterns in
when a study reports correlations Or has mice as its main population
The results it declares May not be quite fair So be careful about
generalisations Alright, let’s see you do better.
Probability
empirical probability - probability derived from a experiment sample
is called empirical
theocratical probability - the actual probability of something
happing in a ideal
Addition rule - to know either or
PMutually exclusive) = P (A) + P (B)
P(non Mutually exclusive) = P (A) + P (B) − P (A + B)
Multiplication rule , Independent than P (A) ∗ P (B)
conditional probability = P (2∣1)
Bayes' theorem -
P (B∣A) =
= P (A + B)
= P (1 + 2)/P (1)
P (A∣B)∗P (B)
P (A)
1
Bayes' statistics is statical study in which the probability constantly
according to the new information
Law of large number the say the sample mean will get closer to
theoretical mean as the data gets larger if the variance is not ∞
Binomial distribution
Binomial distribution allows us to compute the probability of
observing a specified number of "successes" when the process
is repeated a specific number of times
n
Binomial distribution =( k ) ∗ pk
n
Binomial coefficient = ( k )
=
∗ (1 − p)n−k
n!
k!(n−k)!
Bernoulli distribution
Bernoulli distribution : X = 1 = true , X 0 = false
, P (X
= x) = px ∗ (1 − p)x
Geometric distribution
to find out probability of success on that try
geometric distribution = geom(k; p)
= (1 − p)k−1 ∗ p
cumulative geometric distribution =1 − (p)n−1
Randomness
Expectation =
∑ (R ∗ X) Relative frequency , x = each data set,
use intergrals in continuous distribution
Mean of 2 independent variable can be sum of the mean of both
variable, same for variance
mean centre Moment of data
1st moment = E(x) = ∑ x
n
2nd moment = E((μ − x)2 )
Variance =
∑ deviation2
n−1
= E(x2 ) − (E(x))2 =
∑ (x−μ)2
n
= nqp n = number of observation , 1
as it always biased and little smaller than actual, units in square ,
variance show how spread it is can be affected by outliers
3
3rd moment = E((μ − x)3 )= ∑ (x−μ)
=skewed data , + skewed
3
σ
= it has larger exterme value than the mean , - skewed = it has
smaller extreme value than the mean
4th moment =E((μ − x)4 ) ,= kurtosis = thickness of tails in a
distribution
There are three types of peakedness. Leptokurtic- very peaked
Platykurtic – relatively flat Mesokurtic – in between
ZScores and Percentiles
Z score =
x−μ
σ
1
Z
Percentile = 95 percentile means 95% have less score than you thus
picking a person form crowd there is 95% percent probability that it
they would be have a lower score than you
The Normal Distribution
The distribution of sample means for an independent.Random
variable, will get closer and closer to aNormal distribution as the
size of the sample gets. Bigger and bigger, even if the original
populationDistribution isn't normal itself.
The more the sample the more it looks like normal distribution
The more the trial the sample mean reflects the population mean
Standard error = SE = σn
Standard error is used to estimate the efficiency, accuracy, and
consistency of a sample. In other words, it measures how precisely
a sampling distribution represents a population.
helps to compare
Confidence interval
A range off mean is know as confidence interval
95% confidence interval mean the middle 95% of distribution is
represented from the interval
It has 97.5th percentile and 2.5th percentile as it leaves the 2.5% on
each of side
CI = (z ∗ SE) + μ , z = z scores of percentile
100% ci = ±∞
30 sample or more lead to normal disribution
T-distribution can be used when data is less, less information
thicker the tail . It is unimodal , use t-score instead of zscore for
ci
Margin of error =
t − x ∗ SE =z ∗
1
p^(1−p^) p hat = success rate
n
PValues
Null hypothesis significance testing-NHST is a method of statistical
inference by which an experimental factor is tested against a
hypothesis of no effect or no relationship based on a given
observation know as reductio ad absudum
Representation of nhst = H0
: μe = 2300
P value show how rare the data and not just random change
One side p value only comparing on side ,in 2 side the higher and
lower side are both compared form the mean
P value < 0.05 is standard in community ,the results can be call
statistically significant , in medicine it is 0.01
People do not agree on the alpha of p value
It does not tell you the probability of your hypotheses being correct
it just tells you whether it is correct or not , it does the probability of
hypotheses given the data
we do not know whether how much difference is
We fail to reject null hypotheses not accept as there is absence of
evidence which should be confused with evidence of absence
Type I error - reject the null , even if it's true ,false positive
Type Il error - false negative
changing cut off line can help us to increase change of one error
and decrease of another
H0 (x = 1) = 1 − α; H0 (x = 0) = 1 − β
To increase statistical power =
effect size = distance
between mean of distribution (out of control)
increasing sample size can help to σ smaller lead to less over lap
Sufficient statical power = 80% or more
P hacking
P hacking when p value results are changed to give intentionally
statical significant
Family wise error rate is the probability of making one or more
false discoveries, or type I errors when performing multiple
hypotheses tests.
The Bonferroni correction is a multiple-comparison correction
used when several dependent or independent statistical tests
are being performed simultaneously
Bonferroni Correction = α
n
Bayes
updating belies
Bayes' theorem -
P (B∣A) =
P (A∣B)∗P (B)
P (A)
Bayes' factor = P (A1 ∣B1 ) : the Bayes factor is a likelihood ratio of
P (A2 ∣B2 )
the marginal likelihood of two competing hypotheses, usually a null
and an alternative is data based.
Posterior belief = bayes' factor * prior belief
Help to adjust hypotheses with new data
Can be very relative as beliefs are different
Higher data less wait on prior odds
Beta(αposterior , β posterior ) = Beta(αlikelihood +
αprior , β likelihood + β prior )
Test Statistics
Z score =
x−μ
σ
ˉ
−μ
Z Statistic for a group = XGroup
σ
n
Z statics bigger 1or 1 mean they are more extreme
Critical value = α in z score table
T-distribution can be used when data is less, less information
thicker the tail . It is unimodal , use t-score instead of zscore for ci
T Statistic =
x
ˉ−μ
σs
n
General formula =
diff rence−H0
Average∣σ
TTests
Test statistic = σ
w
=
σ2 1
n1
+
σ2 2
n2
Variation can occur due random selection
tow sample t test = where 2 samples are taken and compared
paired t test = where a person is given both the treatment so
remove variation
Degrees of Freedom and Effect Sizes
Degrees of freedom
Degrees of freedom refers to the maximum number of logically
independent values, which are values that have the freedom to
vary, in the data sample
df = N1
Effect size
Sometimes the result can be statical significant but the true
difference might not be practical significant
ES =
μ1 −μ2
σ
1. Small Size 0.2: Such an effect between the two groups is
negligible and cannot be spotted with naked eyes.
2. Medium Size 0.5: This level of correlation is usually identified
when the researcher goes through the data—medium size can
have a reasonable overall impact.
3. Large Size 0.8 or greater): A large effect can be observed
without using any calculator—the impact is significant in realworld scenarios.
Chi-Square Tests
used for categorical test like frequency table
Chi square = x2
2
i)
= ∑ (Oi −E
Ei
Oi = observation, Ei =expectation
Degree of freedom = for rows and tables (r-1)(c-1) or for number of
category 1
expected frequency > 5 for test to work
Type of chi-squ are test
Goodness of fit =to see how well certain portion fit our samples
,has one category
Test of homogeneity - Looking at whether it's likely that
different samples come from the same population.
test of independence = to whether 2 category are completely
independent
Expected frequency = nni
number
∗ ∑ ri ; ni = category ; n = total
∑ ri = row total number
The Replication Crisis
Replication - re-running studies to confirm results
Reproducible analysis - the ability for other scientists to repeat the
analyses you
Unscrupulous researchers - researchers that are more concerned
with attention and publishing and splashy headlines than good
science.
There are multiple ways to change analyse a form of data, thus lead
to different results if the replicated properly
Many scientists do not how to use P value
published studies have a bias toward overestimating effects
Smaller sample can alter results
Replication can be helpful to solve all
Replication does get enough attention and funding
There m
Regression
t = Observed Coefficient - Null coefficient) /Standard error
Linear general model = model + erro = y
F stat = SSR/d.fssr
SSE/d.fsse
SST = SSE + SSR
SStotal = (Yi − Yˉ )2
Var(Y) = SSntotal
^
SSR = ∑(Y
− Yˉ )2
^ )2
SSE = ∑(Yi − Y
t2 = f
d.f. of SSE =
n−2
d.f of SSR = 1
= mx + b
ANOVA
ANOVA = analysis of variance
Can be used for many categorical data
slope = rise
, rise =
run
SST = (xi
μ0 − μ1
− μ)2 , xi = data point
SST = SSM + SSE
SSM =
∑(xi − x
ˉ )2
d.f for categorical variables (ssm) = k-1
d.f for erro = n-k
omnibus test - contains many items and group
can be done t-test to find more specific values
ˉ
ssB sums of square between groups = ∑(X
+ μ)
SSB -(the factor) = interaction
η2 =
SSeff ect
SStotal
interaction plot is made of the means of each category , a parallel
line signified to interaction .
In statistics, an interaction may arise when considering the
relationship among three or more variables, and describes a
situation in which the effect of one causal variable on an outcome
depends on the state of a second causal variable
Other glm
ANOCVA - use multiple variable to reduce error and the models
more better at predicting , but taking multiple anocva might be
consider Phacking.
Repeated measures ANOVA - like pair t-test but each individual has
a bas-line and a difference in it is used to anova table and do the ftest to remove sample variation
Supervised Machine Learning
some data is use to train ai and other is used to test to learn it's
accuracy score and improvise
confusion matrix -
+TN
accuracy score = TP N
logistic regression- Logistic regression is a simple twist on linear
regression. It gets its name from the fact that it is a regression that
predicts what's called the log odds of
An event occuring.
linear discriminate analysis- uses bayes' theorem and lgm to make
model to predict a future result . They compare a distribution where
the opposite portability is one there and one with the desired one.
LDA helps in dimensionally reduction
k-nearest neighbours - use near by data point to predict it
categorical variable , it is classier
Unsupervised Machine Learning
unsupervised means that there is no groups already made thus they
help in categorical analysing and sorting the data
Usally harder tp test
k means - they choose n random point know as centroid, choose all
the data points and check to which centroid it is closet to a from a
group , find it 's centre point and run it again until they
converge(leads to same results).
Silhouette score - show how far the cluster group are to each other
, higher the better model
Hierarchal clustering - it arrange data in to small groups , groups
them to other and continuous until all the data is under one big
group. Like a fractal venn diagram with sub groups.
Can be show in and radar graph
As help in classing autism spectrum disorder
Big data
Big data is data so big that our collection and anylase tolls may fall
behind
facebook likes can be used to predict 5 traits and other
psychometric traits.
this can be used to predict political belief even
Cambridge analytic has help many politician with targeting
marketing.
Can be used to personalise medication , google map and many
more features , like city brain
Problem
Bias - the data might be biased like higher black arrest show in
the data might to copied by the algorithm. Sometime algorithm
can confuse correlation and causation can lead to flawed result.
algorithmic transparency- show what algorithm is detecting can
be one of solutions to it
privacy- the data might be used for reason the person may not
want to not be confutable by giving his information to people,
because of big data identification has also ben easier
K-anonymity - increase number of similar profile ins system to
increase anonymity but DNA company likes 23 and me can also
detect your data if you have not upload it through the pedigree u
have whihc can be concenring
Regulation - there has been some new regulation like coppa and
gdpr which increase algorithmic transparency and decrease
there control over your information , but still big data is very new
and the regulation are not that polished with many loop holes
and question still in placed
A hack might also put your information in the public which can
be dangerous like the iCloud hack
Statistics in the Courts
make more sense to watch the video www.youtube.com/watch?v=HqH 6yw60m0&list=PL8dPuuaLjXtNM
Y bUAhblSAdWRnmBUcr&index=41
The prosecutor's fallacy - is a fallacy of statistical reasoning
involving a test for an occurrence, such as a DNA match. A positive
result in the test may paradoxically be more likely to be an
erroneous result than an actual occurrence, even if the test is very
accurate
Neural Networks
there input nodes which give an output node , between there are
hidden nodes which increase or decrease weightage of input know
as activation function
use for data which is to complex
more than one layer can be called deep learning used in image
detection and much more.
In between the nodes feature generation happen a made up
variable which helps us to get to your output.
Rectified linear unit ReLU - turn - negative values to zero and
leaves positive as it is.
Feed forward neural network - movement of data is only forward
Recurrent neural network - data made in the network can be feed
back in to it , can be help for task in which last output is needed to
remember like writing music or detection words.
Convolution neural network - use windows of pixel weightage so
find out certain characteristic seen in an image. they use
convolution(finding) and pooling(fitting found data together). Like
used in snapchat
Generative adversarial network GANs)- use sets of existing data
to try to learn how to create new data . There are 2 networks , one
which creates data (generator), other check the whether data is
accurate or similar enough to the base data.(discriminator).Both are
trying to better . Can be used to create art
War
makes more sense to watch
www.youtube.com/watch?
v=rRhHY7Mh5T0&list=PL8dPuuaLjXtNM Y
bUAhblSAdWRnmBUcr&index=43
Max = m + m
n
− 1 , m = max number
When Predictions Fail
Market - in 2008 , they reduce the loaning requirement and were
selling the loan as an asset know as mortgage back security.Many
investor had many bad static models and overestimated which
variable are independent. Also had bad weightage in model towards
bank.
Earthquake- to predict it's earth quake you need a location ,
magnitude and time.Lot of variation is there and very less data is
present .
Election- low probabilities do not equal impossible event like seen
in 2016 election.And there was also a non respond biased seen.
Prediction is hard you need a lot of data and a accurate model.
When Predictions succed
Makes more sense to watch video www.youtube.com/watch?
v=uJFdLKkuYc4&list=PL8dPuuaLjXtNM Y
bUAhblSAdWRnmBUcr&index
more complex more data
it helps to update our bellies and see throw our certainty
Download