statistic Sources CrashCourse - YouTube New Source outliers are number in extremes Discrete variables represent counts (e.g. the number of objects in a collection). Continuous variables represent measurable amounts (e.g. water volume or weight). Central tendency m = mean mode = most frequent data set medium = take the centre of list arranged in ascending or descending oder Skewed data = when mean ≠ medium ,skewed > 0 no symmetrical char normal distribution - mean = mode =median , skewed = 0 mode , median alot< mean, graph has large numbers mode, medium alot> mean, graph has small numbers mean affected by unusual large values Measures of Spread range = max(n) − min(n) IQR = Q3 − Q1 (interquartile range) closer the mean to quater , data point less spread out there outliers are not always bad , sometimes can be usall range can be easily affect by outliers but iqr can not better way of measuring average spread deviation = x − mean (x = data point) Variance = ∑ deviation2 n−1 = nqp n = number of observation , 1 as it always biased and little smaller than actual, units in square , variance show how spread it is can be affected by outliers N = trys , q = sucess , p = fails^ σ= variance σ = normal deviation, show the average deviation from mean seen in the group seen , can be affected by outlier Pooled standard deviation is the average standard deviation of 2 2 2 groups = σ = S1 S2 pooled 2 Data Visualisation 2 types of data , categorial (comparing categories like types of pasta),quantitative(ounce of oil, meaning full constant spacing) Frequency table of categorial , contingency table for more than a data set type of graphs fro categorial bar chart (stacked or multiple bar) pie chart (online one value) histogram (use picture) Always look at axis, can amply data bining turn quantitative into categorial to express in graph can change frequency table to suppress data or a group or amplify histogram good for quantitate 1 Dot plot is when the bar graph is replaced by dots in which respent a data point stem and leaf 15, 16, 21, 23, 23, 26, 26, 30, 32, 41 Steam leaf keep the Stem Leaf 1 2 3 4 56 13366 0 2 1 how to place "32" information of a data point box and whiskers 1 lower fence lowest possible value in normal distribution upper fence is the highest possible value in normal distribution outlier is point out side the upper or lower fence Cumulative frequency graph frequency of the data sets till that bin can be used to answer question which length repeats fewer than 20 times Distributions distribution a curve formation made of constantly divide histogram until a smooth curve it formed , it can give us the shape of data which can be used to compare results to find relation between the similar data but not the same for example if day wise electricity bill is compared for 2 years they might different data set but same shape Normal distribution It is unimodal and symmetrical as mean medium mode is same. the shape can be predicted by mean(this the point of the peak), standard deviation tells the slims of graphs (smaller it is the slimmer the graph) 68% data is one deviation away , box and whiskers q1 and q2 equal exmaple = iq , frootloop in box (are very common) skewed distribution long tail in positive direction = positive, long tail in negative direction = negative outliers present in graph ,stretch quarters has lot of outliers can be usually to show problem for example difficult test would have a high skew as many students have low marks. most skewed data has larger range and larger normal deviation, bigger tail , data opposite side of skewed tail, Biomodal or multimodal data with 2 modes , peaks (possible can be measuring 2 different mechanics thus combining 2 unimodal) Uniform distribution A straight lien, everything has the same odds like rolling a dice' not that every uniform distribution is found practical Correlation and relationship scatter plot is when all the data is put in a diagram , a data point represent by a point above a line of best fit(line of regression) is drawn The line of best fit or regression line equation can tell us the relation better the data set if in y = mx = c,m shows change in one variable can lead to how much change on another regression coefficient = (m ≠ 0 showing there some relation change in units can change regression coefficient, thus using standard deviation, know as correlation coefficient, r r = 1 , prefect negative correlation , can predict one variable by another, r = 1 , prefect positive correlation , can predict one variable by another, 0 no correlation r2 , is between 0 and 1, one means another variable can be projected from knowing one Controlled Experiments Experiments are simulation which compare 2 groups together too see the effect and predict of treatment Allocation bias - when scientist allocate experiment to the people which might react it to more selection bias - a specify group of people on sing-up with the experiment showing biased result Randoms selection of participants is the best option as it avoids bias Randomised block - design can be used which separate the focus group into bins like 30% Indian in each group Controls - controls are a group which has been given a placebos or the treatments not been conducted to them, placebos are important as just the attacks of taking the treatments can make you feel better. Single blind study - when the participant does not know what study they are taking part in which study or even taking part in one to avoid bias Double blind study - when the participant and researcher do not know what the experiment is about to avoid the perception biases in study.It is the gold standard Matched-pairs experiments- talking twin as there environment and genotype are same or choose a specify group (like experiment on only women) Repeated - measures design - test different scenarios on a person to keep the conditions same for the experiment sometimes regard as the gold standard for science Sampling Methods and Bias with Surveys surveys sometimes it is not possible to preform controlled experiment thus we take surweys of them Problem are something the questions can be too restrictive or word to bias towards a certain response Sometimes you might asked biased groups corrupting the. response Non-response bias - where people who are likely to complete a survry are systematical different those who don't. Under representation - minority might not considered ,you can talk the weight of the sample from the minority higher but if sample does not represent the minority it can cause bias. Stratified random sampling - splits the population into groups of interest and randomly selects people from each of the "stratas" so that each group in the overall sample is represented appropriately. cluster sampling - creates clusters that are naturally occurring and randomly selects a few clusters to survey instead of randomly selecting individuals. snowball sampling - telling people to find group of interest and survey them. census - surveying the entire population , can give a lot of data which can be important to find about minorities and general patterns in when a study reports correlations Or has mice as its main population The results it declares May not be quite fair So be careful about generalisations Alright, let’s see you do better. Probability empirical probability - probability derived from a experiment sample is called empirical theocratical probability - the actual probability of something happing in a ideal Addition rule - to know either or PMutually exclusive) = P (A) + P (B) P(non Mutually exclusive) = P (A) + P (B) − P (A + B) Multiplication rule , Independent than P (A) ∗ P (B) conditional probability = P (2∣1) Bayes' theorem - P (B∣A) = = P (A + B) = P (1 + 2)/P (1) P (A∣B)∗P (B) P (A) 1 Bayes' statistics is statical study in which the probability constantly according to the new information Law of large number the say the sample mean will get closer to theoretical mean as the data gets larger if the variance is not ∞ Binomial distribution Binomial distribution allows us to compute the probability of observing a specified number of "successes" when the process is repeated a specific number of times n Binomial distribution =( k ) ∗ pk n Binomial coefficient = ( k ) = ∗ (1 − p)n−k n! k!(n−k)! Bernoulli distribution Bernoulli distribution : X = 1 = true , X 0 = false , P (X = x) = px ∗ (1 − p)x Geometric distribution to find out probability of success on that try geometric distribution = geom(k; p) = (1 − p)k−1 ∗ p cumulative geometric distribution =1 − (p)n−1 Randomness Expectation = ∑ (R ∗ X) Relative frequency , x = each data set, use intergrals in continuous distribution Mean of 2 independent variable can be sum of the mean of both variable, same for variance mean centre Moment of data 1st moment = E(x) = ∑ x n 2nd moment = E((μ − x)2 ) Variance = ∑ deviation2 n−1 = E(x2 ) − (E(x))2 = ∑ (x−μ)2 n = nqp n = number of observation , 1 as it always biased and little smaller than actual, units in square , variance show how spread it is can be affected by outliers 3 3rd moment = E((μ − x)3 )= ∑ (x−μ) =skewed data , + skewed 3 σ = it has larger exterme value than the mean , - skewed = it has smaller extreme value than the mean 4th moment =E((μ − x)4 ) ,= kurtosis = thickness of tails in a distribution There are three types of peakedness. Leptokurtic- very peaked Platykurtic – relatively flat Mesokurtic – in between ZScores and Percentiles Z score = x−μ σ 1 Z Percentile = 95 percentile means 95% have less score than you thus picking a person form crowd there is 95% percent probability that it they would be have a lower score than you The Normal Distribution The distribution of sample means for an independent.Random variable, will get closer and closer to aNormal distribution as the size of the sample gets. Bigger and bigger, even if the original populationDistribution isn't normal itself. The more the sample the more it looks like normal distribution The more the trial the sample mean reflects the population mean Standard error = SE = σn Standard error is used to estimate the efficiency, accuracy, and consistency of a sample. In other words, it measures how precisely a sampling distribution represents a population. helps to compare Confidence interval A range off mean is know as confidence interval 95% confidence interval mean the middle 95% of distribution is represented from the interval It has 97.5th percentile and 2.5th percentile as it leaves the 2.5% on each of side CI = (z ∗ SE) + μ , z = z scores of percentile 100% ci = ±∞ 30 sample or more lead to normal disribution T-distribution can be used when data is less, less information thicker the tail . It is unimodal , use t-score instead of zscore for ci Margin of error = t − x ∗ SE =z ∗ 1 p^(1−p^) p hat = success rate n PValues Null hypothesis significance testing-NHST is a method of statistical inference by which an experimental factor is tested against a hypothesis of no effect or no relationship based on a given observation know as reductio ad absudum Representation of nhst = H0 : μe = 2300 P value show how rare the data and not just random change One side p value only comparing on side ,in 2 side the higher and lower side are both compared form the mean P value < 0.05 is standard in community ,the results can be call statistically significant , in medicine it is 0.01 People do not agree on the alpha of p value It does not tell you the probability of your hypotheses being correct it just tells you whether it is correct or not , it does the probability of hypotheses given the data we do not know whether how much difference is We fail to reject null hypotheses not accept as there is absence of evidence which should be confused with evidence of absence Type I error - reject the null , even if it's true ,false positive Type Il error - false negative changing cut off line can help us to increase change of one error and decrease of another H0 (x = 1) = 1 − α; H0 (x = 0) = 1 − β To increase statistical power = effect size = distance between mean of distribution (out of control) increasing sample size can help to σ smaller lead to less over lap Sufficient statical power = 80% or more P hacking P hacking when p value results are changed to give intentionally statical significant Family wise error rate is the probability of making one or more false discoveries, or type I errors when performing multiple hypotheses tests. The Bonferroni correction is a multiple-comparison correction used when several dependent or independent statistical tests are being performed simultaneously Bonferroni Correction = α n Bayes updating belies Bayes' theorem - P (B∣A) = P (A∣B)∗P (B) P (A) Bayes' factor = P (A1 ∣B1 ) : the Bayes factor is a likelihood ratio of P (A2 ∣B2 ) the marginal likelihood of two competing hypotheses, usually a null and an alternative is data based. Posterior belief = bayes' factor * prior belief Help to adjust hypotheses with new data Can be very relative as beliefs are different Higher data less wait on prior odds Beta(αposterior , β posterior ) = Beta(αlikelihood + αprior , β likelihood + β prior ) Test Statistics Z score = x−μ σ ˉ −μ Z Statistic for a group = XGroup σ n Z statics bigger 1or 1 mean they are more extreme Critical value = α in z score table T-distribution can be used when data is less, less information thicker the tail . It is unimodal , use t-score instead of zscore for ci T Statistic = x ˉ−μ σs n General formula = diff rence−H0 Average∣σ TTests Test statistic = σ w = σ2 1 n1 + σ2 2 n2 Variation can occur due random selection tow sample t test = where 2 samples are taken and compared paired t test = where a person is given both the treatment so remove variation Degrees of Freedom and Effect Sizes Degrees of freedom Degrees of freedom refers to the maximum number of logically independent values, which are values that have the freedom to vary, in the data sample df = N1 Effect size Sometimes the result can be statical significant but the true difference might not be practical significant ES = μ1 −μ2 σ 1. Small Size 0.2: Such an effect between the two groups is negligible and cannot be spotted with naked eyes. 2. Medium Size 0.5: This level of correlation is usually identified when the researcher goes through the data—medium size can have a reasonable overall impact. 3. Large Size 0.8 or greater): A large effect can be observed without using any calculator—the impact is significant in realworld scenarios. Chi-Square Tests used for categorical test like frequency table Chi square = x2 2 i) = ∑ (Oi −E Ei Oi = observation, Ei =expectation Degree of freedom = for rows and tables (r-1)(c-1) or for number of category 1 expected frequency > 5 for test to work Type of chi-squ are test Goodness of fit =to see how well certain portion fit our samples ,has one category Test of homogeneity - Looking at whether it's likely that different samples come from the same population. test of independence = to whether 2 category are completely independent Expected frequency = nni number ∗ ∑ ri ; ni = category ; n = total ∑ ri = row total number The Replication Crisis Replication - re-running studies to confirm results Reproducible analysis - the ability for other scientists to repeat the analyses you Unscrupulous researchers - researchers that are more concerned with attention and publishing and splashy headlines than good science. There are multiple ways to change analyse a form of data, thus lead to different results if the replicated properly Many scientists do not how to use P value published studies have a bias toward overestimating effects Smaller sample can alter results Replication can be helpful to solve all Replication does get enough attention and funding There m Regression t = Observed Coefficient - Null coefficient) /Standard error Linear general model = model + erro = y F stat = SSR/d.fssr SSE/d.fsse SST = SSE + SSR SStotal = (Yi − Yˉ )2 Var(Y) = SSntotal ^ SSR = ∑(Y − Yˉ )2 ^ )2 SSE = ∑(Yi − Y t2 = f d.f. of SSE = n−2 d.f of SSR = 1 = mx + b ANOVA ANOVA = analysis of variance Can be used for many categorical data slope = rise , rise = run SST = (xi μ0 − μ1 − μ)2 , xi = data point SST = SSM + SSE SSM = ∑(xi − x ˉ )2 d.f for categorical variables (ssm) = k-1 d.f for erro = n-k omnibus test - contains many items and group can be done t-test to find more specific values ˉ ssB sums of square between groups = ∑(X + μ) SSB -(the factor) = interaction η2 = SSeff ect SStotal interaction plot is made of the means of each category , a parallel line signified to interaction . In statistics, an interaction may arise when considering the relationship among three or more variables, and describes a situation in which the effect of one causal variable on an outcome depends on the state of a second causal variable Other glm ANOCVA - use multiple variable to reduce error and the models more better at predicting , but taking multiple anocva might be consider Phacking. Repeated measures ANOVA - like pair t-test but each individual has a bas-line and a difference in it is used to anova table and do the ftest to remove sample variation Supervised Machine Learning some data is use to train ai and other is used to test to learn it's accuracy score and improvise confusion matrix - +TN accuracy score = TP N logistic regression- Logistic regression is a simple twist on linear regression. It gets its name from the fact that it is a regression that predicts what's called the log odds of An event occuring. linear discriminate analysis- uses bayes' theorem and lgm to make model to predict a future result . They compare a distribution where the opposite portability is one there and one with the desired one. LDA helps in dimensionally reduction k-nearest neighbours - use near by data point to predict it categorical variable , it is classier Unsupervised Machine Learning unsupervised means that there is no groups already made thus they help in categorical analysing and sorting the data Usally harder tp test k means - they choose n random point know as centroid, choose all the data points and check to which centroid it is closet to a from a group , find it 's centre point and run it again until they converge(leads to same results). Silhouette score - show how far the cluster group are to each other , higher the better model Hierarchal clustering - it arrange data in to small groups , groups them to other and continuous until all the data is under one big group. Like a fractal venn diagram with sub groups. Can be show in and radar graph As help in classing autism spectrum disorder Big data Big data is data so big that our collection and anylase tolls may fall behind facebook likes can be used to predict 5 traits and other psychometric traits. this can be used to predict political belief even Cambridge analytic has help many politician with targeting marketing. Can be used to personalise medication , google map and many more features , like city brain Problem Bias - the data might be biased like higher black arrest show in the data might to copied by the algorithm. Sometime algorithm can confuse correlation and causation can lead to flawed result. algorithmic transparency- show what algorithm is detecting can be one of solutions to it privacy- the data might be used for reason the person may not want to not be confutable by giving his information to people, because of big data identification has also ben easier K-anonymity - increase number of similar profile ins system to increase anonymity but DNA company likes 23 and me can also detect your data if you have not upload it through the pedigree u have whihc can be concenring Regulation - there has been some new regulation like coppa and gdpr which increase algorithmic transparency and decrease there control over your information , but still big data is very new and the regulation are not that polished with many loop holes and question still in placed A hack might also put your information in the public which can be dangerous like the iCloud hack Statistics in the Courts make more sense to watch the video www.youtube.com/watch?v=HqH 6yw60m0&list=PL8dPuuaLjXtNM Y bUAhblSAdWRnmBUcr&index=41 The prosecutor's fallacy - is a fallacy of statistical reasoning involving a test for an occurrence, such as a DNA match. A positive result in the test may paradoxically be more likely to be an erroneous result than an actual occurrence, even if the test is very accurate Neural Networks there input nodes which give an output node , between there are hidden nodes which increase or decrease weightage of input know as activation function use for data which is to complex more than one layer can be called deep learning used in image detection and much more. In between the nodes feature generation happen a made up variable which helps us to get to your output. Rectified linear unit ReLU - turn - negative values to zero and leaves positive as it is. Feed forward neural network - movement of data is only forward Recurrent neural network - data made in the network can be feed back in to it , can be help for task in which last output is needed to remember like writing music or detection words. Convolution neural network - use windows of pixel weightage so find out certain characteristic seen in an image. they use convolution(finding) and pooling(fitting found data together). Like used in snapchat Generative adversarial network GANs)- use sets of existing data to try to learn how to create new data . There are 2 networks , one which creates data (generator), other check the whether data is accurate or similar enough to the base data.(discriminator).Both are trying to better . Can be used to create art War makes more sense to watch www.youtube.com/watch? v=rRhHY7Mh5T0&list=PL8dPuuaLjXtNM Y bUAhblSAdWRnmBUcr&index=43 Max = m + m n − 1 , m = max number When Predictions Fail Market - in 2008 , they reduce the loaning requirement and were selling the loan as an asset know as mortgage back security.Many investor had many bad static models and overestimated which variable are independent. Also had bad weightage in model towards bank. Earthquake- to predict it's earth quake you need a location , magnitude and time.Lot of variation is there and very less data is present . Election- low probabilities do not equal impossible event like seen in 2016 election.And there was also a non respond biased seen. Prediction is hard you need a lot of data and a accurate model. When Predictions succed Makes more sense to watch video www.youtube.com/watch? v=uJFdLKkuYc4&list=PL8dPuuaLjXtNM Y bUAhblSAdWRnmBUcr&index more complex more data it helps to update our bellies and see throw our certainty