Advanced Statistics is a scientific body of knowledge that deals with collection, organization & presentation, analysis & interpretation of data The basic idea behind all statistical methods of data analysis is to make inferences about a population by studying small sample chosen from it. Population The complete collection of measurements outcomes, object or individual under study. Parameter A number that describes a population characteristics. Sample A subset of a population, containing the objects or outcomes that are actually observed. Statistic A number that describes a sample characteristics. Descriptive statistics methods that focus on the collection, presentation, and characterization of a set of data in order to properly describe the various features of that set. Inferential statistics make possible the estimation of a characteristic of population or the making of a decision concerning a population based only on sample results. Examples of Descriptive Statistics Out of the 2.3 M documented overseas Filipino workers, about 46.3 % are male (Quickstat, NSO, September, 2017). The teacher- student ratio in the elementary public schools is 1: 31 (Fact Sheet, DepEd as cited from The Philippine Star, 2018). Among the currently married women with ages 15 – 49 and belonging to the low income group, around 52.7% are not using any birth control method (2006 Family Planning Survey, NSO). Jeff Bezos is the richest man in the world with a net worth of 145.3 billion dollars (Forbes, 2019). Examples of Inferential Statistics With the rate the river system in Metro Manila is being polluted then the water supply in the metropolis will be totally depleted by the year 2025 ( Greenpeace, 2007). Wearing of seatbelt increases the chances of survival in vehicular accidents. A new milk formulation designed to improve the psychomotor development of infants was tested on randomly selected infants. Based on the results, it was concluded that the new milk formulation is effective in improving the psychomotor development of infants. Variables is a characteristic or attribute that can assume different values. is also a characteristics of interest, one that can be expressed as a number that possessed by each item under study. Dependent variable the variable we wish to explain 1 Independent variable the variable used to explain the dependent variable Confounding Variables refer to all factors that the researcher has not accounted for but which may have influenced the social interactions. Data are also the values that variables can assume. Qualitative Data refers to the attributes or characteristics of the samples. They represent differences in character or kind but not in amount. Such data are gathered in categorical responses. gender ( male or female) , personality (introvert, ambivert, extrovert), socio-economic status (high, middle, low), educational attainment (elementary, high school, college graduate) Quantitative Data refers to the numerical information gathered about the sample. Such data can be ordered or ranked. They are as a result of counting or measuring. scores in an achievement test, number of votes received by an election candidate, monthly income of an employee, number of graduate school students in BulSU, length time to perform a statistical problem, prices of houses, gross sales Quantitative (Numerical) Data that represent counts or measurements (can be count or measure) Are numerical in nature and can be ordered or ranked. Discrete Variables Assume values that can be counted and finite Ex : no of students in a class Continuous variables Can assume all values between any two specific values & is obtained by measuring Ex: weight, age, height, temperature, salary Levels of Measurement 1. Nominal Data – can be classified into two or more distinct categories. They are either: Real nominal – classified based on naturally occurring characteristics. (Examples: sex, nationality, race). Artificial nominal – classified based man-made characteristics following certain rules. (Examples: political affiliation, religious denomination, smoking habits of people, passing or failing a test.) 2. Ordinal Data – data are grouped according to ranks or orders of categories. In this classification, one category is higher or lower than the other categories. Examples: winners in a contest, birth order, faculty rank, income categories, boxing divisions 3. Interval Data – not only ordering of observations are possible but as well as the arithmetic differences 2 4. between them are meaningful. Here, every value is an actual amount and there is equal unit of measurement separating each score. However, the zero point is arbitrary in this scale and does not reflect the absence of the characteristic. Examples: achievement scores, temperature, RBC count, time Ratio Data – data wherein the equality of ratio or proportion has meaning. In this scale, the zero point is not arbitrary, since it indicates a total absence of the attribute being measured. The concepts of algebraic operations, absolute zero and inequalities have meaning. Examples: length, weight, area, volume, density, money, age, power Reliable Measures Center Spread Skew/Symmetry Peaked Mean Variance (standard deviation) Range (max-min) Skewness Kurtosis Hypothesis Testing is the process of making an inference or generalization about a population by using data gathered from a sample of the population. It is an area of statistical inference in which one evaluates a conjecture about some characteristic of the parent population based upon the information contained in the random sample. Null hypothesis (H0 ) It is the hypothesis to be tested which one hopes to reject. It shows equality or no significant difference, effect, or relationship between variables. For the mean, the null hypothesis will be stated in one of these three possible forms: Ho: µ = some value, Ho: µ ≤ some value, Ho: µ ≥ some value Alternative hypothesis (Ha ) It generally represents the idea which the researcher wants to prove. For the mean, the alternative hypothesis will be stated in only one of three possible forms: Ha: µ ≠ some value, Ha: µ > some value, Ha: µ < some value Types of Hypothesis Testing 1. Two-tailed test - It is nondirectional test with the region of rejection lying on both tails of the normal curve. It is used when the alternative hypothesis uses words such as not equal to, significantly different, etc. Non-mean based measure Mode, median Interquartile range (1st 3rd quartile) Mean Deviation --- Summary of when to use the mean, median and mode 3 2. samples is taken from the same population, the probability of getting a result similar to the present study is 95%. 0.01 level - 99% sure that the error is only 1% An of 0.01 (compared with 0.05) means the researcher is being relatively careful. He is only willing to risk being wrong 1 in 100 times in rejecting the null hypothesis which is true. Critical Value the dividing point between the region where the null hypothesis is rejected and the region where it is not rejected. This is also known as the tabular value. One-tailed test - It is a directional test with the region of rejection lying on either left or right tail of the normal curve. Right directional test - The region of rejection is on the right tail. It is used when the alternative hypothesis uses comparatives such as greater than, higher than, better than, superior to, exceeds, etc. Decision Rule a statement of the specific conditions under which the null hypothesis is rejected and the conditions under which it is accepted. P – VALUE is the probability of observing a sample value as extreme as, or more extreme than, the value observed, given that the null hypothesis is true. - a way to express the likelihood that Ho is false. This process compares the probability called the p – value with the significance level. If the p-value is smaller than the significance level, Ho is rejected. If it is larger than the significance level, Ho is not rejected. Left directional test - The region of rejection is on the left tail. It is used when the alternative hypothesis uses comparatives such as less than, smaller than, inferior to, lower than, below, etc. Level of Confidence 0.05 level - 95% sure that the error is only 5%. When a different set of 4 Make a Decision This step includes the computation of the test statistic, comparing it to the critical value, and making a decision to reject or not to reject the null hypothesis. Errors in Decision Making Type I ( error) – rejecting a true Ho Type II ( error) – accepting a false Ho Your dependent variable should be approximately normally distributed. Example 1 A ten randomly selected oil wells in a large field produced 21, 19, 20, 22, 24, 21, 19, 22, 22, and 20 barrels of crude oil per day. Is this enough evidence to conclude that the oil wells are not producing an average of 22.5 barrels of crude oil per day? Test at 0.01 level of significance. The 5-step solution of t-test A. Ho : = 22.5 (the oil wells are producing 22.5 barrels of oil a day) Ha: 22.5 (the oil wells are not producing 22.5 barrels of oil a day) B. Let = 0.01. The df = n – 1) = 9, = 22.5, and t tabular = 3.250 (critical t) C. Decision Rule: Use two – tailed test. Reject the null hypothesis if tcomp > -3.250 or t comp< 3.250; otherwise Test Statistic for Testing the Significance of Difference Between Means Assumptions of the One Sample T- test The dependent variable should be measured at the interval or ratio level (i.e., continuous). The data are independent (i.e., not correlated/related), which means that there is no relationship between the observations. There should be no significant outliers. accept it. D. Computed: 𝑥 = 21, sx = 1.56 E. F. Decision: Do not reject Ho/Accept Ho because -3.04 > -3.250 G. Conclusion: The oil wells are producing 22.5 barrels of oil a day. The difference could only brought about by chance or by sampling error. 5 Assumptions of the T test for Independent Samples 1. Variables involved: One independent, categorical variable that has two levels/groups. One continuous dependent variable. 2. The groups under study are unrelated groups, also called unpaired groups or independent groups. These are groups in which the cases (e.g., participants) in each group are different. 3. The independent t-test requires that the dependent variable is approximately normally distributed within each group. 4. The independent t-test assumes the variances of the two groups being measured are equal in the population. (Homogeneity of Variance) Test of Differences of Two Means she considered 14 students with whom she used the IGI method. After several sessions, a 30-items test was given. The scores are shown in the table below. The 5-step solution for the t-test Ho: The Team Based Instruction method of teaching Statistics is as effective as the individually Guided Instruction method. (Ho: TBI = IGI) Ha: The Team Based Instruction method of teaching Statistics is more effective than the Individually Guided Instruction method. (Ha: TBI > IGI) = 0.05; one-tailed; df = 27; ttab = 1.703 Criterion: Reject Ho if tcomp ttab. Decision: Do not reject Ho since tcomp(1.69) < ttab(1.703). Conclusion: The Team Based Instruction method of teaching Statistics is as effective as the Individually Guided Instruction method. Two-Sample Tests of Hypothesis DEPENDENT SAMPLES Now we consider situations when samples are not independent. Rather Example A teacher wanted to find out if the Team Based Instruction (TBI) method of teaching Statistics is more effective than the Individually Guided Instruction (IGI) method. Two classes of approximately equal intelligence were selected. From one class, she considered 15 students with whom she used TBI method of teaching and from the other class, 6 the samples are dependent or related. How do we tell the difference between dependent and independent samples? There are two types of dependent samples: One is characterized by measurement, followed by an intervention of some type, and then another measurement. This could be called “before” and “after study”. The other one is characterized by matching and pairing observations. Why do we prefer dependent samples over independent samples? By using dependent sample, we are able to reduce the variation in the sampling distribution. T test for DEPENDENT SAMPLES (Paired T test) The test statistic used is the Paired t – Test two related groups should be approximately normally distributed. Example The following are the weights in pounds of 15 students before and after 6 months of attending aerobics. The 5-step solution is as follows: Ho: Aerobics is not effective in reducing weight. (Ho: B = A) Ha: Aerobics is effective in reducing weight. (Ha: B > A) Let = 0.05; one-tailed; df = 14; ttab = 1.761. Criterion: Reject Ho if tcomp ttab. Decision: Reject Ho, since tcomp(4.21) > ttab(1.761). Conclusion: Based on the sample evidence, aerobics is effective in reducing weight. It can be observed that the two quantities (weights before and after) are positively related (strong). The paired samples design, in this case, provides a more powerful hypothesis test than would an independent samples test carried out on the same data. The computed t-value we got from the SPSS is equal to the one we got using the formula. Also, since p = 0.001/2 = 0.0005 is less than 0.05, we reject Ho in favor of Ha. Conclusion: Aerobics is effective in reducing weight. Assumptions of the Paired T-test ( T- test for Dependent or Correlated Means) 1. The dependent variable should be measured on a continuous scale. 2. The independent variable should consist of two categorical, "related groups" or "matched pairs". 3. There should be no significant outliers in the differences between the two related groups. 4. The distribution of the differences in the dependent variable between the 7 ANALYSIS OF VARIANCE Assumptions of ANOVA 1. The populations follow the normal distribution. 2. The populations have equal standard deviations. 3. The populations are independent. 4. The data must be at least interval scale. When these conditions are satisfied, F is used as the distribution of the test statistic. Interpretation of the Correlation Coefficient Once the value of r is found significant, the rule of thumb for assessing the degree of relationship between the two quantitative variables can be interpreted using the following criteria: CORRELATION Correlation and regression are two related statistical tools. We use correlation to determine if a relationship exists between two variables. On the other hand, we use regression to predict the value of one variable from our knowledge of the other variable. The Scatter Diagram One can usually and roughly estimate if a relationship exists between two variables by constructing a scatter diagram. This is done by plotting the point corresponding to each observation on a rectangular coordinate system. Correlational Tests: 1. Pearson – Product Moment Correlation It measures the degree of relation between two at least interval scale data. 2. Spearman’s Rank Correlation Coefficient – It is the measure of the correlation between two ordinal variables. 3. Phi-Coefficient – The phi coefficient determines the degree of relationship between two variables which are both nominal dichotomous like sex (male/female) and marital status (married/unmarried). 8 4. Point Biserial – it measures correlation between an interval and a nominal dichotomous data. Regression Analysis Regression analysis is used to: 1. Predict the value of a dependent variable based on the value of at least one independent variable 2. Explain the impact of changes in an independent variable on the dependent variable Dependent variable: the variable we wish to explain Independent variable: the variable used to explain the dependent variable EXAMPLE: Describing the Problem A random sample of 14 students is selected from an elementary school, and each student is measured on a creativity score (Create) using a new testing instrument and on a task score (Task) using a standard instrument. The Task score is the mean time taken to perform several hand-eye coordination tasks. Because the test for the creativity test is much cheaper, it is of interest to know whether you can substitute it for the more expensive Task score. That is, can you create a regression equation that will effectively predict a Task score (the dependent variable) from the Create score (the independent variable)? Assumptions of Linear Regression 1. The two variables should be measured at the continuous level. 2. There needs to be a linear relationship between the two variables. 3. There should be no significant outliers. 4. There should be independence of observations. 5. Data needs to show homoscedasticity, which is where the variances along the line of best fit remain similar as you move along the line. 6. The residuals (errors) of the regression line are approximately normally distributed. The figure shows a scatterplot of these two variables along with the regression line and the confidence intervals for Y given X. In the plot, we use the standard practice of plotting the independent variable (Create) on the x-axis and the dependent variable (Task) on the y-axis. By observing the scatterplot, it can be seen that there is a positive correlation between the two variables (in this case, r = .74), and it appears that knowing Create should help in predicting Task. It is also clear that knowing Create does not in any way perfectly predict Task. 9 Regression analysis results are shown in the Table. The “create” line gives the results of the two-sided hypothesis test that the theoretical slope of the regression line for predicting Task from Create is 0.083. In this case, p 0.002 indicates that you should reject the null hypothesis and conclude that there is a statistically significant linear relationship between the two variables and, therefore, that Create should be useful in predicting Task. are using to predict the value of the dependent variable are called the independent variables (or sometimes, the predictor, explanatory or regressor variables). NONPARAMETRIC TESTS The inferential techniques normally require the use of parametric tests. A parametric test has certain assumptions that need to be satisfied. Among these are: 1. The samples are randomly chosen from normal populations with equal variances. 2. The sample sizes are relatively large (greater than 30). 3. The samples are measured at least in the interval scale. However, in many studies, one or more of these assumptions may not be met. Hence, there is a need to consider other types of tests; tests which are not very restrictive. The tests are called nonparametric tests. These types of tests may be used in the following conditions: 1. Samples coming from populations which are of doubtful normality. 2. It is more applicable for a small sample size (n < 30) provided that nature of the population distribution from which the sample came from is known. 3. It may be used for treating samples made up of observations from different populations. The sample regression equation is created from the “Unstandardized Coefficients” in the coefficients table. Thus, the regression equation for predicting Task from Create is Predicted Task 1.599 0.083 Create, or, in words, you predict the Task value by multiplying Create by 0.083 and adding 1.599. Previously, it was shown the scatterplot along with the regression line, provides reasonable estimates for Task for each value of Create. For a new student who has a Create score of 52, you would predict a Task score using the following equation: Predicted Task 1.599 0.083(52) 5.915, which is visually consistent with the regression line in Figure 4.5 at X 52. Multiple Regression Multiple regression is an extension of simple linear regression. It is used when we want to predict the value of a variable based on the value of two or more other variables. The variable we want to predict is called the dependent variable (or sometimes, the outcome, target or criterion variable). The variables we 10 4. It can be used for data that are only nominal or ordinal. Overview of Nonparametric Methods There is at least one nonparametric equivalent for each parametric general type of test. General type of tests fall into the ff. Categories: 1. Tests of differences between groups (independent samples) 2. Tests of differences between variables (dependent samples) 3. Tests of relationships between variables 11