Introduction to Statistics: Definitions and Concepts

MODULE 1: DEFINITION OF STATISTICS INTRODUCTION TO THE STATISTICAL CONCEPTS Statistics plays a major role in many aspects of our lives. It is used in sports, for example, to help a general manager decide which player might be the best fit for a team. It is used in politics to help candidates understand how the public feels about various policies. And statistics is used in medicine to help determine the effectiveness of new drugs. Used a p p r o p r i a t e l y, s t a t i s t i c s c a n e n h a n c e o u r understanding of the world around us. Used inappropriately, it can lend support to inaccurate beliefs. Understanding statistical methods will provide you with the ability to analyze and critique studies and the opportunity to become an informed consumer of information. Understanding statistical methods will also enable you to distinguish solid analysis from bogus “facts.” Objectives: After successful completion of this module, you should be able to: • Define statistics. • Enumerate the importance and limitations of statistics • Explain the process of statistics • Know the difference between descriptive and inferential statistics. • Distinguish between qualitative and quantitative variables. • Distinguish between discrete and continuous variables. • Determine the level of measurement of a variable. Many people say that statistics is numbers. After all, we are bombarded by numbers that supposedly represent how we feel and who we are. Certainly, statistics has a lot to do with numbers, but this definition is only partially correct. Statistics is also about where the numbers come from (that is, how they were obtained) and how closely the numbers reflect reality. Statistics is the science of collecting, organizing, summarizing, and analyzing information to draw conclusions or answer questions. In addition, statistics is about providing a measure of confidence in any conclusions. Let’s break this definition into four parts. The first part states that statistics involves the collection of information. The second refers to the organization and summarization of information. The third states that the information is analyzed to draw conclusions or answer specific questions. The fourth part states that results should be reported using some measure that represents how convinced we are that our conclusions reflect reality. • Statistics is important because it enables people to make decisions based on empirical evidence. • Statistics provides us with tools needed to convert massive data into pertinent information that can be used in decision making. • Statistics can provide us information that we can use to make sensible decisions. What information is referred to in the definition? The information referred to the definition is the data. According to the Merriam Webster dictionary, data are “factual information used as a basis for reasoning, discussion, or calculation”. Data can be numerical, as in height, or nonnumerical, as in gender. In either case, data describe characteristics of an individual. Field of Statistics A. Mathematical Statistics- The study and development of statistical theory and methods in the abstract. B. Applied Statistics- The application of statistical methods to solve real problems involving randomly generated data and the development of new statistical methodology motivated by real problems. Example branches of Applied Statistics: psychometric, econometrics, and biostatistics. Limitation of Statistics Statistics is not suitable to the study of qualitative phenomenon. 2. Statistics does not study individuals. 3. Statistical laws are not exact. 4. Statistics table may be misused. 5. Statistics is only, one of the methods of studying a problem. Definitions: • Universe is the set of all entities under study. • A Population is the total or entire group of individuals or observations from which information is desired by a researcher. Apart from persons, a population may consist of mosquitoes, villages, institution, etc. • An individual is a person or object that is a member of the population being studied. • A statistic is a numerical summary of a sample. • Sample is the subset of the population. • Descriptive statistics consist of organizing and summarizing data. Descriptive statistics describe data through numerical summaries, tables, and graphs. • Inferential statistics uses methods that take a result from a sample, extend it to the population, and measure the reliability of the result. • A parameter is a numerical summary of a population Example: Consider the Scenario. You are walking down the street and notice that a person walking in front of you drops PHP100. Nobody seems to notice the PHP100 except you. Since you could keep the money without anyone knowing, would you keep the money or return it to the owner? Suppose you wanted to use this scenario as a gauge of the morality of students at your school by determining the percent of students who would return the money. How might you do this? You could attempt to present the scenario to every student at the school, but this would be difficult or impossible if the student body is large. A second possibility is to present the scenario to 50 students and use the results to make a statement about all the students at the school. account for the variability in our results. One goal of inferential statistics is to use statistics to estimate parameters. In the PHP100 study presented, the population is all the students at the school. Each student is an individual. The sample is the 50 students selected to participate in the study. 2. Collect the information needed to answer the questions. Suppose 39 of the 50 students stated that they would return the money to the owner. We could present this result by saying that the percent of students in the survey who would return the money to the owner is 78%. This is an example of a descriptive statistic because it describes the results of the sample without making any general conclusions about the population. So 78% is a statistic because it is a numerical summary based on a sample. Descriptive statistics make it easier to get an overview of what the data are telling us. If we extend the results of our sample to the population, we are performing inferential statistics. The generalization contains uncertainty because a sample cannot tell us everything about a population. Therefore, inferential statistics includes a level of confidence in the results. So rather than saying that 78% of all students would return the money, we might say that we are 95% confident that between 74% and 82% of all students would return the money. Notice how this inferential statement includes a level of confidence (measure of reliability) in our results. It also includes a range of values to PROCESS OF STATISTICS 1. Identify the research objective. A researcher must determine the question(s) he or she wants answered. The question(s) must clearly identify the population that is to be studied. Identify the research objective. Conducting research on an entire population is often difficult and expensive, so we typically look at a sample. This step is vital to the statistical process, because if the data are not collected correctly, the conclusions drawn are meaningless. Do not overlook the importance of appropriate data collection. Example: A research objective is presented. For each research objective, identify the population and sample in the study. 1. The Philippine Mental Health Associations contacts 1,028 teenagers who are 13 to 17 years of age and live in Antipolo City and asked whether or not they had been prescribed medications for any mental disorders, such as depression or anxiety. Population: Teenagers 13 to 17 years of age who live in Antipolo City Sample: 1,028 teenagers 13 to 17 years of age who live in Antipolo City 1. A farmer wanted to learn about the weight of his soybean crop. He randomly sampled 100 plants and weighted the soybeans on each plant. Population: Entire soybean crop Sample: 100 selected soybean crop 3. Organize and summarize the information. Descriptive statistics allow the researcher to obtain an overview of the data and can help determine the type of statistical methods the researcher should use. 4. Draw conclusion from the information. In this step the information collected from the sample is generalized to the population. Inferential statistics uses methods that takes results obtained from a sample, extends them to the population, and measures the reliability of the result. Take Note! If the entire population is studied, then inferential statistics is not necessary, because descriptive statistics will provide all the information that we need regarding the population. Example: For the following statements, decide whether it belongs to the field of descriptive statistics or inferential statistics. 1. A badminton player wants to know his average score for the past 10 games. (Descriptive Statistics) 2. A car manufacturer wishes to estimate the average lifetime of batteries by testing a sample of 50 batteries. (Inferential Statistics) 3. Janine wants to determine the variability of her six exam scores in Algebra. (Descriptive Statistics) 4. A shipping company wishes to estimate the number of passengers traveling via their ships next year using their data on the number of passengers in the past three years. (Inferential Statistics) 5. A politician wants to determine the total number of votes his rival obtained in the past election based on his copies of the tally sheet of electoral returns. (Descriptive Statistics) DISTINCTION BETWEEN QUALITATIVE AND QUANTITATIVE VARIABLES Variables are the characteristics of the individuals within the population. For example, recently my mother and I planted a tomato plant in our backyard. We collected information about the tomatoes harvested from the plant. The individuals we studied were the tomatoes. The variable that interested us was the weight of a tomato.My mom noted that the tomatoes had different weights even though they came from the same plant. She discovered that variables such as weight may vary. If variables did not vary, they would be constants, and statistical inference would not be necessary. Think about it this way: If each tomato had the same weight, then knowing the weight of one tomato would allow us to determine the weights of all tomatoes. However, the weights of the tomatoes vary. One goal of research is to learn the causes of the variability so that we can learn to grow plants that yield the best tomatoes. It is helpful to divide variables into different types, as different statistical methods are applicable to each. The main division is into qualitative (or categorical) or quantitative (or numerical variables). Variables can be classified into two groups: 1. Qualitative variables (Categorical) is variable that yields categorical responses. It is a word or a code that represents a class or category. 2. Quantitative variables (Numeric) takes on numerical values representing an amount or quantity. Example: Determine whether the following variables are qualitative or quantitative. 1. Haircolor (Qualitative) 2. Temperature (Quantitative) 3. Stages of breast cancer (Qualitative) 4. Number of hamburger sold (Quantitative) 5. Number of children (Quantitative) 6. Zip code (Qualitative) possible values. If you count to get the value of a quantitative variable, it is discrete. 2. A continuous variable is a quantitative variable that has an infinite number of possible values that are not countable. If you measure to get the value of a quantitative variable, it is continuous. Example: Determine whether the following quantitative variables are discrete or continuous. 1. The number of heads obtained after flipping a coin five times. (Discrete) 2. The number of cars that arrive at a McDonald’s drive-through between 12:00 P.M and 1:00 P.M. (Discrete) 3. The distance of a 2005 Toyota Prius can travel in city conditions with a full tank of gas. (Continuous) 4. Number of words correctly spelled. (Discrete) 5. Time of a runner to finish one lap. (Continuous) LEVELS OF MEASUREMENT 7. Place of birth (Qualitative) 8. Degree of pain (Qualitative) DISTINCTION BETWEEN DISCRETE AND CONTINUOUS Quantitative variables may be further classified into: 1. A discrete variable is a quantitative variable that either a finite number of possible values or a countable number of Levels of Measurement It is important to know which type of scale is represented by your data since different statistics are appropriate for different scales of measurement. A characteristic may be measured using nominal, ordinal, interval and ration scales. 1. Nominal Level - They are sometimes called categorical scales or categorical data. Such a scale classifies persons or objects into two or more categories. Whatever the basis for classification, a person can only be in one category, and members of a given category have a common set of characteristics. Example: - Method of payment (cash, check, debit card, credit card) - Type of school (public vs. private) - Eye Color (Blue, Green, Brown) 2. Ordinal Level - This involves data that may be arranged in some order, but differences between data values either cannot be determined or meaningless. An ordinal scale not only classifies subjects but also ranks them in terms of the degree to which they possess a characteristics of interest. In other words, an ordinal scale puts the subjects in order from highest to lowest, from most to least. Although ordinal scales indicate that some subjects are higher, or lower than others, they do not indicate how much higher or how much better. Example: - Food Preferences - Stage of Disease - Social Economic Class (First, Middle, Lower) - Severity of Pain 3. Interval Level - This is a measurement level not only classifies and orders the measurements, but it also specifies that the distances between each interval on the scale are equivalent along the scale from low interval to high interval. A value of zero does not mean the absence of the quantity. Arithmetic operations such as addition and subtraction can be performed on values of the variable. Example: - Te m p e r a t u r e o n F a h r e n h e i t / C e l s i u s Thermometer - Trait anxiety (e.g., high anxious vs. low anxious) - IQ (e.g., high IQ vs. average IQ vs. low IQ) 4. Ratio Level - A ratio scale represents the highest, most precise, level of measurement. It has the properties of the interval level of measurement and the ratios of the values of the variable have meaning. A value of zero means the absence of the quantity. Arithmetic operations such as multiplication and division can be performed on the values of the variable. Example: - Height and weight - Time - Time until death Operations that make sense for variables of different scales. Both interval and ratio data involve measurement. Most data analysis techniques that apply to ratio data also apply to interval data..Therefore, in most practical aspects, these types of data (interval and ratio) are grouped under metric data. In some other instances, these type of data are also known as numerical discrete and numerical continuous. Example: Categorize each of the following as nominal, ordinal, interval or ratio measurement. 1. Ranking of college athletic teams. (Ordinal) 2. Employee number. (Nominal) 3. Number of vehicles registered. (Ratio) 4. Brands of soft drinks. (Nominal) 5. Number of car passers along C5 on a given day. (Ratio) 6. Zip code (Nominal) 7. Degree of pain (Ordinal) ACTIVITIES/ASSESSMENTS: Read each item carefully. Write the answer on the yellow paper. Answers Only. I. A research objective is presented. For each, identify the (A) population and (B) sample in the study. 8. A polling organization contacts 2141 male university graduates who have a whitecollar job and asks whether or not they had received a raise at work during the past 4 months. A. ______________________________ B. ______________________________ 2. Every year the PSA releases the Current Population Report based on a survey of 50,000 households. The goal of this report is to learn the demographic characteristics, such as income, of all households within the Philippines. A. ______________________________ B. ______________________________ 3. Researchers want to determine whether or not higher folate intake is associated with a lower risk of hypertension (high blood pressure) in women (27 to 44 years of age). To make this determination, they look at 7373 cases of hypertension in these women and find that those who consume at least 1000 micrograms per day of total folate had a decreased risk of hypertension compared with those who consume less than 200. A. ______________________________ B. ______________________________ II. Indicate whether the following statements require the use of descriptive or inferential statistics. ______________1. A teacher wants to know the attitudes of all students towards abortion. ______________2. A market analyst of a sales firm draws a chart showing the sales figures of a given product for the period 2006-2007. ______________3. A forecaster predicts the results of an election using the number of votes cast in 15 out of 25 barangays. ______________4. Men are better in math than women. _____________5. Forty percent of the employees of an organization were recorded tardy for at least 15 working days. ______________10. Brands of soft drinks ______________6. There are very few gender-related occupations. ______________12. Status Employment ____________ 7. An account predicts accuracy rate of a client’s financial resources. ______________ 8. A quality control manager wishes to check production output. ______________ 9. Records indicated that 75% of the faculty in the graduate school are doctoral degree holders. ______________ 10. There is no relationship between educational qualification of parents and academic achievement of their children. III. Identify the qualitative and quantitative variables and indicate the highest level of measurement required in each. If quantitative, classify whether discrete or continuous. ______________1. Occupation ______________11. Socioeconomic status ______________13. Number of missing teeth ______________14. Number of vehicles registered ______________15. Jersey Number ______________16. Number of employees collecting retirement benefits from GSIS ______________17. Duration of a seizure ______________18. Cause of death ______________19. Dividends ______________20. Current assets list ______________21. Number of heart attacks ______________22. Account receivable ______________23. Clothing size ______________2. Number of government officials ______________24. Blood type ______________3. Favorite color ______________25. Ethnic group ______________4. Temperature in Celsius degrees REFERENCES: ______________5. Type of school Statistics. Informed Decision using Data by Michael Sullivan, III,. Fifth Edition ______________6. Volume of mineral water sold daily Sampling: Design and Analysis by Sharon L. Lhr. Second Edition ______________7. Employee number ______________8. Civil status ______________9. Equity accounts MODULE 2: DATA COLLECTION AND BASIC Concepts in Sampling DESIGN Objectives: After successful completion of this module, you should be able to: • Determine the sources of data (primary and secondary data). • Distinguish the different methods data collection under primary and secondary data. • Determine the appropriate sample size. • Differentiate various sampling techniques. • Know the sources of errors in sampling. DATA COLLECTION Everybody collects, interprets and uses information, much of it in numerical or statistical forms in day-today life. It is a common practice that people receive large quantities of information everyday through conversations, televisions, computers, the radios, newspapers, posters, notices and instructions. It is just because there is so much information available that people need to be able to absorb, select and reject it. In everyday life, in business and industry, certain statistical information is necessary and it is independent to know where to find it how to collect it. Analysis of data can lead to powerful results. Data can be used to offset anecdotal claims, such as the suggestion that cellular telephones cause brain cancer. Anecdotal means that the information being conveyed is based on casual observation, not scientific research. Because data are powerful, they can be dangerous when misused. The misuse of data usually occurs when data are incorrectly obtained or analyzed. For example, radio or television talk shows regularly ask poll questions for which respondents must call in or use the Internet to supply their vote. Most likely, the individuals who are going to call in are those who have a strong opinion about the topic. This group is not likely to be representative of people in general, so the results of the poll are not meaningful. Whenever we look at data, we should be mindful of where the data come from. Even when data tell us that a relation exists, we need to investigate. For example, a study showed that breast-fed children have higher IQs than those who were not breast-fed. Does this study mean that a mother who breast-feeds her child will increase the child’s IQ? Not necessarily. It may be that some factor other than breast-feeding contributes to the IQ of the children. In this case, it turns out that mothers who breastfeed generally have higher IQs than those who do not. Therefore, it may be genetics that leads to the higher IQ, not breast-feeding. Data collection is the process of gathering and measuring information on variables of interest, in an established systematic fashion that enables one to answer stated research questions, test hypotheses, and evaluate outcomes. Without proper planning for data collection, a number of problems can occur. If the data collection steps and processes are not properly planned, the research project can ultimately end up with a data set that does not serve the purpose for which it was intended. For example, if more than one person is involved in the data collection, but data collectors do not follow consistent data collection practices, they can end up with data with different units, collection processes, and variable names. Consequences from Improperly Collected Data • Inability to answer research questions accurately. • Inability to repeat and validate the study. • Distorted findings resulting in wasted resources. • Misleading other researchers to pursue fruitless avenues of investigation. • Compromising decisions for public policy. • Causing harm to human participants and animal subjects. Steps in Data Gathering 1. Set the objectives for collecting data 2. Determine the data needed based on the set objectives. 3. Determine the method to be used in data gathering and define the comprehensive data collection points. 4. Design data gathering forms to be used. 5. Collect data. Choosing of Method of Data Collection Decision-makers need information that is relevant, timely, accurate and usable. The cost of obtaining, processing and analyzing these data is high. The challenge is to find ways, which lead to information that is cost-effective, relevant, timely and important for immediate use. Some methods pay attention to timeliness and reduction in cost. Others pay attention to accuracy and the strength of the method in using scientific. The statistical data may be classified under two categories, depending upon the sources. approaches: Primary Data and Secondary Data. SOURCES OF DATA Whether conducting research in the social sciences, humanities arts, or natural sciences, the ability to distinguish between primary and secondary sources is essential. Primary Sources - Provide a first-hand account of an event or time period and are considered to be authoritative. They represent original thinking, reports on discoveries or events, or they can share new information. Often these sources are created at the time the events occurred but they can also include sources that are created later. They are usually the first formal appearance of original research. Primary Data - are data documented by the primary source. The data collectors documented the data themselves. The first hand information obtained by the investigator is more reliable and accurate since the investigator can extract the correct information by removing doubts, if any, in the minds of the respondents regarding certain questions. High response rates might be obtained since the answers to various questions are obtained on the spot. It permits explanation of questions concerning difficult subject matter. Secondary Sources - offer an analysis, interpretation or a restatement of primary sources and are considered to be persuasive. They often involve generalisation, synthesis, interpretation, commentary or evaluation in an attempt to convince the reader of the creator's argument. They often attempt to describe or explain primary sources. Secondary Data - are data documented by a secondary source. The data collectors had the data documented by other sources. In secondary data, data are primary data for the agency that collected them, and become secondary for someone else who uses these data for his own purposes. Secondary data are less expensive to collect both in money and time. These data can also be better utilized and sometimes the quality of such data may be better because these might have been collected by persons who were specially trained for that purpose. On the other hand, such data must be used with great care, because such data may also be full of errors due to the fact that the purpose of the collection of the data by the primary agency may have been different from the purpose of the user of these secondary data. Secondly, there may have been bias introduced, the size of the sample may have been inadequate, or there may have been arithmetic or definition errors, hence, it is necessary to critically investigate the validity of the secondary data. The primary data can be collected by the following five methods: 1. Direct personal interviews - The researcher has direct contact with the interviewee. The researcher gathers information by asking questions to the interviewee. 2. Indirect/Questionnaire Method - This methods of data collection involve sourcing and accessing existing data that were originally collected for the purpose of the study. Designing good “questioning tools” forms an important and time consuming phase in the development of most research proposals. Once the decision has been made to use these techniques, the following questions should be considered before designing our tools: • What exactly do we want to know, according to the objectives and variables we identified earlier? Is questioning the right technique to obtain all answers, or do we need additional techniques, such as observations or analysis of records? • Of whom will we ask questions and what techniques will we use? Do we understand the topic sufficiently to design a questionnaire, or do we need some loosely structured interviews with key informants or a focus group discussion first to orient ourselves? • Are our informants mainly literate or illiterate? If illiterate, the use of selfadministered questionnaires is not an option. • How large is the sample that will be interviewed? Studies with many respondents often use shorter, highly structured questionnaires, whereas smaller studies allow more flexibility and may use questionnaires with a number of open-ended questions. Key Design Principles of a Good Questionnaire 1. Keep the questionnaire as short as possible. Example: - Can you describe exactly what the traditional birth attendant did when your labor started? - What do you think are the reasons for a high drop-out rate of village health committee members? A closed-ended question is a type of question that includes a list of response categories from which the respondent will select his answer. It is useful if the range of possible responses is known. This type of question is usually appropriate for collecting objective data. 2. Decide on the type of questionnaire (Open Ended or Closed Ended). Example: 3. Write the questions properly. Did you eat any of the following foods yesterday? 4. Order the questions appropriately. 5. Avoid questions that prompt or motivate the respondent to say what you would like to hear. • Fish or meat Yes No • Eggs. Yes No • Milk or cheese Yes No 6. Write an introductory letter or an introduction. Take Note! 7. Write special instructions for interviewers or respondents. Question wording and question order have a large effect on the responses obtained. 8. Translate the questions if necessary. Example: 9. Always test your questions before taking the survey. (Pre-test) Two surveys were taken in late 1993/early 1994 about Elvis Presley. An open-ended question is a type of question that does not include response categories. The respondent is not given any possible answers to choose from. This type of question is usually appropriate for collecting subjective data. It permit free responses that should be recorded in the respondent’s own words. One survey asked: “In the past few years, there have been a lot of rumors and stories about whether Elvis Presley is really dead. How do you feel about this? Do you think there is any possibility that these rumors are true and that Elvis Presley is still alive, or don’t you think so?” Second survey asked: “A recent television show examined various theories about Elvis Presley’s death. Do you think it is possible that Elvis is alive or not?” 8% of the respondents to the first question said it is possible that Elvis is still alive and 16% of respondents to the second question said it is possible that Elvis is still alive. 3. A focus group is a group interview of approximately six to twelve people who share similar characteristics or common interests. A facilitator guides the group based on a predetermined set of topics. 4. Experiment is a method of collecting data where there is direct human intervention on the conditions that may affect the values of the variable of interest. Bear in mind that the experimental method has several limitations that you should be aware of. - Ethical, moral, and legal Concerns - Unrealistic Controlled Environments - Inability to Control for All Variables 5. Observation is a technique that involves systematically selecting, watching and recoding behaviors of people or other phenomena and aspects of the setting in which they occur, for the purpose of getting (gaining) specified information. It includes all methods from simple visual observations to the use of high level machines and measurements, sophisticated equipment or facilities such as: - Radiographic - biochemical - X-ray machines - Microscope - Clinical examinations - Microbiological examinations It gives relatively more accurate data on behavior and activities but Investigators or observer’s own biases, prejudice, desires, and etc. and needs more resources and skilled human power during the use of high level machines. size can produce accuracy of results. Moreover, the results from the small sample size will be questionable. A sample size that is too large will result in wasting money and time because enough sample will normally give an accurate result. The secondary data can be collected by the following five methods: The sample size is typically denoted by n and it is always a positive integer. No exact sample size can be mentioned here and it can vary in different research settings. However, all else being equal, large sized sample leads to increased precision in estimates of various properties of the population. 1. Published report on newspaper and periodicals. 2. Financial Data reported in annual reports. 3. Records maintained by the institution. Take Note! 4. Internal reports of the government departments. - Representativeness, not size, is the more 5. Information from official publications. - Use no less than 30 subjects if possible. Take Note! - If you use complex statistics, you may need • Always investigate the validity and reliability of the data by examining the collection method employed by your source. important consideration. a minimum of 100 or more in your sample (varies with method). • Do not use inappropriate data for your research. • The choice of methods of data collection is largely based on the accuracy of the information they yield. SAMPLE SIZE “How many participants should be chosen for a survey”? One of the most frequent problems in statistical analysis is the determination of the appropriate sample size. One may ask why sample size is so important. The answer to this is that an appropriate sample size is required for validity. If the sample size it too small, it will not yield valid results. An appropriate sample Representative Sample Desired Confidence Level 80% 85% 90% 95% 99% Z - Score 1.28 1.44 1.65 1.96 2.58 3. Degree of Variability Choosing of sample size depends on nonstatistical considerations and statistical considerations. • Non-statistical considerations – It may include availability of resources, man power, budget, ethics and sampling frame. • Statistical considerations – It will include the desired precision of the estimate. Depending upon the target population and attributes under consideration, the degree of variability varies considerably. The more heterogeneous a population is, the larger the sample size is required to get an optimum level of precision. Methods in Determining the Sample Size • Estimating the Mean or Average The sample size required to estimate the population mean µ to with a level of confidence with specified margin of error e, given by 2 Zσ n≥ ( e ) Three criteria need to be specified to determine the appropriate sample size: 1. Level of Precision Also called sampling error, the level of precision, is the range in which the true value of the population is estimated to be. where: Z is the z-score corresponding to level of confidence. 2. Confidence Interval e is the level of precision. It is statistical measure of the number of times out of 100 that results can be expected to be within a specified range. For example, a confidence interval of 90% means that results of an action will probably meet expectations 90% of the time. Take Note: To find the right z – score to use, refer to the table: If When σ is unknown, it is common practice to conduct a preliminary survey to determine s and use it as an estimate of σ or use results from previous studies to obtain an estimate of σ. When using this approach, the size of the sample should be at least 30. The formula for the sample standard deviation s is s= ∑ (x − x̄)2 n−1 Example: A soft drink machine is regulated so that the amount of drink dispensed is approximately normally distributed with a standard deviation equal to 0.5 ounce. Determine the sample size needed if we wish to be 95% confident that our sample mean will be within 0.03 ounce from the true mean. Solution: The z – score for confidence level 95% in the z – table is 1.96. n≥ 2 1.96(0.5) = 1067.11 ( 0.03 ) We need a 1068 sample for our study. • Estimating Proportion (Infinite Population) which we know only after we have taken the sample. There are two ways to solve this dilemma: 1. We could determine a preliminary value for p based on a pilot study or an earlier study. Example: If last month 37% of all voters thought that state taxes are too high, then it is likely that the proportion with that opinion this month will not be dramatically different, and we would use the value 0.37 for p in the formula. 2. Simply to replace p in the formula by 0.5. When p = 0.5, the maximum value of p(1- p)=0.25. This is called the most conservative estimate, since it gives the largest possible estimate of n. The conservative formula using the strong law of large number. 2 The sample size required to obtain a confidence interval for p with specified margin of error e is given by 2 Z n≥ p(1 − p) (e) 1 Z n≥ ≈ 385 4 (e) Where: Confidence level is 95%. Where: The level of precision is 0.05. Z is the z-score corresponding to level of confidence. Example: e is the level of precision. P is population proportion. There is a dilemma in this formula: It dependents on p= x N Suppose we are doing a study on the inhabitants of a large town, and want to find out how many households serve breakfast in the mornings. We don’t have much information on the subject to begin with, so we’re going to assume that half of the families serve breakfast: this gives us maximum variability. So p = 0.5. We want 99% confidence and at least 1% precision. Solution: The z – score for confidence level 99% in the z – table is 2.58. 2 2.58 n≥ 0.5(1 − 0.5) = 16,641 ( 0.01 ) We need a 16,641 sample for our study. • Slovin’s Formula Slovin’s formula is used to calculate the sample size n given the population size and error. It is computed as n≥ Where: no is Cochran’s sample size recommendation. N is the population size. This is the link for online calculator of sample size: https://select-statistics.co.uk/calculators/ sample-size-calculator-population-proportion/ https://www.calculator.net/sample-sizecalculator.html N 1 + Ne 2 Where: N is the total population. e is the level of precision. Example: A researcher plans to conduct a survey about food preference of BS Stat students. If the population of students is 1000, find the sample size if the error is 5%. Solution: n≥ 1000 = 285.71 1 + 1000(0.05)2 The researcher need to survey 286 BS stat students. • Finite Population Correction If the population is small then the sample size can be reduced slightly n≥ n0 n −1 1+ o N BASIC SAMPLING DESIGN The goal in sampling is to obtain individuals for a study in such a way that accurate information about the population can be obtained. Reason for Sampling - Important that the individuals included in a sample represent a cross section of individuals in the population. - If sample is not representative it is biased. You cannot generalize to the population from your statistical data. Some definitions are needed to make the notion of a good sample more precise. Definitions: • Observation unit - An object on which a measurement is taken. This is the basic unit of observation, sometimes called an element. In studying human populations, observation units are often individuals. • Target population - The complete collection of observations we want to study. • Sampled population - The collection of all possible observation units that might have been chosen in a sample; the population from which the sample was taken. • Sample - A subset of a population. • Sampling unit - A unit that can be selected for a sample. We may want to study individuals, but do not have a list of all individuals in the target population. Instead, households serve as the sampling units, and the observation units are the individuals living in the households. • Sampling frame - A list, map, or other specification of sampling units in the population from which a sample may be selected. For a survey using in-person interviews, the sampling frame might be a list of all street addresses. • Sampling technique/Sampling Strategies It is a plan you set forth to be sure that the sample you use in your research study represents the population from which you drew your sample. • Sampling Bias - This involves problems in your sampling, which reveals that your sample is not representative of your population. The following examples indicate some ways in which selection bias can occur: - Deliberately or purposively selecting a “representative” sample. Misspecifying the target population. Failing to include all of the target population in the sampling frame, called undercoverage. Including population units in the sampling frame that are not in the target population, called overcoverage. - Having multiplicity of listings in the sampling frame. Substituting a convenient member of a population for a designated member who is not readily available. - Failing to obtain responses from all of the chosen sample. (Nonresponse) - Allowing the sample to consist entirely of volunteers. Advantage of Sampling Over Complete Enumeration - Less Labor - Reduced Cost - Greater Speed - Greater Scope - Greater Efficiency and Accuracy - Convenience - Ethical Considerations Two Type of Samples 1. Probability Sample - Samples are obtained using some objective chance mechanism, thus involving randomization. - They require the use of a complete listing of the elements of the universe called the sampling frame. - The probabilities of selection are known. - They are generally referred to as random samples. - They allow drawing of valid generalizations about the universe/population. 2. Non - probability Sample - Samples are obtained haphazardly, selected purposively or are taken as volunteers. - The probabilities of selection are unknown. - Most basic method of drawing a probability sample. - Assigns equal probabilities of selection to each possible sample. - Results to a simple random sample. Advantage: It is very simple and easy to use. Disadvantage: The sample chosen may be distributed over a wide geographic area. When to use: This is preferable to use if the population is not widely spread geographically. Also, this is more appropriate to use if the population is more or less homogenous with respect to the characteristics of the population. - They should not be used for statistical inference. Sampling Procedure - Identify the population. - Determine if population is accessible. - Select a sampling method. - Choose a sample that is representative of the population. - Ask the question, can I generalize to the Simple Random Sampling general population from the accessible population? Sampling technique can be grouped into how selections of items are made such as probability sampling and non-probability sampling. Basic Sampling Technique of Probability Sampling • Simple Random Sampling • Systematic Random Sampling - It is obtained by selecting every kth individual from the population. - The first individual selected corresponds to a random number between 1 to k. Obtaining a Systematic Random Sample 1. Decide on a method of assigning a unique serial number, from 1 to N, to each one of the elements in the population. When to use: This is advisable to us if the ordering of the population is essentially random and when stratification with numerous data is used. 2. Compute for the sampling interval k= N PopulationSize = n SampleSize 3. Select a number, from 1 to k, using a randomization mechanism. The element in the population assigned to this number is the first element of the sample. The other elements of the sample are those assigned to the numbers and so on until you get a sample of size. Example: Systematic Random Sampling • Stratified Random Sampling We want to select a sample of 50 students from 500 students under this method kth item and picked up from the sampling frame. Solution: k= 500 = 10 50 We start to get a sample starting form i and for every kth unit subsequently. Suppose the random number i is 6, then we select 15, 25, 35, 45, .. . Advantage: Drawing of the sample is easy. It is easy to administer in the field, and the sample is spread evenly over the population. Disadvantage: May give poor precision when unsuspected periodicity is present in the population. When to use: This is advisable to us if the ordering of the population is essentially random and when stratification with numerous data is used. - It is obtained by separating the population into non-overlapping groups called strata and then obtaining a simple random sample from each stratum. - The individuals within each stratum should be homogeneous (or similar) in some way. Example: A sample of 50 students is to be drawn from a population consisting of 500 students belonging to two institutions A and B. The number of students in the institution A is 200 and the institution B is 300. How will you draw the sample using proportional allocation? Solution: There are two strata in this case. Given: N1 = 200 n1 = n2 = N2 = 300 N = 500 n = 50 n 50 N1 = 200 = 20 (N) ( 500 ) n 50 N2 = 300 = 30 (N) ( 500 ) The sample sizes are 20 from A and 30 from B. Then the units from each institution are to be selected by simple random sampling. Advantage: Stratification of respondents is advantageous in terms of precision of the estimates of the characteristics of the population. Sampling designs may vary by stratum to adjust for the differences in the conditions across strata. It is easy to use as a random sampling design. Disadvantage: Values of the stratification variable may not be easily available for all units in the population especially if the characteristic of interest is homogeneous. It is possible that there are not representative in one or two strata. Also, transportation costs can be high if the population covers a wide geographic area. When to use: If the population is such that the distribution of the characteristics of the respondents under consideration concentrated in small and spread segment of the population. Thus, this is preferred to use if precise estimates are desired for stratified parts of the population and if sampling problems differ in the various strata of the population. Stratified Random Sampling • Cluster Sampling - You take the sample from naturally occurring groups in your population. - The clusters are constructed such that the sampling units are heterogeneous within the cluster and homogeneous among the clusters. Obtaining a Cluster Sample 1. Divide the population into non-overlapping clusters. 2. Number the clusters in the population from 1 to N. When to use: If the population can be grouped into clusters where individual population elements are known to be different with respect to the characteristics under study, this preferable to use. 3. Select n distinct numbers from 1 to N using a randomization mechanism. The selected clusters are the clusters associated with the selected numbers. 4. The sample will consist of all the elements in the selected clusters. Example: A researcher wants to survey academic performance of high school students in MIMAROPA. 1. He/She can divide the entire population into different clusters. 2. Then the researcher selects a number of clusters depending on his research through simple or systematic random sampling. 3. Then, from the selected clusters the researcher can either include all the high school students as subject or he can select a number of subjects from each cluster through simple or systematic random sampling. Cluster Sampling • Multi - Stage Sampling - Selection of the sample is done in two or more steps or stages, with sampling units varying in each stage. - The population is first divided into a number of first-stage sampling units from which a sample is drawn. Smaller units, called the secondary sampling units, comprising the selected first-stage units then serve as the sampling units for the next stage. If needed additional stages may be added until the units of observation for the survey are clearly identified. The units comprising the samples selected from the previous stage constitute the frame for the stages. Advantage: There is no need to come out with a list of units in the population; all what is needed is simply a list of the clusters. It is also less costly since the elements are physically closer together. Obtaining a Multi-Stage Sampling Disadvantage: In actual field applications, adjacent households tend to have more similar characteristics than households distantly apart. 1. Organize the sampling process into stages where the unit of analysis is systematically grouped. 2. Select a sampling technique for each 3. Systematically apply the sampling technique to each stage until the unit of analysis has been selected. Example: Suppose we wish to study the expenditure patterns of households in NCR. We can select a sample of households for this study using simple three-stage sampling. - First, divide into smaller cities/municipalities and a random sample of these cities/ municipalities is collected. Multi-Stage Sampling - Second, a random sample of smaller areas such as barangays is taken from within each of the cities/municipalities chosen in the first stage. Basic Sampling Technique of NonProbability Sampling - Third, a random sample of even smaller • Accidental Sampling - There is no system areas such as households is taken from within each of the areas chosen in the second stage. Advantage: It is easier to generate adequate sampling frames. Transportation costs are greatly reduced since there is some form of clustering among the ultimate or final samples; i.e., they are in the sample lower-stage units. Disadvantage: Its complexity in theory may be difficult to apply in the field. Estimation procedures may be difficult for non-statisticians to follow. When to use: If no population list is available and if the population covers a wide area. Take Note! Used probability sampling if the main objective of the sample survey is making inferences about the characteristics of the population under study. of selection but only those whom the researcher or interviewer meets by chance. • Quota Sampling - There is specified number of persons of certain types is included in the sample. The researcher is aware of categories within the population and draws samples from each category. The size of each categorical sample is proportional to the proportion of the population that belongs in that category. • Convenience Sampling - It is a process of picking out people in the most convenient and fastest way to get reactions immediately. This method can be done by telephone interview to get the immediate reactions of a certain group of sample for a certain issue. • Purposive Sampling - It is based on certain criteria laid down by the researcher. People who satisfy the criteria are interviewed. It is used to determine the target population of those who will be taken for the study. • Judgement Sampling - selects sample in accordance with an expert’s judgment. Cases wherein Non-Probability Sampling is Useful - Only few are willing to be interviewed - Extreme difficulties in locating or identifying subjects - Probability sampling is more expensive to implement - Cannot enumerate the population elements. Sources of Errors in Sampling 1. Non-sampling Error - Errors that result from the survey process. - Any errors that cannot be attributed to the sample-to-sample variability. Sources of Non-Sampling Error 1. Non-responses 2. Interviewer Error 3. Misrepresented Answers 4. Data entry errors 5. Questionnaire Design 6. Wording of Questions 7. Selection Bias 2. Sampling Error - Error that results from taking one sample instead of examining the whole population. - Error that results from using sampling to estimate information regarding a population. ACTIVITIES/ASSESSMENTS: I. Determine if the source would be a primary or a secondary source. ______________1. Government Records ______________2. Dictionary ______________3. Artifact ______________4. A TV show explaining what happened in Philippines. ______________5. Autobiography about Rodrigo Duterte. ______________6. Enrile diary describing what he thought about the world war II. ______________7. Audio and video recordings ______________8. Speeches ______________9. Newspaper ______________10. Review Articles II. Determine the sample size of the following problems. Show your solution. 1. A dermatologist wishes to estimate the proportion of young adults who apply sunscreen regularly before going out in the sun in the summer. Find the minimum sample size required to estimate the proportion with precision of 3%, and 90% confidence. 2. The administration at a college wishes to estimate, the proportion of all its entering freshmen who graduate within four years, with 95% confidence. Estimate the minimum size sample required. Assume 1. that the population standard deviation is σ = 1.3 and precision level is 0.05. completed and returned at the end of the program. 2. A government agency wishes to estimate the proportion of drivers aged 16–24 who have been involved in a traffic accident in the last year. It wishes to make the estimate to within 1% error and at 90% confidence. Find the minimum sample size required, using the information that several years ago the proportion was 0.12. ______________4. 24 Hour Fitness wants to administer a satisfaction survey to its current members. Using its membership roster, the club randomly selects 40 club members and asks them about their level of satisfaction with the club. 3. An internet service provider wishes to estimate, to within one percentage error, the current proportion of all email that is spam, with 85% confidence. Last year the proportion that was spam was 71%. Estimate the minimum size sample required if the total email that is spam is 10,000. III. Determine the type of sampling. (ex. Simple Random Sampling, Purposive Sampling) ______________1. To determine customer opinion of its boarding policy, Southwest Airlines randomly selects 60 flights during a certain week and surveys all passengers on the flights. ______________2. A member of Congress wishes to determine her constituency’s opinion regarding estate taxes. She divides her constituency into three income classes: lowincome households, middle-income households, and upper-income households. She then takes a simple random sample of households from each income class. ______________3. The presider of a guestlecture series at a university stands outside the auditorium before a lecture begins and hands every fifth person who arrives, beginning with the third, a speaker evaluation survey to be ______________5. A radio station asks its listeners to call in their opinion regarding the use of U.S. forces in peacekeeping missions. ______________6. A tax auditor selects every 1000th income tax return that is received. ______________7. For a survey, a sample of municipalities was selected from every province in the country and included all child laborers in the selected municipalities. ______________8. To determine his DSL Internet connection speed, Shawn divides up the day into four parts: morning, midday, evening, and late night. He then measures his Internet connection speed at 5 randomly selected times during each part of the day. ______________9. A college official divides the student population into five classes: freshman, sophomore, junior, senior, and graduate student. The official takes a simple random sample from each class and asks the members opinions regarding student services. ______________10. In the game of lotto, 6 balls are selected from a container with 42 balls. IV. Using proportional allocation, determine the sample size needed for every school. The total population of students is 10,679, and the minimum sample is 2,450. School Antipolo National High School Bagong Nayon National High School Dela Paz National High School Sta. Cruz National High School Tubigan National High School Total Population per School Sample 3,360 2,540 2,122 1,290 1,367 10,679 REFERENCES: Statistics. Informed Decision using Data by Michael Sullivan, III,. Fifth Edition Sampling: Design and Analysis by Sharon L. Lhr. Second Edition http://www.economicsdiscussion.net/statistics/ sampling/advantages-of-sampling-overcompleteenumeration-in-statistics/11980 h t t p : / / w w w. n a t c o 1 . o r g / r e s e a r c h / fi l e s /SamplingStrategies.pdf https://data36.com/statistical-bias-typesexplained/ MODULE 3: DESCRIPTIVE STATISTICS OBJECTIVES: After successful completion of this module, you should be able to: ✦ ✦ ✦ ✦ ✦ ✦ ✦ ✦ Distinguish the three main forms of data presentation. Know the different parts of the table. Choose appropriate diagrams/graphs to present a given set of data. Organize qualitative and quantitative data in tables. Compute measures of central tendency, measures of variation and measures of relative position of grouped and ungrouped data. Describe the shape of a distribution. Identify regions under the normal curve corresponding to different standard normal values. Compute probabilities using the standard normal table and Excel. Polytechnic University of the Philippines College of Science Department of Mathematics and Statistics Data Presentation Data are usually collected in a raw format and thus the inherent information is difficult to understand. Therefore, raw data need to be summarized, processed, and analyzed to usefully derive information from them. However, no matter how well manipulated, the information derived from the raw data should be presented in an effective format, otherwise, it would be a great loss for both authors and readers. Planning how the data will be presented is essential before appropriately processing raw data. Polytechnic University of the Philippines College of Science Department of Mathematics and Statistics Presentation of Data Presentation of data refers to an exhibition or putting up data in an attractive and useful manner such that it can be easily interpreted. The three main forms of presentation of data are: Textual Presentation Tabular Presentation Graphical Presentation Polytechnic University of the Philippines College of Science Department of Mathematics and Statistics Textual Presentation • All the data is presented in the form of text, phrases, or paragraphs. • It involves enumerating important characteristics, emphasizing significant figures and identifying important features of data. • Text is the principal method for explaining findings, outlining trends, and providing contextual information. Polytechnic University of the Philippines College of Science Department of Mathematics and Statistics Example: A researcher is asked to present the performance of a section in the statistics test. The following are the test scores: 34 50 37 24 49 42 18 38 29 48 20 35 38 25 46 50 43 39 26 45 17 50 39 28 45 9 23 38 27 46 34 23 38 44 45 43 35 39 44 46 The data presented in textual form would be like this: In the statistics class of 40 students, 3 obtained the perfect score of 50. Sixteen students got a score 40 and above, while only 3 got 19 and below. Generally, the students performed well in the test with 23 or 70% getting a passing score of 38 and above. Polytechnic University of the Philippines College of Science Department of Mathematics and Statistics Advantage of Textual Presentation ✦ ✦ ✦ The data would be more interpreted. Can help in emphasizing some important points in data. Small sets of data can be easily presented. Remember! ✦ Keep your paragraphs simple and short. ✦ Always make sure that the readers are provided with additional explanations about the relevance of the figures and its implications. Polytechnic University of the Philippines College of Science Department of Mathematics and Statistics Tabular Presentation: • It is a systematic and logical arrangement of data in the form of Rows and Columns with respect to the characteristics of data. • A table is best suited for representing individual information and represents both quantitative and qualitative information. Polytechnic University of the Philippines College of Science Department of Mathematics and Statistics Advantage of Tabular Presentation ✦ ✦ ✦ ✦ More information may be presented. Exact values can be read from a table to retain precision. Flexibility is maintained without distortion of data. Less work and less cost are required in the preparation. Polytechnic University of the Philippines College of Science Department of Mathematics and Statistics Preparing Tables The making of a compact table itself is an art. This should contain all the information needed within the smallest possible space. What the purpose of tabulation is and how the tabulated information is to be used are the main points to be kept in mind while preparing for a statistical table. An ideal table should consist of the following main parts:. A. Title: The title must tell as simply as possible what is in the table. It should answer the questions: ✦ Who? White females with breast cancer, black males with lung cancer. ✦ What are the data? Counts, percentage distributions, rates. ✦ Where are the data from? Example: One hospital, or the entire population covered by your registry. ✦ When? A particular year, time period. Polytechnic University of the Philippines College of Science Department of Mathematics and Statistics B. Boxhead: The boxhead contains the captions or column headings. The heading of each column should contain as few words as possible, yet explain exactly what the data in the columns represent. C. Stubs: The row captions are known as the stub. Items in the stub should be grouped to facilitate interpretation of the data. For example, rows may stand for score of classes and columns for data related to sex of students. In the process, there will be many rows for scores classes but only two columns for male and female students. Polytechnic University of the Philippines College of Science Department of Mathematics and Statistics D. Footnotes: Footnotes are given at the foot of the table for explanation of any fact or information included in the table which needs some explanation. Thus, they are meant for explaining or providing further details about the data that have not been covered in title, captions and stubs. E. Sources of Data: We should also mention the source of information from which data are taken. This may preferably include the name of the author, volume, page and the year of publication. This should also state whether the data contained in the table is of ‘primary or secondary’ nature. Polytechnic University of the Philippines College of Science Department of Mathematics and Statistics Parts of the Table https://byjus.com/commerce/tabular-presentation-of-data/ Polytechnic University of the Philippines College of Science Department of Mathematics and Statistics Construction of Data Tables ✦ ✦ ✦ ✦ ✦ ✦ ✦ ✦ The title should be in accordance with the objective of study Comparison Alternative location of stubs Headings Footnote Size of columns Use of abbreviations Units Polytechnic University of the Philippines College of Science Department of Mathematics and Statistics Example: Simple or One – Way Table Optionally, the table may also include totals or percentages. Polytechnic University of the Philippines College of Science Department of Mathematics and Statistics Example: Compound Table A compound table is just an extension of a simple in which there are more than one variable distributed among its attributes (subvariable). An attribute is just a quality, property or component of a variable according to which it can be differentiated with respect to other variables. We may refer to a compound table as a cross tabulation or even to a contingency table depending on the context in which it is used. Polytechnic University of the Philippines College of Science Department of Mathematics and Statistics Organize Quantitative Variable in Table Classes are categories into which data are grouped. When a data set consists of a large number of different discrete data values or when a data set consists of continuous data, we create classes by using intervals of numbers. Make sure that the classes do not overlap. This is necessary to avoid confusion as to which class a data value belongs. Also, make sure that the class widths are equal for all classes. Upper Class Lower Class Limit (LC) Limit (UC) Number Age The class width is the (in thousands) 25 - 34 14,482 difference between 35 44 14,156 consecutive lower class 45 - 54 13,801 limits. Polytechnic University of the Philippines College of Science Department of Mathematics and Statistics 55 - 64 65 - 74 One exception to the requirement of equal class widths occurs in openended tables. A table is open ended if the first class has no lower class limit or the last class has no upper class limit. 12,123 7,010 Scores Frequency 10 - 19 25 20 - 29 36 30 - 39 40 40 and over 12 Guidelines for Determining the Lower Class Limit of the First Class and Class Width Choosing the Lower Class Limit of the First Class: Choose the smallest observation in the data set or a convenient number slightly lower than the smallest observation in the data set. For example, the smallest observation is 10.2. A convenient lower class limit of the first class is 10. Polytechnic University of the Philippines College of Science Department of Mathematics and Statistics Guidelines for Determining the Lower Class Limit of the First Class and Class Width Determining the Class Width: • Decide on the number of classes. Generally, there should be between 5 and 20 classes. The smaller the data set, the fewer classes you should have. • Determine the class width by computing: x − xmin cw = max cw is the class width nc nc is the number of classes Round this value up to a convenient number. Remember! Creating the classes for summarizing continuous data is an art form. There is no such thing as the correct frequency distribution. However, there can be less desirable frequency distributions. The larger the class width, the fewer classes a frequency distribution will have. Polytechnic University of the Philippines College of Science Department of Mathematics and Statistics How to Construct Frequency Distribution Table? A frequency distribution list each category of data and the number of occurrences for each category of data. Polytechnic University of the Philippines College of Science Department of Mathematics and Statistics Example: Use the “Sample Data file”. Solution: To answer this question we need to construct a frequency distribution to determine how many female and male respondents participated in the study. Polytechnic University of the Philippines College of Science Department of Mathematics and Statistics Procedure in Constructing Frequency Table ✦ If the data is in the form of qualitative data To construct the frequency distribution using excel use the command: =frequency(data_array,bins_array) Then Ctrl → Shift → Enter {=frequency(data_array,bins_array)} Polytechnic University of the Philippines College of Science Department of Mathematics and Statistics Polytechnic University of the Philippines College of Science Department of Mathematics and Statistics Polytechnic University of the Philippines College of Science Department of Mathematics and Statistics Polytechnic University of the Philippines College of Science Department of Mathematics and Statistics Final Output Table 1 shows the frequency and percentage distribution of the respondents in terms of sex. It can be gleaned from the table that, out of 128 respondents considered in the study, 65 or 50.8% are male and 63 or 49.2% are female. Polytechnic University of the Philippines College of Science Department of Mathematics and Statistics Example: Use the “Sample Data file”. Polytechnic University of the Philippines College of Science Department of Mathematics and Statistics Procedure in Constructing Frequency Table If the data is in the form of quantitative data Steps 1. Set an interval or range for your data. It is needed for the “BIN RANGE”. 2. Click “DATA” on the menu bar and Click “DATA ANALYSIS” on the tool bar 3. The dialog box “DATA ANALYSIS” will appear and choose “HISTOGRAM” on the dialog box then click OK. ✦ Polytechnic University of the Philippines College of Science Department of Mathematics and Statistics Procedure in Constructing Frequency Table If the data is in the form of quantitative data Steps 4. Highlight your data for the “INPUT RANGE”. 5. Highlight your data for the “BIN RANGE”. 6. Click the box of “LABELS IN FIRST ROW” then click “OK”. 7. The result will appear on the new worksheet of the excel file. Get the Percentage and total. ✦ Polytechnic University of the Philippines College of Science Department of Mathematics and Statistics Polytechnic University of the Philippines College of Science Department of Mathematics and Statistics Polytechnic University of the Philippines College of Science Department of Mathematics and Statistics Polytechnic University of the Philippines College of Science Department of Mathematics and Statistics Polytechnic University of the Philippines College of Science Department of Mathematics and Statistics Polytechnic University of the Philippines College of Science Department of Mathematics and Statistics Final Output Polytechnic University of the Philippines College of Science Department of Mathematics and Statistics Example: Identify problems with the following table. Answer: ✦ ✦ ✦ Useless Information – Don’t show decimals if they are not needed. Poor Alignment – Make sure alignment makes sense. • Don’t center numbers, always right justify – try to align decimal points. • Consider the appropriate placement of row titles. Difficult to Read – Use commas used when the number exceeds a thousand. Polytechnic University of the Philippines College of Science Department of Mathematics and Statistics Graphical Presentation ✦ ✦ ✦ A graph is a very effective visual tool as it displays data at a glance, facilitates comparison, and can reveal trends and relationships within the data such as changes over time, and correlation or relative share of a whole. It is considered an important medium of communication because we are able to create a pictorial representation of the numerical figures. Suited when we need to show the results of the study to nonprofessionals and or people who dislike numbers and too lengthy texts. Polytechnic University of the Philippines College of Science Department of Mathematics and Statistics Bar Graph ✦ ✦ It is constructed by labeling each category of data on either the horizontal or vertical axis and the frequency or relative frequency of the category on the other axis. Rectangles of equal width are drawn for each category. The height of each rectangle represents the category’s frequency or relative frequency. It is use to organize discrete data. Polytechnic University of the Philippines College of Science Department of Mathematics and Statistics Example: Simple Bar Graph The simple bar chart is used for the case of one variable only. Polytechnic University of the Philippines College of Science Department of Mathematics and Statistics Example: Multiple Bar Graph\ Grouped Column Chart The multiple bar chart is an extension of a simple bar chart when there are quantities of several variables to be displayed. The bars representing the quantities for the different variables are piled next to one another for each attribute. The figure becomes very cumbersome when there are too many variables and components. Polytechnic University of the Philippines College of Science Department of Mathematics and Statistics Example: Component Bar Graph/ Subdivided Column Chart In this type of bar chart, the components (quantities) of each variable are piled on top of one another. It saves space as compared to a multiple bar chart. One of the disadvantage of this graph is that it is not always easy to compare size of the components, or parts. It is used to represent data in which the total magnitude is divided into different or components. Polytechnic University of the Philippines College of Science Department of Mathematics and Statistics Remember! • Bar graphs may also be drawn with horizontal bars. Horizontal bars are preferable when category names are lengthy. • In bar graphs, the order of the categories does not usually matter. However, bar graphs that have categories arranged in decreasing order of frequency help prioritize categories for decision-making purposes in areas such as quality control, human resources, and marketing. Polytechnic University of the Philippines College of Science Department of Mathematics and Statistics Histogram ✦ ✦ ✦ It is constructed by drawing rectangles for each class of data. The height of each rectangle is the frequency or relative frequency of the class. The width of each rectangle is the same and the rectangles touch each other. It is a graph used to present quantitative data, is similar to the bar graph. It is use to organize continuous data. Polytechnic University of the Philippines College of Science Department of Mathematics and Statistics https://newonlinecourses.science.psu.edu/ stat500/lesson/1/1.6/1.6.2 Pie Chart It is a circle divided into sectors. Each sector represents a category of data.The area of each sector is proportional to the frequency of the category. ✦ Pie charts are typically used to present the relative frequency of qualitative data. Inmost cases the data are nominal, but ordinal data can also be displayed in a pie chart. ✦ Polytechnic University of the Philippines College of Science Department of Mathematics and Statistics When should a bar graph or a pie chart be used? ✦ ✦ Pie charts are useful for showing the division of all possible values of a qualitative variable into its parts. Bar graphs are useful when we want to compare the different parts, not necessarily the parts to the whole. Polytechnic University of the Philippines College of Science Department of Mathematics and Statistics Line Graph ✦ ✦ ✦ A graph that shows information that is connected in some way (such as change over time) Line segments are then drawn connecting the points. It is use to organize continuous data. Very useful in identifying trends in the data over time. Polytechnic University of the Philippines College of Science Department of Mathematics and Statistics Example: Simple Line Graph The simplest of line graphs is the single line graph, so called because it displays information concerning one variable only, in terms of its frequencies. Polytechnic University of the Philippines College of Science Department of Mathematics and Statistics Example: Multiple Line Graph Multiple line graphs illustrate information on several variables so that comparison is possible between them. Polytechnic University of the Philippines College of Science Department of Mathematics and Statistics Guidelines for Constructing Good Graphics ✦ ✦ ✦ Title and label the graphic axes clearly, providing explanations if needed. Include units of measurement and a data source when appropriate. Avoid distortion. Minimize the amount of white space in the graph. Use the available space to let the data stand out. If you truncate the scales, clearly indicate this to the reader. Polytechnic University of the Philippines College of Science Department of Mathematics and Statistics Guidelines for Constructing Good Graphics ✦ Avoid clutter, such as excessive gridlines and unnecessary backgrounds or pictures. ✦ Don’t distract the reader. ✦ Avoid three dimensions. ✦ Do not use more than one design in the same graphic. Let the data speak for themselves. Polytechnic University of the Philippines College of Science Department of Mathematics and Statistics Grouped and Ungrouped Data Data is often described as ungrouped or grouped. Grouped data is the type of data which is classified into groups after collection. Ungrouped data which is also known as raw data is data that has not been placed in any group or category after collection. Ungrouped data without a frequency distribution 1, 5, 4, 7, 2, 4, 1, 3, 8, 2, 2, 9 Polytechnic University of the Philippines College of Science Department of Mathematics and Statistics Scores 1 - 10 11 - 20 21 - 30 31 - 40 41 - 50 Total Frequency 5 9 10 12 24 60 Ungrouped data with a frequency distribution No. of Television Sets 0 1 2 3 4 5 Total Frequency 7 15 12 4 5 2 45 Measures of Central Tendency: MEAN • • • It is the sum of the data values divided by the number of data values. It is also called the average. It is appropriate only for data under interval and ratio scale measurement. Advantage of Mean ✦ Simple to understand and easy to calculate. ✦ It is rigidly defined. ✦ It is least affected fluctuation of sampling. ✦ It takes into account all the values in the series. Polytechnic University of the Philippines College of Science Department of Mathematics and Statistics Formula for Mean: Sample Mean ✦ For Grouped Data For Ungrouped Data where: where: r n xi = data values xi = data values ∑i=1 fxi ∑i=1 xi n = no. of f = frequency x̄ = x̄ = sample n n = no. of n observations sample observations Population Mean where: where: r N ∑i=1 fxi xi = data values ∑i=1 xi xi = data values μ= μ= N = no. of f = frequency N N observations N = no. of observations ✦ Polytechnic University of the Philippines College of Science Department of Mathematics and Statistics Measures of Central Tendency: MEDIAN It is the “middle observation” when the data set is sorted (in either increasing or decreasing order). • The median divides the distribution into two equal parts. Advantage of Median ✦ The median is not affected by the size of extreme values but by the number of observations. ✦ The median can be calculated even when the frequency distribution contains “open-ended” intervals. ✦ It can also be used to define the middle of a number of objects, properties, or quantities which are not really quantitative in a nature. ✦ It can be easily interpreted. • Polytechnic University of the Philippines College of Science Department of Mathematics and Statistics Formula for Median: ✦ For Ungrouped Data 1. Arrange the data from lowest to highest (or highest to lowest). 2. For an odd number of data, the median of a data set is the “middle observation”. When the number of data is even, the median is the “average of the two middle scores”. Polytechnic University of the Philippines College of Science Department of Mathematics and Statistics ✦ For Grouped Data n − < cf i (2 ) x̃ = LB + f where: LB = lower boundary of the median class i = class width n = no. of observations < cf = less than the cumulative frequency of the class preceding the median class f = frequency of the median class Measures of Central Tendency: MODE • • • • It is the most frequently occurring value in a list of data. It is sometimes called nominal average. It is an appropriate measure of average for data using the nominal scale of measurement. It is the only measure of central tendency used in both quantitative and qualitative data. Advantage of Mode The mode is easy to understand. Like the median, it is not greatly affected by extreme values. Like the median, it can be computed even when the frequency distribution contains “open-ended” intervals. ✦ ✦ ✦ Polytechnic University of the Philippines College of Science Department of Mathematics and Statistics Formula for Mode: ✦ For Ungrouped Data ✦ For Grouped Data d1 1.Obtain a frequency x ̂ = LB + i ( d1 + d2 ) distribution of the distinct where: values of the data. LB = lower boundary of the 2.The mode is the most frequently occurring data (if there is one). modal class i = class width d1 = difference between the frequency of the modal class and the class preceding it d2 = difference between the frequency of the modal class and the class following it Polytechnic University of the Philippines College of Science Department of Mathematics and Statistics Remember! • Whenever you hear the word average, be aware that the word may not always be referring to the mean. One average could be used to support one position, while another average could be used to support a different position. • Mode is not always present in the data sets unlike mean and median. • If you are interested in the “center of gravity” of your data, then use the mean; if you are interested in the “middle value” within your data, then use the median Polytechnic University of the Philippines College of Science Department of Mathematics and Statistics Choosing a Measure of Central Tendency: We have discussed three types of central tendency-the mode, the mean, and the median and examined how they differ in terms of finding the center of a data distribution. The next legitimate question to ask may be “When do we use which measure?” Consider the following data sets: Data Set I Data Set II 108 112 116 120 124 108 112 116 120 205 Determine the mean, median and mode. Polytechnic University of the Philippines College of Science Department of Mathematics and Statistics In both data sets, the median is 116, as it is the number that divides the data set into two exact halves. However, you will notice that the mean is not identical in both data sets. For the first data set, the mean is equal to 116 where the mean of the second data set is equal to 132.5 Notice how the mean of the second data set has been influenced by the presence of an unusual case/outlier in the data set. If we were to say the mean is equal to 132.5 for the second data set and it represents a typical case, this will not make much sense because the majority of data values are less than 120. Therefore, the mean should not be used when unusual, or outlying, data values are present in the data set, as the mean tends to be extremely sensitive to the unusual values. Rather, the median should be reported in this case. Polytechnic University of the Philippines College of Science Department of Mathematics and Statistics • The mode is simply the most frequently occurring data values in the data set. Therefore, it is mainly useful for the nominal level of measurement. Both median and mean are useful when the variable being measured can be quantified. Also both data sets have no mode that’s why mode is not appropriate measure to use in these data sets. • It is better to use the median than to use the mean when the sample is small or asymmetrical (i.e., skewed) and unusual cases/outliers is present in the data sets. This is why the average housing price is always reported with the median, since even one million-dollar house can distort the average housing price when most of the houses are in Php500,000–Php650,000 range. Polytechnic University of the Philippines College of Science Department of Mathematics and Statistics Example: The data given below is the age of the residents in Barangay 634, Sta. Mesa, Manila. Compute mean, median and mode. Class Interval Frequency 55 - 59 55 50 - 54 23 45 - 49 37 40 - 44 37 35 - 39 48 30 - 34 42 25 - 29 27 Polytechnic University of the Philippines College of Science Department of Mathematics and Statistics Solution: To compute mean of grouped data, first you need to fill out this table. Class Interval 55 - 59 Frequency (f) 3 50 - 54 6 45 - 49 7 40 - 44 35 - 39 9 6 30 - 34 4 25 - 29 5 Total n= x fx It is the midpoint of every class interval. To compute this: LC + UP x= 2 Ex: 7 ∑ i=1 fxi = Polytechnic University of the Philippines College of Science Department of Mathematics and Statistics 55 + 59 = 57 2 50 + 54 x= = 52 2 x= Solution: x fx 55 - 59 Frequency (f) 3 57 171 50 - 54 6 52 312 45 - 49 40 - 44 7 9 47 42 329 378 35 - 39 6 37 222 30 - 34 4 32 128 25 - 29 5 27 Total n = 40 Class Interval 7 ∑ i=1 135 fxi = 1,675 x̄ = = 7 ∑i=1 fxi n 1,675 40 = 41.88 The average age is 41.88 Polytechnic University of the Philippines College of Science Department of Mathematics and Statistics Solution: To compute median and mode of grouped data, first you need to fill out this table. Class Interval 55 - 59 3 50 - 54 6 45 - 49 7 40 - 44 35 - 39 9 6 30 - 34 4 f 25 - 29 5 Total n= LB < cf To compute the lower b o u n d a r y, a l w a y s subtract 0.5 to lower class limit (LC). Ex: 55 − 0.5 = 54.5 50 − 0.5 = 49.5 45 − 0.5 = 44.5 Polytechnic University of the Philippines College of Science Department of Mathematics and Statistics Solution: Class Interval 55 - 59 f LB 3 54.5 50 - 54 6 49.5 45 - 49 7 44.5 40 - 44 35 - 39 9 6 39.5 34.5 30 - 34 4 29.5 25 - 29 5 24.5 Total n = 40 < cf 5 If the arrangement of the class interval is descending order, always start at the bottom part. Copy the frequency of the lowest class interval. 5 + 4 = 9 + 6 = 15 + 9 = 24 + 7 = 31 + 6 = 37 + 3 = 40 Polytechnic University of the Philippines College of Science Department of Mathematics and Statistics Solution: Class Interval 55 - 59 f LB < cf 3 54.5 40 50 - 54 6 49.5 37 45 - 49 40 - 44 7 9 44.5 39.5 31 24 35 - 39 6 34.5 15 30 - 34 4 29.5 9 25 - 29 5 24.5 5 Total n = 40 x̃ = LB + n − < cf i (2 ) Polytechnic University of the Philippines College of Science Department of Mathematics and Statistics f First, compute n , it will help us to 2 determine the median class and the < cf. n 40 = = 20 2 2 The median class is the class containing the 20th item. Hence, the median class is 40 - 44. x̃ = 39.5 + (20 − 15)5 = 42.28 9 Solution: Class Interval f LB < cf 55 - 59 3 54.5 40 50 - 54 6 49.5 37 45 - 49 7 44.5 31 40 - 44 9 39.5 24 35 - 39 6 34.5 15 30 - 34 4 29.5 9 25 - 29 5 24.5 5 x ̂ = LB + d1 i ( d1 + d2 ) The modal class is the class interval with the highest frequency. The modal class is 40 - 44. If there are two class interval that contains the highest frequency, always choose the highest class interval. d1 = 9 − 6 = 3 d2 = 9 − 7 = 2 x ̂ = 39.5 + 3 5 = 42.5 (3 + 2) Polytechnic University of the Philippines College of Science Department of Mathematics and Statistics Measures of Relative Position Quantiles are statistics that describe various subdivisions of a frequency distribution into equal proportions. Three special Quantiles: 1. Quartiles 2. Deciles 3. Percentiles Polytechnic University of the Philippines College of Science Department of Mathematics and Statistics Quartiles - split the ordered data into four quarters. Deciles - split the ordered data into ten equal. Percentiles - split the ordered data into 100 equal parts. Polytechnic University of the Philippines College of Science Department of Mathematics and Statistics Formula for Quartile: ✦ For Ungrouped Data 1. Arrange the data from lowest to highest. Then use this formula. Qclass = nk + 0.5 4 2. If the resulting positioning point is an integer, the particular numerical observation corresponding to that point is chosen for the quartile. If not, use interpolation. Polytechnic University of the Philippines College of Science Department of Mathematics and Statistics ✦ For Grouped Data nk − < cf i (4 ) Qk = LB + f where: LB = lower boundary of the quartile class i = class width n = no. of observations k = quartile position < cf = less than the cumulative frequency of the class preceding the quartile class f = frequency of the quartile class Formula for Decile: ✦ For Ungrouped Data 1. Arrange the data from lowest to highest. Then use this formula. Dclass = nk + 0.5 10 2. If the resulting positioning point is an integer, the particular numerical observation corresponding to that point is chosen for the decile.If not, use interpolation. Polytechnic University of the Philippines College of Science Department of Mathematics and Statistics ✦ For Grouped Data nk − < cf i ( 10 ) Dk = LB + f where: LB = lower boundary of the decile class i = class width n = no. of observations k = decile position < cf = less than the cumulative frequency of the class preceding the decile class f = frequency of the decile class Formula for Percentile: ✦ For Ungrouped Data 1. Arrange the data from lowest to highest. Then use this formula. Pclass = nk + 0.5 100 2. If the resulting positioning point is an integer, the particular numerical observation corresponding to that point is chosen for the percentile. If not, use interpolation. Polytechnic University of the Philippines College of Science Department of Mathematics and Statistics For Grouped Data nk − < cf i ( 100 ) Pk = LB + f where: LB = lower boundary of the percentile class i = class width n = no. of observations k = percentile position ✦ < cf = less than the cumulative frequency of the class preceding the percentile class f = frequency of the percentile class Example 1: The data given below is the total number of hours lost due to tardiness and absences of employees in a company in a given year. Find Q3, D4 and P55. Month Hour Lost (x) January February March April May June July August September October November December 55 23 37 37 48 42 27 20 30 32 24 40 Polytechnic University of the Philippines College of Science Department of Mathematics and Statistics Solution: To compute Q3 of ungrouped data: 1. Arrange the data from lowest to highest. 20 1 23 2 24 3 27 4 30 5 32 6 Qclass = 37 7 37 8 40 9 (12)(3) = 9.5 4 42 10 48 11 55 12 2. Use interpolation since the computed Qclass is not an integer. 20 1 23 2 24 3 27 4 30 5 32 6 Q3 = 40 + 0.5(42 − 40) 37 7 37 8 40 9 42 10 48 11 55 12 = 41 Polytechnic University of the Philippines College of Science Department of Mathematics and Statistics Solution: To compute D4 of ungrouped data: 1. Arrange the data from lowest to highest. 20 23 24 27 30 32 37 37 40 42 48 55 1 2 3 4 5 6 7 8 9 10 11 12 Dclass = (12)(4) + 0.5 = 5.3 10 2. Use interpolation since the computed Dclass is not an integer. 20 23 24 27 30 32 37 37 40 42 48 55 1 2 3 4 5 6 7 8 9 10 11 12 D4 = 30 + 0.3(32 − 30) = 30.6 Polytechnic University of the Philippines College of Science Department of Mathematics and Statistics Solution: To compute P55 of ungrouped data: 1. Arrange the data from lowest to highest. 20 23 24 27 30 32 37 37 40 42 48 55 1 2 3 4 5 6 7 8 9 10 11 12 Pclass = (12)(55) + 0.5 = 7.1 100 2. Use interpolation since the computed Pclass is not an integer. 20 23 24 27 30 32 37 37 40 42 48 55 1 2 3 4 5 6 7 8 9 10 11 12 P55 = 37 + 0.1(37 − 37) = 37 Polytechnic University of the Philippines College of Science Department of Mathematics and Statistics Example 2: The data given below is the age of the residents in Barangay 634, Sta. Mesa, Manila. Compute Q1, D7, and P10. Class Interval Frequency 55 - 59 55 50 - 54 23 45 - 49 37 40 - 44 37 35 - 39 48 30 - 34 42 25 - 29 27 Polytechnic University of the Philippines College of Science Department of Mathematics and Statistics Solution: To compute Q1, D7, and P10 of grouped data, first you need to fill out this table. Class Interval 55 - 59 3 50 - 54 6 45 - 49 7 40 - 44 35 - 39 9 6 30 - 34 4 f 25 - 29 5 Total n= Polytechnic University of the Philippines College of Science Department of Mathematics and Statistics LB < cf To compute the lower b o u n d a r y, a l w a y s subtract 0.5 to lower class limit (LC). Ex: 55 − 0.5 = 54.5 50 − 0.5 = 49.5 45 − 0.5 = 44.5 Solution: Class Interval 55 - 59 f LB 3 54.5 50 - 54 6 49.5 45 - 49 7 44.5 40 - 44 35 - 39 9 6 39.5 34.5 30 - 34 4 29.5 25 - 29 5 24.5 Total n = 40 < cf 5 If the arrangement of the class interval is descending order, always start at the bottom part. Copy the frequency of the lowest class interval. 5 + 4 = 9 + 6 = 15 + 9 = 24 + 7 = 31 + 6 = 37 + 3 = 40 Polytechnic University of the Philippines College of Science Department of Mathematics and Statistics Solution: Class Interval 55 - 59 f LB < cf 3 54.5 40 50 - 54 6 49.5 37 45 - 49 40 - 44 7 9 44.5 39.5 31 24 35 - 39 6 34.5 15 30 - 34 4 29.5 9 25 - 29 5 24.5 5 Total n = 40 nk − < cf i (4 ) Qk = LB + f First, compute nk , it will help us to 4 determine the quartile class and the < cf. nk (40)(1) = = 10 4 4 The quartile class is the class containing the 10th item. Hence, the quartile class is 35 - 39. Q1 = 34.5 + (10 − 9)5 = 35.33 6 Polytechnic University of the Philippines College of Science Department of Mathematics and Statistics Solution: Class Interval 55 - 59 f LB < cf 3 54.5 40 50 - 54 6 49.5 37 45 - 49 40 - 44 7 9 44.5 39.5 31 24 35 - 39 6 34.5 15 30 - 34 4 29.5 9 25 - 29 5 24.5 5 Total n = 40 Dk = LB + nk − < cf i ( 10 ) Polytechnic University of the Philippines College of Science Department of Mathematics and Statistics f First, compute nk , it will help us to 10 determine the decile class and the < cf. nk (40)(7) = = 28 10 10 The decile class is the class containing the 28 item. Hence, the decile class is 45 - 49. D7 = 44.5 + (28 − 24)5 = 47.36 7 Solution: Class Interval 55 - 59 f LB < cf 3 54.5 40 50 - 54 6 49.5 37 45 - 49 40 - 44 7 9 44.5 39.5 31 24 35 - 39 6 34.5 15 30 - 34 4 29.5 9 25 - 29 5 24.5 5 Total n = 40 Pk = LB + nk − < cf i ( 100 ) f First, compute nk , it will help us to 100 determine the percentile class and the nk (40)(10) < cf. = =4 100 100 The percentile class is the class containing the 4th item. Hence, the percentile class is 25 - 29. P10 = 24.5 + (5 − 0)5 = 29.5 5 Polytechnic University of the Philippines College of Science Department of Mathematics and Statistics Example 2: The ages of the town’s people in a certain community is as follows: Class Interval Frequency 18 - 24 28 25 - 31 54 32 - 38 38 39 - 45 20 46 - 52 17 53 - 59 3 Find Q2, D5, and P50. Polytechnic University of the Philippines College of Science Department of Mathematics and Statistics Solution: To compute Q2, D5, and P50 of grouped data, first you need to fill out this table. Class Interval f 18 - 24 28 25 - 31 54 32 - 38 38 39 - 45 20 46 - 52 17 53 - 59 3 Total n= Polytechnic University of the Philippines College of Science Department of Mathematics and Statistics LB < cf To compute the lower b o u n d a r y, a l w a y s subtract 0.5 to lower class limit (LC). Ex: 18 − 0.5 = 17.5 25 − 0.5 = 24.5 32 − 0.5 = 31.5 Solution: Class Interval f LB < cf 18 - 24 28 17.5 28 25 - 31 54 24.5 32 - 38 38 31.5 39 - 45 20 38.5 46 - 52 17 45.5 53 - 59 3 52.5 Total n = 160 If the arrangement of the class interval is a s c e n d i n g o r d e r, always start at the upper part. Copy the frequency of the lowest class interval. 28 + 54 = 82 + 38 = 120 + 20 = 140 + 17 = 157 + 3 = 160 Polytechnic University of the Philippines College of Science Department of Mathematics and Statistics Solution: Class Interval f LB < cf 18 - 24 28 17.5 28 25 - 31 54 24.5 82 32 - 38 38 31.5 120 39 - 45 20 38.5 140 46 - 52 17 45.5 157 53 - 59 3 52.5 160 Total n = 160 nk − < cf i (4 ) Qk = LB + f First, compute nk , it will help us to 4 determine the quartile class and the < cf. nk (160)(2) = = 80 4 4 The quartile class is the class containing the 80th item. Hence, the quartile class is 25 - 31. Q2 = 24.5 + (80 − 28)7 = 31.24 54 Polytechnic University of the Philippines College of Science Department of Mathematics and Statistics Solution: Class Interval f LB < cf 18 - 24 28 17.5 28 25 - 31 54 24.5 82 32 - 38 38 31.5 120 39 - 45 20 38.5 140 46 - 52 17 45.5 157 53 - 59 3 52.5 160 Total n = 160 Dk = LB + nk − < cf i ( 10 ) Polytechnic University of the Philippines College of Science Department of Mathematics and Statistics f First, compute nk , it will help us to 10 determine the decile class and the < cf. nk (160)(5) = = 80 10 10 The decile class is the class containing the 80th item. Hence, the decile class is 25 - 31. D5 = 24.5 + (80 − 28)7 = 31.24 54 Solution: Class Interval f LB < cf 18 - 24 28 17.5 28 25 - 31 54 24.5 82 32 - 38 38 31.5 120 39 - 45 20 38.5 140 46 - 52 17 45.5 157 53 - 59 3 52.5 160 Total n = 160 Pk = LB + nk − < cf i ( 100 ) f First, compute nk , it will help us to 100 determine the percentile class and the nk (160)(50) < cf. = = 80 100 100 The percentile class is the class containing the 80th item. Hence, the percentile class is 25 - 31. P50 = 24.5 + (80 − 28)7 = 31.24 54 Polytechnic University of the Philippines College of Science Department of Mathematics and Statistics Sample Interpretation: 1. Jennifer just received the results of her SAT exam. Her SAT Mathematics score of 600 is in the 74th percentile. What does this mean? A percentile rank of 74% means that 74% of SAT Mathematics scores are less than or equal to 600 and 26% of the scores are greater. So 26% of the students who took the exam scored better than Jennifer. 2. Time taken to finish a test is 35 minutes. This time was the first quartile. What does this mean? 25% of the learners finished the exam in 35 minutes or less, and 75% of the learners finished the exam in more than 35 minutes. Polytechnic University of the Philippines College of Science Department of Mathematics and Statistics Measures of Dispersion/Variability Based on the figure below, determine which between the two scatter diagram illustrate larger variability? Figure 1 Figure 2 Since the data points in figure 2 is more scattered than the data points in figure 1, then the data set depicted in figure 2 is more varied. Polytechnic University of the Philippines College of Science Department of Mathematics and Statistics Measures of Dispersion/Variability: RANGE It is the difference between the largest and the smallest observations or items in a set of data. R = Xmax. − Xmin. Range is simple to calculate. However, we should be cautious about using range as a measure of variability. Range is a very crude measure of variability as it only uses the highest and lowest values in computation. Therefore, it does not accurately capture information about how data values in the set differ if the data set contains an unusual cases/outliers. Polytechnic University of the Philippines College of Science Department of Mathematics and Statistics Measures of Dispersion/Variability: STANDARD DEVIATION • It is a measure of how far away items in a data set are from the mean. • The larger the standard deviation, the more variation there is in the data set. • The standard deviation can never be a negative number, due to the way it’s calculated and the fact that it measures a distance (distances are never negative numbers). • The smallest possible value for the standard deviation is 0, and that happens only in contrived situations where every single number in the data set is exactly the same (no deviation). Polytechnic University of the Philippines College of Science Department of Mathematics and Statistics Formula for Standard Deviation: Sample Standard Deviation ✦ For Grouped Data For Ungrouped Data where: where: r n ∑i=1 f(xi − x̄)2 xi = data ∑i=1 (xi − x̄)2 xi = data values s = values s = n−1 n−1 x̄ = mean x̄ = mean f = frequency n = no. of sample observations n = no. of sample observations Population Standard Deviation where: where: r N xi = data 2 xi = data ∑i=1 f(xi − μ)2 ∑i=1 (xi − μ) values σ = values σ = N N μ = mean μ = mean f = frequency N = no. of observations N = no. of observations ✦ Polytechnic University of the Philippines College of Science Department of Mathematics and Statistics Measures of Dispersion/Variability: VARIANCE It represents all data points in a set and is calculated by averaging the squared deviation of each mean. Variance is not easy to read as it is the squared format and hence not easily interpretable. However, Standard deviation being in the same units as the mean we can easily understand the spread of data. Polytechnic University of the Philippines College of Science Department of Mathematics and Statistics Formula for Variance: Sample Variance ✦ For Grouped Data For Ungrouped Data where: where: r n ∑i=1 f(xi − x̄)2 ∑i=1 (xi − x̄)2 xi = data xi = data 2 2 values s = values s = n−1 n−1 x̄ = mean x̄ = mean f = frequency n = no. of sample observations n = no. of sample observations Population Variance where: where: r N xi = data ∑i=1 f(xi − μ)2 ∑i=1 (xi − μ)2 xi = data 2 values σ 2 = values σ = N N μ = mean μ = mean f = frequency N = no. of observations N = no. of observations ✦ Polytechnic University of the Philippines College of Science Department of Mathematics and Statistics Example 1: The data given below is the age of the residents in Barangay 634, Sta. Mesa, Manila. Compute sample standard deviation and sample variance. Polytechnic University of the Philippines College of Science Department of Mathematics and Statistics Class Interval Frequency 55 - 59 55 50 - 54 23 45 - 49 37 40 - 44 37 35 - 39 48 30 - 34 42 25 - 29 27 Solution: To compute SD and Var of grouped data, first you need to fill out this table. Class Interval 55 - 59 50 - 54 45 - 49 40 - 44 35 - 39 30 - 34 25 - 29 f x 3 6 7 9 6 4 5 Total fx 7 ∑ n= i=1 (xi − x̄)2 f(xi − x̄)2 7 fxi = ∑ i=1 f(xi − x̄)2 = Polytechnic University of the Philippines College of Science Department of Mathematics and Statistics Solution: Class Interval 55 - 59 f x fx 3 57 171 50 - 54 45 - 49 40 - 44 35 - 39 30 - 34 25 - 29 6 7 9 6 4 5 52 47 42 37 32 27 312 329 378 222 128 135 Total 1,675 40 = 41.88 Polytechnic University of the Philippines College of Science Department of Mathematics and Statistics Class Interval 55 - 59 f x fx 3 57 171 50 - 54 45 - 49 40 - 44 35 - 39 30 - 34 25 - 29 6 7 9 6 4 5 52 47 42 37 32 27 312 329 378 222 128 135 n = 40 102.41 26.21 0.01 23.81 97.61 221.41 7 ∑ i=1 f(xi − x̄)2 = (x1 − x̄)2 = (57 − 41.88)2 = 228.61 (x2 − x̄)2 = (52 − 41.88)2 = 102.41 (x3 − x̄)2 = (47 − 41.88)2 = 26.21 Solution: Total 228.61 fx = ∑ i i=1 1,675 n = 40 x̄ = 7 f(xi − x̄)2 (xi − x̄)2 7 fx = ∑ i i=1 1,675 (xi − x̄)2 f(xi − x̄)2 228.61 685.83 102.41 26.21 0.01 23.81 97.61 221.41 614.46 183.47 0.09 142.86 390.44 1107.05 7 ∑ i=1 f(xi − x̄)2 = 3,124.20 f(x1 − x̄)2 = 3(228.61) = 685.83 f(x2 − x̄)2 = 6(102.41) = 614.46 f(x3 − x̄)2 = 7(26.21) = 183.47 Polytechnic University of the Philippines College of Science Department of Mathematics and Statistics Solution: Class Interval 55 - 59 50 - 54 45 - 49 40 - 44 35 - 39 30 - 34 25 - 29 (xi − x̄) 2 228.61 102.41 26.21 0.01 23.81 685.83 614.46 183.47 0.09 142.86 97.61 221.41 390.44 1107.05 7 ∑ Total s= f(xi − x̄) 2 i=1 7 ∑i=1 f(xi − x̄)2 n−1 3,124.20 40 − 1 = 8.95 s= s2 = f(xi − x̄)2 = 3,124.20 s2 = 7 ∑i=1 f(xi − x̄)2 n−1 3,124.20 40 − 1 = 80.11 Polytechnic University of the Philippines College of Science Department of Mathematics and Statistics How to interpret variance and standard deviation? Consider the following data set of toddler weights in an outpatient clinic, assuming that the data values were taken: Data Set 15 13 20 19 14 Computed variance for this data set is 9.7. Computed standard deviation for this data set is 3.11. What does this mean? Polytechnic University of the Philippines College of Science Department of Mathematics and Statistics We cannot use variance as a measure of variability. Let us assume that the values represent weight losses measured in pounds taken from five subjects. Because the deviation of each observation from the mean has been squared, the unit for the variance is now in (pound)2 . What does (pound)2 mean? If we were to say that data values differ from the mean on average about 9.7 (pound)2, would this claim make sense? Probably not, since there is no such a unit as a (pound)2. Why do we then take the square of the deviation if the (unit)2 will not make sense to interpret at the end? The answer is simple: If you do not square the deviation and sum each deviation, it will always add up to zero no matter what data set you work with. n ∑ i=1 Polytechnic University of the Philippines College of Science Department of Mathematics and Statistics (xi − x̄) = 0 → n ∑ i=1 (xi − x̄)2 ≠ 0 How can we then talk about variability if the measure of variability comes out to be equal to zero? This is why we take square of the deviation to compute the variance first and then take square root of it to compute the standard deviation, bringing us back to the original unit of measurement. We get the standard deviation of 3.11 by taking square root of 9.7; we can then say that the data values differ from the mean (16.2 lbs.) on an average of about 3.11 pounds. We can interpret this finding to mean that, on average, the weights fall between 13.09 and 19.31 pounds. This makes more sense when you look at the data set, compared to the variance. Note that the mean and standard deviation should always be reported together! 16.2 − 3.11 = 13.09 16.2 + 3.11 = 19.31 Polytechnic University of the Philippines College of Science Department of Mathematics and Statistics Choosing a Measure of Dispersion/Variability: We have discussed four types of dispersion/variability - the range, the interquartile range, the variance, and the standard deviation and examined how they differ. The next legitimate question to ask may be “When do we use which measure?” You should use the range only as a crude measure, since it is extremely sensitive to unusual values in the data set. Interquartile range is not as sensitive to unusual data values, where standard deviation is very sensitive to unusual values. Therefore, the interquartile range should be used with the median when the data contain unusual data values. However, the standard deviation should be used with the mean when the data are free of unusual data values. Polytechnic University of the Philippines College of Science Department of Mathematics and Statistics Shape of Distribution These two statistics give you insights into the shape of the distribution. ✦ ✦ Skewness is the degree of distortion from the symmetrical bell curve or the normal distribution. It measures the lack of symmetry in data distribution. Kurtosis is a measure of the combined sizes of the two tails. It tells you how tall and sharp the central peak is, relative to a standard bell curve. Polytechnic University of the Philippines College of Science Department of Mathematics and Statistics Skewness A symmetrical distribution will have a skewness of 0. So, a normal distribution will have a skewness of 0. In a symmetrical distribution, the Mean, Median and Mode are equal to each other and the ordinate at mean divides the distribution into two equal parts. Polytechnic University of the Philippines College of Science Department of Mathematics and Statistics There are two types of Skewness: • Negatively Skewed/Skewed Left is when the tail of the left side of the distribution is longer or fatter than the tail on the right side. The mean and median will be less than the mode. • Positively Skewed/Skewed Right means when the tail on the right side of the distribution is longer or fatter. The mean and median will be greater than the mode. Skewness < 0 Skewness > 0 Skewness = 0 Polytechnic University of the Philippines College of Science Department of Mathematics and Statistics Karl Pearson’s Measure of Skewness Noticed that the mean, median and mode are not equal in a skewed distribution. The Karl Pearson's measure of skewness is based upon the divergence of mean from mode in a skewed distribution. Karl Pearson’s Coefficient of Skewness (Sk), given by where: x̄ is the mean x ̂ is the median Sk = x̄ − x ̂ s s is the sample standard deviation Polytechnic University of the Philippines College of Science Department of Mathematics and Statistics So far we have seen that Sk is strategically dependent upon mode. If mode is not defined for a distribution we cannot find Sk .But empirical relation between mean, median and mode states that, for a moderately symmetrical distribution, we have Mean − Mode ≈ 3(Mean − Median) Hence Karl Pearson's coefficient of skewness is defined in terms of median as where: x̄ is the mean x̃ is the median Sk = 3(x̄ − x̃) s s is the sample standard deviation Polytechnic University of the Philippines College of Science Department of Mathematics and Statistics Kurtosis It is actually the measure of outliers present in the distribution. The outliers in a sample, therefore, have even more effect on the kurtosis than they do on the skewness. Higher kurtosis means more of the variance is the result of infrequent extreme deviations, as opposed to frequent modestly sized deviations. In other words, it’s the tails that mostly account for kurtosis, not the central peak. The kurtosis decreases as the tails become lighter. It increases as the tails become heavier. Polytechnic University of the Philippines College of Science Department of Mathematics and Statistics • Mesokurtic (Kurtosis=3): This distribution has kurtosis statistic similar to that of the normal distribution. • Leptokurtic (Kurtosis>3): Peak is higher and sharper than normal distribution, which means that data are heavy-tailed or profusion of outliers. • Platykurtic (Kurtosis<3): Compared to a normal distribution, its tails are shorter and thinner, and often its central peak is lower and broader. Polytechnic University of the Philippines College of Science Department of Mathematics and Statistics Percentile Coefficient of Kurtosis A measure of kurtosis based on quartiles and percentiles is k= where: QD P90 − P10 QD is semi-interquartile range QD = Q3 − Q1 2 Polytechnic University of the Philippines College of Science Department of Mathematics and Statistics How to Calculate Measures of Central Tendency, Measures of Variation, Skewness and Kurtosis for Ungrouped and Sample Data Using Excel? Example: The data given below are the scores of randomly selected applied statistics undergraduate students in Section A and Section B. Compare the scores of Section A and Section B based on measures of central tendency, and measures of variation and determine which section performed better in their final examination. Also, describe the shape of the distribution of these two data sets using skewness and kurtosis Data Set A Data Set B 40 38 42 40 39 39 43 40 39 40 46 37 40 33 42 36 40 47 34 45 Polytechnic University of the Philippines College of Science Department of Mathematics and Statistics 1. Click “DATA” on the menu bar and Click “DATA ANALYSIS” on the tool bar. The Dialog box will appear. 2. Select “Descriptive Statistics” then click “OK”. Polytechnic University of the Philippines College of Science Department of Mathematics and Statistics 3. Highlight your data for the “INPUT RANGE” and click the box of “LABELS IN FIRST ROW” then click “OK”. 4. Click “Summary statistics” and then click “OK”. Repeat the process for Data Set B. Polytechnic University of the Philippines College of Science Department of Mathematics and Statistics When comparing distributions, it is better to use a measure of variation/dispersion in addition to a measure of central tendency but because in this example Data set A and Data set B have the same value for measures of central tendency, we will just used measure of variation/dispersion to compare these two data set. Polytechnic University of the Philippines College of Science Department of Mathematics and Statistics Based on the result, Data set B has a larger variability since it has larger value computed based on different measures of variation. This means that Data Set B is much more spread out than the Data Set A. In this example, we want a data set with a large mean value and a small standard deviation so we can say that this is the section that performed better. Section A and Section B have the same mean value but in terms of standard deviation Section A have smaller value compared to Section B, therefore, Section A performed better in their final examination. In terms of the shape of the distribution, these two data sets have the shape in terms of Skewness and kurtosis. It shows that Data Set A and Data Set B have platykurtic shaped and it is skewed to the right. Polytechnic University of the Philippines College of Science Department of Mathematics and Statistics Normal Distribution ✦ ✦ ✦ The normal distribution is sometimes called the bell curve because the graph of its probability density looks like a bell. It is also known as the Gaussian distribution, after the German mathematician Carl Friedrich Gauss who first described it. It is a probability function that describes how the values of a variable are distributed. Polytechnic University of the Philippines College of Science Department of Mathematics and Statistics Normal Curve 50 100 150 The red curve is a model called the normal curve , which is used to describe continuous random variables that are said to be normally distributed. A continuous random variable is normally distributed, or has a normal probability distribution, if its relative frequency histogram has the shape of a normal curve. Polytechnic University of the Philippines College of Science Department of Mathematics and Statistics No data will ever be exactly/perfectly normally distributed in reality. If so, how do we know whether or not a collected data set is normally distributed? We can begin with a visual display of the data in a histogram to see if the data set is normally distributed. However, a visual check, alone, may not be sufficient to know whether the data are normally distributed. There are statistical measures, skewness and kurtosis, which, along with a histogram, allow us to determine whether the set is normally distributed. Polytechnic University of the Philippines College of Science Department of Mathematics and Statistics Why is it important to know if the data follows a normal distribution? The most important reason is that many human characteristics fall into an approximately normal distribution and that the measurement scores are assumed to be normally distributed when running most statistical analyses. Therefore, the statistical results you get at the end may not be trustworthy if the variable is not normally distributed. Polytechnic University of the Philippines College of Science Department of Mathematics and Statistics Properties of Normal Curve 1. The normal curve is bell-shaped and symmetric about the mean, μ. 2. Because mean, median and mode are equal, the normal curve has a single peak and the highest point occurs at x = μ. 3. The normal curve has inflection points at μ − σ and μ + σ. Inflection point Polytechnic University of the Philippines College of Science Department of Mathematics and Statistics Inflection point μ−σ μ μ+σ Properties of Normal Curve 4. The area under the normal curve is 1. 5. The area under the normal curve to the right of μ equals the area under the curve to the left of μ, which equals 0.50 6. The normal curve approaches, but never touches the x-axis as it extends farther and farther away from the mean. Polytechnic University of the Philippines College of Science Department of Mathematics and Statistics area = 1 0.50 0.50 μ1 < μ2, σ1 < σ2 μ1 = μ2, σ1 < σ2 Mean: Changing the mean shifts the entire curve left or right on the X-axis. Standard Deviation: ✦ Changing the standard deviation either tightens or spreads out the width of the distribution along the Xμ1 < μ2, σ1 = σ2 axis. Larger standard deviations produce distributions that are more spread out. ✦ Polytechnic University of the Philippines College of Science Department of Mathematics and Statistics Determine whether the graph represent a normal curve. A. C. B. D. All of them did not represent the normal curve. Polytechnic University of the Philippines College of Science Department of Mathematics and Statistics Role of Area under a Normal Curve Suppose that a random variable X is normally distributed with mean μ and standard deviation σ . The area under the normal curve for any interval of values of the random variable X represents either ✦ ✦ the proportion of the population with the characteristic described by the interval of values or the probability that a randomly selected individual from the population will have the characteristic described by the interval of values. Polytechnic University of the Philippines College of Science Department of Mathematics and Statistics Standard Normal Distribution A normal random variable having mean value μ = 0 and standard deviation σ = 1 is called a standard normal random variable, and its density curve is called the standard normal curve. It will always be denoted by the letter Z. Polytechnic University of the Philippines College of Science Department of Mathematics and Statistics Standardizing a Normal Random Variable The normal random variable of a standard normal distribution is called a standard score or a z-score. Every normal random variable X can be transformed into a z score via the following equation: z= x−μ σ where X is a normal random variable, μ is the mean of X, and σ is the standard deviation of X. Probabilities for a standard normal random variable are computed using Standard Normal Distribution Table which shows a cumulative probability associated with a particular z-score. Polytechnic University of the Philippines College of Science Department of Mathematics and Statistics Remember! Positive values of z-score indicate how far above the mean a score falls and negative values indicate how far below the mean a score falls. Whether positive or negative, larger z-scores mean that scores are far away from the mean and smaller z-scores means that scores are close to the mean. Polytechnic University of the Philippines College of Science Department of Mathematics and Statistics Standard Normal Distribution Table 1 (Positive Side P(Z < z)) Polytechnic University of the Philippines College of Science Department of Mathematics and Statistics Standard Normal Distribution Table 2 (Negative Side P(Z < − z)) Polytechnic University of the Philippines College of Science Department of Mathematics and Statistics Patterns for Finding Areas under a Standard Normal Curve Using Table 1 A. Area to the right of a negative z value or to the left of a positive z value. Use Table 1 directly 0 z1 z1 0 B. Area between z values on either side of 0. = 0 z2 z1 0 z2 z1 0 C. Area between z values on same side of 0. = z1 z2 Polytechnic University of the Philippines College of Science Department of Mathematics and Statistics 1 − Area 0 z1 1 − Area 0 z2 1 − Area Patterns for Finding Areas under a Standard Normal Curve Using Table 1 D. Area to the right of a positive z value or to the left of a negative z value. = 0 z1 0 0 z1 Area = 1 E. Area between a given z value and 0. = 0 0 z1 0 z1 Area = 0.50 Polytechnic University of the Philippines College of Science Department of Mathematics and Statistics Patterns for Finding Areas under a Standard Normal Curve Using Table 2 A. Area to the right of a positive z value or to the left of a negative z value. Use Table 2 directly z1 0 0 z1 B. Area between z values on same side of 0. = 0 z1 z1 z2 0 z2 C. Area between z values on either side of 0. = z1 0 + 0 z2 z2 z1 0 0.50 − Area 0.50 − Area Polytechnic University of the Philippines College of Science Department of Mathematics and Statistics Patterns for Finding Areas under a Standard Normal Curve Using Table 2 D. Area to the right of a negative z value or to the left of a positive z value. = z1 0 + 0 z1 0 0.50 − Area Area = 0.50 E. Area between a given z value and 0. = 0 z1 Polytechnic University of the Philippines College of Science Department of Mathematics and Statistics 0 Area = 0.50 0 z1 Example 1: Scores on a standardized college entrance examination (CEE) are normally distributed with mean 510 and standard deviation 60. A selective university considers for admission only applicants with CEE scores over 560. Find proportion of all individuals who took the CEE who meet the university's CEE requirement for consideration for admission. Solution: Given: μ = 510,σ = 60 and x = 560 Step 1: Draw a normal curve and shade the desired area. Area = P(X > 560) 450 510 Polytechnic University of the Philippines College of Science Department of Mathematics and Statistics X 570 560 Using Table 1 By-hand Approach! Step 2: Convert the value of x to a z-score. P(X > 560) = P (Z > z) 560 − 510 =P Z> ( ) 60 = P(Z > 0.83) = 1 − P(Z ≤ 0.83) = 1 − 0.7967 = 0.2033 Area = P(Z > 0.83) = 0.2033 −2 −1 0 Use the Complement Rule and determine one minus the area. 1 0.83 2 Z The proportion of all CEE scores that exceed 560 is 0.2033 or 20.33%. Polytechnic University of the Philippines College of Science Department of Mathematics and Statistics Using Table 2 By-hand Approach! Step 2: Convert the value of x to a z-score. Area = P(Z > 0.83) P(X > 560) = P (Z > z) 560 − 510 =P Z> ( ) 60 = P(Z > 0.83) = 0.2033 The proportion of all CEE scores that exceed 560 is 0.2033 or 20.33%. Polytechnic University of the Philippines College of Science Department of Mathematics and Statistics = 0.2033 −2 −1 0 1 0.83 2 Z Step 2: Used Excel to determine the area under Technology Approach! any normal curve. Use “TRUE” for cumulative since we want the area under the normal curve. The proportion of all CEE scores that exceed 560 is 0.2033 or 20.33%. Polytechnic University of the Philippines College of Science Department of Mathematics and Statistics Example 2: A pediatrician obtains the heights of her three-year-old female patients. The heights are approximately normally distributed, with mean 38.72 inches and standard deviation 3.17 inches. Determine the proportion of the three-year-old females that have a height less than 35 inches. Solution: Given: μ = 38.72,σ = 3.17 and x = 35 Step 1: Draw a normal curve and shade the desired area. Area = P(X < 35) Polytechnic University of the Philippines College of Science Department of Mathematics and Statistics 35.55 38.72 41.89 35 X Using Table 1 By-hand Approach! Step 2: Convert the value of x to a z-score. Area = P(Z < − 1.17) = 0.1210 P(X < 35) = P (Z < z) 35 − 38.72 =P Z< ( 3.17 ) = P(Z < − 1.17) = 1 − P(Z ≥ − 1.17) = 1 − 0.8790 Z 2 −2 −1 0 1 = 0.1210 −1.17 Use the Complement Rule and determine one minus the area. The proportion of the pediatrician’s three-year-old females who are less than 35 inches tall is 0.1210 or 12.10%. Polytechnic University of the Philippines College of Science Department of Mathematics and Statistics Using Table 2 By-hand Approach! Step 2: Convert the value of x to a z-score. Area = P(Z < − 1.17) = 0.1210 P(X < 35) = P (Z < z) 35 − 38.72 =P Z< ( 3.17 ) = P(Z < − 1.17) = 0.1210 −2 −1 −1.17 0 1 Z 2 The proportion of the pediatrician’s three-year-old females who are less than 35 inches tall is 0.1210 or 12.10%. Polytechnic University of the Philippines College of Science Department of Mathematics and Statistics Step 2: Used Excel to determine the area under any normal curve. Technology Approach! Use “TRUE” for cumulative since we want the area under the normal curve. The proportion of the pediatrician’s threeyear-old females who are less than 35 inches tall is 0.1210 or 12.10%. Polytechnic University of the Philippines College of Science Department of Mathematics and Statistics Example 3: A pediatrician obtains the heights of her three-year-old female patients. The heights are approximately normally distributed, with mean 38.72 inches and standard deviation 3.17 inches. Determine the probability that a randomly selected three-yearold girl is between 35 and 40 inches tall, inclusive. Solution: Given: μ = 38.72,σ = 3.17, and 35 ≤ X ≤ 40 Area = P(35 ≤ X ≤ 40) Step 1: Draw a normal curve and shade the desired area. Polytechnic University of the Philippines College of Science Department of Mathematics and Statistics 35.55 38.72 41.89 40 35 X Using Table 1 By-hand Approach! Step 2: Convert the value of x to a z-score. P(35 ≤ X ≤ 40) = P(z ≤ Z ≤ z) 35 − 38.72 40 − 38.72 =P ≤Z≤ ( 3.17 3.17 ) = P(−1.17 ≤ Z ≤ 0.40) = P(Z ≤ 0.40) − [1 − P(Z ≥ − 1.17)] = 0.6554 − [1 − 0.8790] Area = P(−1.17 ≤ Z ≤ 0.40) = 0.6554 − 0.1210 = 0.5344 The probability a randomly selected three-year-old female is between 35 and 40 inches tall is 0.5344. X −2 −1 0 1 2 −1.17 0.40 Polytechnic University of the Philippines College of Science Department of Mathematics and Statistics Using Table 2 By-hand Approach! Step 2: Convert the value of x to a z-score. P(35 ≤ X ≤ 40) = P(z ≤ Z ≤ z) 35 − 38.72 40 − 38.72 =P ≤Z≤ ( 3.17 3.17 ) = P(−1.17 ≤ Z ≤ 0.40) = [0.50 − P(Z ≥ 0.40) + [0.50 − P(Z ≤ − 1.17)] = [0.50 − 0.3446] + [0.50 − 0.1210] = 0.1554 + 0.3790 Area = P(−1.17 ≤ Z ≤ 0.40) = 0.5344 The probability a randomly selected three-year-old female is between 35 and 40 inches tall is 0.5344. −2 −1 Polytechnic University of the Philippines College of Science Department of Mathematics and Statistics −1.17 0 1 0.40 2 X Step 2: Used Excel to determine the area under Technology Approach! any normal curve. Use “TRUE” for cumulative since we want the area under the normal curve. Polytechnic University of the Philippines College of Science Department of Mathematics and Statistics ACTIVITIES/ASSESSMENTS: 1. Which one do you think is more informative? Why? Polytechnic University of the Philippines College of Science Department of Mathematics and Statistics ACTIVITIES/ASSESSMENTS: 2. What features of the ‘Good Presentation’ make it better than the ‘Bad Presentation’? A. B. Polytechnic University of the Philippines College of Science Department of Mathematics and Statistics ACTIVITIES/ASSESSMENTS: 3. Review the table and consider questions such as the following. Needs Satisfactory Improvement Origin / Rating Poor V Good Excellent Total External 0% 2% 12% 19% 9% 41% Internal 4% 8% 15% 23% 9% 59% Grand Total 4% 10% 27% 41% 17% 100% 1. What percentage of the employees originated from within the organization? 2. What percentage of the employees are both internal and rated ‘Very Good’? 3. What percentage of the employees received ‘Needs Improvement’ or ‘Poor’? 4. What category contains the greatest number of employees? 5. Do you see any notable differences in the percentage by category? Polytechnic University of the Philippines College of Science Department of Mathematics and Statistics ACTIVITIES/ASSESSMENTS: 4. Consider the above Frequency Distribution of Salaries. Salary 41,000 - 50,000 51,000 - 60,000 61,000 - 70,000 71,000 - 80,000 81,000 - 90,000 91,000 - 100,000 101,000 - 110,000 Total Frequency 1 20 53 43 26 6 1 150 Percentage 1% 13% 35% 29% 17% 4% 1% 100% 1.What percentage of the employees earns less than or equal 80,000? 2.What is the salary range of values? 3.What salary categories have percentage less than 5? 4.What salary category includes the most employees? Polytechnic University of the Philippines College of Science Department of Mathematics and Statistics ACTIVITIES/ASSESSMENTS: 5. The length of life of an instrument produced by a machine has a normal distribution with a mean of 12 months and standard deviation of 2 months. Find the probability that an instrument produced by this machine will last A. less than 7 months. B. between 7 and 12 months. Be sure to draw a normal curve with the area corresponding to the probability shaded. 6. The lengths of human pregnancies are approximately normally distributed, with mean μ = 266 days and standard deviation σ = 16 days. What proportion of pregnancies lasts more than 270 days? B. What proportion of pregnancies lasts less than 250 days? C. What proportion of pregnancies lasts between 240 and 280 days? D. What is the probability that a randomly selected pregnancy? lasts more than 280 days? Be sure to draw a normal curve with the area corresponding to the probability shaded. Polytechnic University of the Philippines College of Science Department of Mathematics and Statistics ACTIVITIES/ASSESSMENTS: 7. Construct frequency distribution table scores of 75 randomly selected students. 37 46 37 26 30 41 28 49 29 34 46 35 46 45 27 41 26 45 39 43 46 36 49 47 30 43 31 34 38 41 39 45 28 38 30 29 38 26 31 42 44 48 43 37 42 33 42 42 43 39 39 31 46 46 48 Scores 26 to 30 31 to 35 36 to 40 41 to 45 46 to 50 Total Polytechnic University of the Philippines College of Science Department of Mathematics and Statistics based on the 50 32 43 46 48 38 46 37 38 50 Frequency Percentage (%) 35 36 39 27 45 42 48 26 50 31 ACTIVITIES/ASSESSMENTS: A. Based on the frequency distribution, compute measures of central tendency, measures of variation, Q1, D9, P10 , Skewness and kurtosis. B. Based on the raw data, compute measures of central tendency, measures of variation, Skewness and kurtosis using Excel. C. Compute Skewness and kurtosis of grouped and ungrouped data. Make sure to describe the shape of the distribution D. Do you think that computed value for grouped and ungrouped data are the same? 8. Begin with the following set of data, call it Data Set I. 5, −2, 6, 14, −3, 0, 1, 4, 3, 2, 5 A. Compute the sample standard deviation and sample mean of Data Set I. Polytechnic University of the Philippines College of Science Department of Mathematics and Statistics ACTIVITIES/ASSESSMENTS: B. Form a new data set, Data Set II, by adding 3 to each number in Data Set I. Calculate the sample standard deviation and sample mean of Data Set II. C. Form a new data set, Data Set III, by subtracting 6 from each number in Data Set I. Calculate the sample standard deviation and sample mean of Data Set III. D. Comparing the answers to parts (a), (b), and (c), can you guess the pattern? State the general principle that you expect to be true. 9.Using “Encoded Data file”, construct frequency distribution table for age, sex, marital status and educational attainment and interpret the table. Polytechnic University of the Philippines College of Science Department of Mathematics and Statistics References https://prezi.com/rirrca9ckuiz/textualpresentation-of-data/ https://www.toppr.com/guides/economics/ presentation-of-data/textual-and-tabularpresentation-of-data/ Statistics. Informed Decision using Data by Michael Sullivan, III,. Fifth Edition Polytechnic University of the Philippines College of Science Department of Mathematics and Statistics MODULE 4: INFERENTIAL STATISTICS OBJECTIVES: After successful completion of this module, you should be able to: ✦ Differentiate the null and alternative hypotheses. ✦ Formulates the appropriate null and alternative hypotheses. ✦ Explain the logic of hypothesis testing. ✦ Assess and test if the data follows a normal distribution. ✦ Distinguish between independent and dependent sampling. ✦ Identify the appropriate test statistics for normally distributed data. ✦ Conduct test for two categorical variables. Polytechnic University of the Philippines College of Science Department of Mathematics and Statistics What is HYPOTHESIS TESTING? Hypothesis testing is a procedure on sample evidence and probability, used to test claims regarding a characteristic of one or more populations. What is HYPOTHESIS? •A statement or claim regarding a characteristic of one or more populations. •A preconceived idea, assumed to be true but has to be tested for its truth or falsity. Polytechnic University of the Philippines College of Science Department of Mathematics and Statistics Procedures for Testing Hypothesis 1. State the null and alternative hypothesis. 2. Set the level of significance or alpha level (α). 3. Determine the test distribution to use. 4. Calculate test statistic or p - value. 5. Make statistical Decision 6. Draw Conclusion Polytechnic University of the Philippines College of Science Department of Mathematics and Statistics 1. State the Null and Alternative Hypothesis Two Types of Hypothesis 1. Null Hypothesis Denoted by The statement being tested. Assumed true until evidence indicates otherwise. Must contain the condition of equality and must be written with the symbol = , ≤ , or ≥. • • • • 2. Alternative Hypothesis • • • • Denoted by Statement that must be true if the null hypothesis is false Sometimes referred to as the research hypothesis Must contain the condition of equality and must be written with the symbol ≠, < or >. Polytechnic University of the Philippines College of Science Department of Mathematics and Statistics Example Hypothesis: ✦ ✦ ✦ Null Hypothesis: Students who eat and not eat breakfast will perform the same on a math exam. Students who experience and not experience test anxiety prior to an English exam will get the same scores. Motorists who talk and not talk on the phone while driving will get the same errors on a driving course. Alternative Hypothesis: ✦ ✦ ✦ Students who eat breakfast will perform better on a math exam than students who do not eat breakfast. Students who experience test anxiety prior to an English exam will get higher scores than students who do not experience test anxiety. Motorists who talk on the phone while driving will be more likely to make errors on a driving course than those who do not talk on the phone. Polytechnic University of the Philippines College of Science Department of Mathematics and Statistics Reminders: If you are conducting a research study and you want to use a hypothesis test to support your claim, the claim must be stated in such a way that it becomes the alternative hypothesis, so it cannot contain the condition of equality. Two Types of Alternative Test 1. One - tailed test ✦ Left tailed ✦ Right tailed 2. Two - tailed test Polytechnic University of the Philippines College of Science Department of Mathematics and Statistics 2. Set the Level of Significance or Alpha Level (α) You should establish a predetermined level of significance, below which you will reject the null hypothesis. • The generally accepted levels are 0.10, 0.05, and 0.01. • Be as rigorous as possible. Two Types of Error • Polytechnic University of the Philippines College of Science Department of Mathematics and Statistics Example: H0: The defendant is innocent. Ha: The defendant is not innocent. What happen to the defendant if the jury made type I and type II error? Answer: A type I error is like putting an innocent person in jail. A type II error is like letting a guilty person go free. Polytechnic University of the Philippines College of Science Department of Mathematics and Statistics Reminders: It is important to note that we want to set ( α ) before we start our study because the Type I error is the more ‘grevious’ error to make. The smaller (α ) is, the smaller the region of rejection. Polytechnic University of the Philippines College of Science Department of Mathematics and Statistics 3. Determine the Test Distribution to Use. Determine the appropriate statistical test to be used. ✦ Dependent Sample t - Test ✦ Independent Sample t - Test ✦ ✦ ✦ One Way Analysis of Variance (ANOVA) Test Pearson r Chi - Square Test Polytechnic University of the Philippines College of Science Department of Mathematics and Statistics 4. Calculate Test Statistic or p - value. Performing statistical analysis using statistical software such as Excel, SPSS, R, Minitab, SAS, etc. 5. Make Statistical Decision ✦ Using confidence interval ✦ Using p-value approach ✦ Using traditional method Polytechnic University of the Philippines College of Science Department of Mathematics and Statistics Decision Rule: ✦ Using Confidence Interval Reject the null hypothesis if the test statistic is not within the range specified by the confidence interval. ✦ Using Traditional Approach Reject Ho if the computed value of the test statistic falls in the region of rejection. ✦ Using P-value Approach Reject the null hypothesis if the computed p-value is less than or equal to the set significance level , otherwise do not reject the null hypothesis. Example: If the level of significance (α = 0.05), Polytechnic University of the Philippines College of Science Department of Mathematics and Statistics P-value 0.01 0.05 0.10 Decision Reject H0 Reject H0 Failed to Reject H0 Traditional Approach Rejection of region or critical region is the set of all values of the test statistic which will lead to the rejection of H0. Acceptance Region is the set of all values of the test statistic that leads the researcher to retain H0. Polytechnic University of the Philippines College of Science Department of Mathematics and Statistics One-tailed and Left tailed One-tailed and Right tailed Ha : μ1 < μ2 Ha : μ1 > μ2 Rejection Region Rejection Region -2 0 2 -2 0 2 Two-tailed Ha : μ1 ≠ μ2 Rejection Region Polytechnic University of the Philippines College of Science Department of Mathematics and Statistics -2 Rejection Region 0 2 In stating your decision you can use: ✦ Fail to reject the null hypothesis/ Do not reject the null hypothesis/ Retain the null hypothesis ✦ Reject the null hypothesis. It is important to recognize that we never accept the null hypothesis. We are merely saying that the sample evidence is not strong enough to warrant rejection of the null hypothesis. 6. Draw Conclusion Record conclusions and recommendations in a report, and associate interpretations to justify your conclusion or recommendations. Polytechnic University of the Philippines College of Science Department of Mathematics and Statistics Assessing and Testing Normality of the Data To determine if the data is follows a normality distribution, we can use the graphical or numerical method. Graphical: Normal Q-Q Plot Histogram Numerical: Shapiro Wilk Test Kolmogorov Smirnov Test Polytechnic University of the Philippines College of Science Department of Mathematics and Statistics How to Check Normality? Histogram plots the observed values against their frequency, states a visual estimation whether the distribution is bell shaped or not. Polytechnic University of the Philippines College of Science Department of Mathematics and Statistics How to Check Normality? Q-Q probability plots display the observed values against normally distributed data (represented by the line). Polytechnic University of the Philippines College of Science Department of Mathematics and Statistics Reminders: Graphical methods are typically not very useful when the sample size is small. Polytechnic University of the Philippines College of Science Department of Mathematics and Statistics Hypotheses of Normality Test The hypotheses used are: Ho: The sample data follows a normal distribution. Ha: The sample data does not follow a normal distribution. When we are testing normality: • If P value > alpha, it means that the data are normal. • If P value ≤ alpha, it means that the data are NOT normal. Polytechnic University of the Philippines College of Science Department of Mathematics and Statistics How to Calculate Shapiro - Wilk Test in Excel? Sample Data STEP 1: Rearrange the data in ascending order. Polytechnic University of the Philippines College of Science Department of Mathematics and Statistics STEP 2: Calculate SS as follows: SS = n ∑ i=1 (xi − x̄) 2 Use "=DEVSQ( )” function in excel Polytechnic University of the Philippines Polytechnic University of the Philippines College College of Science of Science Department of Mathematics and Statistics Department of Mathematics and Statistics SS means Sum of Square Polytechnic University of the Philippines Polytechnic University of the Philippines College College of Science of Science Department of Mathematics and Statistics Department of Mathematics and Statistics STEP 3: Calculate b as follows: b = m a x − xi) ∑ i ( n+1−i i=1 n is the number of observation If n is even: n m= 2 If n is odd: n−1 m= 2 Polytechnic University of the Philippines College of Science Department of Mathematics and Statistics Since n is even in this example, m=8. That’s why we used a1 to a8 Polytechnic University of the Philippines College of Science Department of Mathematics and Statistics Taking the ai weights from the table of Shapiro -Wilk (based on the value of n) Shapiro - Wilk Table Polytechnic University of the Philippines College of Science Department of Mathematics and Statistics Polytechnic University of the Philippines College of Science Department of Mathematics and Statistics Polytechnic University of the Philippines College of Science Department of Mathematics and Statistics Polytechnic University of the Philippines College of Science Department of Mathematics and Statistics Polytechnic University of the Philippines College of Science Department of Mathematics and Statistics Note that if n is odd, the median data value is not used in the calculation of b. Polytechnic University of the Philippines College of Science Department of Mathematics and Statistics Polytechnic University of the Philippines College of Science Department of Mathematics and Statistics STEP 4: Calculate the test statistic: Polytechnic University of the Philippines College of Science Department of Mathematics and Statistics b2 W= SS STEP 5: Find the value in the table of Shapiro - Will (for a given value of n) that is closest to W, interpolating if necessary. This is the p-value for the test. Polytechnic University of the Philippines College of Science Department of Mathematics and Statistics Polytechnic University of the Philippines College of Science Department of Mathematics and Statistics Polytechnic University of the Philippines College of Science Department of Mathematics and Statistics We choose this interval in the table of Shapiro - Wilk, because our n=16 and our test statistic (W=0.955) is within this interval. Polytechnic University of the Philippines College of Science Department of Mathematics and Statistics We used interpolation to get the p-value of Shapiro-Wilk Test Result Since the computed p-value is greater than the set level of significance, we failed to reject the null hypothesis. Therefore, the sample data follows a normal distribution. Polytechnic University of the Philippines College of Science Department of Mathematics and Statistics Inferential Statistics 1. Parametric Tests Assume underlying statistical distributions in the data. Therefore, several conditions of validity must be met so that the result of a parametric test is reliable. ✦ Apply to data in ratio scale, and some apply to data in interval scale. 2. Non Parametric Test ✦ Refer to a statistical method in which the data is not required to fit a normal distribution. ✦ Most non-parametric tests apply to data in an ordinal scale, and some apply to data in nominal scale. ✦ Polytechnic University of the Philippines College of Science Department of Mathematics and Statistics Inference About Two Means To perform inference on the difference of two population means, we must first determine whether the data come from an independent or dependent sample. Distinguish between Independent and Dependent Sample ✦ ✦ A sampling method is independent when the individuals selected for one sample do not dictate which individuals are to be in a second sample. A sampling method is dependent when the individual selected to be in one sample are used to determine the individuals to be in the second sample. Polytechnic University of the Philippines College of Science Department of Mathematics and Statistics Example: Determine whether the sample is independent or dependent. 1. An urban economist believes that commute times to work in the South are less than commute times to work in the Midwest. He randomly selects 40 employed individuals in the south and 45 employed individuals in the Midwest and determines their commute times. Answer: Independent 2. In an experiment conducted in biology class, Prof. Rhea measured the time required for 12 students to catch a failing meter stick using their dominant hand and nondominant hand. The goal of the study was to determine whether the reaction time in an individual’s dominant hand is different from the reaction time in the non dominant hand. Answer: Dependent Polytechnic University of the Philippines College of Science Department of Mathematics and Statistics Example: Determine whether the sample is independent or dependent. 3. A researcher wants to know if the mean length of stay in for-profit hospitals is different from the mean length of stay in not-for-profit hospitals. He randomly selected 20 individuals in the for-profit hospital and matched them with 20 individuals in the not-for-profit by diagnosis. Answer: Dependent Polytechnic University of the Philippines College of Science Department of Mathematics and Statistics Dependent Sample t - Test The dependent sample t-test (also called the paired t-test or paired-samples t-test) compares the means of two related groups to determine whether there is a statistically significant difference between these means. H0 : μ1 ≥ μ2 and Ha : μ1 < μ2 H0 : μ1 ≤ μ2 and Ha : μ1 > μ2 H0 : μ1 = μ2 and Ha : μ1 ≠ μ2 Polytechnic University of the Philippines College of Science Department of Mathematics and Statistics Assumptions 1. Your dependent variable should be measured at the interval or ratio level (i.e., they are continuous). 2. Your independent variable should consist of two categorical, "related groups" or "matched pairs”. 3. There should be no significant outliers in the differences between the two related groups. 4. The distribution of the differences in the dependent variable between the two related groups should be approximately normally distributed. Polytechnic University of the Philippines College of Science Department of Mathematics and Statistics Example: A teacher is interested to know if the new learning program will help to increase the number of correct remembered words. 10 Subjects learn a list of 50 words. Learning performance is measured using a recall test. After the first test all subjects are instructed how to use the learning program and then learn a second list of 50 words. Learning performance is again measured with the recall test. In the following table the number of correct remembered words are listed for both tests. Polytechnic University of the Philippines College of Science Department of Mathematics and Statistics 1. State the Null and Alternative Hypothesis Null hypothesis: Ho : μ1 ≥ μ2 The new learning program will not help to increase the number of correct remembered words. Alternative hypothesis: Ha : μ1 < μ2 The new learning program will help to increase the number of correct remembered words. 2. Set the Level of Significance or Alpha Level (α) α = 0.05 Polytechnic University of the Philippines College of Science Department of Mathematics and Statistics 3. Determine the Test Distribution to Use. Dependent Variable: Number of correct remembered words Independent Variable: Treatment (Before and After) Since we are comparing the means of two related groups, we will use the dependent sample t-test. Polytechnic University of the Philippines College of Science Department of Mathematics and Statistics 4. Calculate Test Statistic or p - value. Click “Data”, then click “Data Analysis” Polytechnic University of the Philippines College of Science Department of Mathematics and Statistics Polytechnic University of the Philippines College of Science Department of Mathematics and Statistics 5. Make Statistical Decision Using p-value approach: If pvalue ≤ α , reject Ho, otherwise failed to reject Ho Reject Ho Polytechnic University of the Philippines College of Science Department of Mathematics and Statistics 6. Draw Conclusion There is sufficient evidence to support that the new learning program help to increase the number of correct remembered words. Proper Presentation of Results Polytechnic University of the Philippines College of Science Department of Mathematics and Statistics Exercises: Apply the procedure in testing the hypothesis. Professor Rhea measured the time (in second) required to catch a falling meter sticks for 10 randomly selected students' dominant hand and non-dominant hand. Professor Rhea claims that the reaction time in an individual's dominant hand is less than the reaction time in their non-dominant hand. Test the claim at the level of significance. The data obtained are presented: Polytechnic University of the Philippines College of Science Department of Mathematics and Statistics Result Polytechnic University of the Philippines College of Science Department of Mathematics and Statistics Independent Sample t - Test The independent sample t - test allows researchers to evaluate or to compare the mean difference between two populations using the data from two separate samples. It is used to test whether population means are significantly different from each other, using the means from randomly drawn samples. H0 : μ1 ≥ μ2 and Ha : μ1 < μ2 H0 : μ1 ≤ μ2 and Ha : μ1 > μ2 H0 : μ1 = μ2 and Ha : μ1 ≠ μ2 Polytechnic University of the Philippines College of Science Department of Mathematics and Statistics Assumptions 1. 2. 3. 4. 5. 6. Your dependent variable should be measured on a continuous scale (i.e., it is measured at the interval or ratio level). Your independent variable should consist of two categorical, independent groups. You should have independence of observations, which means that there is no relationship between the observations in each group or between the groups themselves. There should be no significant outliers. Your dependent variable should be approximately normally distributed for each group of the independent variable. There needs to be homogeneity of variances. Polytechnic University of the Philippines College of Science Department of Mathematics and Statistics Example: Researchers wanted to know whether there was a difference in comprehension among students learning a computer program based on the style of the text. They randomly divided 18 students into two groups of 9 each. The researchers verified that the 18 students were similar in terms of educational level, age, and so on. Group 1 individuals learned the software using visual manual (multimodal instruction), while Group 2 individual learned the software using textual manual (Unimodal instruction). The following data represent scores the students received on an exam given to them they studied from the manuals. Polytechnic University of the Philippines College of Science Department of Mathematics and Statistics 1. State the Null and Alternative Hypothesis Null hypothesis: Ho : μ1 = μ2 There is no significant difference between the scores of the students learning computer program using textual and visual style. Alternative hypothesis: Ha : μ1 ≠ μ2 There is significant difference between the scores of the students learning computer program using textual and visual style. 2. Set the Level of Significance or Alpha Level (α) α = 0.05 Polytechnic University of the Philippines College of Science Department of Mathematics and Statistics 3. Determine the Test Distribution to Use. Dependent Variable: Scores Independent Variable: Style of the Text (Visual and Textual) Since we are comparing the means of two independent groups, we will use the independent sample t-test. Polytechnic University of the Philippines College of Science Department of Mathematics and Statistics Click “Data”, then click “Data Analysis” Determine if the variances are equal or not equal. Polytechnic University of the Philippines College of Science Department of Mathematics and Statistics Polytechnic University of the Philippines College of Science Department of Mathematics and Statistics Using p-value approach: If pvalue ≤ α , reject Ho, otherwise failed to reject Ho Ho: Equal Variances Assumed Ha: Equal Variances Not Assumed Failed to Reject Ho Since we failed to reject Ho, we will proceed to t-test: Two Sample Assuming Equal Variances. Polytechnic University of the Philippines College of Science Department of Mathematics and Statistics 4. Calculate Test Statistic or p - value. Click “Data”, then click “Data Analysis” Polytechnic University of the Philippines College of Science Department of Mathematics and Statistics Polytechnic University of the Philippines College of Science Department of Mathematics and Statistics Result Polytechnic University of the Philippines College of Science Department of Mathematics and Statistics 5. Make Statistical Decision Using p-value approach: If pvalue ≤ α , reject Ho, otherwise failed to reject Ho Polytechnic University of the Philippines College of Science Department of Mathematics and Statistics Failed to Reject Ho 6. Draw Conclusion There is no enough evidence to support that there is a difference in comprehension among students learning a computer program based on the style of the text. Proper Presentation of Results Polytechnic University of the Philippines College of Science Department of Mathematics and Statistics Exercises: Apply the procedure in testing the hypothesis. Twenty participants were given a list of 20 words to process. The 20 participants were randomly assigned to one of two treatment conditions. Half were instructed to count the number of vowels in each word (shallow processing). Half were instructed to judge whether the object described by each word would be useful if one were stranded on a desert island (deep processing). After a brief distractor task, all subjects were given a surprise free recall task. Did the instruction affect the level of recall?The number of words correctly recalled was recorded for each subject. Here are the data: Polytechnic University of the Philippines College of Science Department of Mathematics and Statistics Polytechnic University of the Philippines College of Science Department of Mathematics and Statistics Since the result of F-test conclude that the variances of the two groups are equal, we will apply “Assuming Equal Variances”. Polytechnic University of the Philippines College of Science Department of Mathematics and Statistics Result Polytechnic University of the Philippines College of Science Department of Mathematics and Statistics One - Way Analysis of Variance (ANOVA) One-way analysis of variance (ANOVA) is a method of test ing the equality of three or more population means by analyzing sample variances. Ho : μ1 = μ2 = . . . = μk Ha : At least one of the population means is different from the others. Polytechnic University of the Philippines College of Science Department of Mathematics and Statistics Assumptions 1. Your dependent variable should be measured at the interval or ratio level (i.e., they are continuous). 2. Your independent variable should consist of two or more categorical, independent groups. 3. You should have independence of observations, which means that there is no relationship between the observations in each group or between the groups themselves. 4. There should be no significant outliers. 5. Your dependent variable should be approximately normally distributed for each category of the independent variable. 6. There needs to be homogeneity of variances. Polytechnic University of the Philippines College of Science Department of Mathematics and Statistics Example: A Researchers wanted to compare math test scores of students at the end of secondary school from various cities. Eight randomly selected students from Makati, Manila, and Quezon City each were administered the same exam; the results are presented in the following table. Can the researchers conclude that the distribution of exam scores is different for each city at the level of significance? Polytechnic University of the Philippines College of Science Department of Mathematics and Statistics 1. State the Null and Alternative Hypothesis Null hypothesis: There is no significant difference between the mathematics scores of students at various city. Alternative hypothesis: There is significant difference between the mathematics scores of students at various city. 2. Set the Level of Significance or Alpha Level (α) α = 0.10 Polytechnic University of the Philippines College of Science Department of Mathematics and Statistics 3. Determine the Test Distribution to Use. Dependent Variable: Mathematics Scores Independent Variable: Cities (Makati, Manila, Quezon City) Since we are comparing the means of one independent variable that consist of two or more categorical groups, we will use the one-way ANOVA. Polytechnic University of the Philippines College of Science Department of Mathematics and Statistics Click “Data”, then click “Data Analysis” Determine if the variances are equal or not equal. Polytechnic University of the Philippines College of Science Department of Mathematics and Statistics Polytechnic University of the Philippines College of Science Department of Mathematics and Statistics Using p-value approach: If pvalue ≤ α , reject Ho, otherwise failed to reject Ho Ho: Equal Variances Assumed Ha: Equal Variances Not Assumed Failed to Reject Ho Polytechnic University of the Philippines College of Science Department of Mathematics and Statistics E q u a l Variances Assumed Polytechnic University of the Philippines College of Science Department of Mathematics and Statistics Using p-value approach: If pvalue ≤ α , reject Ho, otherwise failed to reject Ho Ho: Equal Variances Assumed Ha: Equal Variances Not Assumed Failed to Reject Ho Polytechnic University of the Philippines College of Science Department of Mathematics and Statistics Polytechnic University of the Philippines College of Science Department of Mathematics and Statistics E q u a l Variances Assumed Using p-value approach: If pvalue ≤ α , reject Ho, otherwise failed to reject Ho Ho: Equal Variances Assumed Ha: Equal Variances Not Assumed Failed to Reject Ho Polytechnic University of the Philippines College of Science Department of Mathematics and Statistics E q u a l Variances Assumed 4. Calculate Test Statistic or p - value. Click “Data”, then click “Data Analysis” Polytechnic University of the Philippines College of Science Department of Mathematics and Statistics Polytechnic University of the Philippines College of Science Department of Mathematics and Statistics Result Polytechnic University of the Philippines College of Science Department of Mathematics and Statistics 5. Make Statistical Decision Using p-value approach: If pvalue ≤ α , reject Ho, otherwise failed to reject Ho Reject Ho Polytechnic University of the Philippines College of Science Department of Mathematics and Statistics 6. Draw Conclusion There is enough evidence to support that the distribution of exam scores of students in mathematics is different for each city. Proper Presentation of Results Polytechnic University of the Philippines College of Science Department of Mathematics and Statistics Exercises: Apply the procedure in testing the hypothesis. A teacher is concerned about the level of knowledge possessed by PUP students regarding Philippine history. Students completed a high school senior level standardized history exam. Academic major of the students was also recorded. Data in terms of percent correct is recorded below for 24 students. Is there a significant difference between the levels of knowledge possessed by PUP students regarding Philippine history when grouped according to their academic major? Polytechnic University of the Philippines College of Science Department of Mathematics and Statistics Polytechnic University of the Philippines College of Science Department of Mathematics and Statistics Result Polytechnic University of the Philippines College of Science Department of Mathematics and Statistics Pearson Product Moment Correlation The Pearson product moment correlation coefficient (Pearson r) is a measure of the strength of a linear association between two variables and is denoted by r. Ho: There is no significant relationship between two continuous variables. Ha: There is significant relationship between two continuous variables. Polytechnic University of the Philippines College of Science Department of Mathematics and Statistics Features of r • Unit free • Range between -1 and 1 • The closer to -1, the stronger the negative linear relationship. • The closer to 1, the stronger the positive linear relationship. • The closer to 0, the weaker the linear relationship. Polytechnic University of the Philippines College of Science Department of Mathematics and Statistics Pearson Product Moment Correlation If r is positive, the correlation is direct. If r is negative, the correlation is inverse. Polytechnic University of the Philippines College of Science Department of Mathematics and Statistics Sample of Observations from Various r Values Y Y r = -1 X Y r = -.6 Y X r =0 X Y r = .6 r=1 Polytechnic University of the Philippines College of Science Department of Mathematics and Statistics Reminders: • Correlation does not imply causation. • Watch out for hidden (lurking) variables. Lurking Variable • A variable that is not included as an explanatory or response variable in the analysis but can affect the interpretation of relationships between variables. • Can falsely identify a strong relationship between variables or it can hide the true relationship. Polytechnic University of the Philippines College of Science Department of Mathematics and Statistics Assumptions 1. Your two variables should be measured at the interval or ratio level (i.e., they are continuous). 2. There is a linear relationship between your two variables. 3. There should be no significant outliers. 4. Your variables should be approximately normally distributed. Polytechnic University of the Philippines College of Science Department of Mathematics and Statistics Significance Testing of Pearson r Test Statistic: t=r where: df 1 − r2 df = degrees of freedom r = correlation coefficient of Pearson r Note: df = n − 2 Polytechnic University of the Philippines College of Science Department of Mathematics and Statistics Example: A dietetics student wanted to look at the relationship between calcium intake and knowledge about calcium in sports science students. Table shows the data she collected. Is there a relationship between calcium intake and knowledge about calcium in sports science students? Polytechnic University of the Philippines College of Science Department of Mathematics and Statistics Polytechnic University of the Philippines College of Science Department of Mathematics and Statistics 1. State the Null and Alternative Hypothesis Null hypothesis: There is no significant relationship between the calcium intake and knowledge about calcium in sports science students. Alternative hypothesis: There is significant relationship between the calcium intake and knowledge about calcium in sports science students. 2. Set the Level of Significance or Alpha Level (α) α = 0.0.5 Polytechnic University of the Philippines College of Science Department of Mathematics and Statistics 3. Determine the Test Distribution to Use. Dependent Variable: Calcium Intake Independent Variable: Knowledge about Calcium Since we are testing the significant relationship of two variables, we will use Pearson r. Polytechnic University of the Philippines College of Science Department of Mathematics and Statistics 4. Calculate Test Statistic or p - value. Polytechnic University of the Philippines College of Science Department of Mathematics and Statistics t=r df 1 − r2 df = n − 2 Polytechnic University of the Philippines College of Science Department of Mathematics and Statistics Result Polytechnic University of the Philippines Polytechnic University of the Philippines College College of Science of Science Department of Mathematics and Statistics Department of Mathematics and Statistics 5. Make Statistical Decision Using p-value approach: If pvalue ≤ α , reject Ho, otherwise failed to reject Strong and Ho D i r e c t Correlation Polytechnic University of the Philippines College of Science Department of Mathematics and Statistics Reject Ho 6. Draw Conclusion There is sufficient evidence to conclude that there is significant relationship between the calcium intake and knowledge about calcium in sports science students. Proper Presentation of Results Polytechnic University of the Philippines College of Science Department of Mathematics and Statistics Exercises: Apply the procedure in testing the hypothesis. A group of twelve children participated in a psychological study designed to assess the relationship, if any, between age (years) and average total sleep time (minutes). To obtain a measure for average total sleep time, recordings were taken on each child on five consecutive nights and then averaged. The results obtained are shown in the table. Polytechnic University of the Philippines College of Science Department of Mathematics and Statistics Result Polytechnic University of the Philippines College of Science Department of Mathematics and Statistics Chi-Square Distribution Definition: The chi-square distribution is written as χ 2 distribution. The symbol χ is the Greek letter “chi”, pronounced as “ki”. Polytechnic University of the Philippines College of Science Department of Mathematics and Statistics Chi - Square: Test for Independence ✦ ✦ ✦ Used to discover if there is association between two categorical variables. Used when you want to decide whether two variables are independent or dependent. A contingency table will be constructed. Polytechnic University of the Philippines College of Science Department of Mathematics and Statistics Chi - Square: Test for Independence H0: The two categorical variables are independent. Ha: The two categorical variables are dependent. Polytechnic University of the Philippines College of Science Department of Mathematics and Statistics Chi - Square: Test for Independence The test statistic for a test of independence is given by (O − E)2 2 χ = ∑ E where: O is the observed frequency for a category E is the expected frequency for a category E= (row total)(column total) grand total Polytechnic University of the Philippines College of Science Department of Mathematics and Statistics Observed and Expected Frequencies The frequencies obtained from the performance of an experiment are called the observed frequencies and are denoted by O. The expected frequencies, denoted by E, are the frequencies that we expect to obtain if the null hypothesis is true. Example of Contingency Table: Observed Values Some College Bachelor's Degree Masters Degree Column Total Low 20 17 11 48 Medium High 35 20 33 25 18 21 86 66 Row Total 80 70 50 200 Polytechnic University of the Philippines College of Science Department of Mathematics and Statistics Assumptions 1. There are 2 variables, and both are measured as categories, usually at the nominal level. 2. The two variables should consist of two or more categorical, independent groups. 3. The data in the cells should be frequencies, or counts of cases rather than percentages or some other transformation of the data. 4. For a 2 by 2 table, all expected frequencies > 5. 5. For a larger table, all expected frequencies > 1 and no more than 20% of all cells may have expected frequencies < 5. Polytechnic University of the Philippines College of Science Department of Mathematics and Statistics Example: 1. A doctor who knows that hypertension depends on smoking habits can tell his smoking patients what they should do. 2. If the traffic condition (light, moderate, heavy, standstill) is found to be dependent on vehicle plate numbers (odd, even) a traffic officer may decide to revise traffic law enforcement. 3. If poverty status of households is found to be correlated with family size, government ought to adopt a viable poverty management program Polytechnic University of the Philippines College of Science Department of Mathematics and Statistics Reminders: The word contingency refers to dependence, but this is only a statistical dependence and cannot be used to establish a direct cause-andeffect link between the two variables in question. Polytechnic University of the Philippines College of Science Department of Mathematics and Statistics Example: Educators are always looking for novel ways in which to teach statistics to undergraduates as part of a non-statistics degree course (e.g., psychology). With current technology, it is possible to present how-to guides for statistical programs online instead of in a book. However, different people learn in different ways. An educator would like to know whether gender (male/female) is associated with the preferred type of learning medium (online vs. books). Use “Data_Example and Exercises file”. Polytechnic University of the Philippines College of Science Department of Mathematics and Statistics 1. State the Null and Alternative Hypothesis Null hypothesis: Gender is independent with the preferred type of learning medium. Alternative hypothesis: Gender is dependent with the preferred type of learning medium. 2. Set the Level of Significance or Alpha Level (α) α = 0.0.5 Polytechnic University of the Philippines College of Science Department of Mathematics and Statistics 3. Determine the Test Distribution to Use. Two Categorical Variables Gender (Male and Female) Preferred type of learning medium (online vs. books) Since we are testing the significant relationship of two categorical variables, we will use Chi-square test. Polytechnic University of the Philippines College of Science Department of Mathematics and Statistics 4. Calculate Test Statistic or p - value. Click “Insert”, then click “Pivot Table” Polytechnic University of the Philippines College of Science Department of Mathematics and Statistics Polytechnic University of the Philippines College of Science Department of Mathematics and Statistics Polytechnic University of the Philippines College of Science Department of Mathematics and Statistics Polytechnic University of the Philippines College of Science Department of Mathematics and Statistics Polytechnic University of the Philippines College of Science Department of Mathematics and Statistics Row Total Grand Total Column Total E= Polytechnic University of the Philippines College of Science Department of Mathematics and Statistics Polytechnic University of the Philippines College of Science Department of Mathematics and Statistics (row total)(column total) grand total Polytechnic University of the Philippines College of Science Department of Mathematics and Statistics 5. Make Statistical Decision Using p-value approach: If pvalue ≤ α, reject Ho, otherwise failed to reject Ho Polytechnic University of the Philippines College of Science Department of Mathematics and Statistics Reject Ho 6. Draw Conclusion There is sufficient evidence to conclude that there gender is associated with the preferred type of learning medium. Proper Presentation of Results Polytechnic University of the Philippines College of Science Department of Mathematics and Statistics Exercises: Apply the procedure in testing the hypothesis. A survey was conducted at a community college of 102 randomly selected students who dropped a course in the current semester to learn why students drop courses. Personal drop reasons include financial, transportation, family issues, health issues, and lack of child care. Course drop reasons include reducing ones load, being unprepared for the course, the course was not what was expected, dissatisfaction with teaching, and not getting the desired grade. Work drop reasons include an increase in hours, a change in shift, and obtaining full-time employment. Test whether gender is independent of drop reason at the 1% level of significance. Use “Data_Example and Exercises file”. Polytechnic University of the Philippines College of Science Department of Mathematics and Statistics Result Polytechnic University of the Philippines College of Science Department of Mathematics and Statistics ACTIVITIES/ASSESSMENTS: Determine whether the sampling is dependent or independent. ________1. A researcher wishes to compare academic aptitudes of married mathematicians and their spouses. She obtains a random sample of 287 such couples who take an academic aptitude test and determines each spouses academic aptitude. ________2. A political scientist wants to know how a random sample of 18- to 25-year-olds feel about Democrats and Republicans in Congress. She obtains a random sample of 1030 registered voters 18 to 25 years of age and asks, Do you have favorable/unfavorable opinion of the Democratic/ Republican party? Each individual was asked to disclose his or her opinion about each party. Polytechnic University of the Philippines College of Science Department of Mathematics and Statistics ACTIVITIES/ASSESSMENTS: ________3. An educator wants to determine whether a new curriculum significantly improves standardized test scores for third grade students. She randomly divides 80 third-graders into two groups. Group 1 is taught using the new curriculum, while group 2 is taught using the traditional curriculum. At the end of the school year, both groups are given the standardized test and the mean scores are compared. ________4. A stock analyst wants to know if there is difference between the mean rate of return from energy stocks and that from financial stocks. He randomly select 13 energy stocks and computes the rate of return for the past year. He randomly selects 13 financial stocks and compute the rate of return for the past year. ________5. An urban economist believes that commute times to work in the South are less than commute times to work in the Midwest. He randomly selects 40 employed individuals in the south and 45 employed individuals in the Midwest and determines their commute times. Polytechnic University of the Philippines College of Science Department of Mathematics and Statistics ACTIVITIES/ASSESSMENTS: Solve the following problems. Make sure to follow the 6 steps procedure. 1. A study is designed to test whether there is a difference in mean daily calcium intake in adults with normal bone density, adults with osteopenia (a low bone density which may lead to osteoporosis) and adults with osteoporosis. Adults 60 years of age with normal bone density, osteopenia and osteoporosis are selected at random from hospital records and invited to participate in the study. Each participant's daily calcium intake is measured based on reported food intake and supplements. The data are shown below. I s t h e r e a s t a t i s t i c a l l y Normal Bone Osteopenia Osteoporosis Density 1200 1000 890 significant difference in mean 1000 1100 650 calcium intake in patients with normal bone density as 980 700 1100 compared to patients with 900 800 900 osteopenia and osteoporosis? 750 500 400 800 700 350 Polytechnic University of the Philippines College of Science Department of Mathematics and Statistics ACTIVITIES/ASSESSMENTS: 2. Some studies have shown that in the United States, men spend more than women buying gifts and cards on Valentine’s Day. Suppose a researcher wants to test this hypothesis by randomly sampling nine men and 10 women with comparable demographic characteristics from various large cities across the United States to be in a study. Each study participant is asked to keep a log beginning one month before Valentine’s Day and record all purchases made for Valentine’s Day during that onemonth period. The resulting data are shown below. Use these data and a 1% level of significance to test to determine if, on average, men actually do spend significantly more than women on Valentine’s Day. Assume that such spending is normally distributed in the population and that the population variances are equal. Polytechnic University of the Philippines College of Science Department of Mathematics and Statistics Men (in $) Women (in $) 107.48 125.98 143.61 45.53 90.19 56.35 125.53 80.62 70.7 46.37 83 44.34 129.63 75.21 154.22 68.48 93.8 85.82 126.11 ACTIVITIES/ASSESSMENTS: 3. A researcher is interested whether a training course increases the teaching performance of the teachers who attended the training courses. Test at 10% level of significance. The data are shown below: Case Before After 1 85 95 2 84 98 3 86 97 4 87 92 5 89 96 6 82 93 7 80 94 8 84 95 9 86 90 10 82 82 Case Before After 11 89 97 12 87 98 13 82 95 14 81 95 15 86 92 16 89 91 17 89 94 18 84 95 19 85 96 20 88 97 Polytechnic University of the Philippines College of Science Department of Mathematics and Statistics ACTIVITIES/ASSESSMENTS: 4. A pediatrician wants to determine the relation that may exist between a child’s height and head circumference. She randomly selects eleven 3yearold children from her practice, measures their heights and head circumference, and obtains the data shown in the table below. Height (inches) 27.75 24.5 25.5 26 25 27.75 26.5 27 26.75 26.75 27.5 Head Circumference (inches) 17.5 17.1 17.1 17.3 16.9 17.6 17.3 17.5 17.3 17.5 17.5 Polytechnic University of the Philippines College of Science Department of Mathematics and Statistics ACTIVITIES/ASSESSMENTS: 5. The following data represent the smoking status from a random sample of 1054 U.S. residents 18 years or older by level of education. Smoking Status No. Of Years of Education Current Former Never Less than 12 178 88 208 12 137 69 143 13 - 15 44 25 44 16 or more 34 33 51 Test whether smoking status and level of education are independent at the α = 0.05 level of significance. Polytechnic University of the Philippines College of Science Department of Mathematics and Statistics ACTIVITIES/ASSESSMENTS: 6. A pediatrician wants to determine the relation that may exist between a child’s height and head circumference. She randomly selects eleven 3yearold children from her practice, measures their heights and head circumference, and obtains the data shown in the table below. Height (inches) 27.75 24.5 25.5 26 25 27.75 26.5 27 26.75 26.75 27.5 Head Circumference (inches) 17.5 17.1 17.1 17.3 16.9 17.6 17.3 17.5 17.3 17.5 17.5 Polytechnic University of the Philippines College of Science Department of Mathematics and Statistics References h t t p s : / / w o l f w e b . u n r. e d u / h o m e p a g e / a n i a / stat352f12lectures/352lecture21f12.pdf Statistics. Informed Decision using Data by Michael Sullivan, III,. Fifth Edition http://www.real-statistics.com/tests-normalityand-symmetry/statistical-tests-normalitysymmetry/shapiro-wilk-test/ Polytechnic University of the Philippines College of Science Department of Mathematics and Statistics

Introduction to Statistics: Definitions and Concepts

Related documents

Products

Support

Introduction to Statistics: Definitions and Concepts

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib