BHMC3004 Chapter 1 Chapter 1 INTRODUCTION to STATISTICS 1.1 • • • • • Introduction Science of conducting studies to collect, organize, summarize, analyze, and draw conclusions from data. Reasons of studying statistics: 1. Able to read and understand the various statistical studies performed in your fields. 2. May be called on to conduct research in your field, since statistical procedures are basic to research. 3. Use the knowledge gained to become better consumers and citizens. Role of Statistics in social research process: 1. Asking the research question 2. Formulating the hypothesis 3. Collecting data 4. Analyzing data 5. Evaluating the hypothesis The field of statistics is usually divided into two categories: A. Descriptive/ Deductive Statistics o Descriptive and analysis without drawing conclusion or inference about a larger group. o Collecting, organizing, summarizing, presenting the data by using tables, graphs, and summary measures. B. Inferential / Inductive Statistics o Making inferences / drawing conclusions about population, based on information obtained from the samples. o Performing estimations and hypothesis tests, determining relationships among variables, and making predictions. Example: On the last 3 Sundays, Henry sold 2, 1, and 0 new cars, respectively. o An example of descriptive statistics is: Henry averaged 1 new car sold for the last 3 Sundays. o An example of inferential statistics is: Henry never sells more than 2 cars on a Sunday. Example 1 The last four semesters an instructor taught Introductory Statistics, the following numbers of students passed the course: 17, 19, 4, and 20. Determine each of the following statements is descriptive in nature and which is inferential. i) The last four semesters the instructor taught Introductory Statistics, an average of 15 students passed the course. 1 BHMC3004 Chapter 1 ii) The next time the instructor teaches Introductory Statistics, we can expect approximately 15 students to pass the course. iii) The instructor will never pass more than 20 students in an Introductory Statistics class. iv) The last four semesters the instructor taught Introductory Statistics, no more than 20 students passed the course. v) Only 5 students passed one semester because the instructor was in bad mood the entire semester. vi) The instructor passed so few students in his Introductory Statistics class because he does not like teaching that course. 1.2 Basic Tterms o Population ✓ A collection, or set, of individuals or objects or events whose properties, are to be analyzed. ✓ Set of all the items under consideration. o Sample ✓ A subset of population. ✓ Should possess the same or similar characteristics as the subjects in the population. ✓ Draw conclusions about the population. o Variable ✓ A characteristic of interest about each individual element of a population or sample. ✓ Dependent variable: variable that the researcher wants to explain (the “effect”); the object of the research. ✓ Independent variable: variable that is expected to “cause” or account for the dependent variable. ✓ The independent variable usually occurs earlier in time than the dependent variable. o Data ✓ The set values collected for the variable from each of the elements belonging to the sample. o Parameter ✓ A numerical value summarizing all the data of an entire population. o Statistic ✓ A numerical value summarizing the sample data. 2 BHMC3004 Chapter 1 Example 2 A statistics student is interested in finding out the percent of all households in Malaysia have a single woman as the head of the household. To estimate the percentage, you conduct a survey with 200 households and the finding shows that 75 of them are headed by a single woman. Identify each of the following terms. i) Population ii) Sample iii) Variable iv) Data v) Parameter vi) Statistic 3 BHMC3004 1.3 Chapter 1 Types of Variables Data Qualitative/ Quantitative/ Attribute Numerical Discrete Continuous • Qualitative Variable / Attribute o Cannot assume a numerical value. o Two or more non-numerical categories. o E.g., Hair colour, hometown, level of satisfactory. • Quantitative Variable o Can measure numerically. o E.g., Number of cars owned, time it takes to get to school. o Can be further divided into two types: 1. Discrete ✓ Values are countable. ✓ Certain values with no intermediate values. ✓ E.g., Number of children in the family, number of houses. 2. Continuous ✓ Any numerical value over a certain interval. ✓ Any variable that involves money is considered a continuous variable. ✓ E.g., Height of students, income. • The table below provides examples of the various types of data. Data type Question type Responses / Data Do you own a car? Yes / No What type of car do you own? Toyota / Honda / Perodua How many cars do you own? 1/ 2 / 3 / … (integer) Qualitative Discrete Quantitative Continuous What is the price of your car? … (figures) 4 BHMC3004 1.4 • • • 1) 2) 3) 4) Chapter 1 Levels of Measurement Important in determining which statistical inference test should be used to analyze the data. 4 levels: Nominal Ordinal Interval Ratio All are mutually exclusive and exhaustive. * Mutually Exclusive An individual, object, or measurement is included in only one category (nonoverlapping). * Exhaustive Each individual, object, or measurement must appear in one of the categories. Nominal Variable o Qualitative variable that can only be categorized and counted, no particular order. o Arithmetic operations are not meaningful. o The lowest / most primitive measurement, less informative. o E.g., Hair colour, religion and hometown. Ordinal Variable o A qualitative variable that incorporates an ordered position or ranking. o Precise differences between data values cannot be determined or are meaningless. o Higher than nominal variable. o E.g., Level of satisfaction (“very satisfied”, “satisfied”, “not satisfied”) and grade (A, B, C, F). Interval Variable o Next highest level of measurement. o Meaningful amount of differences between data values can be determined. o No natural zero point. o E.g., Temperature on the Celsius scale. ✓ 0oC is just a point on the scale and does not represent the absence of the condition (no heat). ✓ it is incorrect to say that 60C is twice as hot as 30C, just that it is 30C warmer. Ratio Variable o The highest level, gives most information. o The interval level with an inherent zero starting points, i.e., 0 point is meaningful, which means the zero point is the absence of the characteristic. o Ratios and differences between two numbers are meaningful. o E.g., Monthly income, Age. 5 BHMC3004 Chapter 1 Levels of Data Nominal Ordinal Interval Ratio Data may only be classified Data are ranked Meaningful difference between values Meaningful 0 point and ratio between values Example 3 Identify the level of measurement for the following data. i) Numbers of persons in a family. ii) Colour of cars. iii) Marital status of people. iv) Length of a frog’s jump. v) Reading group of a student (low, medium, or high). vi) The most frequent use of your microwave oven (reheating, defrosting, warming, other). vii) Number of consumers who refuse to answer a telephone survey. viii) The door chosen by a mouse in an experiment (A, B, or C). 6 BHMC3004 1.5 • A) B) C) D) • Chapter 1 Sources of Data The availability of accurate and appropriate data is essential for deriving reliable results. Data may be obtained from internal sources, external sources, or surveys and experiments. Internal Data o Data taken from the records of the organization itself, such as a company’s own personnel files or accounting records. o For example, if a company wishes to forecast the future sales of its product may use the data of past periods from its own records. o Accurate and reliable, since these records are kept by the organization itself. External Data o Data taken from sources outside the organization, often for another purpose. o A large number of government and private publications can be used as external sources of data. o For instant, the Statistical Abstract of the United States, Employment and Earnings and Handbook of Labour Statistics, census data. Primary Data o The data are published or released by the same organization that collected them (for the first time and specially collected the present purpose of one particular statistical inquiry, e.g., surveys and experiments). o Can take a long time and costly to collect. However, it can be more accurate, more detailed, and more complete. o For example, if we want to study the relationship between the family background and the course selected by students, we could collect all the relevant information by means of questionnaire. We then have to process the data and present them in the most convenient form for our study. Secondary Data o The data are published by an organization other than the one by which they were collected or collected for other purposes, e.g., data obtained from the internal or external sources. o Secondary data is convenient and cheaper to collect. However, it may be inadequate for the purpose of the inquiry. Collecting primary data is very much more complicated and time consuming compared to the collection of secondary data. The method of collection has to be decided upon, questionnaires have to be designed for the collection of data and the researcher has to make a decision as to whether to conduct a census or a sample survey. 7 BHMC3004 Chapter 2 Chapter 2 DATA COLLECTION 2.1 Data Collection Process 2.1.1 Personal Interview • Advantages: i) Purpose and meaning of each question are explained so that answers given are more valid. ii) High response rate (80 - 90%). • Disadvantages: i) Interviewer biases either consciously or unconsciously. ii) More expensive (recruit, train and pay the interviewers). iii) People may not like to give confidential or embarrassing information. 2.1.2 Postal Questionnaire • Advantages: i) Cheaper. ii) Wider area coverage. iii) Can ask many things including personal habits. iv) No interruption, the respondent will answer questions in a convenient way. • Disadvantages: i) Poor response rate (about 20%) and hard to get a good sample. ii) For questions that are not clear, answers given would not be accurate or relevant. iii) Need a mailing list. 2.1.3 Direct Observation • Advantage: i) Most accurate and precise among the other method of collecting data. • Disadvantages: i) Expensive and time consuming. ii) Not applicable and uneconomical in many situations. 2.1.4 Telephone Enquiries • Advantages: i) Cheaper. ii) Wider area coverage. iii) All sessions can be controlled and monitored properly. iv) Questionnaire can be computerized, and questions can be changed based on respondent’s answer. • Disadvantages: i) Poor response rate (hang up the interviewer). ii) Time waste (not at home). iii) Limited interview time. 1 BHMC3004 Chapter 2 2.1.5 Online Survey • Advantages: i) Faster and large volume of data collection. ii) Save cost and flexible design. iii) Anonymity. iv) Respondent acceptability. • Disadvantages: i) Sample bias. ii) Length, response and dropout rates. iii) Technical problems. 2.1.6 Focus Group • Advantages: i) Require fewer resources and time. ii) Can request clarifications to unclear responses. iii) Can view both sides of the coin and build a balances perspective on the matter. • Disadvantages: i) Sample selected may not represent the population accurately. ii) Dominant participants can influence the responses of others. 2.2 • • 2.3 • Designing a Questionnaire Prepared either to be used as postal questionnaire or as a basis for personal and telephone interview. Consists of two sections: (a) Classification section o Personal details of the respondents such as gender, age, marital status, occupation etc. (b) Questioning section Related to the subject matter of inquiry. The characteristics of the questions: o Simple question o Not ambiguous o Short question o Capable of a precise answer o Not too personal o Avoid questions that lead to a particular answer o Questions are in a logical sequence o Questionnaire should be as short as possible o Cover the exact object of the inquiry Sample and Census Data Sample survey o Technique of collecting information from a portion of the population. o The results of the sample survey are usually used to make inferences about the larger population. o Sample data. 2 BHMC3004 • • • 2.4 • • • • • • Chapter 2 Census o Survey that includes every member of the population. o Many countries carry out a census study of their population every ten years - update the information on the residents. o Census data. Pilot study o A study that done before the actual fieldwork is carried out. o The purpose: ✓ to identify possible problems and difficulties ✓ to test out and improve questionnaires A sample survey can reduce the cost and time and the results may be as accurate as the census study if the sample is selected using a proper sampling technique. Sampling Techniques Sampling o Process of selecting a representative subset (random process) from the population. Sampling Techniques o Scientific methods of selecting samples from populations. Sampling Frame o A list of all elements in the population from which the sample will be drawn. o Complete, up to date and adequate for the purpose. Reasons for Sampling: 1) The destructive nature of certain tests. 2) The physical impossibility of checking all items in the population. 3) The cost is often prohibitive and time-consuming. 4) The adequacy of sample results. A useful sample (the conclusion can be drawn about the population) is a sample with 1) proper size (larger more reliable); 2) randomly chosen (avoid biasness). Two types of sampling methods: A) Probability Sampling o Each item or person in the population being studied has a known likelihood (nonzero) of being included in the sample. o Simple random sampling, systematic random sampling, stratified random sampling, cluster sampling, multi-stage random sampling. B) Non-probability Sampling o Not all items or persons have a chance of being included in the sample. o Sample is based on the judgment of the person selecting the sample. o Convenience sampling, judgment sampling, quota sampling, snowball sampling. 3 BHMC3004 Chapter 2 2.4.1 Methods of Probability Sampling 2.4.1.1 Simple Random Sampling • Each item or person has the same chance of being chosen. • Can be obtained by a) Through mixing and simply picking; b) Using output of some mechanical process such as a revolutionary drum in the drawing of lottery ticket; c) Using a random number table. • The use of random number table: a) Number all items in the sampling frame (population) in sequential order. b) Select a starting point randomly in the random number table. c) After that, continue select the random numbers in a consistent manner, that is, row by row or column by column. Select groups of random numbers with same number of digits as the total population size. d) Select the items that have the same digits as the random numbers chosen in step (3). Example 1 Sample of 10 students out of 300 students for a seminar. 1) Number the students from 001 to 300. 2) Refer to random number table, start from _________________________________________________ (starting point). 3) Refer the first three digits, the random numbers are _______________________________________________________________________________________ ________________________________________________________________________________________________. 4) Thus, students numbered___________________________________________________________________ ________________________________________________________________________________________________ are being chosen. Do not select the same number more than once 4 BHMC3004 Chapter 2 Table of Random Numbers Row 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 1–5 13962 43905 00504 61274 43753 83503 36807 19110 82615 05621 06936 84981 66354 49602 78430 33331 62843 19528 16737 99389 36160 05505 85962 28763 42222 43626 97761 49275 15797 04497 95468 01420 74633 46662 10853 68583 75818 16395 53892 66009 45292 34033 13364 03343 46145 37703 12622 56043 43401 18053 6 – 10 70992 46941 48658 57238 21159 51662 71420 55680 86984 26584 37293 60458 88441 94109 72391 51803 84445 15445 01887 06685 38196 45420 19758 04900 40446 40039 43444 44270 75134 24853 87411 74218 40171 99688 10393 01032 78982 16837 15105 26869 93427 45008 09937 62593 24476 51658 98083 00251 35924 53460 11 – 15 65172 72300 38051 47267 16239 21636 35804 18792 93290 36493 55875 16194 96191 36460 96973 15934 56652 77764 50934 45945 77705 44016 92795 54460 82240 51492 95895 52512 39856 43879 30647 71047 97092 59576 03013 67938 24258 00538 40963 91829 92326 41621 00535 93332 62507 17420 17689 70085 28308 32125 16 – 20 28053 11641 59408 35303 50595 68192 44862 41487 87971 63013 71213 92403 04794 62353 70437 75807 91797 33446 43306 62000 28891 79662 00458 22083 79159 36488 24102 03951 73527 07613 88711 14401 79137 04887 90372 29733 93051 57133 69267 65078 70206 79437 88122 09921 19530 30593 59677 28067 55140 81357 Column 21 – 25 26 – 30 02190 83634 43548 30455 16508 82979 29066 02140 62509 61207 84294 38754 23577 79551 16614 83053 60022 35415 68181 57702 83025 46063 80951 80068 14714 64749 00721 66980 97803 78683 46561 80188 45284 25842 41204 70067 75190 86997 76228 60645 12106 56281 92069 27628 71289 05884 89279 43492 44168 38213 70280 24218 07006 71923 21651 53867 78417 36208 26400 17180 01765 57688 74537 14820 30698 97915 02310 35508 89639 65800 71176 35699 02081 83890 89398 78205 85534 00533 89616 49016 15847 14302 98745 84455 47278 90758 25306 57483 41257 97919 39637 64220 56603 93316 78135 53000 07515 53854 26935 67234 31 – 35 66012 07686 92002 60867 86816 84755 42003 00812 20852 49510 74665 47076 43097 82554 04670 78984 96246 33354 56561 87750 86222 50002 37963 00066 46839 14596 04800 73531 59510 18880 60665 45248 36305 69481 88532 10551 66944 72122 27130 14200 60043 66769 23542 98115 02290 45486 79858 18138 23023 78460 36 – 40 70305 31840 63606 39847 29902 34053 58684 16749 02909 75304 12178 23310 83976 90270 70667 29317 73504 70680 79018 46329 66116 32540 23322 40857 26598 04744 32062 70073 76913 66083 57636 78007 42613 30300 71789 15091 99856 99655 90420 97469 30530 94729 35273 33460 40357 03698 52548 40564 70268 47833 41 – 45 66761 03261 41078 50968 23395 94582 09271 45347 99476 38724 10741 74899 83281 12312 58912 27971 21631 66664 34273 46544 39626 19848 73243 86568 29983 89336 41425 45542 22499 02196 36070 65911 87251 94047 59964 52947 87950 25294 72584 88307 57149 17975 67912 55304 38408 80220 67367 77086 80435 20496 46 – 50 88344 89139 86326 96719 72640 29215 68396 88199 45568 15712 58362 87929 72038 56299 21883 16440 81223 75486 25196 95665 06080 27319 98185 49336 67645 35630 66862 22831 68467 10638 37285 38583 75608 57096 50681 20134 13952 20941 84576 92282 08642 50963 97670 43572 50031 12139 72416 49557 24269 35645 5 BHMC3004 Chapter 2 2.4.1.2 Systematic Random Sampling • The items or individuals of the population are arranged in some order. • A random starting point is selected and then every k-th member of the population is selected for the sample. • k, sampling interval = population size (N) sample size (n) • Can be biased if the population has repetitive or systematic pattern. Example 2 Let a population of 200; select a sample that is 10% of population. Number of samples = 10% of population = Then sampling interval is Select a starting point randomly, say ______, the items selected for the sample would be ___________________________________________________________________________________________________________ __________________________________________________________________________________________________________. 2.4.1.3 Stratified Random Sampling • A population is first divided into strata, according to its various prominent characteristics such as sex, age, and household income. • Elements in each subgroup or strata are homogeneous. • Sub-sample is drawn utilizing a simple random sample within each stratum. • Advantage: More accurate in reflecting the characteristics of the population. • Can be divided proportionately or non-proportionately. Example 1 Example 2 Example 3 Population All people in United States All intercollegiate athletes All primary students in the local school district Strata 4 Time Zones in the United States (Eastern, Central, Mountain, Pacific) 26 intercollegiate teams 11 different primary schools in the local school district Obtain a Simple Random Sample 500 people from each of the 4 time zones 5 athletes from each of the 26 teams 20 students from each of the 11 primary schools Sample 4 × 500 = 2000 selected people 26 × 5 = 130 selected athletes 11 × 20 = 220 selected students Example 3 Refer to Example 1, the 300 students are classified according to their year of study as shown in the following table. Draw a proportionate stratified sampling of 10 students. Year Number of students Label 1 120 001 – 120 2 90 121 – 210 3 90 211 – 300 n 6 BHMC3004 Chapter 2 Using the same starting point in the random number table in Example 1, the random numbers (within the range) are Year Label 1 001 – 120 2 121 – 210 3 211 – 300 n Sample 2.4.1.4 Cluster Sampling • First divided into small subdivisions, called primary units or clusters. • Clusters or primary units should be as heterogeneous as the population itself. • Then randomly choose the clusters. All the items in the chosen clusters are included in the sample. • A simple and less costly procedure. • Area sample is the most popular type of cluster sample. Example 1 Example 2 Example 3 Population All people in United States All intercollegiate athletes All primary students in the local school district Strata 4 Time Zones in the United States (Eastern, Central, Mountain, Pacific) 26 intercollegiate teams 11 different primary schools in the local school district Obtain a Simple Random Sample 2 time zones from the 4 possible time zones 8 teams from the 26 possible teams 4 primary schools from the l1 possible elementary schools Sample every person in the 2 selected time zones every athlete on the 8 selected teams every student in the 4 selected primary schools 2.4.1.5 Multi-stage Ssampling • The area of survey is divided into a number of areas, and three or four areas are selected by random means. • Each area selected is again sub-divided and another sample of smaller areas is selected at random. • The process continues until ultimately a number of quite small areas has been selected. • A random sample of the relevant people within each of these areas is then interviewed. • It reduces the area of survey and thus brings down the cost to a reasonable bound. 7 BHMC3004 Chapter 2 2.4.2 Methods of Nonprobability Sampling 2.4.2.1 Convenience Sampling • Pre-testing of questionnaires, the gathering of ideas and insights or the forming of hypothesis. • The selection is left primarily to the interviewers. • Often, respondents are selected because they happen to be in the right place at the right time. 2.4.2.2 Judgment Sampling • The researcher selects a respondent whom he feels possesses certain characteristics that represent the population of interest based on his experience. 2.4.2.3 Quota Ssampling • Like stratified random sampling, one has to take note the various characteristics of the population, for example, the divisions on gender, age and job type. • The sample size is then divided into sub-sample sizes (quota) to include similar proportions of people within these characteristics. • Each interviewer is then given the quota of people with these characteristics to contact. The final selection of the individuals is left up to the interviewers (similar to convenience sampling). 2.4.2.4 Snowball Sampling • An initial group of respondents is selected, usually at random. • After being interviewed, these respondents are asked to identify others who belong to the target population of interest. • This procedure is applied until the researcher obtains the required number of respondents. 8 BHMC3004 Chapter 2 Summary: Strengths and Weaknesses of Basic Sampling Techniques Probability Sampling Techniques Strengths Weaknesses Simple Random Sampling Easily applied. Results can be projected on population. Difficult to obtain sampling frame, expensive, sometimes no assurance of representativeness. Systematic Sampling Easier to implement than simple random sampling. Can decrease representativeness if certain patterns exist in sampling frame. Stratified Sampling Includes all important subpopulations, precision is improved. Difficult to select relevant stratification variables, not feasible to stratify on many variables, expensive. Cluster Sampling Easy to implement, cost effective and work is reduced. Imprecise, difficult to compute and to interpret results. Non-probability Sampling Techniques Strengths Weaknesses Convenience Sampling Less expensive, less time consuming, most convenience. Selection bias, sample not representative, not recommended for descriptive or causal research. Judgment Sampling Less expensive, less time consuming, most convenience. Does not allow generalisation, subjective. Quota Sampling Sample can be controlled for certain characteristics. Selection bias, no assurance or representativeness. Snowball Sampling Can estimate rare characteristics. Time consuming. 9 BHMC3004 Chapter 3 Chapter 3 DATA PRESENTATION 3.1 • Frequency Distribution A grouping of data into mutually exclusive categories showing the number of observations in each class. 3.1.1 Qualitative Data • Lists all categories and the number of elements that belong to each of the categories. • Nominal Variable • • Gender Frequency, f Female 15 Male 25 Total, 40 Relative Frequency = 15 40 25 40 𝑓 Σ𝑓 Percentage, % = 𝑓 Σ𝑓 × 100 = 0.375 0.375 100 = 37.5% = 0.625 0.625 100 = 62.5% 1 100 Ordinal Variable Relative Frequency = Grade f A 8 B 15 C 10 F 7 Total, 40 𝑓 Σ𝑓 Percentage, % 100 1 Joint Frequency Distribution Grade Gender A B C F Female 3 6 4 2 Male 5 9 6 5 Total 15 25 Total 8 15 10 7 40 o The table is referred as bivariate table or contingency table, reporting the overlap between two variables. 1 BHMC3004 Chapter 3 3.1.2 Quantitative Data • Lists all the classes and the number of values that belong to each data. • Discrete Variable - Ungrouped f Number of Children Relative Frequency Percentage 0 1 2 3 4 Total o Number of class, c = 5 • Discrete Variable - Grouped 1 to 5, i.e. 1, 2, 3, 4 and 5 Class Limit Class Boundary 1–5 0.5 -< 5.5 6 – 10 5.5 -< 10.5 11 – 15 10.5 -< 15.5 16 – 20 15.5 -< 20.5 f Class Midpoint Total • Continuous Variable Class Limit = Class Boundary 0 to less than 5 f Class Midpoint 0 -< 5 5 -< 10 10 -< 15 15 -< 20 Open-ended class Assume 20 -< 25 20 and above Total o Current upper class limit = subsequent lower class limit o Number of class, c = 5 o For open-ended class, in further calculation, assume to be of the same size as the immediate neighbouring class. 2 BHMC3004 • Chapter 3 Steps in Constructing of a Grouped Frequency Distribution from a Set of Raw Data or Ungrouped Data o Step 1: Decide the number of classes, c. 2c > n, n = number of observations o Step 2: Determine the class width, i (same for all classes). Highest value - Lowest value 𝑖 > 𝑐 o Step 3: Set the class limits and class boundaries, if necessary. o Step 4: Tally mark. o Disadvantage: Lose the information on individual observations. Example 1 A random sample of 30 students were asked to give the number of hours (to the nearest hour) they spent per week studying outside of class. Also, their eye color and the number of pets they owned was recorded. The results are given as follows. Student 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 Eye Colour Blue Brown Brown Green Blue Green Hazel Brown Blue Hazel Blue Green Brown Grey Brown Green Green Hazel Grey Brown Grey Blue Blue Brown Hazel Blue Brown Grey Hazel Brown Number of Pets 1 0 3 1 2 1 0 3 4 3 1 1 1 2 0 4 0 1 2 2 1 0 3 0 1 1 2 2 4 2 Number of Hours Studying 10 7 15 20 6 25 22 13 12 21 16 22 25 20 29 25 27 15 14 17 8 18 24 28 24 25 11 9 10 17 3 BHMC3004 Chapter 3 Construct the frequency distributions for the data on eye colour, number of pets owned, and number of hours spent per week studying outside of class. Eye Colour f Relative Frequency Percentage 30 1 100 Blue Brown Grey Hazel Green Total No of Pets f 0 6 1 10 2 7 3 4 4 3 Total 30 Relative Frequency Percentage Number of Hours Studying: n= lowest = highest = i> Class limit: 4 BHMC3004 Chapter 3 Example 2 A random sample of 30 students was selected and the average number of hours each student studied in a week is determined. 15.0 23.7 19.7 15.4 18.3 23.0 14.2 20.8 13.5 20.7 17.4 18.6 12.9 20.3 13.7 21.4 18.3 29.8 17.1 18.9 10.3 26.1 15.7 14.0 17.8 33.8 23.2 12.9 27.1 16.6 Organize the data into a frequency distribution. n= lowest = highest = i> Class limit: Number of Hours Frequency Relative Frequency Percentage 5 BHMC3004 3.2 • • • • Chapter 3 Cumulative Frequency Distribution For variables that are ordinal level and above. Gives the total number of values that fall below/above the upper/lower boundary of each class. “Less than” cumulative frequency distribution o A table showing the total frequency of all values less than the upper class boundary of each class interval. “More than” cumulative frequency distribution o A table showing the total frequency of all values more than or equal to the lower class boundary of each class interval Cumulative frequency of a class • Cumulative relative frequency = • Cumulative percentage = (Cumulative relative frequency) 100 Total frequency Example 3 Construct a “Less than” and a “More than” cumulative frequency distribution for the data in Example 2. Number of Hours Cumulative Frequency Less than Less than Less than Less than Less than Less than Number of Hours Cumulative Frequency More than or equal to More than or equal to More than or equal to More than or equal to More than or equal to More than or equal to 6 BHMC3004 3.3 • • • • Chapter 3 Graphic Presentation for Qualitative Data Well suited for non-technical audience such as executives, managers. Provide overview information rather than detail. Examples: Bar chart, Pie chart A good diagram or graph should has i. title ii. source iii. units of measurement iv. include a key if appropriate v. scale should be approximately vi. axis should be clearly stated chosen and stated 3.3.1 Bar Chart • A graph made of bars of the same width heights or lengths of the bars represent the frequencies of respective categories. • Can be used to depict any of the level of measurement (nominal, ordinal, interval, or ratio). • Can be constructed vertically or horizontally. • Can show positive and negative values. • 3 types: A) Simple Bar Chart B) Component Bar Chart C) Multiple Bar Chart Note: • Leave a small gap between the adjacent bars (say 1/2 of the bar width). 3.3.1.1 Simple Bar Chart • Used to represent a qualitative variable. • E.g., Daily sales of ice cream Sales Day Monday Tuesday Wednesday Thursday Friday 100 140 100 100 170 o Vertical Bar Chart o Horizontal Bar Chart 7 BHMC3004 Chapter 3 3.3.1.2 Component Bar Chart • Useful to illustrate a breakdown in the figures. • The constituent parts of each bar are always stacked in the same order with the height of each representing the individual values or frequencies. • E.g., The total sales of ice cream could be broken down into sales by flavours. Friday Flavour Monday Tuesday Wednesday Thursday Vanilla Chocolate 50 50 60 80 40 60 50 50 80 90 Total 100 140 100 100 170 8 BHMC3004 Chapter 3 3.3.1.3 Multiple Bar Chart • Uses a separate bar to represent each constituent part of the total. • These bars are joined into a set of each class of data. • E.g., The multiple bar chart for the previous example. 9 BHMC3004 Chapter 3 Example 4 Construct a multiple bar chart and a component bar chart for the following data. Month Jan Feb Mar Apr May Total Product (in thousand) A B C 15 20 30 29 18 14 30 25 15 24 18 32 35 27 15 10 BHMC3004 Chapter 3 3.3.3 Pie Chart • A circle divided proportionally to the relative frequency and portions of the circle are allocated for the different groups. • The angle, = relative frequency 360 • Useful for displaying a relative frequency distribution. • Whole pie or chart represents the total sample or population. Example 5 In a study of retractions in biomedical journals, 436 were due to error, 201 were due to plagiarism, 888 were due to fraud, 291 were duplications of publications, and 287 had other causes. Illustrate the above information in a graph and interpret the graph. 11 BHMC3004 Chapter 3 Cause Frequency Error 436 Plagiarism 201 Fraud 888 Duplication 290 Other 287 Total 2102 Relative frequency Angle 13.65% 20.74% 13.80% 9.56% 42.25% 12 BHMC3004 3.4 • • • Chapter 3 Graphic Presentation for Quantitative Data Well suited for technical audience such as engineers, supervisors, etc. Provide more numerical details. Examples: Histogram, Frequency polygon, Cumulative frequency polygon, Stem and leaf display 3.4.1 Histogram • A graph, with a set of rectangles, in which classes (midpoints or class boundaries) is marked on the horizontal axis and frequencies (called the frequency histogram), relative frequencies (relative frequency histogram), or percentages (percentage histogram) are marked on the vertical axis. • Each rectangle is constructed so that its area is proportion to the frequency of the class interval it represents. • 2 types: A) Equal-width histogram B) Unequal-width histogram • The bars in a histogram are drawn adjacent to each other. • Symbol -⁄⁄- (truncation) is used to indicate that the entire axis is not shown. 3.4.1.1 Equal Width Histogram • All the class intervals have the same width (or size). • The vertical axis which represents the height of each rectangle is the class frequency, relative frequency or percentage. Example 6 The data below represent the defective items produced by machines of varying age. Draw a histogram for the data. Age (to the nearest month) Frequency 1–5 2 6 – 10 3 11 – 15 7 16 – 20 15 21 – 25 20 26 – 30 22 31 – 35 17 13 BHMC3004 Frequency, f Chapter 3 Histogram for the Defective Items Produced by Machine of Varying Age 22 20 18 16 14 12 10 8 6 4 2 0 0.5 5.5 10.5 15.5 20.5 25.5 30.5 35.5 Age (to the nearest month) 14 BHMC3004 Chapter 3 3.4.1.2 Unequal-width Histogram • Class intervals are of unequal width (or size). • The height of each rectangle must be adjusted where it differs from the “standard” class width, i.e., from the class width of the majority of class intervals. • For example, when the width of a particular class interval (base of the rectangle) doubles in length, the height (class frequency) must be halves, and so on. • Vertical axis is frequency density/adjusted height, where frequency Frequency density = no of standard class widths Example 7 Construct a histogram for the following data. Time (minutes) Frequency Class width No. of standard class widths 40 –< 45 8 5 1 45 –< 50 13 5 1 50 –< 55 16 5 1 55 –< 60 24 5 1 60 –< 70 24 10 2 70 –< 85 15 15 3 Frequency density/ Adjusted frequency Height (Frequency density) 24 / 2 = Histogram for the Time 25 20 15 10 5 0 35 Time (minutes) 40 45 50 55 60 65 70 75 80 85 15 BHMC3004 3.4.1.3 Chapter 3 Shapes of Histogram 1. symmetric / normal / triangular • identical on both sides of its central point 2. skewed • non-symmetric, a longer tail on one side than the other i. skewed to the right o longer tail on the right side ii. skewed to the left o longer tail on the left side 3. • uniform / rectangular same frequency for each class 3.4.2 Frequency Polygon • Consists of line segments connecting the points formed by the class midpoint and the class frequency. • Join the midpoints of the tops of successive bars in a histogram with straight lines. • Join the points at each end of the diagram to the base line at the centers of the adjoining class intervals (2 classes with 0 frequencies). Example 8 Construct a frequency polygon for the data in Example 6 and Example 7. (Draw on the histogram) 16 BHMC3004 Chapter 3 Example 6 Frequency, f Histogram for the Defective Items Produced by Machine of Varying Age 22 20 18 16 14 12 10 8 6 4 2 0 0.5 5.5 10.5 15.5 20.5 25.5 30.5 35.5 Age (to the nearest month) 17 BHMC3004 Chapter 3 Example 7 Frequency density/ 25 Adjusted frequency Histogram for the Time 20 15 10 5 0 35 Time (minutes) 40 45 50 55 60 65 70 75 80 85 18 BHMC3004 Chapter 3 3.4.3 Cumulative Frequency Polygon • A line drawn for a cumulative frequency distribution by joining the dots marked above the upper boundaries of classes at heights equal to the cumulative frequencies of respective classes. • Used to determine how many or what proportion of the data values are below or above a certain value. • If the dots are joined by a smooth curve, it is called as cumulative frequency curve or ogive. Example 9 Construct a cumulative frequency polygon for the data in Example 6 and Example 7. Example 6 Age in months Example 7 F Time(minute) Less than 0.5 More than or equal to 40 Less than 5.5 More than or equal to 45 Less than 10.5 More than or equal to 50 Less than 15.5 More than or equal to 55 Less than 20.5 More than or equal to 60 Less than 25.5 More than or equal to 70 Less than 30.5 More than or equal to 85 Less than 35.5 More than or equal to 40 F 19 BHMC3004 Chapter 3 Example 6 Cumulative frequency, F Cumulative Frequency Polygon for the Defective Items Produced by Machine of Varying Age 100 90 80 70 60 50 40 30 20 10 0 –4.5 0.5 5.5 10.5 15.5 20.5 25.5 30.5 35.5 Age (to the nearest month) 20 BHMC3004 Chapter 3 Example 7 Cumulative frequency, F Cumulative Frequency Polygon for the Time 100 90 80 70 60 50 40 30 20 10 0 40 45 50 55 60 65 70 75 80 85 Time (minutes) 21 BHMC3004 Chapter 3 3.4.4 Stem and Leaf Display • A display of data in which each numerical value is divided into two parts: a leading digit(s) becomes the stem and the trailing digit(s) becomes the leaf. • The purpose is to display the shape of a distribution. • Steps: i) Split each value into two parts; the stem and the leaf. ii) Draw a vertical line and write the stems on the left side of it, from the lowest to the highest. iii) Records the leaves next to the corresponding stems on the right side of the vertical line. • The leaves are usually arranged in increasing order. • No comma is places between leaf digits. • Each leaf contains only a single digit while the stem may have many digits as needed. • The advantage of the display is do not lose any information on individual observations. • The stem-and-leaf display reveals some important features: i) Range of data values ii) Where the values are concentrated iii) The distribution is symmetrical or not iv) Whether gaps exist or not v) Presence of outliers o If the leaves become too crowded, then each distinct stem from the basic plot can be split into either 2 or 5 different intervals. 2 intervals: 5 intervals: Stem 1st Stem 1st 2nd 3rd 4th 5th Leaf digits 0, 1, 2, 3, or 4 2nd 5, 6, 7, 8, or 9 Leaf digits 0 or 1 2 or 3 4 or 5 6 or 7 8 or 9 o E.g., 12, 13, 13, 15, 17, 18, 19, 20, 21, 23, 25, 27. ✓ Split the ones digit, thus if duplicate each stem, 1 2335789 Too few stems, shape is not clearly seen not a suitable display 2 01357 ✓ thus if duplicate each stem, 1 233 1 5789 2 013 2 57 22 BHMC3004 Chapter 3 o E.g., 10, 12, 13, 13, 14, 15, 15, 15, 16, 16, 19 ✓ Split the ones digit, 1 0 2 3 3 4 5 5 5 6 6 9 ✓ if duplicate each stem, 1 0 2 3 3 4 1 555669 Unable to comment on the shape Also, not suitable ✓ thus 1 0 1 233 Better display Shape can be seen clearer 1 4555 1 66 1 9 o If the range between the smallest and largest data values is large and there are relatively few data values, the stem and leaf display will have many stem rows with few leaves in any one row (or empty rows), we may produce a condensed stem and leaf display by truncating the last digit of the data values and reconstruct the plot. o E.g., 4, 25, 78, 105, 136, 143, 198, 200, 261 ✓ Split the ones digit, 0 4 1 2 5 3 4 : : 7 8 : : 10 5 Too many stems, shape is not clearly seen not a suitable display *Cannot skip the “in-between” stems which have no leaf 11 : : 25 26 1 o Consider: 004, 025, 078, 105, 136, 143, 198, 200, 261 (no round up) ✓ Split the tenths digit, 0 0 2 7 Better display Shape can be seen clearer 1 0 3 4 9 2 0 6 Example 10 Alice achieved the following scores on her quizzes this semester: 86, 79, 92, 84, 69, 88, 91, 83, 96, 78, 82, 85. Construct a stem and leaf display for the data. 6 7 8 9 23 BHMC3004 Chapter 3 Example 11 Below is the weight for a sample of 30 students (in kg): 19.1 19.8 18.0 19.2 19.5 17.3 20.0 20.3 19.6 18.5 18.1 19.7 18.4 17.6 21.2 20.6 22.2 19.1 21.1 19.3 20.8 21.2 21.0 18.7 19.9 18.7 22.1 17.2 18.4 21.4 Construct a stem and leaf display. Example 12 The ages (in months) at which 25 children were first enrolled in a preschool are listed below. 38 40 38 35 39 34 37 36 35 36 45 35 36 36 43 41 36 37 43 38 40 34 41 39 36 Construct a stem and leaf display for the distribution of the age of the preschoolers. 24 BHMC3004 Chapter 4 Chapter 4 DESCRIPTIVE STATISTICS 4.1 • • • • Measure of Central Location (Average) A single value within the range of data used to represent all the values in the series. The point of location around which individual values cluster. Also known as measure of central value or central tendency. Two types: o Mathematical Average: Mean o Positional Average: Median, Mode, Fractiles 4.1.1 Mean • Properties: o For interval-level and ratio level data. o All values are used. o Unique. o Σ(𝑥𝑖 − 𝑥̅ ) = 0. o Use to comparing populations. • Advantages: o Simple, always exists and unique. o Fully representative. o For further mathematical analysis. o Can be calculated even when only the total value and the number of items are known. o Relatively reliable. • Disadvantages: o Affected by extreme values. o Cannot determine the mean for open-ended class(es) data. If such classes contain a large proportion of the values, then the mean may be subject to substantial error. A) Raw Data Σ𝑋 • Population Mean, μ = • Sample Mean, 𝑥̅ = Σ𝑥 • is a parameter and 𝑥̅ is a statistic. 𝑁 𝑛 Example 1 The following are the ages (in years) of all eight employees of a small company: 53 32 61 27 39 44 49 57 Find the mean age of these employees. Interpret your answer. X = 53 + 32 + 61 + 27 + 39 + 44 + 49 + 57 = Σ𝑋 μ= 𝑁 The average age of all eight employees of this company is years old. 1 BHMC3004 Chapter 4 Example 2 Following are the list prices (in $) of eight homes randomly selected from all homes for sale in a city. 245,670 176,200 360,280 272,440 450,394 310,160 393,610 374, 480 Calculate and interpret the mean. x = 245,670 + … + 374,480 = 𝑥̅ = Σ𝑥 𝑛 The average B) Ungrouped Frequency Distribution • 𝑥̅ = Σ𝑓𝑥 Σ𝑓 x : data value f : frequency Example 3 In a survey of 50 households, the number of children in each household are shown as below. Number of children 0 1 2 3 4 5 Number of households 8 15 13 9 3 2 Find and interpret the mean for the above data. 𝑥̅ = Σ𝑓𝑥 Σ𝑓 The average C) Grouped Frequency Distribution • 𝑥̅ = Σ𝑓𝑥 Σ𝑓 x : midpoint Example 4 Determine the mean for the data below. Score f 60 – 62 10 63 – 65 36 66 – 68 84 69 – 71 54 72 – 74 16 x fx 𝑥̅ = Σ𝑓𝑥 Σ𝑓 Total 2 BHMC3004 4.1.2 • • • • • A) Chapter 4 ̃ Median, 𝒙 The midpoint of the ordered values. There are as many values above the median as below it in the data array. Properties: o Unique o Ratio, interval and ordinal-level data o Open-ended frequency distribution Advantages: o Not influenced by outliers. o Preferred for data sets that contain outliers. Disadvantages: o Data have to be arranged. o Does not fully reflect the distribution. o Unsuitable for use in further calculations. o May not be truly representative if there are too few items. Raw Data • 𝑛+1 𝑥̃ = the ( )th item (n is odd) 2 𝑛 𝑛 𝑥̃ = the mean of ( 2) th and ( 2 + 1) th items (n is even) Example 5 The following data relates to the marks obtained by 15 students. Find and interpret the median value. 30, 35, 52, 52, 35, 40, 59, 60, 41, 46, 61, 65, 47, 70, 72 Rank the data: 30, 35, 35, 40, 41, 46, 47, 52, 52, 59, 6 0, 61, 65, 70, 72 𝑛+1 n = 15 (odd), 𝑥̃ = ( 2 ) th item Half of the group of students obtained less than or equal to 52 marks while the other half of them obtained more than or equal to 52 marks Example 6 Find the median of the data speeds (in Mbps) of smartphones from six different telecommunication companies. Interpret the finding. 38.5 55.6 22.4 14.1 23.1 24.5 Rank the data: n = 6 (even), 6 6 𝑥̃ = between (2) th and (2 + 1) th items Half of the telecommunication 3 BHMC3004 B) • • Chapter 4 Ungrouped Frequency Distribution Determined using the cumulative frequency distribution table. Position of median is the same as in raw data set. Example 7 Given below are quiz scores (out of 10) obtained by 150 students. Determine the median. f Score C) • • 6 30 7 52 8 26 9 22 10 20 150 150 2 2 𝑥̃ = between ( ) th and ( + 1) th Grouped Frequency Distribution Estimate from cumulative frequency polygon. 𝑥̃ = xn/2 or 50% of the total (Regardless of even or odd n) F n n/2 Class boundary ̃ 𝒙 Example 8 Estimate the median for the data in Example 4 using cumulative frequency polygon. Cumulative frequency distribution 𝑥̃ = Class Boundary F Less than 59.5 0 Less than 62.5 10 Less than 65.5 46 Less than 68.5 130 Less than 71.5 184 Less than 74.5 200 200 th 2 4 BHMC3004 F Chapter 4 Cumulative Frequency Polygon for the Score 200 180 160 140 120 100 80 60 40 20 Score 0 59.5 62.5 65.5 68.5 71.5 74.5 5 BHMC3004 Chapter 4 Example 9 The following table gives the frequency distribution of the workers of a factory according to their average monthly income in a certain year. Estimate the median value using cumulative frequency polygon. Income (RM) f 500 -< 1000 28 1000 –< 1500 34 1500 –< 2000 46 2000 –< 2500 32 2500 –< 3000 24 3000 –< 3500 12 Income (RM) F Less than 500 Less than 1000 Less than 1500 Less than 2000 Less than 2500 Less than 3000 Less than 3500 𝑥̃ = 6 BHMC3004 Chapter 4 Cumulative Frequency Polygon for the Average Monthly Income in a Certain Year F 180 160 140 120 100 80 60 40 20 Income (RM) 0 500 1000 1500 2000 2500 3000 3500 7 BHMC3004 4.1.3 • • • • • • • Chapter 4 ̂ Mode, 𝒙 The value that appears most frequently. Useful for nominal and ordinal data. Advantages: o Simple and easy to understand. o Not affected by extreme values. o Can be found for open-ended classes. o For quantitative and qualitative variables. o Can be the value of an actual item in the distribution. Disadvantages: o May not exist and may not be unique. o Not suitable for further calculations or mathematical analysis. o Data have to be arranged. The distribution with one mode is called as unimodal. When two values occur with the same (highest) frequency, the distribution is called bimodal. If more than two modal values occur, it is said to be multimodal. A) Raw data Example 10 The following data gives the speed (in km per hour) of the cars that were stopped for speeding violations at two locations, A and B. Location A: 125, 130, 120, 135, 127, 125, 118, 125 Location B: 115, 120, 110, 113, 112, 125, 118, 123 Determine the mode for the two locations. 𝑥̂A = The most frequent 𝑥̂B = The distribution of speed that are stopped for speeding violations in location B Example 11 Printing press turns out in 5 impressions: ‘very sharp’, ‘sharp’, ‘sharp’, ‘sharp’, ‘blurred’. Then modal value is B) • Ungrouped Frequency Distribution Choose the item with the highest frequency. 8 BHMC3004 Chapter 4 Example 12 i) ii) C) • • Days of birth freq Monday 22 Tuesday 10 Wednesday 32 Thursday 17 Friday 13 Saturday 32 Sunday 14 Height (cm) f 155 3 156 7 𝑥̂ = 157 10 The most frequent 158 15 159 16 160 9 161 2 𝑥̂ = The most frequent Grouped Frequency Distribution If frequency polygon or curve is given, 𝑥̂ is the x value with the highest peak. Plot a histogram. Modal class f 𝑥̂ 9 BHMC3004 Chapter 4 Example 13 Estimate the mode using the histogram Weight(gram) No of packages 450 –< 452 11 452 –< 454 26 454 –< 456 34 456 –< 458 24 458 -< 460 20 f From graph, 𝑥̂ = The most frequent Histogram for the Weight 36 32 28 24 20 16 12 8 4 0 450 452 454 456 458 460 10 BHMC3004 • Chapter 4 Considerations for Choosing a Measure of Central Tendency 4.1.4 The Relative Positions of the Mean, Median, and Mode a) Symmetric Distribution o Zero skewness 𝑥̅ = 𝑥̃ = 𝑥̂ b) Positively Skewed o Skewed to the right 𝑥̂ < 𝑥̃ < 𝑥̅ c) Negatively Skewed o Skewed to the left 𝑥̅ < 𝑥̃ < 𝑥̂ 11 BHMC3004 Chapter 4 4.1.5 Fractiles and Quartiles • Measures of location/position. • Include not only central location but also any position based on the number of equal divisions in a given distribution • Median (𝑥̃) – divide the distribution into 2 equal parts • Quartiles (Qi) – divide into 4 equal parts • Deciles (Di) – divide into 10 equal parts • Percentiles (Pk) – divide into 100 equal parts • Q2 = D5 = P50 = 𝑥̃ A) Raw data/Ungrouped frequency distribution Organize the data into ascending order and calculate the location: 1 o Q1 = 4 (𝛴𝑓 + 𝟏) th item • 3 o Q3 = 4 (𝛴𝑓 + 𝟏) th item o Di = 𝑖 10 𝑘 Exact location (𝛴𝑓 + 𝟏) th item, i = 1, 2, 3, …, 8, 9, 10 o Pk = 100 (𝛴𝑓 + 𝟏)th item, k = 1, 2, …, 99, 100 B) Grouped Frequency Distribution Location: • 1 o Q1 = 4 Σ𝑓 th item 3 o Q3 = 4 Σ𝑓 th item Approximate location 𝑖 o Di = 10 Σ𝑓 th item 𝑘 o Pk = 100 Σ𝑓 th item • Use the ogive / cumulative frequency polygon to estimate the quartiles, deciles, and percentiles. Example 14 For the following data, determine the Q1, Q3, D7, P59. 46 47 49 49 51 53 54 54 55 55 59 1 Q1 = 4 (11 + 1) th item 3 Q3 = 4 (11 + 1) th item 7 D7 = 10 (11 + 1) th item 59 P59 = 100 (11 + 1) th item 12 BHMC3004 Chapter 4 Example 15 A company selling a consumer product directly to retail outlets has collected the following information: Number of Order No. of salesman 10 –19 3 20 – 29 8 30 – 39 16 40 – 49 22 50 – 59 19 60 – 69 8 70 – 79 4 Determine the Quartiles, D2, P66. Q1 = 80 Q3 = 3(80) D2 = 2(80) 4 F Less than 9.5 0 Less than 19.5 3 Less than 29.5 11 Less than 39.5 27 Less than 49.5 49 Less than 59.5 68 Less than 69.5 76 Less than 79.5 80 th = 4 P66 = Class Boundary 10 th = th = 66(80) 100 th = 13 BHMC3004 F Chapter 4 Cumulative Frequency Polygon for the Number of Order 80 70 60 50 40 30 20 10 0 9.5 Number of Order 19.5 29.5 39.5 49.5 59.5 69.5 79.5 14 BHMC3004 4.2 • • • Chapter 4 Measure of Dispersion Measure of variability - describes diversity and variability in the distribution of a variable. Nominal variable: Index of Qualitative Variation Interval/Ratio: Two main types: A) Distance measures: o Measure the distance between any two significant positional values o Range, Interquartile Range. B) Average Deviation Measures o Measures the average or Mean Deviation of all the data from some measures of central tendency. o Variance, Standard Deviation and Coefficient of Variation. 4.2.1 Index of Qualitative Variation • For nominal variables to compare the diversity of a variable in different groups or to find out the group has become more diverse over time. • Based on the ratio of the total number of differences in the distribution to the maximum number of possible differences within the same distribution. • Vary from 0.00 to 1.00. • When all the cases in the distribution are in one category, there is no variation or diversity, IQV = 0.00. • When the cases in the distribution are distributed evenly across the categories, there is a maximum of variability or diversity, IQV = 1.00. • IQV = 𝐾(1002 −Σ𝑃𝑐𝑡 2 ) 1002 (𝐾−1) where K : Number of categories Pct : Sum of all percentage in the distribution Example 16 The following table shows the top five ethnic groups for two states by percentage, 2010. Comment and compare the diversity for the ethnicity between the following two states. Ethnic Group Maine (%) Hawaii (%) White 97.3 29.7 Latino 1.3 10.7 Asian 1.1 46.3 Native Hawaiian or Pacific Islander - 11.9 Other 0.3 1.5 Total 100.0 100.0 15 BHMC3004 Chapter 4 Ethnic Group Maine (%) White 97.3 Latino 1.3 Asian 1.1 Native Hawaiian or Pacific Islander - Other 0.3 Total 100.0 Ethnic Group Hawaii (%) White 29.7 Latino 10.7 Asian 46.3 Native Hawaiian or Pacific Islander 11.9 Other 1.5 Total 100.0 IQVMaine = (%)2 - (%)2 𝐾(1002 −𝛴𝑃𝑐𝑡 2 ) 1002 (𝐾−1) The number of ethnic differences in Maine is 7% of the maximum possible differences. 1 IQVHawaii = The number of ethnic differences in Hawaii is 84% of the maximum possible differences. is considerably more ethnic variation than in 4.2.2 Range • Influenced by an extreme value(s), especially if they are unrepresentative values. • Easy to compute and understand. A) • Raw Data/Ungrouped Frequency Distribution Range = highest value – lowest value • Grouped Frequency Distribution Range = Upper class boundary of the last class – Lower class boundary of the first class B) 16 BHMC3004 Chapter 4 Example 17 Find the range for the below data. Data Set i) {2, 2, 3, 4, 5} ii) {2, 5, 7, 10, 100} iii) {-4, -8, 12, 10, 17, 7, 1, -3} Range Example 18 Find the range for the data in Example 15. Lowest value = Highest value = Range = 4.2.3 Interquartile Range and Quartile Deviation • Interquartile Range o Measure the middle 50 percent of the observations o IR = Q3 – Q1 • Quartile Deviation o QD = • • • (𝑄3 −𝑄1 ) 2 o The smaller the QD, the greater concentration of the middle half of the observations in the data. Can be computed for the open-ended classes. Not influenced by the extreme values. Not fully representative of a set of measurements as it is not based on all the information available. Example 18 For a set of heights for a group of students, the upper quartile is 24cm and the lower quartile is 10cm. What is the quartile deviation? Give an interpretation for the finding. IR = Q3 – Q1 = [The height of the middle 50 percent of the students varied with a spread of (𝑄3 − 𝑄1 ) 𝑄𝐷 = 2 .] The height of half of the middle 50 percent of the students varied with a spread of 17 BHMC3004 Chapter 4 4.2.4 Standard Deviation and Variance • Variance o The arithmetic mean of the squared deviations from the mean. o All values are used. o Not influenced by extreme values. A) • • B) • • • Raw data Population variance, Σ(𝑋 − μ)2 Σ𝑋 2 Σ𝑋 2 σ = = − μ2 , where = 𝑁 𝑁 𝑁 Sample variance, (Σ𝑥)2 2 Σ𝑥 2 − 𝑛 ) Σ(𝑥 − 𝑥̅ Σ𝑥 𝑠2 = = , where 𝑥̅ = 𝑛−1 𝑛−1 𝑛 * Deviation formula; Direct formula Grouped Frequency Distribution Population variance, Σ𝑓(𝑋 − μ)2 Σ𝑓𝑋 2 2 σ = = − μ2 Σ𝑓 Σ𝑓 Sample variance, (Σ𝑓𝑥)2 2 Σ𝑓𝑥 − 2 Σ𝑓(𝑥 − 𝑥̅ ) Σ𝑓 𝑠2 = = Σ𝑓 − 1 Σ𝑓 − 1 * X, x : class midpoint, f = class frequency * Deviation formula; Direct formula Standard Deviation o The square root of the variance. o Population standard deviation, σ = √σ2 . • • o Sample standard deviation, s = √𝑠 2 . For a data set with a large amount of variation, the data values will, on the average, be far from the mean - the standard deviation will be large. For a data set with a small amount of variation, the data values will, on the average, be close to the mean; the standard deviation will be small. Example 19 Refer to Example 1. Find the population mean and standard deviation. Ages (in years) of all eight employees of a small company: 53 32 61 27 39 44 49 57 x = 362, μ = 45.25 years old σ2 = Σ(𝑋−μ)2 𝑁 18 BHMC3004 Chapter 4 Example 20 The hourly wages earned by a sample of five students are $7, $5, $11, $8, $6. Find the variance and the standard deviation. x 2 = 295, x = 37 s2 = Σ𝑥 2 − (Σ𝑥)2 𝑛 𝑛−1 Example 21 Calculate the sample standard deviation following set of data. Score No. of students, f 60 – 62 10 63 – 65 fx fx 2 36 2304 147456 66 – 68 84 5628 377076 69 – 71 54 3780 264600 72 – 74 16 1168 85264 200 4.2.5 • • • • • Midpoint, x - Coefficient of Deviation / Coefficient of Variation Ratio of the standard deviation to the arithmetic mean. Expressed as a percentage. σ CV = μ × 100% for population 𝑠 CV = 𝑥̅ × 100% for sample Used to compare the variability between two or more different distributions or when the means differ markedly. Example 22 Consider the measurement on yield and plant height of a paddy variety. The mean and standard deviation for yield are 50kg and 10kg respectively. The mean and standard deviation for plant height are 55cm and 5cm respectively. Compare and comment on the variability of the distributions. CVyield = CVheight = The distribution of distribution of the of the paddy is more disperse/variable as compared to the of the paddy. 19 BHMC3004 Chapter 4 • Considerations for Choosing a Measure of Variation • Measure of Skewness Measurement of the lack of symmetry of the distribution. 4.3 4.3.1 Pearsonian Coefficient of Skewness • Pearson first coefficient of skewness, mean-mode Sk(1) = standard deviation • Pearson second coefficient of skewness, 3(mean-median) Sk(2) = standard deviation Sk(2) = Sk(2) = • • 3(𝑥̅ −𝑥̃) 𝑠 ̃) 3(μ−μ 𝑠 for sample and for population Range from –3.00 up to 3.00. A value of 0 indicates a symmetric distribution. 4.3.2 Quartile Measure of Skewness 𝑄3 + 𝑄1 −2 median • SkQ = • • Takes values between –1 and +1. Convenient to use when the median and the quartiles are used to describe the distribution. Interquartile range 20 BHMC3004 Chapter 4 Example 23 The lengths of stay by patients on the cancer floor of a local hospital were organized into a frequency distribution. The mean length of stay was 28 days, the median 25 days, and the standard deviation was found to be 4.2 days. Calculate the coefficient of skewness. Interpret the result. 3(mean-median) Sk =standard deviation The distribution for the lengths of stay by patients on the cancer floor of a local hospital is 4.4 • • Box Plot/Box and Whisker Plot A graphical display, based on quartiles, that helps to picture a set of data. Five data are needed: whisker box whisker i) Minimum value ii) iii) iv) iv) • • • First Quartile Median Third Quartile Maximum Value (iv) (v) (ii) (iii) (i) Right-skewed: the right side whisker is much longer than the left side whisker. Perfectly symmetrical: the length of the left whisker will equal the length of the right whisker, and the median line will divide the box in half. Left skewed: the length of the left side whisker will be much longer than the right side whisker. Example 24 In a study of memory recall times, a series of stimulus words was shown to a subject on a computer screen. For each word, the subject was instructed to recall either a pleasant or an unpleasant memory associated with that word. Successful recall of a memory was indicated by the subject pressing a bar on the computer keyboard. Table below shows the recall times (in seconds) for 11 pleasant and 7 unpleasant memories. Pleasant memory Unpleasant memory 1.07 4.63 1.45 1.22 5.55 1.9 1.63 6.17 2.32 2.12 2.43 2.56 2.57 2.93 3.87 3.03 4.33 3.22 21 BHMC3004 Chapter 4 Pleasant memory: n = 11 Minimum = Q1 = Median = Q3 = Maximum = Unpleasant memory: n=7 Minimum = Q1 = Median = Q3 = Maximum = Boxplots for the Recall Time Unpleasant memory Pleasant memory 1 2 3 4 5 6 The distributions for the recall times for However, the distributions for unpleasant memory. is more disperse/variable as compared to 22 BHMC3004 4.5 • • Chapter 4 Reliability and Validity All measurements, especially measurements of behaviours, opinions, and constructs, are subject to fluctuations (error) that can affect the measurement’s reliability and validity. Reliability o Measurement of consistency and stability of test scores. o Prerequisite for validity. o Analogous to variance (low reliability = high variance) o Reliability coefficient 0.70 is considered to have good reliability; if below 0.50, it would not be considered a very reliable test. Type Definition Over time (test-retest reliability) Administer the same test twice over a period Correlation between of time to the same individuals. scores at Time 1 and Time 2 Across items (internal consistency) Consistency of people’s responses across the Cronbach’s alpha, items on a multiple-item measure. Across different researchers (inter-rater reliability) The extent to which different observers are Cronbach’s alpha consistent in their judgments. (quantitative) or Cohen’s kappa, (categorical) Alternate forms (parallel-forms reliability) Administer different versions of an assessment Correlation between the tool (different in wording but both contain responses to the pairs of items that probe the same construct) to the questions same group of individuals. • Measured by Validity o Suitability or meaningfulness of the measurement. o Analogous to unbiasedness (valid = unbiased). Type Definition Measured by Content The extent to which the content of the test Correlating experts’ judgment or matches the instructional objectives. Item-item or item-total correlation Criterion The extent to which scores on the test are in agreement with (concurrent validity) or predict (predictive validity) an external criterion. Construct The extent to which an assessment Factor analysis or corresponds to other variables, as Correlating with other theoretical predicted by some rationale or theory. measure with which the developing instrument should correlate Correlating the test with the criteria during data collection (concurrent validity) or some point in the future (predictive validity) 23 BHMC3004 • Chapter 4 Measures to ensure validity of a research: o Appropriate time scale for the study o Appropriate methodology o Most suitable sampling method o The respondents must not be pressured in any ways to select specific choices among the answer sets Example Sample raw data: 13, 24, 44, 56, 67, 70, 82 After entering the data: x = 356, x2 = 21930, n = 7 𝑥̅ = 𝛴𝑥 = 𝑛 𝛴𝑥 s=√ 356 7 = 50.8571 2 2 −(𝛴𝑥) 𝑛 𝑛−1 21930 − =√ (356)2 7 7−1 = 25.2493 Sample grouped data: After entering the data: fx = 240, fx2 = 4287.5, f = 14 𝑥̅ = 𝛴𝑓𝑥 s=√ 𝛴𝑓 = 240 14 = 17.1429 (𝛴𝑓𝑥)2 𝛴𝑓𝑥 2 − 𝛴𝑓 𝛴𝑓−1 =√ (240)2 4287.5 − 14 14−1 Class f Midpoint, x 10-<15 4 12.5 15-<20 7 17.5 20-<25 3 22.5 = 3.6502 24 BHMC3004 Chapter 5 Chapter 5 REGRESSION 5.1 • • • • • 5.2 • • • • Regression Analysis A prediction model using one or more independent/exploratory/predictor variables to predict the values of a dependent/response/outcome variable. Explain and predict the dependent variable on the basis of information on the independent variable(s). Bivariate regression/Simple linear regression o Examines changes in the dependent variable as a function of changes or differences in values of ONE independent variable. o E.g. i) What is the relationship between education and income? For each year of education, how much does income increase (on average)? ii) What will be the rate of return on investment? For each dollar invested, how much will sales increase? iii) For a political candidate, how many votes will he or she get for each dollar spent on advertising? Multiple linear regression o Attempts to model the relationship between two or more independent variables and a dependent variable by fitting a linear equation to observed data. i) Do age and IQ scores effectively predict GPA? ii) Do weight, height, and age explain the variance in cholesterol levels? Nonlinear Regression o Observational data are modeled by a function which is a nonlinear combination of the model parameters and depends on one or more independent variables. Scatter Diagram A plot of paired observations. Illustrates whether o any relationship between the DV and IVs; o positive / negative relationship; o linear / non-linear relationship. Positive relationship: An increase in IV will lead to an increase in DV, and vice versa. Negative relationship: An increase in IV will lead to a decrease in DV, and vice versa. Example 1 The following data shows the educational attainment (in year), X and the Internet usage (in hour) per week, Y, for a sample of 10 individuals. Draw a scatter diagram of the two variables and comment on the graph. X 10 9 12 13 19 11 16 23 14 21 Y 1 0 3 4 7 2 6 9 5 8 1 BHMC3004 Chapter 5 Scatter Diagram for Educational Attainment and Internet Usage Internet Usage per Week (hour) 10 8 6 4 2 0 There exists a 5.3 • • • 4 8 12 16 20 Educational Attainment (year) 24 relationship between Simple Linear Equation The relationship between the variables is linear, i.e., the equation model is a straight line. The general form: 𝑌̂ = a + bX 𝑌̂ – predicted value of the Y variable. X – any value of the independent variable. * 𝑌̂ can be denoted as Y ’ The general form: 𝑌̂ = a + bX a : Y-intercept; the estimated value of Y when X = 0; or when the regression line crosses the Y-axis when X = 0. b : slope of the regression line; the average change in slope of the regression line; or the average change in 𝑌̂ for each change of one unit in X. o b positive - positive linear relationship. o b negative- negative linear relationship. o Be careful in making interpretation of a. If X = 0 is outside the range of X in the data set, the prediction may not carry much credibility. 2 BHMC3004 • Chapter 5 By Least Squares Method: o Slope of the regression line, b : 𝑛(Σ𝑋𝑌) − (Σ𝑋)(Σ𝑌) 𝑛(Σ𝑋 2 ) − (Σ𝑋)2 n : the total observations in (X, Y) o Y-intercept, a : Σ𝑌 Σ𝑋 𝑎 = −𝑏 or 𝑎 = 𝑌̅ − 𝑏𝑋̅ 𝑛 𝑛 * 𝑌̅ and 𝑋̅ are the mean of Y and X, respectively a and b : estimated regression coefficients or regression coefficients. The model 𝑌̂ = a + bX is also called as the least-squares regression line of Y on X. Assumptions: o For each value of X, there is a group of Y values, and these Y values are normally distributed. o The means of these normal distributions of Y values all lie on the regression line. o The standard deviations of these normal distributions are equal. o The Y values are statistically independent. Two types of estimation using the regression equation: 1. Interpolation estimate o Estimate the values of Y within the range of the observations of X in the data set. o More accurate and more reliable. 2. Extrapolation estimate o Estimate the values of Y outside the range of the observations of X in the data set. o Most commonly used for forecasting using a time series. o May less accurate and unreliable to a certain extent. 𝑏 = • • • • Example 2 Find the least squares equation for the Internet usage on educational attainment based on the data in Example 1. Interpret the regression coefficients obtained. X Y XY X2 10 1 9 0 0 81 12 3 36 144 13 4 52 169 19 7 133 361 11 2 22 121 16 6 96 256 23 9 207 529 14 5 70 196 21 8 168 441 3 BHMC3004 𝑏 = 𝑛(Σ𝑋𝑌) − (Σ𝑋)(Σ𝑌) 𝑛(Σ𝑋 2 ) − (Σ𝑋)2 𝑎 = Σ𝑌 Σ𝑋 −𝑏 𝑛 𝑛 Chapter 5 The least squares line: 𝑌̂ = The regression coefficient b is 0.6166 and is interpreted as “The average change in the estimated with every 1 year of change in or With each additional of is predicted to The Y-intercept a is -4.6257, may not have a clear substantive interpretation. (The estimated is when Example 3 The age of the respondents in the sample from Example 1 are recorded as follows. Age, X Internet Usage per week (hour), Y 55 1 60 0 45 3 35 4 23 7 40 2 22 6 27 9 41 5 30 8 4 BHMC3004 i) Chapter 5 Refer to the scatter diagram for age and internet usage per week, interpret the diagram. The diagram suggests there is a ii) relationship between Obtain a least squares regression line for Internet usage per week on the age. 𝑛(𝛴𝑋𝑌) − (𝛴𝑋)(𝛴𝑌) 𝑏 = 𝑛(𝛴𝑋 2 ) − (𝛴𝑋)2 𝑎 = 𝛴𝑌 𝛴𝑋 −𝑏 𝑛 𝑛 The least squares regression line: iii) Interpret the regression coefficients obtained from ii). The average change in the estimated with every 1 of change in the or With each additional The estimated iv) Predict the Internet usage per week for the respondent of age (1) 50; (2) 20. Comment on the reliability and accuracy of each of the estimate. (1) 𝑌̂ = 12.2641 – 0.2054(50) The value of 50 falls by the technique of the range of the data set hence the estimate is obtained , thus it is considered as (2) 𝑌̂ = 12.2641 – 0.2054 (20) The value of 50 falls by the technique of the range of the data set hence the estimate is obtained , thus it is considered as 5 BHMC3004 5.4 • • • • • • Chapter 5 Coefficient of Determination, r2 The proportion of the total variation in the dependent variable (Y) that is explained or accounted for by the variation in the independent variable (X). Also known as goodness of fit. 0 < r 2 < 1. r 2 = 0, the DV cannot be predicted from IV. r 2 = 1, the DV can be predicted without error from the IV. The closer the value is to 1 or 100%, the better fit of the regression model. 2 𝑛Σ𝑋𝑌 − (Σ𝑋)(Σ𝑌) 2 𝑟 = ( ) √[𝑛Σ𝑋 2 − (Σ𝑋)2 ][𝑛Σ𝑌 2 − (Σ𝑌)2 ] Example 4 Calculate and interpret the coefficient of determination for Example 2 and Example 3. Example 2 X = 148 Y = 45 X 2 = 2398 Y 2 = 285 n = 10 2 10(794) − (148)(45) XY = 794 2 𝑟 =( ) √[10(2398) − (148)2 ][10(285) − (45)2 ] About of the total variation in the accounted for by the variation in the Thus the regression line 𝑌̂ = -4.6257 + 0.6166 X is that is explained or Example 3 2 10(1391) − (378)(45) 2 𝑟 =( ) √[10(15798) − (378)2 ][10(285) − (45)2 ] About of the total variation in the accounted for by the variation in the Thus the regression line 𝑌̂ = 12.2641 – 0.2054X that is explained or . 6 BHMC3004 5.5 Chapter 5 Multiple Linear Regression Extension of bivariate regression used to examine the effect of two or more independent variables on the dependent variable. General form: 𝑌̂ = a + b1* X1 + b2* X2 o where 𝑌̂ = the predicted value on DV o X1 = the value on IV X1 o X2 = the value on IV X2 a = the Y-intercept; or the estimated value of Y when X1 = 0 and X2 = 0 bi * = the partial slope of Y and Xi ; the average change in Y with a unit change in a specific Xi , while controlling or holding constant the value of the other IV(s) If there is a curvilinear relation between the IV and DV, the model can be a polynomial regression model, 𝑌̂ = 0 + 1X + 2X 2 + … + h X h. o Consider as multiple linear regression since it is linear in the regression coefficients, 1, 2, … h. o When h = 2, the model is called as quadratic regression; h = 3 is a cubic regression; h = 4 is a quartic regression, and so on. Multiple coefficient of determination, R square, measures the proportion of the total variation in the DV that is explained jointly by two or more IVs. Pearson’s multiple correlation coefficient, R, measures the linear relationship between the DV and the combined effect of two or more IVs. • • • • • • • Example 5 Refer to the previous examples, let educational attainment and age be the independent variables and Internet usage per week be the dependent variable. Let Y : Internet usage per week (Usage) X1 : Educational Attainment (Edu) X2 : Age Output from SPSS Model 1 Variables Entered/Removeda Variables Variables Entered Removed b Age, Edu . Method Enter a. Dependent Variable: Usage b. All requested variables entered. Model R 1 .9884a Model Summary Adjusted R R Square Square .9769 .9703 Std. Error of the Estimate .5220 a. Predictors: (Constant), Age, Edu 7 BHMC3004 Chapter 5 ANOVAa Model 1 Regression Residual Total Sum of Squares 80.5924 1.9076 82.5000 df Mean Square F Sig. 2.0000 7.0000 9.0000 40.2962 .2725 147.8695 .0000b a. Dependent Variable: Usage b. Predictors: (Constant), Age, Edu Coefficientsa Model 1 (Constant) Edu Age Unstandardized Coefficients B -.6051 .491 -.057 Std. Error 1.7175 .062 .023 Standardized Coefficients Beta .779 -.245 t Sig. -.3523 7.883 -2.477 .7350 .000 .042 a. Dependent Variable: Usage b1* = 0.49, the estimated Internet usage increases by 0. 49 hours per each year of increase in education attainment, holding a ge constant. b2* = -0.06, the estimated Internet usage decreases by 0. 06 hours with each of increase in age when educational attainment is held constant. a = -0.61, the estimated Internet usage per week is -0.61 hours when both educational attainment and age are 0 (not meaningful). Multiple coefficient of determination, r 2 = 0.98 98% of the total variation in the Internet usage per week can be explained by the model containing Educational Attainment and Age. 8 BHMC3004 Chapter 5 5.6 Non-linear Regression • Observational data are modeled by a function which is a non-linear combination of the model parameters and depends on one or more IVs. • Some non-linear equations can be transformed to mimic a linear equation. If this happens, the non-linear equation is called “intrinsically linear”. • Non-linear Transformation Standard Linear Power Model Model Transformation Parameter Transformation y = + x None NA log y = log a + b log x Y = log y, X = log x = log a, = b y = ln a + b(ln x) Y = y, X = ln x = ln a, = b ln y = ln a + bx Y = ln y, X = x = ln a, = b b y = ax Logarithmic b y = ln (ax ) Exponential y = ae bx 1 Reciprocal 𝑦 1 y = 𝑎+𝑏𝑥 1 Y = 𝑦, X = x 1 1 Square Root = a + bx = a, = b y =(𝑎+𝑏𝑥)2 √𝑦 = a + bx = a, = b y = a + b √𝑥 Y = y, X = √𝑥 = a, = b Example 6 Fit the data using a suitable regression model. Dose 0 1.3 2.8 5.0 10.2 16.5 21.3 31.8 52.2 Response 0.1 0.5 0.9 2.6 7.1 12.3 15.3 20.4 24.2 Plot a scatter diagram. There is a possible non-linear relationship between the variables. 9 BHMC3004 Chapter 5 Result from statistical software: Model Summary and Parameter Estimates Dependent Variable: response Equation Model Summary R Square F Sig. Parameter Estimates Constant b1 b2 Linear 0.9294 92.0945 0.0000 1.2464 0.5116 Quadratic 0.9955 659.7932 0.0000 -1.0123 0.9378 -0.0087 Cubic 0.9972 601.7636 0.0000 -0.6141 0.7637 0.0014 Exponential 0.6200 11.4205 0.0118 0.8864 0.0875 b3 -0.0001 The independent variable is dose. Models that fit the data: Linear: Quadratic: Cubic: 10 BHMC3004 Chapter 6 Chapter 6 CORRELATION 6.1 • 6.2 • • • Correlation Analysis A group of statistical techniques used to measure the strength of the association between two variables. Coefficient of Correlation A measure of the strength of the relationship between two variables. Range between -1 and 1. Value of 0 indicates that there is no linear relationship between two variables. 6.2.1 Pearson’s Correlation Coefficient, r • Pearson product-moment correlation coefficient. • A measure for interval-ratio variable, describe the strength and the direction of the linear relationship between the variables. • r =√Coefficient of Determination, 𝑟 2 Σ(𝑋 − 𝑋̅)(𝑌 − 𝑌̅) 𝑛Σ𝑋𝑌 − (Σ𝑋)(Σ𝑌) = = (𝑛 − 1)𝑆𝑋 𝑆𝑌 √[𝑛Σ𝑋 2 − (Σ𝑋)2 ][𝑛Σ𝑌 2 − (Σ𝑌)2 ] • • A symmetrical measure, thus the correlation between X and Y is identical to the correlation between Y and X. Guideline for the strength of a relationship: Type of correlation Negative correlation Positive correlation Strength/ Degree –1 –1.0 < r ≤ –0.8 –0.8 < r ≤ –0.6 –0.6 < r ≤ –0.4 –0.4 < r ≤ –0.2 –0.2 < r < 0 0 1 Perfect Very strong Strong Moderate Weak Very weak No relationship 0.8 ≤ r < 1.0 0.6 ≤ r < 0.8 0.4 ≤ r < 0.6 0.2 ≤ r < 0.4 0 ≤ r < 0.2 0 Example 1 The weight of a car can influence the mileage that the car can obtain. Based on the given data, calculate and interpret the coefficient of correlation. Hence, calculate the coefficient of determination and interpret the result. Weight (in00 pounds) 23 25 28 30 35 35 40 Mileage (mpg) 53.3 40.9 46.9 32.2 31.3 28.0 23.1 X = Weight, Y = Mileage X 2 = X = 2 Y = Y = XY = 1 BHMC3004 𝑟 = Chapter 6 𝑛Σ𝑋𝑌 − (Σ𝑋)(Σ𝑌) √[𝑛Σ𝑋 2 − (Σ𝑋)2 ][𝑛Σ𝑌 2 − (Σ𝑌)2 ] There exists a linear relationship between the When the increases, the Coefficient of determination, r 2 = of the variability in with the can be predicted from the relationship Example 2 Use Pearson product-moment correlation coefficient to illustrate the strength of the relationship between i) Educational attainment and Internet usage; ii) Age and Internet usage. Educational Attainment (year) Age Internet Usage per week (hour) 10 55 1 9 60 0 12 45 3 13 35 4 19 23 7 11 40 2 16 22 6 23 27 9 14 41 5 21 30 8 Let r1 be the correlation coefficient between Educational attainment and Internet usage, and r2 be the correlation coefficient between Age and Internet usage. [Refer Chapter 5, Example 4] X1 = 148 Y = 45 X1Y = 794 X12 = 2398 Y 2 = 285 n = 10 𝑟1 = 10(794) − (148)(45) √[10(2398) − (148)2 ][10(285) − (45)2 ] 2 BHMC3004 Chapter 6 X2 = 378 Y = 45 X2Y = 1391 X22 = 15798 Y 2 = 285 n = 10 𝑟2 = 10(1391) − (378)(45) √[10(15798) − (378)2 ][10(285) − (45)2 ] 6.2.2 Spearman’s Rank Correlation Coefficient • Rank Correlation o Used to measure the strength of a relationship between the variables that are of at least ordinal data. E.g., discipline and exam marks, job performance and qualification. • Spearman’s Rank Correlation Coefficient, rs o A measure of rank correlation. o Can be used even though the variables to be correlated are not representable in numeric form. • Spearman’s Rank Correlation Coefficient, 6Σ𝐷2 𝑟𝑠 = 1 – 𝑛(𝑛2 − 1) where D = rX – rY , rX / rY = rank of X / Y X and Y are the characteristics of the data • Guideline for the strength of a relationship: Type of correlation Strength/ Degree Agreement between the rankings Disagreement between the rankings Perfect Very high High Moderate Low Very low No relationship 1 –1 –1.0 < r ≤ –0.8 –0.8 < r ≤ –0.6 –0.6 < r ≤ –0.4 –0.4 < r ≤ –0.2 –0.2 < r < 0 0 0.8 ≤ r < 1.0 0.6 ≤ r < 0.8 0.4 ≤ r < 0.6 0.2 ≤ r < 0.4 0 ≤ r < 0.2 0 3 BHMC3004 Chapter 6 Example 3 Consider a musical talent contest where 10 competitors are evaluated by two judges, X and Y. The scores of the judges (out of 10) were as follows: Contestant 1 2 3 4 5 6 7 8 9 10 Score by Judge X 5 9 3 8 6 7 4 8 4 6 Score by Judge Y 7 8 6 7 8 5 10 6 5 8 Describe the relationship between the scores by the judges using Spearman rank correlation coefficient. Contestant 1 2 3 4 5 6 7 8 9 10 Score by Judge X 5 9 3 8 6 7 4 8 4 6 Score by Judge Y 7 8 6 7 8 5 10 6 5 8 rX rY D = rX – rY D2 D 2 = rs = There is a degree of between the rankings of If Judge X evaluates a particular contestant with a higher score, then 6.2.3 Comparison of Rank and Product Moment Correlation • Product moment coefficient o The standard measure of correlation. o Data must be numeric. • Rank coefficient o Approximation to the r. o Easier to use, less calculations. o Non-numeric data. o Insensitive to small changes in actual values. 4 BHMC3004 6.3 Chapter 6 Other Measures of Association • If one or both of the variables is nominal: o Contingency Coefficient o Phi and Cramer’s V o Lambda • If both of the variables are ordinal: o Gamma o Kendall’s tau-b o Kendall’s tau-c * Dichotomies should be treated as ordinal. 6.3.1 Contingency Coefficient • Range between 0 and 1 with higher values indicate a stronger association. • Highly sensitive to the size of table. The larger the number of categories, the closer the maximum value is to 1. 6.3.2 Phi and Cramer’s V • • Vary between 0 and 1, regardless of the number of rows and columns. Nondirectional measure that ranges between 0 and 1, with 0 indicating no association and 1 as perfect association. 6.3.3 Lambda, • Asymmetrical measure of association, vary depending on which variable is considered the independent variable and which the dependent variable. • Often underestimate the strength of the relationship. • 3 versions of Lambda – one that you would use when one variable is the dependent variable, another that you would use if the other variable was dependent, and a third you would use if you don’t want to designate either of the variables as dependent. • Range from 0 to 1. • 0.0: nothing to be gained by using the IV to predict the DV. • 1.0: by using the IV as a predictor, we are able to predict the DV without any error. 6.3.4 Gamma, Kendall’s tau-b, and Kendall’s tau-c • Symmetrical measure of association. • Vary from 0.0 to 1.0 and provides an indication of the strength and direction of the association between the variables. • Gamma will always be larger. 5 BHMC3004 Chapter 7 Chapter 7 HYPOTHESIS TESTING 7.1 • • Introduction Hypothesis is a statement about a population parameter developed for the purpose of testing. Hypothesis testing is o a procedure based on sample evidence and probability theory to determine whether the hypothesis is a reasonable statement, or o an inferential procedure that uses the data from a sample to draw a general conclusion about a population. Step 2 Step 3 o Step 1 State null and alternative hypothesis Make a decision: 1. Reject H0 2. Fail to reject H0 7.2 • • • Determine significance level and the critival value Identify the test statistic Step 5 Step 4 Take a sample and calculate the value of the test statistic. Formulate a decision rule Definition Null Hypothesis, H0 o A statement about the value of a population parameter. o No effect, no change, or no significant difference. o Include =, ≤ or ≥. Always contain the equal sign as the null hypothesis is the statement to be tested, and we need a specific value to include in our calculations. o Also used to state that there is no relationship between two variables. Alternative Hypothesis, H1 o Research hypothesis. o Inverse, or opposite of H0. o Expressed in terms of population parameters, but its specific form varies from test to test. o Can include ≠, > or <, directly contradicts the H0. o A statement in which there is some statistical significance between two variables. o A statement that is accepted if the sample data provide sufficient evidence that the H0 is false. Level of Significance, o The probability of rejecting the H0 when it is true. o Level of risk. 1 BHMC3004 • • • • • • Chapter 7 Critical Region o Composed of extreme sample values that are very unlikely to be obtained if the null hypothesis is true. o If the outcome of a statistical test falls in the critical region, the H0 is rejected. Critical Value o The dividing point between the acceptance region and the rejection region (critical region). o The boundary of the critical region. o Based on the level of significance, type of test and type of test statistic. Type I Error o Rejecting the H0 when it is true. o In a typical research situation, a Type I error means the researcher concludes that a treatment does have an effect when in fact it has no effect. o The probability of committing Type I error is . Type II Error o Fails to reject H0 when it is false. o In a typical research situation, a Type II error means that the hypothesis test has failed to detect a real treatment effect. o The probability of committing Type II error is . Test Statistic o A value, determined from sample information, used to determine whether to reject the null hypothesis. Decision Rule o A statement of the specific condition under the null hypothesis is rejected and the condition under which it is not rejected. 2 BHMC3004 7.3 • • Chapter 7 Hypothesis Testing for One Population Mean The claims are statements about a population mean, . Type of hypothesis test: i) One-tailed test/Directional Hypothesis Test/One-Tailed Test. o H1 specifies either an increase (right-tailed test) or a decrease (left-tailed test) in the population mean score. o Make a statement about the direction of the effect. o The rejection region is at the right tail or left tail of the distribution. ii) Two-tailed test / Non-Directional Hypothesis Test/Two-Tailed Test. o The primary concern is deciding whether a population mean is different from a specific value. o The rejection region is in both tails of the distribution. Sign • Type of Test H0 H1 ≤ > Right-tailed test more/ not more than, at most = Two-tailed test different, change/ same, equal ≥ < Left-tailed test less/ not less than, at least Type of test statistic: i) If , population standard deviation, is known, the test statistic is the z-test, sample mean − hypothesized population mean 𝑥̅ − μ 𝑧 = =σ standard error between 𝑥̅ and μ ⁄ 𝑛 √ ii) If is unknown but n ≥ 30, the test statistic is the z-test where is estimated by s, sample standard deviation, 𝑥̅ − μ 𝑧 = 𝑠 ⁄ 𝑛 √ iii) If is unknown and n < 30, the test statistic is the t-statistic, where is estimated by s, 𝑥̅ − μ 𝑡 = 𝑠 with (𝑛 − 1) degree of freedom ⁄ 𝑛 √ 3 BHMC3004 Chapter 7 Percentage Points of the Normal Distribution The table gives the 100α percentage points, uα, of a standardised Normal distribution where 1 −u 2/ 2 α= du . Thus, uα is the value of a standardised Normal variate which has e 2 u probability α of being exceeded. α uα u u u 0.50 0.0000 0.029 1.8957 0.009 2.3656 0.45 0.1257 0.028 1.9110 0.008 2.4089 0.40 0.2533 0.027 1.9268 0.007 2.4573 0.35 0.3853 0.026 1.9431 0.006 2.5121 0.30 0.5244 0.025 1.9600 0.005 2.5758 0.25 0.6745 0.024 1.9774 0.004 2.6521 0.20 0.8416 0.023 1.9954 0.003 2.7478 0.15 1.0364 0.022 2.0141 0.002 2.8782 0.10 1.2816 0.021 2.0335 0.001 3.0902 0.05 1.6449 0.020 2.0537 0.0005 3.2905 0.048 1.6646 0.019 2.0749 0.0001 3.7190 0.046 1.6849 0.018 2.0969 0.00005 3.8906 0.044 1.7060 0.017 2.1201 0.00001 4.2649 0.042 1.7279 0.016 2.1444 0.000005 4.4172 0.040 1.7507 0.015 2.1701 0.038 1.7744 0.014 2.1973 0.036 1.7991 0.013 2.2262 0.034 1.8250 0.012 2.2571 0.032 1.8522 0.011 2.2904 0.030 1.8808 0.010 2.3263 4 BHMC3004 Chapter 7 Critical Values of Student’s t Distribution α Degrees of freedom α/2 α One-Tailed Tests α/2 Two-Tailed Tests Significance level, 0.01 0.1 0.1 0.05 0.02 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 35 3.078 1.886 1.638 1.533 1.476 1.440 1.415 1.397 1.383 1.372 1.363 1.356 1.350 1.345 1.341 1.337 1.333 1.330 1.328 1.325 1.323 1.321 1.319 1.318 1.316 1.315 1.314 1.313 1.311 1.310 1.306 6.314 2.920 2.353 2.132 2.015 1.943 1.895 1.860 1.833 1.812 1.796 1.782 1.771 1.761 1.753 1.746 1.740 1.734 1.729 1.725 1.721 1.717 1.714 1.711 1.708 1.706 1.703 1.701 1.699 1.697 1.690 15.895 4.849 3.482 2.999 2.757 2.612 2.517 2.449 2.398 2.359 2.328 2.303 2.282 2.264 2.249 2.235 2.224 2.214 2.205 2.197 2.189 2.183 2.177 2.172 2.167 2.162 2.158 2.154 2.150 2.147 2.133 31.821 6.965 4.541 3.747 3.365 3.143 2.998 2.896 2.821 2.764 2.718 2.681 2.650 2.624 2.602 2.583 2.567 2.552 2.539 2.528 2.518 2.508 2.500 2.492 2.485 2.479 2.473 2.467 2.462 2.457 2.438 1.282 1.645 2.054 2.326 0.05 0.02 0.01 6.314 2.920 2.353 2.132 2.015 1.943 1.895 1.860 1.833 1.812 1.796 1.782 1.771 1.761 1.753 1.746 1.740 1.734 1.729 1.725 1.721 1.717 1.714 1.711 1.708 1.706 1.703 1.701 1.699 1.697 1.690 12.706 4.303 3.182 2.776 2.571 2.447 2.365 2.306 2.262 2.228 2.201 2.179 2.160 2.145 2.131 2.120 2.110 2.101 2.093 2.086 2.080 2.074 2.069 2.064 2.060 2.056 2.052 2.048 2.045 2.042 2.030 31.821 6.965 4.541 3.747 3.365 3.143 2.998 2.896 2.821 2.764 2.718 2.681 2.650 2.624 2.602 2.583 2.567 2.552 2.539 2.528 2.518 2.508 2.500 2.492 2.485 2.479 2.473 2.467 2.462 2.457 2.438 63.657 9.925 5.841 4.604 4.032 3.707 3.499 3.355 3.250 3.169 3.106 3.055 3.012 2.977 2.947 2.921 2.898 2.878 2.861 2.845 2.831 2.819 2.807 2.797 2.787 2.779 2.771 2.763 2.756 2.750 2.724 1.645 1.960 2.327 2.576 5 BHMC3004 • • • Chapter 7 Hypothesis Test Statistic Critical Value Critical Region Left-tailed test H0: ≥ 0 H1: < 0 z-statistic –z z < –z t-statistic –t, n – 1 t <–t, n – 1 Right-tailed test H0: ≤ 0 H1: > 0 z-statistic z z > z t-statistic t, n – 1 t > t, n – 1 Two-tailed test H0: = 0 H1: ≠ 0 z-statistic z/2 z > z/2 or z < –z/2 t-statistic t/2, n – 1 t > t/2, n – 1 or t < –t/2, n – 1 Refer critical value to the Standard Normal z table or Student’s t table. Compare the test statistic to the critical value and make a decision to reject or not to reject the null hypothesis. Interpret the results of the test. 6 BHMC3004 Chapter 7 Example 1 It is known that, nationally, doctors working for health maintenance organizations (HMOs) average 13.5 years of experience in their specialties, with a standard deviation of 7.6 years. The executive director of an HMO in a Western state is interested in determining whether its doctor have less experience than the national average. A random sample of 150 doctors from HMOs shows a mean of only 10 years of experience. Test at 0.01 level of significance. Let be the true population mean number of years of experience H0: 13.5 H1: < 13.5 (Claim, Left-tailed test) Since is known, z test is used. = 0.01, critical value = – z0.01 = –2.3263 Test statistic, Rejection region 𝑧 = 𝑥̅ −μ σ ⁄ 𝑛 √ 10−13.5 = 7.6 ⁄ √150 = -5.6403 -2.3263 If z < -2.3263, H0 is rejected. Otherwise, it is failed to reject H0. Since z = -5.6403 < -2.3263, H0 is rejected. Therefore, the doctors have less experience than the national average at 0.01 level of significance. Example 2 The average cost of a hotel room in town A is said to be $168 per night. To determine if this is true, a random sample of 25 hotels is taken and resulted in a mean of $172.50 and a standard deviation of $15.40. Test the appropriate hypothesis at 0.05 level of significance. Let be the true population mean cost of a hotel room per night. H0: H1: Since is not given and n < 30, the test statistic is t test. = 0.05, df = critical value = Test statistic, 𝑥̅ −μ 𝑡 =𝑠 ⁄ 𝑛 √ = If , H0 is rejected. Otherwise, it is failed to reject H0. Since Therefore, we can conclude that the average cost of a hotel room in town A is 7 BHMC3004 Chapter 7 7.4 Hypothesis Testing for Two Population Mean 7.4.1 Independent Groups • Compare the means of two independent populations and test the hypothesis about 1 – 2. • E.g., A social psychologist may want to compare men and women in terms of their attitudes towards abortion. • Assumptions: i) The observations within each sample must be independent. ii) The two populations from which the samples are selected must be normal. iii) The two populations from which the samples are selected must have equal variances. • Types of Test Statistic: i) If 1 and 2 are known, the test statistic is the z-test, z= = sample mean difference−hypothesized population mean difference estimated standard error (𝑥̅1 − 𝑥̅2 ) − (μ1 − μ2 ) σ2 √ 1 . σ22 𝑛1 + 𝑛2 If σ2 = σ12 = σ22 , 𝑧 = (𝑥̅1 − 𝑥̅2 ) − (μ1 − μ2 ) 1 1 σ√𝑛 + 𝑛 1 2 . ii) If 1 and 2 are unknown but n1 ≥ 30 and n2 ≥ 30, the test statistic is the z-test where 1 and 2 are estimated by s1 and s2, sample standard deviation, (𝑥̅1 − 𝑥̅ 2 ) − (μ1 − μ2 ) 𝑧 = 𝑠2 𝑠2 √ 1+ 2 𝑛1 𝑛2 iii) If 1 and 2 are unknown and n1 < 30 and n2 < 30, the test statistic is the t-statistic, where 1 and 2 are estimated by s1 and s2, (𝑥̅1 − 𝑥̅2 ) − (μ1 − μ2 ) 𝑡 = with 𝑑𝑓 = 𝑛1 + 𝑛2 − 2, 1 1 𝑠𝑤 √𝑛 + 𝑛 1 where 𝑠𝑤2 2 (𝑛1 − 1)𝑠12 + (𝑛2 − 1)𝑠22 = (pooled variance). 𝑛1 + 𝑛2 − 2 8 BHMC3004 Chapter 7 Hypothesis Test Statistic Critical Value Critical Region Left-tailed test H0: 1 – 2 ≥ d0 H1: 1 – 2 < d0 z-statistic –z z < –z t-statistic −𝑡α,𝑛1+𝑛2−2 t < −𝑡α,𝑛1+𝑛2−2 Right-tailed test H0: 1 – 2 ≤ d0 H1: 1 – 2 > d0 z-statistic z z > z t-statistic 𝑡α,𝑛1+𝑛2−2 t >𝑡α,𝑛1+𝑛2−2 Two-tailed test H0: 1 – 2 = d0 H1: 1 – 2 ≠ d0 z-statistic z/2 z > z/2 or z < –z/2 t-statistic 𝑡α/2,𝑛1+𝑛2−2 t >𝑡α/2,𝑛1+𝑛2−2 or t < −𝑡α/2,𝑛1+𝑛2−2 Example 3 The salaries for 35 faculty members from private institutions and 30 faculty members from public institutions are randomly and independently selected. Their annual salaries ($000) are recorded and the summary of the information are as follows. Private Institutions Public Institutions 𝑥̅1 = 98.19 s1 = 26.21 n1 = 35 𝑥̅2 = 83.18 s2 = 23.95 n2 = 30 At the 5% significance level, do the data provide evidence to conclude that mean salaries for faculty in private and public institutions differ? Let 1 be the true population mean annual salary for faculty members from private institutions; and 2 be the true population mean annual salary for faculty members from public institutions. H0: H1: 1 and 2 are unknown but n1 ≥30 and n2 ≥ 30, z test is used. = 0.05, critical value = Test statistic, (𝑥̅1 − 𝑥̅2 ) − (μ1 − μ2 ) 𝑧 = = 2 2 𝑠 𝑠 √ 1+ 2 𝑛1 𝑛2 If , H0 is rejected. Otherwise, it is failed to reject H0. Since Therefore, the data provide 9 BHMC3004 Chapter 7 Example 4 A sample of 10 children from City A showed that the mean time they spent watching television is 28.50 hours per week with a standard deviation of 4 hours. Another sample of 15 children from City B showed that the mean time spent by them watching television is 23.25 hours per week with a standard deviation of 5 hours. Using a 1% level of significance, can you conclude that the mean time spent watching television by children in City A is greater than that for children in City B ? Assume that the standard deviations for the two populations are equal. Let A be the true population mean time spent watching television by children in City A; and B be the true population mean time spent watching television by children in City B. H0: H1: 1 and 2 are unknown, n1 < 30 and n2 < 30, t test is used. = 0.01, df = 10 + 15 – 2= 23 critical value = sw2 = Test statistic, t = , H0 is rejected. Otherwise, it is failed to reject H0. If Since Thus, we are 99% confident that the mean time spent watching television by children in City A 7.4.2 Correlated Groups • A single sample of individuals is measured more than once on the same dependent variable. The same subjects are used in all the treatment conditions. o E.g., A clinical psychologist may want to evaluate a therapy technique by comparing depression scores for patients before therapy with their scores after therapy. • In a matched-subjects study, each individual in one sample is matched with a subject in the other sample. The matching is done so that the two individuals are equivalent (or nearly equivalent) with respect to a specific variable that the researcher would like to control. • Assumptions: o o The observations within each treatment condition must be independent. The population distribution of difference scores (D values) must be normal. 10 BHMC3004 • Chapter 7 The t test begins by computing a difference between the first and second measurements for each subject (or the difference for each matched pair). o The difference scores, are obtained by D = X2 – X1. o The mean difference, 𝑥̅𝐷 = o The test statistic is t = Hypothesis 𝑛 𝑥̅ 𝐷 −μ𝐷 𝑠𝐷 ⁄ √𝑛 Test Statistic Left-tailed test H0: D ≥ 0 H1: D < 0 Right-tailed test H0: D ≤ 0 H1: D > 0 Σ𝐷 t-statistic Two-tailed test H0: D = 0 H1: D ≠ 0 , where D is the sum of differences. Σ𝐷 with df (n – 1), where sD = √ 2 −(Σ𝐷) 𝑛 𝑛−1 2 . Critical Value Critical Region −𝑡α,𝑛−1 t <−𝑡α,𝑛−1 𝑡α,𝑛−1 t > 𝑡α,𝑛−1 𝑡α/2,𝑛−1 t >𝑡α/2,𝑛−1 or t <−𝑡α/2,𝑛−1 Example 5 The following data are weight changes of a group of 10 participants in a study, after administration of a drug proposed to result in weight loss. At = 0.05 level of significance, do these data provide sufficient evidence to indicate that the drug will help reducing weight? Subject 1 2 3 4 5 6 7 8 9 10 Before 55.4 63.9 60.1 78.8 59.2 68.7 70.0 69.2 84.9 75.3 After 55.2 63.6 58.8 77.2 58.5 69.2 70.0 68.9 83.9 74.8 -1.6 -0.7 0.5 0.0 -0.3 -1.0 -0.5 D = xa – xb D = –5.4 𝑥̅𝐷 = –0.54 D 2 = 6.46 sD = 0.6275 Let D be the true population mean difference between the weights before and after the administration of the drug, where D = xafter – xbefore. H0: H1: df = 10 – 1 = 9, = 0.05, critical value = 11 BHMC3004 Chapter 7 Test statistic, 𝑥̅𝐷 − μ𝐷 𝑡 = 𝑠 𝐷 ⁄ √𝑛 If , H0 is rejected. Otherwise, it is failed to reject H0. Since Therefore, these data Example 6 Listed below are brain volumes (cm3) of 10 pairs of twins. Use = 0.10 to test the claim that there is no difference in brain volumes between the first-born and the second-born twins. First Born 1005 1035 1281 1051 1034 1079 1104 1439 1029 1160 Second Born 963 1027 1272 1079 1070 1173 1067 1347 1100 1204 D = X1 – X2 9 -28 -36 -94 37 92 -71 -44 D2 81 784 1296 8836 1369 8464 5041 1936 Let D be the true population mean difference in brain volumes between the first-born and second-born twins. H0: H1: df = , = 0.1, critical value Test statistic, 𝑥̅𝐷 − μ𝐷 𝑡 = 𝑠 = 𝐷 ⁄ √𝑛 If t , H0 is rejected. Otherwise, it is failed to reject H0. Since t = −1.833 < 0.4742 < 1.833, H0 is failed to reject. Therefore, there is 12 BHMC3004 7.5 • • • • Chapter 7 The Chi-Square Test An inferential statistical technique designed to test on qualitative variables. Used to test on the i) shape of the distribution of a variable (Goodness of Fit Test); ii) significance of the relationship between two variables (Independence Test); iii) comparison of the distributions of a variable between two or more populations (Homogeneity Test). Rely on Chi-square distribution, 2. Critical value = 2df ; critical region: 2 > 2df . Acceptance region • Rejection region Test statistic, (𝑓𝑜 − 𝑓𝑒 )2 2 χ =Σ 𝑓𝑒 with a specific degree of freedom, df where fo: observed frequency (from sample) fe: expected frequency (predicted from the H0) 7.5.1 Goodness of Fit Test • Determines how well the obtained sample proportions fit the population proportions specified by the H0. • The null hypothesis assumes that there is no significant difference between the observed and expected distribution. • The alternative hypothesis states that the population distribution has a different shape from that specified in H0. • Degree of freedom, df = C – 1, where C is the number of categories. • fe = np where n is the sample size and p is the proportion stated in the H0. 13 BHMC3004 Chapter 7 Chi-square (2) Distribution df 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 40 50 60 70 80 90 100 0.1 2.706 4.605 6.251 7.779 9.236 10.645 12.017 13.362 14.684 15.987 17.275 18.549 19.812 21.064 22.307 23.542 24.769 25.989 27.204 28.412 29.615 30.813 32.007 33.196 34.382 35.563 36.741 37.916 39.087 40.256 51.805 63.167 74.397 85.527 96.578 107.565 118.498 Proportion in Critical Region 0.05 0.025 0.01 3.841 5.024 6.635 5.991 7.378 9.210 7.815 9.348 11.345 9.488 11.143 13.277 11.070 12.833 15.086 12.592 14.449 16.812 14.067 16.013 18.475 15.507 17.535 20.090 16.919 19.023 21.666 18.307 20.483 23.209 19.675 21.920 24.725 21.026 23.337 26.217 22.362 24.736 27.688 23.685 26.119 29.141 24.996 27.488 30.578 26.296 28.845 32.000 27.587 30.191 33.409 28.869 31.526 34.805 30.144 32.852 36.191 31.410 34.170 37.566 32.671 35.479 38.932 33.924 36.781 40.289 35.172 38.076 41.638 36.415 39.364 42.980 37.652 40.646 44.314 38.885 41.923 45.642 40.113 43.195 46.963 41.337 44.461 48.278 42.557 45.722 49.588 43.773 46.979 50.892 55.758 59.342 63.691 67.505 71.420 76.154 79.082 83.298 88.379 90.531 95.023 100.425 101.879 106.629 112.329 113.145 118.136 124.116 124.342 129.561 135.807 0.005 7.879 10.597 12.838 14.860 16.750 18.548 20.278 21.955 23.589 25.188 26.757 28.300 29.819 31.319 32.801 34.267 35.718 37.156 38.582 39.997 41.401 42.796 44.181 45.559 46.928 48.290 49.645 50.993 52.336 53.672 66.766 79.490 91.952 104.215 116.321 128.299 140.169 14 BHMC3004 Chapter 7 Example 7 The human resource at a company is concerned about absenteeism among hourly workers. She decides to sample the records to determine whether absenteeism is distributed evenly throughout the six working days. The sample results are as follows: Day Mon Tues Wed Thurs Fri Sat No. Absent 12 9 11 10 9 9 Use 0.01 level of significance to test the hypothesis. H0 : In the general population, the absenteeism is distributed evenly throughout the six working days and the distribution of the absenteeism is as follows: Day Mon Tues Wed Thurs Fri Sat p 1/6 1/6 1/6 1/6 1/6 1/6 H1 : The absenteeism is not distributed evenly throughout the six working days. = 0.01, df = 6 – 1 = 5, Critical value = Day Mon Tues Wed Thurs Fri Sat 𝑓𝑜 12 9 11 10 9 9 𝑓𝑒 10 10 10 10 10 60 Test statistic, (𝑓𝑜 − 𝑓𝑒 )2 2 χ =Σ 𝑓𝑒 If , H0 is rejected. Otherwise, it is failed to reject H0. Since Therefore, the absenteeism is Example 8 The American Accounting Association classifies accounts receivable as “current”, “late” and “not collectable”. Industry figures shows that 60 percent of accounts receivable are current, 30 percent are late, and 10 percent are not collectable. An accountancy firm has 500 accounts receivable: 320 are current, 120 are late, and 60 are not collectable. Are these numbers in agreement with the industry distribution? Use 0.05 level of significance. H0: In the general population, the distribution of the classification of the accounts receivable is as follows: 60% current, 30% late, and 10% not collectable H1: The distribution of the classification of the accounts receivable is different from that specified in H0. = 0.05, df = , Critical value = 15 BHMC3004 𝑓𝑜 Chapter 7 Current Late Not Collectible Total 320 120 60 500 𝑓𝑒 (𝑓𝑜 − 𝑓𝑒 )2 χ =Σ 𝑓𝑒 2 If , H0 is rejected. Otherwise, it is failed to reject H0. Since Therefore, we are 95% confident that the distribution of accounts 7.5.2 Independence Test • The null hypothesis always states that the two variables are independent or there is no consistent, predictable relationship between them. • The data are presented in the form of matrix, called as a contingency table. • df = (R – 1)(C – 1) where R is the number of rows and C is the number of columns. Column total ×Row total • fe = Grand total Example 9 Recent recession and bad economic conditions forced many people to hold more than one job. A sample of 500 persons who held more than one job produced the following two-way table. Test at 5% level of significance whether gender and marital status are related for all people who hold more than one job. Single Married Other Male 72 209 39 Female 33 102 45 H0 : In the general population, there is no relationship between gender and marital status for all people who hold more than one job. H1 : There is a consistent and predictable relationship between gender and marital status for all people who hold more than one job. = 0.05, df = , Critical value = 𝑓𝑜 (𝑓𝑒 ) Single Married Other Total Male 72 209 (199.04) 39 (53.76) 320 Female 33 102 (111.96) 45 (30.24) 180 Total χ2 = Σ 105 311 84 500 (𝑓𝑜 − 𝑓𝑒 )2 𝑓𝑒 16 BHMC3004 Chapter 7 If , H0 is rejected. Otherwise, it is failed to reject H0. Since Therefore, gender and marital status are 7.5.3 Homogeneity Test • The null hypothesis can be stated as the populations are homogeneous with respect to the variable. • The steps for carrying out the independence test and homogeneity test are the same. Example 10 In a study of the television viewing habits of children, a developmental psychologist selects a random sample of 300 first graders - 100 boys and 200 girls. Each child is asked which of the following TV programs they like best: The Lone Ranger, Sesame Street, or The Simpsons. Results are shown in the contingency table below. Viewing Preferences Total Lone Ranger Sesame Street The Simpsons Boys 50 30 20 100 Girls 50 80 70 200 Total 100 110 90 300 Do the boys’ preferences for these TV programs differ significantly from the girls’ preferences? Use a 0.05 level of significance. H0: In the general population, the boys’ preferences for these TV programs do not differ significantly from the girls’ preferences. H1: The boys’ preferences for these TV programs differ significantly from the girls’ preferences. = 0.05, df = , Critical value = 𝑓𝑜 (𝑓𝑒 ) Lone Ranger Sesame Street The Simpsons Total Boys 50 30 (36.7) 20 (30) 100 Girls 50 80 (73.3) 70 (60) 200 Total 100 110 90 300 (𝑓𝑜 − 𝑓𝑒 )2 χ =Σ 𝑓𝑒 2 If , H0 is rejected. Otherwise, it is failed to reject H0. Since Therefore, the boys' preferences for these TV programs 17 BHMC3004 7.6 • Chapter 7 p-Value in Hypothesis Testing p-Value o The probability of observing a sample value as extreme as, or more extreme than, the value observed, given that the null hypothesis is true. o If p-value ≤ , H0 is rejected. Otherwise, it is failed to reject H0. 18 BHMC3004 Chapter 8 Chapter 8 Cross Tabulation 8.1 • • 8.2 • • • • Introduction A technique for analysing the relationship between two or more nominal or ordinal variables that have been organized in a table. A type of bivariate analysis, a statistical method designed to detect and describe the relationship between two nominal or ordinal variables. Bivariate Table Contingency table, a joint frequency distribution of two nominal or ordinal variables. r c table o r : number of rows o c : number of columns. Characteristics: o Title: Description of the variables o Column Variable: Independent variable o Row Variable: Dependent variable o Order the categories from lowest to highest: From left to right across the columns; from top to bottom along the rows. o Cell: Intersection of a row and a column o Marginal: Row and column totals Column o ource of Data variable E.g., 2 2 Contingency Table Dependent Variable Row variable Independent Variable Total I1 I2 D1 D2 a c b d a+b c+d Total a+c b+d a+b+c+d Marginal Cell • Two basic rules: 1. Calculate percentages within each category of the IV. Independent Variable Dependent Variable D1 D2 Total I1 𝑎 100% 𝑎+𝑐 𝑐 𝑎+𝑐 100% 100% (Total I1) I2 𝑏 𝑏+𝑑 𝑑 𝑏+𝑑 100% 100% 100% (Total I2) 1 BHMC3004 Chapter 8 2. Interpret the table by comparing the percentage point difference for different categories of the independent variable. o Limit comparisons to categories with at least 10 percent point difference. o For 2 2 table, only one comparison is needed for interpretation. 8.3 Properties of a Bivariate Relationship 1. Existence of a relationship • Percentage distributions vary across the different categories of the independent variable. 2. Strength of the relationship • The larger the percentage difference across the categories, the stronger the association. • Percentage differences are a rough indicator of the strength of a relationship between two variables. 3. • • • Direction of the relationship Applicable to ordinal or interval-ratio level. Positive relationship: vary in the same direction (both go up or both go down) Negative relationship: vary in the opposite direction (when one goes up the other goes down) Example 1 Refer to the following bivariate table, describe the relationship between race and home ownership. Home Ownership by Race Home Ownership Race Total Black White Own 3 7 10 Rent 6 4 10 Total 9 11 20 Let Race be the independent variable. Home Ownership Race Total Black White Own 33% 64% 50% Rent 67% 36% 50% Total (N) 100% (9) 100% (11) 100% (20) 2 BHMC3004 Chapter 8 There is a 31% percentage point difference between the percentage of white homeowners (64%) and black homeowners 33%). In other words, in this group, whites are more likely to be homeowners than blacks. Therefore, we can conclude that one’s race appears to be associated with the likelihood of being a homeowner. Example 2 Analyse the following bivariate table to examine whether the frequency of church attendance by respondents had an effect on their support for abortion. Support for abortion was measured with the following questions:” Please tell me whether or not you think it should be possible for a pregnant woman to obtain a legal abortion if the woman wants it for any reason.” Frequency of church attendance was determined by asking respondents to indicate how often they attend religious services. Support for Abortion by Church Attendance Abortion Church Attendance Total Never Infrequently Frequently Yes 55% 50% 26% 43% No 45% 50% 74% 57% Total (N) 100% (111) 100% (212) 100% (157) 100% (480) Let the hypothesis be those who attend church frequently are more likely to be pro-life. We may observe that the percentage that supports abortion changes across Thus, the table indicates Besides that, the largest percentage difference between respondents who The differences indicate a 3 BHMC3004 Chapter 8 Example 3 Describe the direction of the relationship based on the following bivariate tables. i) Health Condition by Social Class Health Class Low Middle High Poor 39% 12% 9% Fair 36% 45% 28% Good 25% 43% 63% Total (N) 100% (39) 100% (254) 100% (202) * As “class” goes up “health” goes up. There is a ii) Frequency of Trauma by Social Class Trauma Class Low Middle High 0 31% 41% 48% 1 22% 42% 20% 2+ 47% 17% 32% Total (N) 100% (48) 100% (220) 100% (180) * As “class” goes up “trauma” goes down. There is a 4