DRAFT 02/15/16 DRAFT Chapter 1 INTRODUCTION Gratzer & Jantzen Introduction Page #1 DRAFT 02/15/16 DRAFT INTRODUCTION Welcome. The goal of this book is to introduce you to some of the ways that health care professionals use statistical methods as guides in the decision making process. The basic goal of a statistical investigation or study is to learn something about the group of objects, i.e. people, hospitals, venders, under study. The health care professional may, for example, be interested in knowing if the average response time of two ambulance companies is the same, or if there has been a change in the proportion of people being insured by certain health insurance companies. Before we begin answering such questions we must agree on the language that will be used. This chapter will introduce the language of statistics and familiarize you with some basic concepts. A population is defined as the complete collection of objects with which a study is concerned. It is often impossible to collect the information desired for the population of interest. Consider a study, funded by a manufacturer of exercise equipment, that has as its goal the publication of the average weight of a citizen of California. Can the data collection staff hired to gather these measurements ever complete their task? People move to other states, people from other states move to California, people die, children are born, people gain and lose weight. Investigators must draw conclusions based on information gathered from a subset of that population, that is from a sample, since a complete and accurate compilation of data will never be achieved. Thus, the statistician attempts to estimate a characteristic of a population by measuring that same characteristic on a sample. A measurable characteristic of a population is known as a parameter. A descriptive measure that is calculated entirely from the observations in a sample is called a statistic. The actual average weight of a Californian is a parameter; the average calculated from the measurements taken on the people in the sample is a statistic. Introduction Page #2 DRAFT 02/15/16 DRAFT In today’s society we are presented with more numeric information than at any time in history. Thus, a basic knowledge of statistics has become an indispensable tool in understanding our world. However, if you ask a typical person "What is Statistics?" you usually get an answer which suggests that statistics is little more than things like batting averages, points per game, proportion of voters in favor of a candidate or proportion of hospital beds filled weekly. Let us take a moment to explore this view of the nature of statistics. During the 1997 National Basketball Association finals Michael Jordan averaged 32.3 points per game. During the 1990 season the New York Yankees average attendance at home games was 24,770.9, while the New York Mets averaged 33,737.6 at their home games. The average distance an Iona College alumnus traveled to the 1997 homecoming game was 143.6 miles. All of the above pieces of information present the reader with an “average”, which we will assume was calculated by adding a series of measurements and then dividing by the number of measurements. However, all averages are not created equal. Michael’s reported average is most probably, to within rounding error, an exact summary, since the limited data on which it is based is accurately recorded and easily obtainable. The attendance averages are also calculated from easily obtainable and accurately recorded data. However, they may not be directly comparable because in 1990 the manner in which attendance at major league baseball games was counted differed from league to league. It is unreasonable to believe that every alumnus at the 1997 Iona College homecoming game was found and asked the distance he/she traveled to get to the game. Therefore, at best this number is a summary based on representative but incomplete information. Thus, even though each of the above situations presented an “average”, which was calculated by implementation of the same formula, the critical reader should realize that the information contained in these summaries is not quite the same. Introduction Page #3 DRAFT 02/15/16 DRAFT Even when a statistic is well defined and seemingly well understood it may hold some surprises. Consider the fictitious hospital that needs to have a pacemaker and calls to order one. The manufacturer has one in stock and will have it delivered by one of its two regular delivery companies, the Speed–D Delivery Co. or the Prompt Delivery Co. The manager tells her assistant, "Prompt has a better overall on-time delivery rate but their out-of-state on-time delivery rate is not as good as Speed–D's, so call Speed–D for this delivery." While dialing the assistant notices that the delivery address is in-state and hangs up to call Prompt. The manager says "Why are you calling Prompt? Speed–D's in-state on-time delivery rate is better than Prompt's. Call Speed–D." The assistant is confused and mutters “Prompt has a higher on-time delivery rate than Speed–D. How can Speed–D have a higher on-time rate both in-state and outof-state?” What’s going on? Has the manager made a terrible, possibly life threatening blunder? The answer to this question can be found in the summary of these companies on-time records found below. Speed–D Prompt Out-of-State 71/80 = .888 26/30 = .867 In-State 19/20 = .950 66/70 = .943 Overall 90/100 = .900 92/100 = .920 We see the manager has not made a mistake. We see that the seemingly contradictory statement that Speed-D has a higher on-time delivery rate both in and out of state, but an overall lower ontime rate than Prompt is in fact true. Thus, we have learned that numbers must be carefully studied to be fully understood. The following definition of statistics expands the layman's concept of statistics as merely numbers to a process that attempts to uncover the information Introduction Page #4 DRAFT 02/15/16 DRAFT contained in the numbers. Statistics is defined as methods of collecting, organizing, summarizing and generalizing the information contained in measurements taken from a sample. The information about which statistics is concerned is called data. The data used in statistics are numbers. The numbers used are either measurements (i.e. height, cholesterol level, cost of hospital stay), or counts (i.e. number of patients admitted for hypothermia, number of patients with blue eyes). The smallest object or individual that can be investigated and measured is called a Unit. Units are the source of the basic information in statistical studies. Data can be collected in different ways. This fact gives rise to the Measurement Scales listed below. a) Nominal Scale: Classification of units into unordered categories. Such unordered categories might include male-female, married-single-divorced-widowed, and cause of b) Ordinal Scale: death. Units placed into ordered categories. Such ordered categories might include socioeconomic status, convalescing patients rated as unimproved-improved-well. c) Interval Scale: Distance between consecutive values are equal. However, the zero point is arbitrary. An example of an interval scale is Degrees Fahrenheit. One can add and subtract temperatures but it makes no sense to multiply or divide them. 20 is not twice as warm as 10. d) Ratio Scale: Distance between consecutive values are equal, with a non-arbitrary zero. An example of a ratio scale is height. You can add, subtract, multiply and divide height measurements. Introduction Page #5 DRAFT 02/15/16 DRAFT DATA There are many different questions that can be asked in statistical studies such as “How long should a post-appendectomy patient be allowed to stay in the hospital?” “What proportion of all ER patients are uninsured?” Any characteristic observable on the subject is called a variable and a variable being studied is called a response variable. Some variables record the category or group to which a subject belongs. This type of variable is called a categorical variable. Examples of categorical variables are insurance carrier, religious affiliation and blood type. A variable that takes on a numerical value on which it makes sense to perform arithmetic is called a quantitative variable. Examples of such a variable include the weight of a person, the time it takes to get an orderly to move a patient and cost of an hospital stay. Notice that it makes perfect sense to add ten waiting times to get a total waiting time or to average the total time of the ten instances. It is important to note that not all observations that result from numeric recordings are quantitative variables. State assembly districts are often given numeric labels, e.g., John lives in assembly district 47. However, assembly districts are categorical variables that are recorded in a numerical format, since it makes no sense to get the average assembly district for a sample. When subjects are chosen for a study in a truly random manner the investigator has no way of predicting the value of the variable for any particular subject, i.e. there is no way of knowing in advance how tall the next subject will be. Thus, the value of the variable is the result of chance; this is known as a random variable. If a random variable can take on only certain values in a range, that is, it is characterized by interruptions, it is called a discrete random variable. An example of such a variable is the number of children in a given family, since Introduction Page #6 DRAFT 02/15/16 DRAFT fractional children are not a possibility. However, if a random variable can potentially take on any value within a range it is called a continuous random variable. The weight of a person is an example of a variable that can potentially take on any value over a range. Homework 1. In a medical research study the following information is gathered from each of the experimental subjects: systolic blood pressure, diastolic blood pressure, heart rate, cholesterol level, hemoglobin count, sex, occupation, pain reliever used, average daily caloric intake, and state of residence. State whether each of the variables listed above is categorical or quantitative. If quantitative, is it discrete or continuous? 2. In a 1996 survey of 225 households 97 reported owning a personal computer. This number did not shock the researchers because 40% of all 1996 households owned a personal computer. Is the number 97 a statistic or a parameter? Is the number 40% a statistic or a parameter? What is the number 225? 3. Car dealers are judged on both the number of units they sell and the total dollar amount of the sales. Explain why both variables are needed. Which, if any, of the variables are discrete? Why? Which, if any, of the variables are continuous? Why? 4. In the text it was stated that the characteristic “height of a subject” is a random variable because there is no way to predict the height of an unknown person. However, we do have some “feeling” about what to expect. A 7’2” subject would take us by surprise, while a 5’7” subject would not raise an eyebrow. Flipping 2 heads in a row would not be considered unusual, but flipping 20 heads in a row would give even the most trusting player pause. What Introduction Page #7 DRAFT 02/15/16 DRAFT does this discussion suggest to you about the long-term and short-term behavior of random variables? 5. Make a list of the variables you would want to measure if you were doing a study on 3rd grade honor-roll children. How many variables on you list are categorical? How many are quantitative? How many are categorical? How many are continuous? Gathering Data A statistical study data set can contain primary data or secondary data, or both. Primary data refer to original information collected from experiments or surveys conducted by the researcher. A survey is a study that attempts to assess conditions as they exist in nature, that is, take a snapshot of the population. While conducting a survey the researcher makes every attempt to alter as little as possible. A questionnaire which simply asks the subjects to check off, from a short list, the meal they would prefer is an example of a survey. In an experiment the researcher alters existing conditions in a defined manner in order to assess the effect of the alteration. A study in which people's white blood cell count is measured before and after they are given an antibiotic is an example of an experiment. Secondary data refer to information developed by others. For example, if the Health Care Financing Adminstration (HCFA) wanted to assess how satisfied senior citizens are with the Medicare program, they could collect primary data by surveying insured seniors in differing areas of the country. Because enrollee satisfaction is likely to reflect the ease or difficulty in finding an appropriate physician, HCFA could also incorporate secondary data on what fraction of each area’s local physicians have agreed to be “participating” providers. The latter variable, unknown to each survey respondent, might explain why some areas have higher levels of Medicare satisfaction. Introduction Page #8 DRAFT 02/15/16 DRAFT Regardless of whether the sample data are primary or secondary, it is important that the collected data are valid for answering the questions being asked. If a sample is to be considered reliable for statistical purposes it must be representative of the population to be studied. Thus, the process by which a sample is collected must avoid the systematic favoring of a certain type of outcome or unit. Such a systematic favoring is called bias. For example, asking the parents of only honor roll students to grade the quality of the teachers may well produce biased results. Any sample which is the result of a biased collection procedure is of questionable use for statistical analysis. No data collection process can guarantee that a particular sample is representative, however the process known as simple random sampling eliminates bias by insuring that every sample of a given size n has an equal chance of being chosen for use in the study. This means that if from a population of Ping-Pong balls numbered from 1 to 54 six are drawn, the result 1, 2, 3, 4, 5, 6 is just as likely as any other. In this method each member of the population is assigned a number and these numbers are put in a “hat”. Then, an impartial device like a computer, calculator, printed list or blindfolded person is used to pick numbers from the “hat.” If a subject’s number is picked that subject becomes part of the sample. Introduction Page #9 DRAFT 02/15/16 DRAFT Example I Three of the ten cost centers in a hospital are audited every year. To assure that no cost center is able to prepare for the audit, the three centers are randomly chosen every year. The list of cost centers appears below. Outpatient Services Surgical Services Inpatient Services Admitting Fund Raising Public Relations Plant & Property Food Services Financial Services Administrative Offices Three centers can be randomly chosen in the following manner. 1) Assign each center a number. We will label the centers with the numbers 10 – 19, with Outpatient services being #10. 2) Use a random number generator to get three randomly chosen numbers between 10 and 19. In this instance we will use the random number table on page A–1 of your text. Close your eyes and place your pencil somewhere on the page. The number closest to your pencil point is your starting value. Let us assume you landed on row 22 column 11112. Since in this example our labels are two digits, we will read the table two digits at a time. (If our labels were three digits we would read the table three digits at a time). The first two digits are 14, therefore Fund Raising will be audited. The next two digits are 81, which can't be used. The next two digits 24 also can't be used; neither can the 88 or 95 that follow. However, the two numbers that follow, 11 and 19, are usable and require Surgical Services and Administrative Offices to undergo audits. Introduction Page #10 DRAFT 02/15/16 DRAFT It must be noted that for this particular situation a cost center can not be audited more than once a year. Thus, if a valid label happened to be repeated, it must be skipped over the second time. However, some situations allow for the repeated use of a single element of the population. An example of such a situation is the numbers games played in many states. In these games PingPong balls labeled 0 – 9 are placed in three randomizing devices and a ball is picked from each device. Since each device contains all the numbers it is possible that a particular number will be chosen more than once. In such instances a repeated label is accepted. While random samples are the preferred method for generating representative samples, researchers often find it difficult to collect such data. In most cases, not all of those chosen for the sample provide the requested information, leading to the possibility of nonresponse bias. For example, if HCFA wants to gauge the level of satisfaction with the Medicare program, it could randomly select a group of seniors for a survey. If all selected seniors responded to the survey, the sample’s viewpoints would be representative of the overall population’s viewpoints. However, it is extremely unlikely that all of the individuals chosen for the survey would be willing to respond. Those that did respond might be more (or less) satisfied with Medicare than those who didn’t respond. Unfortunately, the researcher has no way of knowing whether there is a satisfaction difference between the two groups, and the respondents’ answers might not reflect the overall population’s perceptions of the Medicare program. Because of the difficulty in obtaining randomly generated sample data, many studies rely on data that may or may not be representative of the population being studied. Such nonrandom sample data studies often try to “guesstimate” whether sample bias is present or not by comparing the characteristics of the surveyed sample to known characteristics of the population. However, even if the sample’s respondents are similar to the overall population in measured characteristics, it does not Introduction Page #11 DRAFT 02/15/16 DRAFT eliminate the possibility that the statistics generated by the sample are biased measures of the population’s parameters. The Medicare survey is a case in point. Even if respondents to the satisfaction survey had the same average age, lived in the same areas, had a similar gender mix, etc., as the overall population of Medicare insurees, there is no guarantee that the sample’s satisfaction with Medicare is similar to that of the overall population’s. It’s possible (even likely) that the sample would contain a disproportionate share of individuals with negative perceptions of the program, because those with complaints and desire changes in the program are more likely to respond such a survey than those without complaints. In conclusion, “caveat emptor (let the buyer beware)” probably is an appropriate viewpoint to adopt when reading the many analyses that rely on nonrandomly generated sample data. A Brief Overview of Survey Design In order to conduct your own primary survey research, you will have to identify the research question(s) that needs to be addressed, design a questionnaire, choose the right sample, collect the data, and employ statistical analysis. The first obvious step, namely identifying what is the point of the study, is a critical part of the process. A study that lacks focus runs the risk of not collecting the appropriate information needed to answer the research question(s) that prompted the study. In the second stage, a survey form (or questionnaire) needs to be developed that will generate the data desired. Prior to actually creating the questionnaire, you must decide how you will elicit responses from those surveyed. Personal interviews tend to generate the most accurate information, but are more costly to conduct. Telephone surveys, while less accurate, are Introduction Page #12 DRAFT 02/15/16 DRAFT considerably cheaper. Mail surveys are the cheapest, but generate responses with the largest measurement error. Once the mode of measurement has been decided, the questionnaire can be developed. The questionnaire should include only those questions pertaining to the variables likely to be relevant for the research question being studied. Collecting information for variables that are irrelevant to the study is unnecessarily time consuming and expensive, and the response rate to surveys usually decreases as the length of the questionnaire increases, increasing the chance of nonresponse bias. Questions must also be clearly worded, the fewer words the better. Finally, the questions must be unambiguous, so that they mean the same thing to differing individuals. For example, the question “Do you smoke? Yes___ No ___” has several possible ambiguities. It’s not clear if only cigarettes (vs. cigars, crack, etc.) are being referred to. Occasional smokers may also respond differently, some saying they don’t smoke, while others say they do. A better question would be “How many cigarettes do you usually smoke per day?” In addition, sometimes operational definitions have to be listed on the questionnaire, so that terms take on the same meaning for differing respondents. For example, Loubeau and Jantzen’s mail survey of hospital CEOs asked if their hospitals were members of “integrated delivery systems (IDSs).” Because there are many possible definitions of an IDS, ranging from multihospital systems owned and operated by a common corporate owner, to loose affiliations of independent hospitals (like joint purchasing arrangements), the questionnaire also provided the American Hospital Association’s definition of an IDS with the question. Once the questionnaire has been designed, it is essential to pre-test it on a small group of individuals, and to make any necessary revisions, prior to utilizing it for the final survey. Pretesting can illuminate problems in specific questions, in the respondent answers, and the Introduction Page #13 DRAFT 02/15/16 DRAFT response rate. Failure to pre-test questionnaires can be disastrous, because once a survey has been conducted, it’s extremely difficult to collect additional information or to amend items. After the questionnaire has been pre-tested and revised, you’re ready to select the sample and collect the data. As noted above, a probability-based sample (like the simple random sample) is most likely to generate responses that are unbiased reflections of the population’s characteristics. Non-probability samples (like mail surveys to all former patients, or suggestions from the suggestion box), where the data is collected without regard to whether the sample is representative of the population, are more likely to generate biased responses. In addition to choosing the sampling method, you must also choose the sample size (how many persons will you try to contact). Larger sample sizes generate more precise estimates of population parameters, but are more expensive to obtain. How to estimate the appropriate sample size is a concept that will be developed more fully later in this course. Once your sample has been contacted and the questionnaires completed, in order to utilize a statistical program for analysis, you will have to create a data file that contains all of the information on each questionnaire. Since every respondent has answered the same questions, but in differing ways, each question on the questionnaire is considered a variable, while each person’s responses constitute an individual record or observation in the data file. So for example, if you sampled 100 individuals and asked them “how many cigarettes do you smoke per day?, how many years of formal schooling have you completed?, and are you male/female?” your data file will contain three variables and 100 observations. The responses for the first five questionnaires might look like this in a data file: Introduction Page #14 DRAFT Observation 1 2 3 4 5 02/15/16 Cigarettes 0 5 0 20 1 Schooling 12 11 16 10 16 DRAFT Gender 1 2 1 1 2 Note that each questionnaire’s information is recorded on separate lines, and that all of the information has been “coded” as numbers, including gender. Gender has been coded as either a 1 (=Male) or 2 (=Female) because statistical programs cannot analyze alphabet information, and it’s a lot easier to type in a 1 or 2, rather than “male” or “female,” 100 times. With a properly collected sample and a basic knowledge of the type of data collected, statistics provides the user with many analytical tools. This course will introduce you to several basic, yet powerful, techniques for the analysis of data. Homework 1) Describe how you would generate a sample that could be used to measure the average cost of a hospital stay in your institution. 2) Discuss any problems you may expect to encounter when trying to estimate the annual income of people living in the Unites States who are over 30 years of age. 3) A student suggests the following as an “easier” way of generating a simple random sample. Randomly pick a starting point on the list, then take every 13th subject on the list until you have the required number of subjects. Explain any weaknesses you may see in this method. Give an example of how this method may have a built-in bias. Introduction Page #15 DRAFT 02/15/16 DRAFT 4) Imagine a room containing 30 people. What are the chances that two or more of the people in the room share the same birthday? Simulate this situation by generating a list of 30 random integers with values between 1 and 365. Record whether a shared birthday occurred. Repeat this process 10 times. 5) A student suggests the following as an “easier” way of generating a simple random sample. Randomly pick a starting point on the list, then take every 13th subject on the list until you have the required number of subjects. Explain any weaknesses you may see in this method. Give an example of how this method may have a built-in bias. Introduction Page #16