Chapter 1: Introduction

advertisement
DRAFT
02/15/16
DRAFT
Chapter 1
INTRODUCTION
Gratzer & Jantzen
Introduction
Page #1
DRAFT
02/15/16
DRAFT
INTRODUCTION
Welcome. The goal of this book is to introduce you to some of the ways that health care
professionals use statistical methods as guides in the decision making process. The basic goal of
a statistical investigation or study is to learn something about the group of objects, i.e. people,
hospitals, venders, under study. The health care professional may, for example, be interested in
knowing if the average response time of two ambulance companies is the same, or if there has
been a change in the proportion of people being insured by certain health insurance companies.
Before we begin answering such questions we must agree on the language that will be used.
This chapter will introduce the language of statistics and familiarize you with some basic
concepts.
A population is defined as the complete collection of objects with which a study is
concerned. It is often impossible to collect the information desired for the population of interest.
Consider a study, funded by a manufacturer of exercise equipment, that has as its goal the
publication of the average weight of a citizen of California. Can the data collection staff hired to
gather these measurements ever complete their task? People move to other states, people from
other states move to California, people die, children are born, people gain and lose weight.
Investigators must draw conclusions based on information gathered from a subset of that
population, that is from a sample, since a complete and accurate compilation of data will never
be achieved. Thus, the statistician attempts to estimate a characteristic of a population by
measuring that same characteristic on a sample. A measurable characteristic of a population is
known as a parameter. A descriptive measure that is calculated entirely from the observations
in a sample is called a statistic. The actual average weight of a Californian is a parameter; the
average calculated from the measurements taken on the people in the sample is a statistic.
Introduction
Page #2
DRAFT
02/15/16
DRAFT
In today’s society we are presented with more numeric information than at any time in
history. Thus, a basic knowledge of statistics has become an indispensable tool in understanding
our world. However, if you ask a typical person "What is Statistics?" you usually get an answer
which suggests that statistics is little more than things like batting averages, points per game,
proportion of voters in favor of a candidate or proportion of hospital beds filled weekly. Let us
take a moment to explore this view of the nature of statistics. During the 1997 National
Basketball Association finals Michael Jordan averaged 32.3 points per game. During the 1990
season the New York Yankees average attendance at home games was 24,770.9, while the New
York Mets averaged 33,737.6 at their home games. The average distance an Iona College
alumnus traveled to the 1997 homecoming game was 143.6 miles. All of the above pieces of
information present the reader with an “average”, which we will assume was calculated by
adding a series of measurements and then dividing by the number of measurements. However,
all averages are not created equal. Michael’s reported average is most probably, to within
rounding error, an exact summary, since the limited data on which it is based is accurately
recorded and easily obtainable. The attendance averages are also calculated from easily
obtainable and accurately recorded data. However, they may not be directly comparable because
in 1990 the manner in which attendance at major league baseball games was counted differed
from league to league. It is unreasonable to believe that every alumnus at the 1997 Iona College
homecoming game was found and asked the distance he/she traveled to get to the game.
Therefore, at best this number is a summary based on representative but incomplete information.
Thus, even though each of the above situations presented an “average”, which was calculated by
implementation of the same formula, the critical reader should realize that the information
contained in these summaries is not quite the same.
Introduction
Page #3
DRAFT
02/15/16
DRAFT
Even when a statistic is well defined and seemingly well understood it may hold some
surprises. Consider the fictitious hospital that needs to have a pacemaker and calls to order one.
The manufacturer has one in stock and will have it delivered by one of its two regular delivery
companies, the Speed–D Delivery Co. or the Prompt Delivery Co. The manager tells her
assistant, "Prompt has a better overall on-time delivery rate but their out-of-state on-time
delivery rate is not as good as Speed–D's, so call Speed–D for this delivery." While dialing the
assistant notices that the delivery address is in-state and hangs up to call Prompt. The manager
says "Why are you calling Prompt? Speed–D's in-state on-time delivery rate is better than
Prompt's. Call Speed–D." The assistant is confused and mutters “Prompt has a higher on-time
delivery rate than Speed–D. How can Speed–D have a higher on-time rate both in-state and outof-state?” What’s going on? Has the manager made a terrible, possibly life threatening blunder?
The answer to this question can be found in the summary of these companies on-time records
found below.
Speed–D
Prompt
Out-of-State
71/80 = .888
26/30 = .867
In-State
19/20 = .950
66/70 = .943
Overall
90/100 = .900
92/100 = .920
We see the manager has not made a mistake. We see that the seemingly contradictory statement
that Speed-D has a higher on-time delivery rate both in and out of state, but an overall lower ontime rate than Prompt is in fact true. Thus, we have learned that numbers must be carefully
studied to be fully understood. The following definition of statistics expands the layman's
concept of statistics as merely numbers to a process that attempts to uncover the information
Introduction
Page #4
DRAFT
02/15/16
DRAFT
contained in the numbers. Statistics is defined as methods of collecting, organizing,
summarizing and generalizing the information contained in measurements taken from a sample.
The information about which statistics is concerned is called data. The data used in statistics are
numbers. The numbers used are either measurements (i.e. height, cholesterol level, cost of
hospital stay), or counts (i.e. number of patients admitted for hypothermia, number of patients
with blue eyes). The smallest object or individual that can be investigated and measured is called
a Unit. Units are the source of the basic information in statistical studies.
Data can be collected in different ways. This fact gives rise to the Measurement Scales
listed below.
a) Nominal Scale:
Classification of units into unordered categories. Such unordered
categories might include male-female, married-single-divorced-widowed,
and cause of
b) Ordinal Scale:
death.
Units placed into ordered categories. Such ordered categories
might include socioeconomic status, convalescing patients rated as
unimproved-improved-well.
c) Interval Scale:
Distance between consecutive values are equal. However, the zero
point is arbitrary. An example of an interval scale is Degrees Fahrenheit.
One can add and subtract temperatures but it makes no sense to multiply
or divide them. 20 is not twice as warm as 10.
d) Ratio Scale:
Distance between consecutive values are equal, with a non-arbitrary zero.
An example of a ratio scale is height. You can add, subtract, multiply and
divide height measurements.
Introduction
Page #5
DRAFT
02/15/16
DRAFT
DATA
There are many different questions that can be asked in statistical studies such as “How long
should a post-appendectomy patient be allowed to stay in the hospital?” “What proportion of all
ER patients are uninsured?” Any characteristic observable on the subject is called a variable and
a variable being studied is called a response variable. Some variables record the category or
group to which a subject belongs. This type of variable is called a categorical variable.
Examples of categorical variables are insurance carrier, religious affiliation and blood type. A
variable that takes on a numerical value on which it makes sense to perform arithmetic is called a
quantitative variable. Examples of such a variable include the weight of a person, the time it
takes to get an orderly to move a patient and cost of an hospital stay. Notice that it makes perfect
sense to add ten waiting times to get a total waiting time or to average the total time of the ten
instances. It is important to note that not all observations that result from numeric recordings
are quantitative variables. State assembly districts are often given numeric labels, e.g., John
lives in assembly district 47. However, assembly districts are categorical variables that are
recorded in a numerical format, since it makes no sense to get the average assembly district for a
sample.
When subjects are chosen for a study in a truly random manner the investigator has no way
of predicting the value of the variable for any particular subject, i.e. there is no way of knowing
in advance how tall the next subject will be. Thus, the value of the variable is the result of
chance; this is known as a random variable. If a random variable can take on only certain
values in a range, that is, it is characterized by interruptions, it is called a discrete random
variable. An example of such a variable is the number of children in a given family, since
Introduction
Page #6
DRAFT
02/15/16
DRAFT
fractional children are not a possibility. However, if a random variable can potentially take on
any value within a range it is called a continuous random variable. The weight of a person is
an example of a variable that can potentially take on any value over a range.
Homework
1. In a medical research study the following information is gathered from each of the
experimental subjects: systolic blood pressure, diastolic blood pressure, heart rate,
cholesterol level, hemoglobin count, sex, occupation, pain reliever used, average daily caloric
intake, and state of residence. State whether each of the variables listed above is categorical
or quantitative. If quantitative, is it discrete or continuous?
2. In a 1996 survey of 225 households 97 reported owning a personal computer. This number
did not shock the researchers because 40% of all 1996 households owned a personal
computer. Is the number 97 a statistic or a parameter? Is the number 40% a statistic or a
parameter? What is the number 225?
3. Car dealers are judged on both the number of units they sell and the total dollar amount of the
sales. Explain why both variables are needed. Which, if any, of the variables are discrete?
Why? Which, if any, of the variables are continuous? Why?
4. In the text it was stated that the characteristic “height of a subject” is a random variable
because there is no way to predict the height of an unknown person. However, we do have
some “feeling” about what to expect. A 7’2” subject would take us by surprise, while a 5’7”
subject would not raise an eyebrow. Flipping 2 heads in a row would not be considered
unusual, but flipping 20 heads in a row would give even the most trusting player pause. What
Introduction
Page #7
DRAFT
02/15/16
DRAFT
does this discussion suggest to you about the long-term and short-term behavior of random
variables?
5. Make a list of the variables you would want to measure if you were doing a study on 3rd
grade honor-roll children. How many variables on you list are categorical? How many are
quantitative? How many are categorical? How many are continuous?
Gathering Data
A statistical study data set can contain primary data or secondary data, or both. Primary data
refer to original information collected from experiments or surveys conducted by the researcher.
A survey is a study that attempts to assess conditions as they exist in nature, that is, take a
snapshot of the population. While conducting a survey the researcher makes every attempt to
alter as little as possible. A questionnaire which simply asks the subjects to check off, from a
short list, the meal they would prefer is an example of a survey. In an experiment the researcher
alters existing conditions in a defined manner in order to assess the effect of the alteration. A
study in which people's white blood cell count is measured before and after they are given an
antibiotic is an example of an experiment. Secondary data refer to information developed by
others. For example, if the Health Care Financing Adminstration (HCFA) wanted to assess how
satisfied senior citizens are with the Medicare program, they could collect primary data by
surveying insured seniors in differing areas of the country. Because enrollee satisfaction is
likely to reflect the ease or difficulty in finding an appropriate physician, HCFA could also
incorporate secondary data on what fraction of each area’s local physicians have agreed to be
“participating” providers. The latter variable, unknown to each survey respondent, might
explain why some areas have higher levels of Medicare satisfaction.
Introduction
Page #8
DRAFT
02/15/16
DRAFT
Regardless of whether the sample data are primary or secondary, it is important that the
collected data are valid for answering the questions being asked. If a sample is to be considered
reliable for statistical purposes it must be representative of the population to be studied. Thus, the
process by which a sample is collected must avoid the systematic favoring of a certain type of
outcome or unit. Such a systematic favoring is called bias. For example, asking the parents of
only honor roll students to grade the quality of the teachers may well produce biased results.
Any sample which is the result of a biased collection procedure is of questionable use for
statistical analysis.
No data collection process can guarantee that a particular sample is representative,
however the process known as simple random sampling eliminates bias by insuring that every
sample of a given size n has an equal chance of being chosen for use in the study. This means
that if from a population of Ping-Pong balls numbered from 1 to 54 six are drawn, the result 1, 2,
3, 4, 5, 6 is just as likely as any other. In this method each member of the population is assigned
a number and these numbers are put in a “hat”. Then, an impartial device like a computer,
calculator, printed list or blindfolded person is used to pick numbers from the “hat.” If a
subject’s number is picked that subject becomes part of the sample.
Introduction
Page #9
DRAFT
02/15/16
DRAFT
Example I
Three of the ten cost centers in a hospital are audited every year. To assure that no cost center is
able to prepare for the audit, the three centers are randomly chosen every year. The list of cost
centers appears below.
Outpatient Services
Surgical Services
Inpatient Services
Admitting
Fund Raising
Public Relations
Plant & Property
Food Services
Financial Services
Administrative Offices
Three centers can be randomly chosen in the following manner.
1) Assign each center a number. We will label the centers with the numbers 10 – 19, with
Outpatient services being #10.
2) Use a random number generator to get three randomly chosen numbers between 10 and 19.
In this instance we will use the random number table on page A–1 of your text. Close your
eyes and place your pencil somewhere on the page. The number closest to your pencil point
is your starting value. Let us assume you landed on row 22 column 11112. Since in this
example our labels are two digits, we will read the table two digits at a time. (If our labels
were three digits we would read the table three digits at a time). The first two digits are 14,
therefore Fund Raising will be audited. The next two digits are 81, which can't be used. The
next two digits 24 also can't be used; neither can the 88 or 95 that follow. However, the two
numbers that follow, 11 and 19, are usable and require Surgical Services and Administrative
Offices to undergo audits.
Introduction
Page #10
DRAFT
02/15/16
DRAFT
It must be noted that for this particular situation a cost center can not be audited more than once
a year. Thus, if a valid label happened to be repeated, it must be skipped over the second time.
However, some situations allow for the repeated use of a single element of the population. An
example of such a situation is the numbers games played in many states. In these games PingPong balls labeled 0 – 9 are placed in three randomizing devices and a ball is picked from each
device. Since each device contains all the numbers it is possible that a particular number will be
chosen more than once. In such instances a repeated label is accepted.
While random samples are the preferred method for generating representative samples,
researchers often find it difficult to collect such data. In most cases, not all of those chosen for
the sample provide the requested information, leading to the possibility of nonresponse bias.
For example, if HCFA wants to gauge the level of satisfaction with the Medicare program, it
could randomly select a group of seniors for a survey. If all selected seniors responded to the
survey, the sample’s viewpoints would be representative of the overall population’s viewpoints.
However, it is extremely unlikely that all of the individuals chosen for the survey would be
willing to respond. Those that did respond might be more (or less) satisfied with Medicare than
those who didn’t respond. Unfortunately, the researcher has no way of knowing whether there is
a satisfaction difference between the two groups, and the respondents’ answers might not reflect
the overall population’s perceptions of the Medicare program. Because of the difficulty in
obtaining randomly generated sample data, many studies rely on data that may or may not be
representative of the population being studied. Such nonrandom sample data studies often try to
“guesstimate” whether sample bias is present or not by comparing the characteristics of the
surveyed sample to known characteristics of the population. However, even if the sample’s
respondents are similar to the overall population in measured characteristics, it does not
Introduction
Page #11
DRAFT
02/15/16
DRAFT
eliminate the possibility that the statistics generated by the sample are biased measures of the
population’s parameters. The Medicare survey is a case in point. Even if respondents to the
satisfaction survey had the same average age, lived in the same areas, had a similar gender mix,
etc., as the overall population of Medicare insurees, there is no guarantee that the sample’s
satisfaction with Medicare is similar to that of the overall population’s. It’s possible (even
likely) that the sample would contain a disproportionate share of individuals with negative
perceptions of the program, because those with complaints and desire changes in the program are
more likely to respond such a survey than those without complaints. In conclusion, “caveat
emptor (let the buyer beware)” probably is an appropriate viewpoint to adopt when reading the
many analyses that rely on nonrandomly generated sample data.
A Brief Overview of Survey Design
In order to conduct your own primary survey research, you will have to identify the
research question(s) that needs to be addressed, design a questionnaire, choose the right sample,
collect the data, and employ statistical analysis.
The first obvious step, namely identifying what is the point of the study, is a critical part
of the process. A study that lacks focus runs the risk of not collecting the appropriate
information needed to answer the research question(s) that prompted the study.
In the second stage, a survey form (or questionnaire) needs to be developed that will
generate the data desired. Prior to actually creating the questionnaire, you must decide how you
will elicit responses from those surveyed. Personal interviews tend to generate the most accurate
information, but are more costly to conduct. Telephone surveys, while less accurate, are
Introduction
Page #12
DRAFT
02/15/16
DRAFT
considerably cheaper. Mail surveys are the cheapest, but generate responses with the largest
measurement error.
Once the mode of measurement has been decided, the questionnaire can be developed.
The questionnaire should include only those questions pertaining to the variables likely to be
relevant for the research question being studied. Collecting information for variables that are
irrelevant to the study is unnecessarily time consuming and expensive, and the response rate to
surveys usually decreases as the length of the questionnaire increases, increasing the chance of
nonresponse bias. Questions must also be clearly worded, the fewer words the better. Finally,
the questions must be unambiguous, so that they mean the same thing to differing individuals.
For example, the question “Do you smoke? Yes___ No ___” has several possible ambiguities.
It’s not clear if only cigarettes (vs. cigars, crack, etc.) are being referred to. Occasional smokers
may also respond differently, some saying they don’t smoke, while others say they do. A better
question would be “How many cigarettes do you usually smoke per day?” In addition,
sometimes operational definitions have to be listed on the questionnaire, so that terms take on
the same meaning for differing respondents. For example, Loubeau and Jantzen’s mail survey of
hospital CEOs asked if their hospitals were members of “integrated delivery systems (IDSs).”
Because there are many possible definitions of an IDS, ranging from multihospital systems
owned and operated by a common corporate owner, to loose affiliations of independent hospitals
(like joint purchasing arrangements), the questionnaire also provided the American Hospital
Association’s definition of an IDS with the question.
Once the questionnaire has been designed, it is essential to pre-test it on a small group of
individuals, and to make any necessary revisions, prior to utilizing it for the final survey. Pretesting can illuminate problems in specific questions, in the respondent answers, and the
Introduction
Page #13
DRAFT
02/15/16
DRAFT
response rate. Failure to pre-test questionnaires can be disastrous, because once a survey has
been conducted, it’s extremely difficult to collect additional information or to amend items.
After the questionnaire has been pre-tested and revised, you’re ready to select the sample
and collect the data. As noted above, a probability-based sample (like the simple random
sample) is most likely to generate responses that are unbiased reflections of the population’s
characteristics. Non-probability samples (like mail surveys to all former patients, or suggestions
from the suggestion box), where the data is collected without regard to whether the sample is
representative of the population, are more likely to generate biased responses. In addition to
choosing the sampling method, you must also choose the sample size (how many persons will
you try to contact). Larger sample sizes generate more precise estimates of population
parameters, but are more expensive to obtain. How to estimate the appropriate sample size is a
concept that will be developed more fully later in this course.
Once your sample has been contacted and the questionnaires completed, in order to
utilize a statistical program for analysis, you will have to create a data file that contains all of the
information on each questionnaire. Since every respondent has answered the same questions,
but in differing ways, each question on the questionnaire is considered a variable, while each
person’s responses constitute an individual record or observation in the data file. So for
example, if you sampled 100 individuals and asked them “how many cigarettes do you smoke
per day?, how many years of formal schooling have you completed?, and are you male/female?”
your data file will contain three variables and 100 observations. The responses for the first
five questionnaires might look like this in a data file:
Introduction
Page #14
DRAFT
Observation
1
2
3
4
5
02/15/16
Cigarettes
0
5
0
20
1
Schooling
12
11
16
10
16
DRAFT
Gender
1
2
1
1
2
Note that each questionnaire’s information is recorded on separate lines, and that all of the
information has been “coded” as numbers, including gender. Gender has been coded as either a
1 (=Male) or 2 (=Female) because statistical programs cannot analyze alphabet information, and
it’s a lot easier to type in a 1 or 2, rather than “male” or “female,” 100 times.
With a properly collected sample and a basic knowledge of the type of data collected,
statistics provides the user with many analytical tools. This course will introduce you to several
basic, yet powerful, techniques for the analysis of data.
Homework
1) Describe how you would generate a sample that could be used to measure the average cost of
a hospital stay in your institution.
2) Discuss any problems you may expect to encounter when trying to estimate the annual
income of people living in the Unites States who are over 30 years of age.
3) A student suggests the following as an “easier” way of generating a simple random sample.
Randomly pick a starting point on the list, then take every 13th subject on the list until you
have the required number of subjects. Explain any weaknesses you may see in this method.
Give an example of how this method may have a built-in bias.
Introduction
Page #15
DRAFT
02/15/16
DRAFT
4) Imagine a room containing 30 people. What are the chances that two or more of the people in
the room share the same birthday? Simulate this situation by generating a list of 30 random
integers with values between 1 and 365. Record whether a shared birthday occurred. Repeat
this process 10 times.
5) A student suggests the following as an “easier” way of generating a simple random sample.
Randomly pick a starting point on the list, then take every 13th subject on the list until you
have the required number of subjects. Explain any weaknesses you may see in this method.
Give an example of how this method may have a built-in bias.
Introduction
Page #16
Download