Identification-of-Variables-is-Specification-of-the-Sample

advertisement
Identification of Variables is
Specification of the Sample
As Cynthia, a perceptive student remarked in one of my classes,
Identification of the variables of interest as the first step of planning research is equivalent to
specifying the raw data (i.e., the sample) that will be collected by performing an experimental
or observational study.
This is so because a statement such as
Ai = the age (years) of the ith randomly sampled Labrador Retriever in Baltimore, MD,
i = 1,2,..., 24.
is actually 24 statements. These 24 statements are
A1 = the age (years) of the 1st randomly sampled Labrador Retriever in Baltimore, MD.
A2 = the age (years) of the 2nd randomly sampled Labrador Retriever in Baltimore, MD.
A3 = the age (years) of the 3rd randomly sampled Labrador Retriever in Baltimore, MD.
A23 = the age (years) of the 23rd randomly sampled Labrador Retriever in Baltimore, MD.
A24 = the age (years) of the 24th randomly sampled Labrador Retriever in Baltimore, MD.
Thus, these statements1 are defining the planned observations (data) of the Ci column of the datatable shown in Table 1 on page 3. Moreover,
A random variable of interest is always a characteristic of the individuals of the population or
sample. It is not a characteristic of the population or the sample as a whole. E.g., the mean age of
all individuals in the population (called the population mean age) is a characteristic of the
population as a whole. Likewise, the mean age of all individuals in the sample (called the sample
mean age) is a characteristic of the sample as a whole. Neither mean would be referred to as a
random variable of interest.
1
Recall that in mathematical writing, ellipses, whether written horizontally (. . .) or vertically
  , means “and so
on, continuing the sequence”, i.e., “et cetera”.
Document1
1
2/8/2016
Example
This is how Elizabeth, another student, identified three variables in a Melanoma Study. The
bullet points are how Cynthia correctly interpreted Elizabeth’s definitions.
1. Ci = the coat color (black, yellow, chocolate) of the i-th randomly sampled Labrador Retriever
in Baltimore, MD, i= 1,2,..., 24.





coat color (black, yellow, chocolate)
qualitative, nominal, three levels
denoted by Ci
n = 24
from the population of all Labrador Retriever in Baltimore, MD
2. Ai = the age (years) of the i=th randomly sampled Labrador Retriever in Baltimore, MD, i=1,2,
...,24.





age (years)
quantitative, continuous ratio, units=years
denoted by Ai
n = 24
from the population of all Labrador Retriever in Baltimore, MD
3. Mi = the number of melanomas (0,1,2,...) found on the i-th randomly sampled Labrador
Retriever in Baltimore, MD, i=1,2,...,24.





number of melanomas (0,1,2,...) found on
quantitative, discrete ratio, i.e, a "count" with levels of (0,1,2,...)
denoted by Mi
n = 24
from the population of all Labrador Retriever in Baltimore, MD
Document1
2
2/8/2016
Table 1
The data-table representation of the sample (data) that is intrinsic in the
definitions of the three variables
i
Ci =
Coat Color
(black, yellow, …)
Ai =
age
(years)
Mi =
Melanoma Frequency
(0, 1, …)
1
c1
a1
m1
2
c2
a2
m2
3
c3
a3
m3
i
ci
ai
mi
23
c23
a23
m23
24
c24
a24
m24
Summary
1. Identification of the variables of interest as the first step of planning research is
equivalent to specifying the raw data (i.e., the sample) that will be collected by
performing an experimental or observational study.
2. A random variable of interest is always a characteristic of the individuals of the
population or sample. It is not a characteristic of the population or the sample as a whole.
3. The symbolic notation, e.g., Ci, Ai, and Mi, is used to denote the data because at Step 1 of
planning a study, we don’t know the values that we are going to observe. So, we denote
the values symbolically by letters.
4. In an observational study, the subscript i is an ordinal label of the ith observational unit
randomly sampled from the population. In an experiment, it is an ordinal label of the
random outcome of the ith independent trial of the experiment. In this example, there are
n = 24 observations (not 24 × 3 = 72 observations). Each observation is an observation
of, in this example, three characteristics of the sampled observational or experimental
unit.
5. Each column of the table holds one of the three characteristics observed (or to be
observed). The characteristics are called random variables because they randomly vary
from observation to observation.
Document1
3
2/8/2016
All of this information is encompassed by Elizabeth’s three sentences identifying/defining the
three random variables.
A minor detail
Elizabeth listed the levels of coat color as (black, yellow, chocolate), whereas in the data-table
they are listed as (black, yellow, …). The list (black, yellow, chocolate) implies that only these
three colors are possible. [Whether or not that is correct depends on more knowledge of Labrador
Retrievers (the population) than I possess.] The list (black, yellow, …) includes, virtually, all
colors. It simply informs us that coat color is going to be characterized nominally (rather than as
a wave length, for example). Neither lists says that black Labs or or yellow Labs will necessarily
be included in the study—we really can’t know if coat color is random. The purpose of the
parenthetical list of levels is simply to clarify exactly what is going to be observed or measured,
e.g., to distinguish between color (black, yellow, …) versus color wavelength (nanometers).
[Note that … is the mathematical way of saying and so forth or etc.] The point is that the purpose
of the parenthetical list is to clarify rather than to limit. The list of levels must include all
possible levels [as color (black, yellow, …) does and as color (black, yellow) doesn’t], and it’s fine
if it contains some values that are not possible or that may not be observed.
Document1
4
2/8/2016
Download