Identification of Variables is Specification of the Sample As Cynthia, a perceptive student remarked in one of my classes, Identification of the variables of interest as the first step of planning research is equivalent to specifying the raw data (i.e., the sample) that will be collected by performing an experimental or observational study. This is so because a statement such as Ai = the age (years) of the ith randomly sampled Labrador Retriever in Baltimore, MD, i = 1,2,..., 24. is actually 24 statements. These 24 statements are A1 = the age (years) of the 1st randomly sampled Labrador Retriever in Baltimore, MD. A2 = the age (years) of the 2nd randomly sampled Labrador Retriever in Baltimore, MD. A3 = the age (years) of the 3rd randomly sampled Labrador Retriever in Baltimore, MD. A23 = the age (years) of the 23rd randomly sampled Labrador Retriever in Baltimore, MD. A24 = the age (years) of the 24th randomly sampled Labrador Retriever in Baltimore, MD. Thus, these statements1 are defining the planned observations (data) of the Ci column of the datatable shown in Table 1 on page 3. Moreover, A random variable of interest is always a characteristic of the individuals of the population or sample. It is not a characteristic of the population or the sample as a whole. E.g., the mean age of all individuals in the population (called the population mean age) is a characteristic of the population as a whole. Likewise, the mean age of all individuals in the sample (called the sample mean age) is a characteristic of the sample as a whole. Neither mean would be referred to as a random variable of interest. 1 Recall that in mathematical writing, ellipses, whether written horizontally (. . .) or vertically , means “and so on, continuing the sequence”, i.e., “et cetera”. Document1 1 2/8/2016 Example This is how Elizabeth, another student, identified three variables in a Melanoma Study. The bullet points are how Cynthia correctly interpreted Elizabeth’s definitions. 1. Ci = the coat color (black, yellow, chocolate) of the i-th randomly sampled Labrador Retriever in Baltimore, MD, i= 1,2,..., 24. coat color (black, yellow, chocolate) qualitative, nominal, three levels denoted by Ci n = 24 from the population of all Labrador Retriever in Baltimore, MD 2. Ai = the age (years) of the i=th randomly sampled Labrador Retriever in Baltimore, MD, i=1,2, ...,24. age (years) quantitative, continuous ratio, units=years denoted by Ai n = 24 from the population of all Labrador Retriever in Baltimore, MD 3. Mi = the number of melanomas (0,1,2,...) found on the i-th randomly sampled Labrador Retriever in Baltimore, MD, i=1,2,...,24. number of melanomas (0,1,2,...) found on quantitative, discrete ratio, i.e, a "count" with levels of (0,1,2,...) denoted by Mi n = 24 from the population of all Labrador Retriever in Baltimore, MD Document1 2 2/8/2016 Table 1 The data-table representation of the sample (data) that is intrinsic in the definitions of the three variables i Ci = Coat Color (black, yellow, …) Ai = age (years) Mi = Melanoma Frequency (0, 1, …) 1 c1 a1 m1 2 c2 a2 m2 3 c3 a3 m3 i ci ai mi 23 c23 a23 m23 24 c24 a24 m24 Summary 1. Identification of the variables of interest as the first step of planning research is equivalent to specifying the raw data (i.e., the sample) that will be collected by performing an experimental or observational study. 2. A random variable of interest is always a characteristic of the individuals of the population or sample. It is not a characteristic of the population or the sample as a whole. 3. The symbolic notation, e.g., Ci, Ai, and Mi, is used to denote the data because at Step 1 of planning a study, we don’t know the values that we are going to observe. So, we denote the values symbolically by letters. 4. In an observational study, the subscript i is an ordinal label of the ith observational unit randomly sampled from the population. In an experiment, it is an ordinal label of the random outcome of the ith independent trial of the experiment. In this example, there are n = 24 observations (not 24 × 3 = 72 observations). Each observation is an observation of, in this example, three characteristics of the sampled observational or experimental unit. 5. Each column of the table holds one of the three characteristics observed (or to be observed). The characteristics are called random variables because they randomly vary from observation to observation. Document1 3 2/8/2016 All of this information is encompassed by Elizabeth’s three sentences identifying/defining the three random variables. A minor detail Elizabeth listed the levels of coat color as (black, yellow, chocolate), whereas in the data-table they are listed as (black, yellow, …). The list (black, yellow, chocolate) implies that only these three colors are possible. [Whether or not that is correct depends on more knowledge of Labrador Retrievers (the population) than I possess.] The list (black, yellow, …) includes, virtually, all colors. It simply informs us that coat color is going to be characterized nominally (rather than as a wave length, for example). Neither lists says that black Labs or or yellow Labs will necessarily be included in the study—we really can’t know if coat color is random. The purpose of the parenthetical list of levels is simply to clarify exactly what is going to be observed or measured, e.g., to distinguish between color (black, yellow, …) versus color wavelength (nanometers). [Note that … is the mathematical way of saying and so forth or etc.] The point is that the purpose of the parenthetical list is to clarify rather than to limit. The list of levels must include all possible levels [as color (black, yellow, …) does and as color (black, yellow) doesn’t], and it’s fine if it contains some values that are not possible or that may not be observed. Document1 4 2/8/2016