Uploaded by Lou Chua

Ch8+Lec-+Describing+Variation+and+Distribution+of+Data+%E2%80%8B

advertisement
Describing Variation and
Distribution of Data
CHAPTER 8
“Variability is the law of life, as no two faces
are the same, so no two bodies are alike, and
no two individuals react alike and behave
alike under the abnormal conditions which
we know as disease”
-William Osler
VARIABLE- A measure of a single
characteristic that can vary
VARIATIONS
CAUSES
Biologic differences- can result from many factors such as genes, nutrition,
environmental exposures, age, sex and race.
Presence or absence of disease and stages or extent of disease
Example: cancer of the cervix may be in situ, localized, invasive, or metastatic.
Different conditions of measurement often account for the variations observed in
medical data and include factors such as time of the day, ambient temperature or
noise, and the presence of fatigue or anxiety in the patient.
Example: Blood pressure is higher with anxiety or following exercise and lower after
sleep.
VARIATIONS
CAUSES
Different techniques of measurement – can produce different results.
Example: A blood pressure measurement derived from the use of an
intraarterial catheter may differ from a measurement derived from the
use of an arm cuff.
Measurement Error –can also cause variation
Example: Two different blood pressure cuffs of the same size may give
different measurements in the same patient because of defective
performance by one of the cuffs.
VARIATIONS

Some types of variation can distort data systematically in one direction, this
form of distortion is called systematic error and can introduce bias
Example: measuring and weighing patients while wearing shoes

Other types of variation are random, and is called as random error, this
makes some readings to high and others too low, it is not systematic and does
not introduce bias
Example: slight, inevitable inaccuracies in obtaining any measurement, such as
blood pressure.
Statistics and Variables
Quantitave and Qualitative Data

Quantitative characteristic, such as systolic blood pressure
measurement or serum sodium level, is characterized using defined,
continuous measurement scale.

Qualitative characteristic, such as coloration of the skin, is
described by its features, generally in words rather than numbers.
Statistics and Variables
Types of Variables

Nominal Variables

Dichotomous (binary) variables

Ordinal (ranked) variables

Continuous (dimensional) variables

Ratio variables

Risks and proportions
Types of Variables
Nominal Variables

Naming or categoric variables that are not based on measurement
scales or rank order.
Examples:
-
Blood groups ( O, A, B, and AB)
-
Occupations
-
Food groups
-
Skin color
-
Assigning of number to each color (e.g. 1 is bluish purple, 2 is red, 3
is white, 4 is blue and 5 is yellow)
Types of Variables
Dichotomous (Binary) Variables

Dichotomous (Greek, “cut into two”)

Variable with only two levels
Example: Investigators might choose to create a variable with only two levels: normal
skin color (coded as a 1) and abnormal skin color (coded as 2)

In many cases dichotomous variables inadequately describe the information needed.
Example: Study of heart murmurs
-
Dichotomous data concerning a murmur’s timing ( e.g., systolic or diastolic)
-
Nominal data on its location (e.g., aortic valve area) and character ( e.g., rough)
-
Ordinal data on its loudness (e.g., grade III)

Dichotomous, nominal and often ordinal variables are referred to as discrete variables
because the numbers of possible values they can take are countable
Types of Variables
Ordinal (Ranked) Variables

Data that can be characterized in terms of three or more qualitative
values that have a clearly implied direction from better to worse.
Examples
-
Satisfaction with care ( “very satisfied”, “ fairly satisfied”, “not
satisfied”)
-
Amount of swelling in a patient’s legs ( “none”, or 1+, 2+, 3+ or 4+)
-
Pain (absent, mild, moderate, or severe) (scale of 0-10, 0-no pain and
10-worst imaginable pain)
Types of Variables
Continuous (Dimensional) Variables

Data that are measured in continuous (dimensional) measurement
scales.

Continuous data show not only the position of the different
observations relative to each other, but also the extent to which one
observation differs from another.
Examples
-Patients’ height, weights, systolic and diastolic blood pressures and
serum glucose levels.
Types of Variables
Ratio Variables

If a continuous scale has a true 0 point, the variables derived from it
can be called ratio variables.

Kelvin temperature scale is a ratio scale because 0 degrees on this
scale is absolute 0.

Centigrade temperature scale is a continuous scale, but not a ratio
scale because 0 degrees on this scale does not mean the absence of
heat
Examples of the Different Types of Data
Information Variable Type
Content
Examples
Higher
Ratio
Temperatiure (Kelvin) ;
Blood pressure
Higher
Continuous(dimensional)
Temperature (Fahrenheit)
Higher
Ordinal(ranked)
Edema= 3+ out of 5;
Perceived quality of care=
good/fair/poor
Higher
Binary (dichotomous)
Gender= male/female;
Heart
murmur=present/absent
Lower
Nominal
Blood type; skin color
Types of Variables
Risks and Proportions as Variables

Risk is the conditional probability of an event (e.g., death or disease)
in a defined population in a defined period.

Risks and proportions, which are variables created by the ratio of
counts in the numerator to counts in the denominator.

Risks and proportions can be analyzed using the statistical method
for continuous variables
COUNTS AND UNITS OF OBSERVATION

It is the person or thing from which the data originated.
Examples:
-
Persons
-
Animals
-
Cells

May be arranged in a frequency table (characteristics :x and y axis)
COUNTS AND UNITS OF OBSERVATION
TABLE 8.2 Standard 2x2 Table Showing
Gender of 71 Participants and Whether
Serum Total Cholesterol Was Checked
CHOLESTEROL LEVEL
(NO. OF PARTICIPANTS )
GENDER
Checked
Not Checked
Total
Female
17 (63%)
10(37%)
27 (100%)
Male
25 (57%)
19 (43%)
44 (100%)
Total
42 (59%)
29 (41%)
71 (100%)
Data from unpublished findings in a sample of 71 young adults in
Connecticut.
Combining Data

The conversion of continuous variable to an ordinal variable by grouping units
with similar values together

Example:

Individual birth weights of infants can be converted to a range of birth weights.

Advantage:

Percentage can be created, it can show the mortality rate and survival rate

Disadvantage:

Lost of individual information
FREQUENCY DISTRIBUTIONS
FREQUENCY DISTRIBUTIONS OF CONTINUOUS
VARIABLES
Frequency distribution can be shown by creating a table that lists the
values of the variable according to the frequency with which the value
occurs.
4,5
4
3,5
Number of persons

3
2,5
2
1,5
1
0,5
0
123
143
163
183
203
223
243
263
Serum level of Total Cholesterol (mg/dL)
FREQUENCY DISTRIBUTIONS
Range of a Variable
Range
The distance between the lowest and highest
observations of the variable.
Example
Based on the table of Serum levels of the Total
Cholesterol Reported in 71 Participants, The
Cholesterol levels vary from a value of
124mg/dL to a value of 264mg/dL
Range= (264-124)= 140
FREQUENCY DISTRIBUTIONS
Real and Theoretical Frequency Distributions
Real frequency distributions – are those obtained from actual data or a
sample.
Theoretical frequency Distributions –are calculated using assumptions
about the population from which the sample was obtained.
- Normal Distribution
FREQUENCY DISTRIBUTIONS
Real and Theoretical Frequency Distributions
NORMAL DISTRIBUTION
-
It is also called the Gaussian
distribution (after Johann Karl Gauss)
-
Bell-shaped
-
Bell-shaped curve are often used to
represent the expected or
theoretical distribution of the
observations (the height of the curve
on the y-axis) for the different
possible values on a measurement
scale (on the x-axis)
normal (Gaussian) distribution
mean=median=mode
Symmetrical distribution
FREQUENCY DISTRIBUTIONS
Parameters of a Frequency Distribution

Measures of Central Tendency and Measures of Dispersion
-Two types of descriptors known as parameters which defined the
frequency distributions from continuous data.
Measures of central tendency
Examining a distribution
First step = look for the central tendency of the observations
The next step= examine in detail the mode, median, and the mean.
Measures of Central Tendency
Mode

The most commonly observed value.

Frequency distribution typically has a mode at
more than one value.
Example
In the table of Serum Levels of Total Cholesterol
Reported in 71 Participants, the most commonly
observed cholesterol levels (each with four
observations) are 171 mg/dL and 180mg/dL.
Measures of Central Tendency
Median

It is the middle observation when data have
been arranged in order from the lowest value
to the highest value.
Example
In the table shown, the median value is 178
mg/dL.
Measures of Central Tendency
Median

When there is an even number of observations, the
median is considered to lie halfway between the two
middle observations.
Example:
In the table, the two middle observations are the 13th
and 14th observations. The corresponding values for
these are 57 and 58mg/dL
Initial HDL cholesterol
values (mg/dL) of
participants
31, 41,
44,46,47,47,48,48,
49,52, 53, 54, 57, 58, 58,
60, 60, 62, 63, 64, 67,
69, 70, 77, 81, and 90
Median (mg/dL)
(57+ 58)/2=57.5
The Median value is also called the
50th percentile observation because
50% of the observation lie at the value
or below.
Measures of Central Tendency
Mean

It is the average value, or the
sum (∑) of all the observed values
(xi) divided by the total number of
observations (N); where the
subscript letter i means the “value
of x for the individual i, where i(
ranges from 1 to N”).
Mean= x̅= ∑(xi)
N
Example:
No. of observations or N
26
Initial HDL cholesterol
values (mg/dL) of
participants
31, 41, 44,46,47,47,48,48, 49,52,
53, 54, 57, 58, 58, 60, 60, 62, 63,
64, 67, 69, 70, 77, 81, and 90
Mean or x̅ (mg/dL)
1496/26=57.5 mg/dL
Measures of dispersion

The next after the central tendency of frequency distribution is
determined, the next step is to determine how spread out (dispersed)
the numbers are

Based on Percentiles

Based on the Mean
Measures of Dispersion
Based on Percentiles

Percentile of distribution – is a point at which a certain percentage
of the observations lie below the indicated point when all the
observations are ranked in descending order.
Example:
-The median discussed previously, is the 50th percentile because 50% of
the observation are below it.
-
The 75th percentile is the point at or below which 75% of the
observations lie.
-
The 25th percentile is the point at or below which 25% of the
observations lie.
Measures of Dispersion
Based on the Mean
Three measures of dispersion based on the mean

Mean deviation

Variance

Standard deviation
Mean Absolute Deviation

This is seldom used, but helps define the concept of dispersion.
Mean deviation=
___________
N

Does not have mathematical properties (as base for many statistical
tests)

Variance has become the fundamental measure of dispersion.
Variance

The fundamental measure of dispersion in statistics that are based on the normal distribution.

It is the sum of the squared deviations from the mean, divided by the number of observations
minus 1.
Variance =
- symbol for variance calculated from the observed data
N-1 = degrees of freedom
= numerator of the variance is an extremely important measure in statistics. It is
usually called either the sum of squares (SS) or the total sum of squares (TSS)
How to compute for the variance?
Standard deviation
s=
√

It is the square root of the variance

It is used to describe the amount of spread in the frequency
distribution.

It is an average of the deviations from the mean.
How to compute for Standard Deviation?

REVIEW (VARIANCE)
√171.94 = 13.1
STANDARD DEVIATION IS 13.1 mg/dL
TABLE 8.4 Raw Data and Results of Calculations in Study of Serum Levels of High-Density
Lipoprotein (HDL) Cholesterol in 26 Participants
Parameters
Raw Data or Results of Calculation
No. of observations or N
26
Initial HDL cholesterol values
(mg/dL) of participants
31, 41, 44,46,47,47,48,48, 49,52, 53, 54, 57, 58, 58, 60, 60, 62,
63, 64, 67, 69, 70, 77, 81, and 90
Highest Value (mg/dL)
90
Lowest value (mg/dL)
31
Mode (mg/dL)
47, 48, 58 and 60
Median (mg/dL)
(57+58)/2=57.5
Sum of the values, or sum of xi
(mg/dL)
1496
Mean, or the x̅ (mg/dL)
1496/26=57.5
Range (mg/dL)
90-31= 59
Interquartile range (mg/dL)
64-48=16
Sum of (xi-x̅)² or TSS
4298.46 mg/dL squared
Variance or s²
171.94 mg/dL
Standard deviation or s
√171.94 = 13.1 mg/dL
Problems in Analyzing a Frequency
Distribution

In a normal (Gaussian) distribution, the following holds true:
mean=median=mode
Symmetrical distribution
Problems in Analyzing a Frequency
Distribution
Skewness

SKEWNESS – A horizontal stretching of a frequency
distribution to one side or the other, so that one tail of
observations is longer and has more observations than the
other tail.

Skewed to the left

Skewed to the right
Problems in Analyzing a Frequency
Distribution

Skewed to the left – when a histogram or a frequency polygon has a
longer tail on the left side of the diagram
- Negatively skewed distribution
Problems in Analyzing a Frequency
Distribution

Skewed to the right- Positively skewed
Problems in Analyzing a Frequency
Distribution

Kurtosis- characterized by a vertical stretching or flattening of the
frequency distribution.
C. Abnormal peaking
D. Abnormal flattening
Thank you very much for listening!
Reference:

Elmore, J.G.,G.,W.D.M., Nelson, H.D., & katz, D.L. (2020).
Jekel’s Epidemiology. Biostatistics, Preventive Medicine, and Public
Health. Elsevier.
Download