part i: descriptive statistics

PROBLEM SET 5

STATISTICS ANALYSIS

PART I: DESCRIPTIVE STATISTICS

The purpose of PART I is to use descriptive statistics to explore patterns of prevalence of diabetes in a high risk population, the Pima Indians of Arizona.

Database:

The data file PIMA.XLS contains medical data derived from observations of individuals in the Pima population. Each record of the file refers to an individual's fasting blood glucose level and blood glucose level measured 2 hours after the ingestion of 75 g of carbohydrate. All values in the file are non-zero and no values are missing.

Each record of the file contains the following fields:

X1, X2, X3, X4, X5, X6, X7 where X1 = NIH # for identification of the individual

X2 = fasting glucose , mg/100ml of plasma

X3 = 2-hr glucose, mg/100 ml of plasma

X4 = sex; 1-male, 2 female

X5 = age, years

X6 = height, cm

X7 = weight, kg

PIMA.XLS contains 1211 records and can be found in the Biol 315 homepage.

Required Work:

In performing the required work for this exercise, you will use database functions from an Excel spread-sheet. New spread-sheet functions will include advanced data filtering and descriptive statistical functions (AVERAGE, STDEV, SKEW, KURT). You will also need to set up a frequency analysis with which to generate histograms. The required work will be in two parts:

PART I a.

To illustrate how transformations are necessary to obtain Gaussian or Normal distribution as random variables, you will first transform results of glucose tolerance test.

The required transformation is to get log to the base 10 of variable X3 for males and females ages 15 to 24 years old. With an EXCEL spread-sheet produce the following histograms in order to compare their shapes:

a. X3 for males, 15-24 years old2

b. log10(X3) for males, 15-24 year old

c. X3 for females, 15-24 year old

d. log10(X3) for females, 15-24 year old

1

PART I b . Obtain the histograms and a set of descriptive statistics (mean, standard deviation, skewness and kurtosis) of the log to the base 10 2-hr glucose values for the following twelve groups: a. Males, 0-14 years old g. Females, 0-14 years old b. Males, 15-24 h. Females, 15-24 c. Males, 25-34 i. Females, 25-34 d. Males, 35-44 j. Females, 35-44 e. Males, 45-54 k. Females, 45-54 f. Males, 55 and older l. Females, 55 and older

For each sex group (males and females), plot the mean and standard deviation of log two-hour blood glucose versus age group midpoint for each of the 6 age groups above.

Use 1 standard deviation as the Y error.

PART I Submit:

1. In an EXCEL file (15 points): a. Two X-Y plots (males and females): Age vs mean of the log transformed 2-hrglucose plus and minus 1 standard deviation. b. 14 Histograms (Parts I and II) and statistics for the 12 combinations of age and sex groups (Part II)

1.

Your discussion in PART I must cover the following points (15 points) a.

Explain basis for the use of the Normal or Gaussian distribution as a reference in these studies. Comment on the effectiveness of the logarithmic transformation. b.

Describe the patterns of variations of mean and standard deviation of log 2-hr glucose with age in both sexes. c.

Identify patterns of variation of the log transformed 2-hr plasma glucose with age and sex in term of skewness and kurtosis.

PART II: INFERENCE

The purpose of this PART II is to acquaint you with statistical inference by exploring blood glucose measurements obtained from Pima Indians. In addition, you will use a simple function to characterize changes in blood glucose levels with age and sex in this population. For this exercise, you will need to refer to the distributions of Log 2-hr glucose for various sex and age combinations. You may consult the distributions you plotted in problem set 9.

PART IIa.

A function describing two overlapping Gaussian distributions has been proposed to explain the observed Log 2-hr blood glucose distributions in the Pima

Indians. This model is f x

 

* N



 

1

2

   

* N



 

2

2



(1)

2

where N



1

2



is the Gaussian distribution for Log 2-hr glucose levels in "normal" individuals with mean



1

and standard deviation



and





1



, and N



2

2



is the Gaussian distribution for Log 2-hr glucose levels in hyperglycemic individuals with mean standard deviation



2

. The quantities 1





2



are the relative proportions of

and

"normal" and hyperglycemic individuals in the population.

Pool the log-2hr glucose values for all ages and both sexes to form a single frequency distribution and a single histogram. Using this histogram, determine a cut off point that you think best separates the two component distributions (i.e. pick an antimode value C). Specify this antimode in the analyses that follow and use only this value. For your own information, you might make a guess (based on visual inspection of the distributions) of the components of the model in equation 1 for each age and sex group.

PART IIb . Next reanalyze each of the age and sex groups more formally. For each age sex group, you must separate "normal" and hyperglycemic individuals. If the Log 2-hr glucose value is less than C, the individual is "normal"; otherwise, the individual is hyperglycemic. In each of the age-sex groups, you must calculate the following statistics:

a.

b.

X

, the sample mean of Log 2-hr glucose, overall; s

, the sample standard error of Log 2-hr glucose, overall; n

c.

X

1

, the sample mean of Log 2-hr glucose for normal individuals; s

1

d. , the sample standard error of Log 2-hr glucose for normal individuals; n

1

e.

X

2 s

2

, the sample mean of Log 2-hr glucose for hyperglycemic individuals;

f. , the sample standard error of Log 2-hr glucose for hyperglycemic individuals; n

2

g. p

2

 n n

2 , the sample proportion of hyperglycemic individuals; and n

1

* n

2

h. p

2 q

2  n

, the sample standard error of the proportion of hyperglycemic n n individuals.

You may choose any procedure to calculate these statistics.

3

PART II c.

Make a table of these (a to h) parameters for each age-sex group. The table must have informative labels.

PART II d.

Use an EXCEL spread-sheet to produce the following three XY-plots for both males and females (ie six graphs):

a. Mean Log 2-hr glucose of normal individuals with 2 S.E. versus age class midpoint;

b. Mean Log 2-hr glucose of hyperglycemic individuals with 2 S.E. versus age class midpoint; and

c. Proportion of hyperglycemic individuals with 2 S.E. versus age class midpoint.

PART II Submit:

1.In your EXCEL file (10 Points) a. A table of parameters for each age and gender group as outlined in PART II c. b. 6 XY-plots: Changes in the mean Log 2-hr glucose and proportion of hyperglycemic individuals with age for males and females (PART II d)

2. In your PART II discussion, address the following issues (15 Points): a. Examine variations of the proportion of hyperglycemic individuals with age. b. The overall mean Log 2-hr glucose value for any subgroup is related to the mean levels for "normal" and hyperglycemic individuals by a simple relationship:

 overall

  

1

 

2

Using this model, comment on causes of change with age (pooled sexes) of the overall mean for Log 2-hr glucose. c. Apply appropriate statistical tests to test the following: i.

The mean Log 2-hr glucose value for Pima males age 15-24 is equal to

2.00. Apply the test to determine if females 15-24 have the same value. ii.

If the mean values for “normal” Pima age 45-54 are equal for males and females. iii.

If the proportion of hyperglycemic individuals for Pima age 45-54 are the same for males and females. d. Comment on the pattern of variation of Log 2-hr glucose with age for each gender.

PART III: REGRESSION AND CORRELATION

The purpose of this PART III is to apply correlation analysis to the identification of risk factors for diabetes in the Pima population. In the analysis that follows, you will need to calculate a body mass index (BMI) for each individual in the sample data set, PIMA.

BMI is simply the weight of the individual divided by the height squared.

Required Work: PART III a. Sort the PIMA file and generates four data sets. Each data set should have the format:

V1, V2, V3, V4 where V1 is Log 2-hr glucose, V2 is age, V3 is Log fasting glucose, and V4 is Log BMI

(Log is log to the base 10).

4

Constraints on the data sets are:

Data 1: males, BMI =< 30 kg/m/m, V1 =< 2.3 N, N

Data 2: males, BMI =< 30 kg/m/m, V1 > 2.3 N, H

Data 3: males, BMI > 30 kg/m/m, V1 =< 2.3 O, N

Data 4: males, BMI > 30 kg/m/m, V1 > 2.3 O, H

PART III b. Use EXCEL to find the correlation among for V1, V2, V3, and V4 for all 4 data sets indicated above.

PART III c. Use EXCEL to produce scattergrams (x-y plots with points only) of V1 versus V2 for all four data files.

PART III Submit:

1. In your EXCEL file (10 points): a. Summary table with four a 4x4 correlation matrix for each of the data sets b. Four scattergrams with trendlines of V1 versus V2;

2. Discuss risk factors for diabetes and which addresses at least the following issues (15 points): a. Possible reasons for statistically significant correlation coefficients among the four variables in the four data sets studied. b. By developing a graphical representation of the relation between the BMI and the proportion of the population that is hyperglycemic (i.e. presumptively diabetic), determine the extent of risk imposed by obesity on incidence of diabetes.

5

part i: descriptive statistics

PROBLEM SET 5

STATISTICS ANALYSIS