Programming with CDC dataset

advertisement

Programming with CDC

Dataset

(Text 2, Chapters 13, 19 & 24)

HCA 741: Essential Programming for Health Informatics

Rohit Kate

CDC Mortality Dataset

• Text 2 Chapter 13

• U.S. Centers for Disease Control and

Prevention (CDC) prepares a dataset of de-identified records of almost every death in the U.S.

http://www.cdc.gov/nchs/deaths.htm

• One file for each year, each exceeds 1

Gb

• Publicly available: ftp://ftp.cdc.gov/pub/Health_Statistics/NCHS/Datasets/DVS/mortality/

CDC Data Format

• Each line in the file is a record of one death

(Perhaps the saddest dataset  )

• Each line is a sequence of characters

• The following documentation explains what each character means for the year 1999:

(each year may have slightly different format) ftp://ftp.cdc.gov/pub/Health_Statistics/NCHS/Dataset_Documentation/mortality/Mort99doc.pdf

This is specific for the file of 1999, also at: ftp://ftp.cdc.gov/pub/Health_Statistics/NCHS/Datasets/mortality/Mort99us.zip

Representation of Causes of

Death

• In a prototypical death certificate, the causes of death are listed under item 27, parts I and II

• A list of causes are specified, one leading to another, ending with the ultimately underlying cause, for example:

– Bleeding of esophageal varices

– (due to) Portal hypertension

– (due to) Liver cirrhosis

– (due to) hepatitis B

• These are specified by the ICD codes

Representation of Causes of

Death

• These ICD codes for the causes of death appear at characters 162-302 on each line (for year 1999), separated by spaces

• Each ICD code for the cause is preceded by two characters corresponding to its position on the death certificate (which line & which cause number on the line); we will ignore this for our purpose

• Example: 11I219 21I251 61I500 62R54

– I219 (I21.9, Acute myocardial infarction unspecified)

– I251 (I25.1, Atherosclerotic heart disease)

– I500 (I50.0, Congestive heart failure)

– R54 (R54, Senility)

• Note that the “dots” of ICD codes are not included

Find Number of Occurrences of Death for Every Cause

• Use 1999 file: ftp://ftp.cdc.gov/pub/Health_Statistics/NCHS/Datasets/DVS/mortality/Mort1999us.zip

• Unzip it and save at C:\Python33\Mort99us.dat

• Make a dictionary with ICD codes as keys and number of deaths as values

• Loop through each line, find the sequence of characters listing the causes

• Make a list of the causes

• For each cause, increment the corresponding dictionary value

• Display using the ICD code names (use the corresponding dictionary without dots, icd_nodot.py)

Save as a .csv file

• We can save the output as a .csv

(comma separated values) file and manipulate it with excel

• In general, save in the input format that a software takes and then work with that software instead of coding everything yourself

Code for Saving as a .csv file

(CDC.py)

# This program computes how many deaths were causes by which diseases

# using the CDC dataset and ICD codes.

# The output is saved in a .csv file which can be manipulated in excel.

import icd_nodot def main() : di = {} # dictionary of cause of death (ICD codes) as keys and

# number of occurrences as values infile = open("Mort99us.dat","r") for line in infile: causes = line[161:302] # read the causes cause_list = causes.split() for cause in cause_list : # for each cause a = cause[2:] di[a] = di.get(a,0) + 1 infile.close() icddi = icd_nodot.ICD_dictionary() # ICD dictionary with dots removed from codes outfile = open("CDC_out.csv","w") for key in di: if (key in icddi): print(di[key],",",icddi[key].replace(",",""),file=outfile) else: print(di[key],",",key,file=outfile) outfile.close()

Top 10 Causes

(After Sorting in Excel)

412827 Atherosclerotic heart disease

352559 "Cardiac arrest unspecified"

273644 Congestive heart failure

244162 "Acute myocardial infarction unspecified"

210394 "Chronic obstructive pulmonary disease unspecified"

206996 "Pneumonia unspecified"

203906 Essential (primary) hypertension

176834 "Stroke not specified as hemorrhage or infarction"

162128 "Malignant neoplasm of bronchus or lung unspecified"

149777 Unspecified diabetes mellitus without complications

Case Study: Emphysema

Rates in CDC Dataset

• Text 2 Chapter 19

• A good example of Medical Discovery from datasets

• African-Americans have much lower levels of alpha-1 antitrypsin disease gene variants than whites, which are believed to play a role in emphysema cases

• Hypothesis: Blacks should have fewer deaths due to emphysema than whites

• We will test it empirically from the CDC dataset

• We will use the 1999 dataset

• Characters 60-61 encode race

– 01 White

– 02 Black

J43

J43.0

J43.1

J43.2

J43.8

J43.9

ICD Codes for Emphysema

(From our ICD.txt file)

Emphysema

MacLeod's syndrome

Panlobular emphysema

Centrilobular emphysem

Other emphysema

"Emphysema, unspecified”

ICD codes for emphysema: Start with “J43”

Code (CDC_Emphysema.py)

# This program computes proportions of black and whites

# who died of Emphysema def main() : infile = open("Mort99us.dat","r") whitecount = 0 blackcount = 0 whiteemph = 0 blackemph = 0 for line in infile: # for each record race = line[59:61] # read the race if (race == "01"): whitecount += 1 if (race == "02") : blackcount += 1 causes = line[161:302] # read the causes if (causes.count("J43") > 0) : # Emphysema was one of the causes if (race=="01") : whiteemph += 1 if (race=="02") : blackemph += 1 infile.close() print("Total Whites in file: ",whitecount) print("Total Blacks in file: ",blackcount) print("Total Whites with Emphysema: ",whiteemph) print("Total Blacks with Emphysema: ",blackemph) print("Percent Whites with Emphysema: ",100*whiteemph/whitecount,"%") print("Percent Blacks with Emphysema: ",100*blackemph/blackcount,"%")

Results

Total Whites in file: 2064169

Total Blacks in file: 285276

Total Whites with Emphysema: 32595

Total Blacks with Emphysema: 2125

Percent Whites with Emphysema: 1.5790858209768677 %

Percent Blacks with Emphysema: 0.7448926653486448 %

Conclusions from the Results

• Blacks have fewer deaths due to emphysema than whites (0.74% compared to 1.57%, in 1999)

• This does not prove that that particular gene variant increases the risk of emphysema

• But it supports the claim and warrants further investigation

Another Case Study:Sickle

Cell Rates

• Text 2, Chapter 24

• Sickle cell disorders are produced by an alteration of hemoglobin

• ICD codes start with “D57”

• We want to know whether the deaths due to this is increasing or decreasing

– There is no publication that informs us about this

– We will analyze CDC mortality datasets of a few years to find it ourselves

ICD Codes for Sickle Cell Disorders

(From our ICD.txt file)

D57

D57.0

D57.1

D57.2

D57.3

D57.8

Sickle-cell disorders

Sickle-cell anemia with crisis

Sickle-cell anemia without crisis

Double heterozygous sickling disorders

Sickle-cell trait

Other sickle-cell disorders

ICD codes for Sickle Cell Disorders: Start with “D57”

Counting Deaths due to

Diseases

• We will write a general function to count the number of deaths due to diseases in any CDC file

– Sickle cell disorders will become a special case

• Parameters:

– CDC mortality file name

– A regular expression for diseases

– Start of the characters encoding causes of death

– End of the characters encoding causes of death

We need the last two parameters because the files for different years have slightly different formats.

• Returns:

– Number of deaths due to the specified diseases

– Total number of deaths

Code for the Function

(in CDC_diseaseCount.py) import re def diseaseCount(filename, diseaseRE, start, end) :

# This funtion returns the number of deaths due to a disease

# and the total deaths in a CDC file for any year.

# The regular expression for the disease ICD code and

# the start and end characters in a line for the causes

# of death are part of the parameters.

diseaseCount = 0 allCount = 0 infile = open(filename, "r") for line in infile : allCount += 1 if re.search(diseaseRE, line[start:end]) : diseaseCount += 1 infile.close() return diseaseCount, allCount

Mortality Files for Different Years

• We will use the years 1999, 2002 and

2004 as used in the book

– For 1996, a different type of file opens after unzipping

• Additionally download the following files, unzip them and save in C:\Python33\ ftp://ftp.cdc.gov/pub/Health_Statistics/NCHS/Datasets/DVS/mortality/Mort2002us.zip

ftp://ftp.cdc.gov/pub/Health_Statistics/NCHS/Datasets/DVS/mortality/Mort2004us.zip

• Different years have the causes of the deaths at different characters in a line

1999: 162-301

2002: 163-302

2004: 165-304

Code for Counting Sickle Cell Rates

(in CDC_diseaseCount.py) def main() : c,a = diseaseCount("Mort99US.dat", "D57", 161, 301) print("In the year 1999 there were",c,"deaths due to sickle cell anemia, which is",100000*c/a,"per 100,000") c,a = diseaseCount("Mort02US.dat", "D57", 162, 302) print("In the year 2002 there were",c,"deaths due to sickle cell anemia, which is",100000*c/a,"per 100,000") c,a = diseaseCount("Mort04US.dat", "D57", 164, 304) print("In the year 2004 there were",c,"deaths due to sickle cell anemia, which is",100000*c/a,"per 100,000")

Output

In the year 1999 there were 799 deaths due to sickle cell anemia, which is 33.362966105481256 per

100,000

In the year 2002 there were 827 deaths due to sickle cell anemia, which is 33.79930325208967 per

100,000

In the year 2004 there were 876 deaths due to sickle cell anemia, which is 36.47872074623137 per

100,000

Conclusions from the Output

• The rate is increasing according to our output

• Death certificate data is not perfectly reliable due to:

– Doctor’s discretion in reporting the causes of deaths

– Due to errors in reporting

• But when the numbers are large, they usually reflect reality

Download