(Text 2, Chapters 13, 19 & 24)
HCA 741: Essential Programming for Health Informatics
Rohit Kate
• Text 2 Chapter 13
• U.S. Centers for Disease Control and
Prevention (CDC) prepares a dataset of de-identified records of almost every death in the U.S.
http://www.cdc.gov/nchs/deaths.htm
• One file for each year, each exceeds 1
Gb
• Publicly available: ftp://ftp.cdc.gov/pub/Health_Statistics/NCHS/Datasets/DVS/mortality/
• Each line in the file is a record of one death
(Perhaps the saddest dataset )
• Each line is a sequence of characters
• The following documentation explains what each character means for the year 1999:
(each year may have slightly different format) ftp://ftp.cdc.gov/pub/Health_Statistics/NCHS/Dataset_Documentation/mortality/Mort99doc.pdf
This is specific for the file of 1999, also at: ftp://ftp.cdc.gov/pub/Health_Statistics/NCHS/Datasets/mortality/Mort99us.zip
• In a prototypical death certificate, the causes of death are listed under item 27, parts I and II
• A list of causes are specified, one leading to another, ending with the ultimately underlying cause, for example:
– Bleeding of esophageal varices
– (due to) Portal hypertension
– (due to) Liver cirrhosis
– (due to) hepatitis B
• These are specified by the ICD codes
• These ICD codes for the causes of death appear at characters 162-302 on each line (for year 1999), separated by spaces
• Each ICD code for the cause is preceded by two characters corresponding to its position on the death certificate (which line & which cause number on the line); we will ignore this for our purpose
• Example: 11I219 21I251 61I500 62R54
– I219 (I21.9, Acute myocardial infarction unspecified)
– I251 (I25.1, Atherosclerotic heart disease)
– I500 (I50.0, Congestive heart failure)
– R54 (R54, Senility)
• Note that the “dots” of ICD codes are not included
• Use 1999 file: ftp://ftp.cdc.gov/pub/Health_Statistics/NCHS/Datasets/DVS/mortality/Mort1999us.zip
• Unzip it and save at C:\Python33\Mort99us.dat
• Make a dictionary with ICD codes as keys and number of deaths as values
• Loop through each line, find the sequence of characters listing the causes
• Make a list of the causes
• For each cause, increment the corresponding dictionary value
• Display using the ICD code names (use the corresponding dictionary without dots, icd_nodot.py)
• We can save the output as a .csv
(comma separated values) file and manipulate it with excel
• In general, save in the input format that a software takes and then work with that software instead of coding everything yourself
# This program computes how many deaths were causes by which diseases
# using the CDC dataset and ICD codes.
# The output is saved in a .csv file which can be manipulated in excel.
import icd_nodot def main() : di = {} # dictionary of cause of death (ICD codes) as keys and
# number of occurrences as values infile = open("Mort99us.dat","r") for line in infile: causes = line[161:302] # read the causes cause_list = causes.split() for cause in cause_list : # for each cause a = cause[2:] di[a] = di.get(a,0) + 1 infile.close() icddi = icd_nodot.ICD_dictionary() # ICD dictionary with dots removed from codes outfile = open("CDC_out.csv","w") for key in di: if (key in icddi): print(di[key],",",icddi[key].replace(",",""),file=outfile) else: print(di[key],",",key,file=outfile) outfile.close()
(After Sorting in Excel)
412827 Atherosclerotic heart disease
352559 "Cardiac arrest unspecified"
273644 Congestive heart failure
244162 "Acute myocardial infarction unspecified"
210394 "Chronic obstructive pulmonary disease unspecified"
206996 "Pneumonia unspecified"
203906 Essential (primary) hypertension
176834 "Stroke not specified as hemorrhage or infarction"
162128 "Malignant neoplasm of bronchus or lung unspecified"
149777 Unspecified diabetes mellitus without complications
• Text 2 Chapter 19
• A good example of Medical Discovery from datasets
• African-Americans have much lower levels of alpha-1 antitrypsin disease gene variants than whites, which are believed to play a role in emphysema cases
• Hypothesis: Blacks should have fewer deaths due to emphysema than whites
• We will test it empirically from the CDC dataset
• We will use the 1999 dataset
• Characters 60-61 encode race
– 01 White
– 02 Black
J43
J43.0
J43.1
J43.2
J43.8
J43.9
(From our ICD.txt file)
Emphysema
MacLeod's syndrome
Panlobular emphysema
Centrilobular emphysem
Other emphysema
"Emphysema, unspecified”
ICD codes for emphysema: Start with “J43”
# This program computes proportions of black and whites
# who died of Emphysema def main() : infile = open("Mort99us.dat","r") whitecount = 0 blackcount = 0 whiteemph = 0 blackemph = 0 for line in infile: # for each record race = line[59:61] # read the race if (race == "01"): whitecount += 1 if (race == "02") : blackcount += 1 causes = line[161:302] # read the causes if (causes.count("J43") > 0) : # Emphysema was one of the causes if (race=="01") : whiteemph += 1 if (race=="02") : blackemph += 1 infile.close() print("Total Whites in file: ",whitecount) print("Total Blacks in file: ",blackcount) print("Total Whites with Emphysema: ",whiteemph) print("Total Blacks with Emphysema: ",blackemph) print("Percent Whites with Emphysema: ",100*whiteemph/whitecount,"%") print("Percent Blacks with Emphysema: ",100*blackemph/blackcount,"%")
Total Whites in file: 2064169
Total Blacks in file: 285276
Total Whites with Emphysema: 32595
Total Blacks with Emphysema: 2125
Percent Whites with Emphysema: 1.5790858209768677 %
Percent Blacks with Emphysema: 0.7448926653486448 %
• Blacks have fewer deaths due to emphysema than whites (0.74% compared to 1.57%, in 1999)
• This does not prove that that particular gene variant increases the risk of emphysema
• But it supports the claim and warrants further investigation
• Text 2, Chapter 24
• Sickle cell disorders are produced by an alteration of hemoglobin
• ICD codes start with “D57”
• We want to know whether the deaths due to this is increasing or decreasing
– There is no publication that informs us about this
– We will analyze CDC mortality datasets of a few years to find it ourselves
(From our ICD.txt file)
D57
D57.0
D57.1
D57.2
D57.3
D57.8
Sickle-cell disorders
Sickle-cell anemia with crisis
Sickle-cell anemia without crisis
Double heterozygous sickling disorders
Sickle-cell trait
Other sickle-cell disorders
ICD codes for Sickle Cell Disorders: Start with “D57”
• We will write a general function to count the number of deaths due to diseases in any CDC file
– Sickle cell disorders will become a special case
• Parameters:
– CDC mortality file name
– A regular expression for diseases
– Start of the characters encoding causes of death
– End of the characters encoding causes of death
We need the last two parameters because the files for different years have slightly different formats.
• Returns:
– Number of deaths due to the specified diseases
– Total number of deaths
(in CDC_diseaseCount.py) import re def diseaseCount(filename, diseaseRE, start, end) :
# This funtion returns the number of deaths due to a disease
# and the total deaths in a CDC file for any year.
# The regular expression for the disease ICD code and
# the start and end characters in a line for the causes
# of death are part of the parameters.
diseaseCount = 0 allCount = 0 infile = open(filename, "r") for line in infile : allCount += 1 if re.search(diseaseRE, line[start:end]) : diseaseCount += 1 infile.close() return diseaseCount, allCount
• We will use the years 1999, 2002 and
2004 as used in the book
– For 1996, a different type of file opens after unzipping
• Additionally download the following files, unzip them and save in C:\Python33\ ftp://ftp.cdc.gov/pub/Health_Statistics/NCHS/Datasets/DVS/mortality/Mort2002us.zip
ftp://ftp.cdc.gov/pub/Health_Statistics/NCHS/Datasets/DVS/mortality/Mort2004us.zip
• Different years have the causes of the deaths at different characters in a line
1999: 162-301
2002: 163-302
2004: 165-304
(in CDC_diseaseCount.py) def main() : c,a = diseaseCount("Mort99US.dat", "D57", 161, 301) print("In the year 1999 there were",c,"deaths due to sickle cell anemia, which is",100000*c/a,"per 100,000") c,a = diseaseCount("Mort02US.dat", "D57", 162, 302) print("In the year 2002 there were",c,"deaths due to sickle cell anemia, which is",100000*c/a,"per 100,000") c,a = diseaseCount("Mort04US.dat", "D57", 164, 304) print("In the year 2004 there were",c,"deaths due to sickle cell anemia, which is",100000*c/a,"per 100,000")
In the year 1999 there were 799 deaths due to sickle cell anemia, which is 33.362966105481256 per
100,000
In the year 2002 there were 827 deaths due to sickle cell anemia, which is 33.79930325208967 per
100,000
In the year 2004 there were 876 deaths due to sickle cell anemia, which is 36.47872074623137 per
100,000
• The rate is increasing according to our output
• Death certificate data is not perfectly reliable due to:
– Doctor’s discretion in reporting the causes of deaths
– Due to errors in reporting
• But when the numbers are large, they usually reflect reality