Report of the study group on data and application security for health care. Paul Kantor, Gio Wiederhold, Bhavani Thuraisingham, Le Grunewald, Gail-Joon Ahn, Murat Kantarcioglu, Peng Liu 1 Overview This discussion finds itself in the context of the heightened federal emphasis on moving from a situation in which most medical records are in the form of hand written physician or nurse’s notes, to one in which electronic records are the primary means for storing and sharing information about individual health. The President has set this as a high priority and it is widely believed that introducing such a system will have great benefits in quality and efficiency, and support the discovery of broad general facts about healthcare which are presently obscured by the difficulty of assembling the records. At the same time this transition poses very substantial risks to the perceived level of privacy and confidentiality currently associated with health records. As an example, it is reported that in the UK, where healthcare is nationalized, one may buy a person’s <excludes those using private healthcare serevices> health record for about £25 on the streets. In the American context, this concern is exacerbated by a general mistrust of the government, and a desire not to have all of one’s records available for examination by a government agency. There are laws in the United States which impede the ability of one agency to examine the records of another. On the other hand, one might take the position that “privacy is an illusion” and that all records are available to anyone who is willing to expend the effort to obtain them. In the end, the decision of how much privacy is desired should be a personal decision and health care systems should record and respect that decision and provide the desired type and level of security. Healthcare records have interesting technical characteristics. They contain both structured and unstructured material, and both forms are essential. There is some semi-structured data, where , for example, a diagnosis may be selected from a controlled vocabulary, and additional physician’s notes are added. There are multimedia, including many types of image records. In addition, there are time-series data: both slow time-series, as in the monitoring of a suspicious area in a chest X-ray, and rapid series, such as new techniques for examining the beating heart. Their intensity varies greatly, infrequent updates during healthy periods of a patient's life, and massive updates during a hospitalization. The data sets have both temporal and spatial dimensions, with the both dimensions varying enormously in scale. There are many kinds of security concerns which drive this study. 1. Patient’s records are maintained at specific hospitals, clincs, and physician offices, but they ought to be available, when needed, at any site. However, wide access increases the opportunities for information leaks. 2. Electronic devices can be used to support remote healthcare provided for patients at their homes. The required two-way communication must be reliable and secure. 3. Doctors seek to consult with other doctors, and one can imagine a future in which there are bulletin board broadcasts by physicians seeking help and information on a particularly challenging or puzzling case. Again, patient privacy is at risk here. 4. There are cyber-physical systems such as heart monitors, or implanted defibrillators, and computer-controlled infusion pumps for medications that pose risks. Most of these devices involve embedded software, and parameters that are subject to external change. These devices represent unique challenges, as failures and interference with these applications could cause harm or even death to the patient. 5. Accessible electronic health records will facilitate data aggregation for research, essential to deal with the variety and complexity of medical conditions, but such aggregations increase the risk of privacy violations. 6. The medical information system must guard against fraud and billing manipulation. To be effective, the study of healthcare data applications and security must involve computer scientists working very closely with healthcare professionals, as well as with experts in policy, privacy and law. 2 Stories We can form a grid to represent the main security issues, and place examples into such a grid. Current Transition Future 2.1 Integrity 6 aspects 6 aspects Unknown Privacy 6 aspects 6 aspects Unknown Fraud 6 aspects 6 aspects Unknown Deduplication Story 1. Oklahoma State maintains a children’s healthcare database, which includes records contributed by physicians at multiple hospitals. They use it to generate, or try to generate official state-level statistics. But the database cannot really generate correct statistics. For example the same child may have multiple records being known as “Baby A”, “Baby B”, or under one last name and then another and so forth. So there is a database management problem here of a familiar kind which is de-duplication, but it is complicated by the privacy considerations that are associated with healthcare records. In general, records that are collected for administrative purposes are incomplete and biased by the desire of medical staff to achieve compliance with minimal effort. Quality can only by gained if the information will provide feedback to the providers in order to monitor and improve quality. 2.2 Data control at Data Entry Time Story 2. A patient visited his doctor one day and was told to his surprise “I see you’ve gained 50 pounds since your last visit”. The patient, who knew that at 50 pounds less he would be nothing but skin and bones was astonished to hear this. The problem is that the data had been entered incorrectly a year ago, with an “8” converted to a “3”. The underlying problem is that, because data entry is done by people different from those who are seeing the patient, there is no common sense check to prevent the entry of absolutely absurd information into the record. The corresponding research question is to ask whether the systems can have built in sanity checks, which raise a flag and question the data entry clerk, or the data entry physician, if the information that’s being provided is too far out of line with reasonable expectations. In the case of healthcare, this is complicated by the fact that data which are out of line may be extremely salient. And yet a data clerk, embarrassed to be challenged by the system, might simply adjust the number until the system does not complain, with consequent harm to the patients. In modern settings, if data are collected and entered while the patient is present, a screen should be available where the patient can inspect what is being entered and can prevent errors from being entered the systems in the first place. If that is not possible, then the patient should receive a record by email or on paper as soon as possible. Here again, privacy, if desired, must be dealt with. It should be the patient's decision if the risk of data exposure is of greater concern to that patient than having errors in the record. The decision should not be made by some authority who fears legal consequences. 2.3 Retrospective Conversion: Humans and Computers Story 3. This is in a way similar to story 2 but addresses the fact that in order to get from the present paper system to the electronic system there are two steps. First, electronic entry has to be available to every practitioner everywhere. Second, a reasonable amount of retrospective data conversion may be needed. This is going to require some kind of new workforce, which will primarily exist just for purposes of the transition to electronic health records. Most retrospective data entry may only be done just prior to a scheduled appointment at a health care facility. Hospital records older than one year are best abstracted. Such an effort may temporarily double the number of people who are transcribing, and will not be done by the same people who currently work to convert physician’s notes into text. These new temporary workers will have to be intelligent and trainable, and they’ll have to have considerable patience in reading a variety of difficult handwritings, often in a cryptic shorthand. There are a great many people of high intelligence who used to work in the financial world and are currently seeking gainful employment, and a major WPA-style project which put them to work on this might benefit both them and the nation. There is also a computer science research question which is to develop programs that can automatically match from billing ICD-9 codes to the terms that are being entered to provide some level of validation and reality check on the retrospective conversion. 2.4 Anonymous Posting to Seek Consultation; Authorization Story 4. A patient’s primary care physician is puzzled and realizes that some kind of consult is needed, but is not even sure who would be the most appropriate specialist. Such requests complement accessing the medical literature, which is less specific. We can imagine that, with electronic records, it would be possible to post a somewhat anonymized sketch of the presentation, and obtain comments and guidance on reaching a second opinion. The questions here are policy questions: how much should the systems disclose? How much is too much? When there is such an online forum an indirect inference attack could succeed through attribute aggregation and correlation among related postings. If something like this is going to be done, should the patient have “control” of this process? The natural inclination is to suggest that the default mode would be “yes”, and that the patient’s judgment could only be overridden by the physician for particular kinds of delicate issues. The research issues here seem to be, except for the question of an attack, primarily economic and social issues. 2.5 Purpose Driven Access Control Story 5. For research purposes, it would be very nice if a provider could multi-cast data driven requests to various federated partners. As a result, patients’ records would be aggregated and then possibly used by researchers. Here of course there is a great potential privacy threat. We must understand how to accommodate patients’ concerns during the data gathering phase. In particular, since we cannot know in advance which aspects of the data will be valuable to a subsequent research effort, it is impossible to know a priori what kind of release should be requested from the patients. The related complex questions here might be called “privacy aware patient record integration”, as well as the more familiar “patient record set anonymization”. We understand that what’s needed is to bring this to an acceptable level, as it can never be perfect. The corresponding research issue could be called purpose driven access control (PDAC). We recognize also that the government, which is likely to have a central role in the electronic health record revolution, may have a very different set of purposes from researchers. And, as noted above, Americans are, at least officially, very unwilling to grant the government access to highly personal data. The research questions implied here seem to be essentially policy requirements, together with the technical problems of ensuring that permission is indeed given by the people who had been authorized to give it. 2.6 Regional Health Integration Story 6. Regional Health Information Organizations (RHIO) are being promoted by federal and state governments to enable providers and practitioners to share patient records. There are many kinds of threats to privacy here, including query content privacy, data location privacy, patient location privacy. There are also problems of assuring the trustworthiness of the source, of the inquirer, and of the transmission method, to a level that will withstand a legal challenge if it has been trusted, in good faith, by the recipient. So the technical question is how to construct privacy preserving RHIO systems with adequate content to serve actual patient care needs. An initial focus on a high-demand population, as the elderly, may be appropriate and effective, since that population makes intensive use of local facilities. 2.7 Provider or Payer Fraud; Over-Testing Story 7. It is not unheard for a doctor to double bill multiple insurance companies. Some occurrences are due to poor recordkeeping, while other instances are motivated by a combination of laziness and income enhancement. In some cases of misfeasance, a provider may amass a variety of unneeded information simply in order to maximize billing. The technical problem is appropriate fraud detection. Such a system must be immune to various kinds of collusion attack. Another way to think of it is that it is healthcare information system auditing. There is a related kind of misfeasance which is not quite fraud in the legal sense, but which is a burden on the system. This includes charging for work such as tests that were actually done at a related site, but could have been avoided if the available information were trusted and used. To reduce the cost of health care, physicians have to accept test results done and documented by others previously, and avoid retesting - a major cost contributor. That can only happen if the documents can be trusted and are secure. There is also the aspect of fear of legal assault when errors are made, but that will have to be dealt with by others. Generally, of course, giving lawyers fewer excuses can only help. 2.8 Mission Critical Medical Devices; Attack by Remote Control Story 8. There are some medical devices which are mission critical in the very specific sense that people’s lives depend on them and if there are errors in the programs, they could kill people. There have been documented examples where malfunction of a computer program resulted in a harmful dose of radiation during the course of what was supposed to have been beneficial radiation therapy. The corresponding research question is: as we move towards the possibility of remote healthcare, could a criminal or an attacker misuse the remote control channel and exploit it to trigger harmful actions that would affect patients, due to deficiencies in the security design of the system? 2.9 Data Tampering Story 9. There are various reasons to want to tamper with data. One is the case in which a provider has misread or misinterpreted the data that were available, resulting in a diagnosis and treatment that were ineffective. Faced with the potential for some legal action, the provider might want to tamper with the data in order to cover up the error. The related research issue is to develop tamper proof ways of storing data (such as compound checksum methods) and to ensure that software is “tamper proof” and cannot be used to modify existing data. In general, every item of data should be time stamped, not only at the source but also when it enters an accessible record system. Such records should never be overwritten, but only appended. 2.10 Removing image-based Inferences to Personal Identity Story 10. There are some kinds of very high tech imaging methods, particularly MRI imaging, in which it is considered possible to reconstruct a reasonably good image of the face of the patient. Thus, if data of this type are shared without having been in some sense blurred, or made nonidentifiable, it represents a breach of patient privacy. At a more mundane level, most X-ray images contain a plate in the corner that has the patient’s name in radio-opaque ink, to ensure a match of the image to the patient. Technology for removal of this type of text-bearing sub-image from an image is available and could probably be built into image distribution software. [Wang, James Z., Gio Wiederhold, and Jia Li: "Wavelet-based Progressive Transmission and Security Filtering for Medical Image Distribution"; in Stephen Wong (ed.): Medical Image Databases; Kluwer publishers, 1998, pages 303-324.] 2.11 Man in the Middle Attacks on Remote Data Story 11. For a number of reasons we might want streams of health care data collected in one region to be delivered in real time to another region. In anticipated models of remote health care, monitors send a stream of data to a remote doctor. Monitoring of composite data from a stream of geographically collocated sources could provide early warning of either naturally occurring epidemics or bio-terror attacks. As this data flows, correlation attacks could support inference as to a sensitive medical condition For various kinds of data, time is a critical parameter, and the data, although privacy protected, should be able to support time series analysis. Such analysis is clearly more powerful if it is done “within patients” rather than “between patients”. 2.12 Pre Approval of Data Release Story 12. A patient, sitting and talking with doctor Bob, at hospital A, would like to get her information from hospital B. However, the process at hospital B is very complicated. In the first instance, they have to confer with the physician who was treating the patient, and created the record. If that physician is unavailable, which is likely to be the case for three-quarters of the hours in the week, the question will then be transferred to the legal office. It is much easier for that office to say “no” than to examine the reasons for possibly saying “yes”. This situation calls for some kind of complex conditional delegation or agency model, in which the patient and the physician acting together would specify a range of conditions under which data would automatically be released, upon request from the patient, or from a recognized practitioner. Although, as a matter of law, the records belong to the patient, yet as a practical matter, possession is nine-tenths of the law, and the records are held by the hospital. Typically a request to a hospital for the records of a patient’s stay is interpreted as foreplay for a legal assault. The result is that hundreds of pages of poorly copied material are delivered, at the very last minute, and permitted by law. 3 Adequate Software Infrastructure Healthcare technology has become greatly dependent on complex software, and medical applications rely on the general infrastructure that is available. That infrastructure is not of a quality that instills confidence that security and privacy can be maintained. Straightforward technology improvements, if adopted, could greatly enhance trust in the infrastructure. For instance, a simple analysis showed that currently some 48% of all software attacks by intruding hackers involve buffer overflow. This highlights a need for developers of healthcare systems to demand better compiler technology. Secure systems must indeed assume that the underlying software is not perfect, and can be penetrated. But if the software is as rotten as it seems to be, that effort becomes excessive. There is no reason for people to accept software that ever has buffer overflows for instance. Dan Swinehart pointed out that Xerox Parc's software was free of those errors, and so was the PL/1 compiler developed at the Stanford Medical School in 1966 [see http://infolab.stanford.edu/pub/gio/1960s/ACMEcompiler1968IFIP.pdf ]. The modest performance hits are well compensated by the reduction of security costs and worries. Developers in the security area should insist on only purchasing software that includes such protection. If there are discriminating customers, the compiler folk will follow the money. Maybe even educators will change their tone. Today the prime metric is speed of code, which is naively assumed to be used perfectly.