
A semantic based methodology
to classify and protect sensitive
data in medical records
Flora Amato, Valentina Casola,
Antonino Mazzeo, Sara Romano
Dipartimento di Informatica e Sistemistica
Universita’ degli Studi di Napoli, Federico II
Naples, Italy
Introduction to challenges in e-healt;
Motivation and Open challenges;
Proposal of access control policies;
Methodology to extract relevant
information to protect and apply the proper
security policy;
• A Case study;
• Conclusion and future works
The Electronic Health
• E-Health challenges:
– To provide value-added services to the healthcare actors
(patients, doctors, etc...);
– To enhance the efficiency and reducing the costs of
complex informative systems.
• E-Health term encloses many meanings; we are focused on
those aspects of telemedicine that involve not only
technological aspects but, also, procedural ones;
• In particular, we are assisting to a gradual adoption of
innovative IT solutions for e-health but, at the state, the major
open issue is the cohesistence of two different domains:
The cohesistence of old and new
systems from a security point of view…..
1) Modern eHealth systems are designed to
enforce fine-grain access control policies and
the medical records are a-priori well structured
to properly manage the different fields, but…..
2) eHealth is also applied in those contexts where
new information systems have not been
developed yet but “documental systems” are, in
some way, introduced. This means that today
documental systems give users the possibility
to access a digitalized version of a medical
record without having previously classified the
critical parts.
Unstructured Medical record data
and actors
Actors are not aware that structuring
data is important for data elaboration
and protection.
• Security Problem
• private data (critical part) can
be accessed by not authorized
•It is not possible to enforce a
fine-grained acess control on
digitalized unstructured
• Solution
• extract relevant informaton
from the records,
• enforce access control policies
Motivation and our proposal
• The problem: “Documental systems” allow
access to medical record digitalized version
(unstructured data) without having previously
classified the critical parts.
• We propose a semantic-based method to locate
the resource being accessed and associate the
proper security rule to apply.
• The Access control models is still based on finegrain data classification.
Semantic method for resource
• Knowledge extraction by means of several text analysis
• Running example:
Step 1 - Text Preprocessing:
Tokenization and Normalization
• Goal:
– extraction of relevant units of lexical elements
• Text tokenization:
– segmentation of a sentence into minimal units of analysis (token).
disambiguation of punctuation marks, aiming at token separation;; separation of continuous strings (i.e. strings that are not
separated by blank spaces) to be considered as independent tokens: for example, in the Italian string “c’era” there are two
independent tokens (c’ + era).
This segmentation can be performed by means of special tools,
defined tokenizers, including glossaries with wellknown
expressions to be regarded as medical domain tokens and minigrammars containing heuristic rules regulating token combinations.
• Text normalization:
– variations of the same lexical expression should be reported in a
unique way:
• (i) words that assume different meaning if are written in small or capital letter
• (ii) acronyms and abbreviations (“USA” or “U.S.A.”)
Step 2 - Morpho-syntactic analysis:
POS tagging and Lemmatization
• Goal:
– extraction of word categories.
• Part-of-speech (POS)
– assignment of a grammatical
category (noun, verb, etc.) to
each lexical unit.
– word-category disambiguation:
the vocabulary of the
documents of interest is
compared with an external
lexical resource
• Key-Word In Context (KWIC)
• Lemmatization:
– Reducing the inflected forms to
the respective lemma
Step 3 - Relevant Terms
• Goal:
– identification of terms useful to characterize the
sections of interest.
• TF-IDF (Term Frequency - Inverse Document
Frequency): relevant lexical items are frequent
and concentrated on few documents.
Wt,d = ft,d * log(N/Dt)
term frequency (tf ), corresponds to the number of times a given term
occurs in the resource;
inverse document frequency (idf), concerning the term distribution within
all the sections of the medical records: it relies on the principle that term
importance is inversely proportional to the number of documents from the
corpus where the given term occurs.
Step 4 - Identification of Concepts
of Interest
• Goal:
– Clusterize relevant terms
in synset (semantically
equivalent terms) in order
to associate the semantic
Security Policies
• At the end of the semantic analysis process, a
medical record can be seen as composed by
several sections (resources) that can be
properly protected;
• A Security policy is set of rules structured as
sj ; ai; rk
– sj  S = s1 … sm the set of actors;
– ai  A = a1 … ah the set of actions;
– rk  R = r1 … rh the set of resources;
Medical Record Policy (Use Case)
Action-actors identification
Giving the policy and given a resource
r*  R, it is easy to locate the set of all
allowed rules:
Lr* = sj, ai, r* r*R, ai  A*A, sj  S*S
System behavior: an example
Conclusions and Future works
• We have proposed a semantic approach for document
parts (resource) classification from a security point of
• It is useful to associate a set of security rules on the
• It is a promising method that can strongly help in facing
security issues that arise once data are made available
for new potential applications.
• Future works:
– To prove the methodology in other e-government
– To implement a system to on-line extract/classify and
enforce fine-grained policies with acceptable