TECHNICAL REPORT no. 09262011 , J.A. O’Donnell

advertisement
Page 1
TECHNICAL REPORT no. 09262011
Preprocessing messages posted by dentists to an Internet mailing list:
a report of methods developed for a study of clinical content
M. Kreinacke1, T. Bekhuis2, *, H. Spallek3, M. Song3, J.A. O’Donnell4
1
Institute of Business Taxation, Leibniz University of Hanover, Germany
2
Department of Biomedical Informatics, School of Medicine,
University of Pittsburgh, Pennsylvania, US
3
Center for Dental Informatics, Department of Dental Public Health, School of Dental Medicine,
University of Pittsburgh, Pennsylvania, US
4
Office of Education and Curriculum, School of Dental Medicine,
University of Pittsburgh, Pennsylvania, US
*
Corresponding author:
Tanja Bekhuis, PhD, MS, MLIS
Department of Biomedical Informatics, School of Medicine, University of Pittsburgh
5607 Baum Boulevard, Room 514, Pittsburgh, PA, US 15206
Telephone: +1 412-648-9324
E-mail: tcb24@pitt.edu
Preprocessing messages posted by dentists
Page 2
Summary
Objectives: Mining social media artifacts requires substantial processing before content analyses.
In this report, we describe our procedures for preprocessing 14,576 e-mail messages sent to a
mailing list of several hundred dental professionals. Our goal was to transform the messages into
a format useful for natural language processing (NLP) to enable subsequent discovery of clinical
topics expressed in the corpus.
Methods: Preprocessing involved message capture, database creation and import, extraction of
multipurpose Internet mail extensions, decoding of encoded text, de-identification, and cleaning.
We also developed a Web-based tool to identify signals for noisy strings and sections, and to
verify the effectiveness of customized noise filters. We tailored our cleaning strategies to delete
text and images that would impede NLP and in-depth content analyses. Before applying the full
set of filters to each message, we determined an effective filter order.
Results: Preprocessing messages improved effectiveness of NLP by 38%. Sources of noise
included personal information in the salutation, the farewell, and the signature block; names and
places mentioned in the body of the text; threads with quoted text; advertisements; embedded or
attached images; spam- and virus-scanning notifications; auto text parts; e-mail addresses; and
Web links. We identified 53 patterns of noise and delivered a set of de-identified and cleaned
messages to the NLP analyst.
Conclusion: Preprocessing electronic messages can markedly improve subsequent NLP to enable
discovery of clinical topics.
Keywords: Electronic mail; data processing; natural language processing; dental informatics
Preprocessing messages posted by dentists
Page 3
1. Introduction
Translation of the best clinical evidence for dentists at the point of care is a major challenge
because we know little about what it is they want or need to know (1). To answer this question,
social media artifacts, such as text-based messages, images, or videos, are potentially rich data
sources due to their currency and presumed content validity. For example, mining exchanges
among peers in a virtual community of practice (2, 3) is likely to be more fruitful than fielding a
survey, especially when the response rate is typically low and items are in a forced choice
format. Nevertheless, survey methodology is well developed, whereas mining social media is an
emergent field (4–6). Thus, procedural questions arise.
A first step in developing useful procedures is to consider the entire workflow in terms of
research informatics. In other words, consider which procedures need to be carried out to
complete various tasks and in what order to ensure the feasibility of a study.
For researchers interested in mining social media artifacts, one of the first tasks is to identify a
resource with artifacts worth analyzing. For our team, an overriding goal is to identify clinical
topics relevant to practicing dentists. We therefore elected to mine e-mail messages posted to a
private Internet discussion list of dental professionals, primarily dentists from North America.
This particular list has endured for several years, has a relatively stable number of subscribers,
and has a clinical focus.
To ensure the feasibility of labor-intensive content analyses, we devised a workflow to enable
our planned research. The larger project is described in (7, 8).
In this report, we describe how we preprocessed electronic messages. Substantial preprocessing
is essential before meaningful natural language processing (NLP) can be carried out. NLP, in
Preprocessing messages posted by dentists
Page 4
turn, is a precursor to subsequent qualitative content analyses. Thus, an initial task involves
transforming messages into a format useful for NLP to enable eventual discovery of potentially
relevant clinical topics (see Fig. 1).
Fig. 1. Workflow for preprocessing electronic messages
Additionally, we offer this report as a kind of primer on how to process e-mail, beginning with
very basic notions regarding message capture and format, and ending with procedures for
identifying noise and developing filters. Our intention is to encourage researchers to explore
electronic mail as an interesting data source.
2. Methods
Preprocessing messages posted by dentists
Page 5
2.1 Message capture and format
Collecting messages from a social media-based member exchange, such as a subscription
mailing list, requires permission from the owner or the administrator of the exchange, and in
some cases, community leaders. Automated screen scraping may be possible without permission,
but this raises ethical questions not addressed here (9). Additionally, investigators need to obtain
Institutional Review Board approval. For this study, we obtained approval from both the
administrator (the founder of the mailing list) and the University of Pittsburgh Institutional
Review Board (IRB PRO08040313). Once permission is granted and IRB approval attained,
collecting messages is easily achieved by subscribing to the list during the study interval. At the
end of the study interval, messages can be saved from the mail account.
Internet mail is stored in a defined file format as described by RFC 822 (10). Each message
consists of a header with meta-information about the message followed by a body, which is
usually textual although images are sometimes embedded or attached.
The header consists of the subject line, a date and time stamp, sender and recipient information,
as well as technical metadata usually not displayed to the user, such as a list of mail servers
involved in delivery. Each header line is a key value pair that typically looks like “Subject: <the
message topic>”. A single blank line separates the header from the body of the message. When
texts are written in English, they can be stored in ASCII format as is.
However, texts written in languages with accented characters usually require some form of
encoding. This is because e-mail is historically limited to 7-bit character sets. We briefly
describe the implications for decoding messages in section 2.4.
2.2 Database
Preprocessing messages posted by dentists
Page 6
Messages captured over time must be imported into a database. We used a MySQL database
server for data storage and developed all tools in the PHP programming language on a Linux
server. However, our procedures are generalizable to other environments.
Most modern mail server systems store each message in a separate file, usually in the Maildir
format (11). Thus, the import process must iterate over all files in a folder and process each one
as a message. The metadata is read first from the message header in a line-by-line manner until a
blank line appears. The rest of the file is the body part and can be read in one great chunk. The
metadata of interest can now be stored along with the body part in the database.
2.3 Computing sender statistics
We computed sender statistics by querying the metadata for the messages stored in the database
prior to de-identification. We also processed sender names and addresses to ensure that authors
could be identified even with changing e-mail addresses and variant spellings.
To identify senders, one must analyze the “From” field in the header, which contains the
sender’s e-mail address and often a name. The standard format is “First Last
<user@example.com>”; variants include “<user@example.com> (First Last)”,
“<user@example.com> First Last” or just the e-mail address without a name. Thus, the first step
involves matching names and e-mail addresses, and then saving these in separate database
columns.
Senders sometimes change the spelling of their names. Thus, it would seem straightforward to
count users based on their e-mail addresses rather than their names. However, users sometimes
change their e-mail addresses or use two addresses, e.g., one for the office and one for home.
Counting e-mail addresses therefore inflates the sender count. We corrected this by counting two
Preprocessing messages posted by dentists
Page 7
or more e-mail addresses as one if they were used along with the same sender name. This task
required trivial manual intervention. However, it could be carried out with SQL statements
within the database.
We created two intermediate database views with queries and subqueries that could be used to
identify senders with multiple addresses and/or multiple variants of their names. The result was a
translation table with three columns: the sender’s name, a list of addresses, and the ‘main’
address. The first two columns were populated with data from the database views of senders
having more than one address. We then assigned a main address. The last step was another
database view with two columns: one for unique addresses and one for the number of messages
sent from an address. This view applied the translation table so that each sender was listed only
once with his or her main address. We exported the data from this view into an Excel sheet to
compute sender statistics.
2.4 Decoding the message body
In the simplest case, the message body contains text as is. If so, no decoding is required.
However, some form of translation is necessary in a number of other cases. For example:
a.
The message text uses a character set other than the American standard US ASCII. This
is so for languages with accented characters, non-Latin characters or special characters not
contained in the US ASCII character set. Most notable are Latin 1 (Western European languages,
ISO 8859 1), Code Page 1252 (legacy Windows systems), and UTF 8 (Unicode). Unicode
character sets contain characters from a large number of languages and in this way combine the
other mentioned sets. With Unicode, it is possible to have English, Chinese, and Arabic text in
the same document. The name of the character set used in a message is noted in its message
Preprocessing messages posted by dentists
Page 8
header. We found as many as 10 different character sets in the messages. About 48% were in
Latin 1, 46% in US-ASCII, and 4% in UTF-8. In order to make the entire set of messages
useable for NLP, conversion into one uniform character representation was necessary.
b.
If a message contains long lines of more than 1,000 characters or non-ASCII characters,
it can ‘break’ some mail systems because of historical limitations. In this case, a content transfer
encoding transforms the data into an acceptable form (12). Current implementations can handle
content with 8-bit characters and lately even multibyte representations, e.g., UTF 8. Another
purpose of transfer encoding is to embed images and binary attachments. Still, a transfer
encoding is sometimes applied to text parts. The result is usually longer in bytes, but guaranteed
to fit the 7-bit restriction. Usual encodings are Quoted Printable, Base64, and UUencode. Note
that the applied encoding is specified in the message header so that the receiving e-mail client
can handle the content. Note also that the message headers themselves are always in 7-bit USASCII format with no encoding except for an optional inline encoding for the subject.
c.
If a message comprises more than one part, there must be some way to distinguish among
them. Multiple parts can be alternative representations of the same content. For example, when
the writer formats his or her text, the formatted text body is sent in HTML or RTF format along
with a plain text body for recipients unable to handle the formatting. Furthermore, formatted text
may contain inline images that make up additional parts just as attachments that the sender adds
to the message. The Multipurpose Internet Mail Extensions (MIME) define a standard for
appending multiple MIME parts with additional meta-information on how to handle them (13).
Preprocessing messages posted by dentists
Page 9
To enable subsequent natural language processing and content analyses, we removed e-mail
encodings. We also disregarded any formatting, dropped attachments and inline images, and
converted all messages to a uniform character representation in the UTF 8 character set.
Thus, the task of decoding the message body consisted of the following steps (order matters):
1.
Split up the MIME parts if more than one exists; detect the plain text part and omit all
others. If only a formatted text part is included, convert to a plain text representation.
2.
Decode the content transfer encoding if one is applied. The result is human readable text.
3.
Convert the applied character set to UTF 8.
4.
Store the text in the database.
2.5 De-identification and text cleaning
Preliminary de-identification took place when we imported the data. This involved saving
message dates and bodies in the database, and disregarding sender and recipient names, as well
as e-mail addresses. Note that we imported the e-mail addresses into a database used to compute
sender statistics, but not into the database used for NLP.
At this stage, messages from the mailing list were technically useable for NLP. However,
remaining noise seriously impeded discovery of clinical topics in subsequent analyses. (By noise
we mean data irrelevant to the task of finding clinical topics in the corpus of e-mail messages.) A
major source of noise is personal information, such as data found in the salutation, the farewell,
and the signature block at the bottom of the message, as well as names and places inside the
body of the text. The latter can be removed easily using a wordlist and a gazetteer available in
Preprocessing messages posted by dentists
Page 10
the Natural Language Toolkit (14, 15). However, signature blocks are more challenging as they
appear in many forms and therefore require tailored algorithms for detection and removal.
Common to most forms of noise is that a human reader can effortlessly identify them based on
his or her experience and from the context, whereas an automated tool may be much less
successful. This is partly because no consistent markers identify the message parts. Although
automated tools have been developed primarily to de-identify electronic health records and
clinical reports (16-19), to our knowledge, none is good enough for the task of processing
messages to be used as input for natural language processing. To cope with the sheer message
volume, we therefore decided on a semi-automated approach.
In general, we applied customized text filters based on regular expressions to filter out noise.
Regular expressions are text patterns written in a formal language that can be interpreted by a
processor (20, 21). They provide a concise way to find text patterns of interest. To write regular
expressions, we searched for words that signal noisy sections.
We created a number of regular expressions based on our experience reading messages from the
discussion list, as well as on signals we identified. After finding and replacing noisy strings, we
further examined individual messages and then added regular expressions to remove remaining
noise.
The process for finding noise and developing filters consisted of five steps:
1. We created a Web-based tool to facilitate fast manual inspection of individual messages. The
user can display randomly selected message(s) or search for signaling words within message(s).
Data for this tool are the decoded messages in plain text.
Preprocessing messages posted by dentists
Page 11
2. Using our Web-based tool, we identified typical patterns of noise and then created appropriate
text filters based on regular expressions to identify noisy strings or sections.
For example, a reply to an earlier message often starts with quoted text preceded by a marker
automatically inserted by the mail client. A typical marker consists of one or two lines of text
resembling the following: “On <date>, <name> wrote: ”. At times, a line break follows the
comma. The marker sometimes starts with “At” instead of “On” and may carry a number of
hyphens at the start. Similar patterns have the form “In article <reference> <name> wrote: ” or
“<name> wrote in message <reference>: ”. Common to all such patterns is that they directly
precede quoted text.
One goal was to find similar patterns of noise in a large number of messages with a small set of
filters. Thus, the filters were written to match several patterns. For example, for patterns where
the subject starts with either “On” or “At” with a possible comma between the date and the
name, the expression “(On|At) <date>,? <name>” matches 4 different strings.
3. In order to verify the effect of our text filters, we further developed the Web tool to display
simultaneously filtered and non-filtered messages. We carried out three tasks with the enhanced
Web tool:
a.
Verify that noisy strings or sections are successfully identified and that cleaning is
effective.
b.
Verify that relevant text about a clinical topic is not accidentally removed if it resembles
a pattern of noise.
c.
Search for remaining patterns of noise.
Preprocessing messages posted by dentists
Page 12
4. After interacting with the enhanced Web tool, we repeated steps 2 and 3 until no further noise
was observed.
5. Lastly, we applied the complete set of filters to each message and saved the cleaned messages
in the database.
Many messages contain two or three lines of advertisement from free e-mail providers displayed
at the bottom and others contain confidentiality notifications. Similar forms of noise are inline
notifications from spam- and virus-scanning software, as well as auto text parts from Microsoft
Outlook, such as “The message is ready to be sent with the following file …”, and inline images
in the old UUencode format.
Signatures are often prefaced with a blank line followed by a line of only two hyphens. However,
this pattern may occur in the middle of a message text if the author typed two hyphens, e.g.,
when trying to automate insertion of an em dash. Moreover, many e-mail clients use different
markers for signatures or no markers at all. In the latter case, signatures can be detected
automatically by matching a typical pattern inside the signature block itself, assuming a pattern
exists.
We removed remaining noise consisting of e-mail addresses, Web links, and dates and times
mentioned in the messages.
Some patterns occur within other patterns. For example, e-mail addresses can appear in a text
body as well as within a signature block. For these cases, the order in which text filters are
applied matters. For example, it is advisable to detect signatures first because the appearance of
e-mail addresses and links helps to identify the signature block.
Preprocessing messages posted by dentists
Page 13
The process for text cleaning consisted of six steps, in this order:
1.
Remove UUencoded inline images.
2.
Remove quoted text and software generated auto texts.
3.
Remove signatures, advertisements, and virus scanner notifications.
4.
Remove e-mail addresses, Web links, and dates and times.
5.
Repeat Steps 3 and 4 until no further matches are found.
6.
Remove fragments of cascading style sheets.
Signatures, advertisements, and virus scanner notifications can appear in any order at the end of
the message, but only the last pattern is detected at a time. This is because location is part of the
pattern. The filters in step 3 for text cleaning must thus be applied iteratively. Repetition along
with step 4 helps when signature detection fails, as can happen when the sender adds an e-mail
address or a link below his or her signature. Additionally, removal of quoted text must occur
before removal of signatures because quoted text may appear at the bottom of the message.
In sum, after preprocessing, the resultant corpus consisted of de-identified, plain text messages
relatively free of noise. Most importantly, the messages were now useful for meaningful NLP.
3. Results
3.1 Senders
Preprocessing messages posted by dentists
Page 14
Approximately 450 dental professionals subscribed to the discussion list during the study interval
from April 2008 to May 2009. Of the subscribers, 245 (54%) posted messages.
Just 20 dental professionals (8% of those who posted messages; about 5% of all subscribers) sent
7,288 (50%) of the messages. A small percentage of the senders (1%) were responsible for about
10% of the total traffic; 36% of the senders sent 90% of the messages (see Fig. 2). This
distribution is typical of other virtual communities of practice or discussion lists (3, 22). The
mean number of messages sent was 59 (range: 1 to 699).
800
700
N messages sent
600
500
400
300
200
100
Subscriber (ordered by N messages sent)
Fig. 2. Messages sent by subscriber
3.2 Messages and patterns of noise
241
231
221
211
201
191
181
171
161
151
141
131
121
111
91
101
81
71
61
51
41
31
21
11
1
0
Preprocessing messages posted by dentists
Page 15
We retrieved 14,576 messages posted to the mailing list by dental professionals. We identified
53 patterns of noise and cleaned 12,674 (87%) messages. Thirteen percent (1902) of the
messages were devoid of noisy strings or segments. Patterns included: 1 pattern for UUencoded
inline images; 18 patterns for quoted text; 3 patterns for software generated auto text; 22 patterns
for signatures, advertisements, and virus scanner notifications; 8 patterns for e-mail addresses,
Web links, and dates and times; and 1 pattern for fragments of cascading style sheets.
Most of the messages (n=10,490; 72%) contained quoted text in threads that we removed.
3.3 Effectiveness
We informally assessed the effectiveness of preprocessing by comparing the number of clinically
relevant collocated trigrams (triples of text strings or ‘words’) derived from text processed with
preliminary vs final procedures.
During NLP, we selected the top 300 collocated trigrams by sorting according to pointwise
mutual information. When the input text had been processed using preliminary procedures, NLP
yielded 103 (34%) clinically relevant collocated trigrams. Whereas when the input text had been
processed using final procedures, NLP yielded 142 (47%) clinically relevant collocated trigrams.
This is a 38% improvement in effectiveness. For details regarding NLP in the larger study, see
(8).
4. Discussion
To our knowledge, tools to process electronic mail for research endeavors are not publicly
available. However, Li, Cao, and Tang from Microsoft were recently awarded a patent for a data
Preprocessing messages posted by dentists
Page 16
processing system that cleans electronic mail (Patent No. US 7,590,608 B2; see (23)). The
motivation for their system is that cleaned messages can be used for automatic routing, SPAM
filtering, summarization, information extraction, discovery of trends in news groups, and more
(cf. section 5 in (23)). Until such tools become available, in-house development is necessary.
4.1 What is noise?
In carrying out this project, we discerned that defining noise in social media artifacts is essential.
Consider, for example, that the purpose of de-identification of medical records or clinical reports
is typically to preserve confidentiality, and perhaps to facilitate data sharing. In contrast,
de-identification here serves another very important function, i.e., to remove data about the
senders that could obscure discovery of clinical topics during NLP and content analyses. Hence,
personal information is noise in this context.
To expunge noise adequately implies that one knows what kind of data will impede fruitful
analysis. For example, our preliminary cleaning efforts were rather rudimentary. Consequently,
early NLP analyses were disappointing. We therefore had to reconsider what kinds of strings,
segments, images, and attachments were in fact noise as we defined it. Once we had a better idea
of what we were looking for, cleaning improved dramatically.
4.2 Reducing potential bias
The majority of messages include replies to earlier ones. Most of these consist of a new message
along with response(s) that might accrue in a thread or chain of messages. If we had retained the
responses in the threads, the relative frequencies of words and phrases, and therefore the
empirical probabilities, would not accurately characterize the clinical topics. Rather, frequencies
Preprocessing messages posted by dentists
Page 17
based on threads probably would be more important for discourse analysis. As we are concerned
with discovering clinical topics of interest to dentists instead of analyzing the conversations per
se, we took care to delete threaded responses. Note that in longer threads, the new message can
make up less than 10% of the e-mail. In our opinion, it is likely that retaining the threads could
have severely biased our natural language processing of messages. Bogg (24) makes a similar
point in terms of removing information that could otherwise skew the data in subsequent text
mining.
4.3 Limitations
To the extent that patterns of noise identified in this report typify e-mail, our preprocessing may
generalize to studies where the unit of analysis is an e-mail message. However, to the extent that
some patterns are unique, generalizability may be limited. For example, signature blocks are
particularly challenging, as some of them do not follow a clear pattern. We therefore had to add
customized signature filters for a small group of senders.
Detecting responses to threaded messages can also be problematic. Typically, each line of an
earlier message is preceded with “>”, but this is not the case when the sender uses the forward
function instead of the reply function. In some cases, a sender uses the forward function and then
replies within the thread. To mark his reply, he may use idiosyncratic formatting, such as bold
text, a different color, or even “>”. For these scenarios, pattern-matching algorithms cannot
distinguish a new message from earlier messages in the thread.
Other problems occur when quoted text is damaged and its marker is missing. For example:
> Here is quoted text that is wrapped,
Preprocessing messages posted by dentists
Page 18
but the marker is not repeated.
In this case, automated pattern matching will miss the second line.
4.4 Future research
In the future, we will adopt a more structured approach for finding patterns of noise and
evaluating accuracy (e.g., see (24)). Additionally, dentists interested in best clinical care
sometimes include excerpts from scientific articles in their messages. As these are not easily
identified with rule-based pattern matching, we will consider using statistical approaches to
identify such excerpts. The reason for considering article excerpts as potential sources of noise is
that the distribution of clinical topics may be distorted by the inclusion of formal scientific
language.
4.5 Conclusion
In this report, we described procedures for preprocessing electronic messages that other
biomedical researchers may find useful. We found that preprocessing markedly improved NLP
to enable discovery of clinical topics in a subsequent study of content.
5. Acknowledgments
We thank David S. Dodell, DMD, for permission to collect messages posted to the discussion
list. The Pittsburgh Biomedical Informatics Training Program 5T15LM007059 partially
supported this research.
Preprocessing messages posted by dentists
Page 19
6. References
1. Spallek H, Song M, Polk DE, Bekhuis T, Frantsve-Hawley J, Aravamudhan K. Barriers to
implementing evidence-based clinical guidelines: a survey of early adopters. J Evid Based Dent
Pract 2010; 10 (4): 195-206. PMID: 21093800. PMCID: PMC3011934.
2. Spallek H, Irwin JY, Schleyer T, Butler BS, Weiss PM. Supporting the emergence of dental
informatics with an online community. Int J Comput Dent 2007; 10 (3): 247-264.
3. Falkman G, Gustafsson M, Jontell M, Torgersson O. SOMWeb: a semantic web-based system
for supporting collaboration of distributed medical communities of practice. J Med Internet Res
2008; 10 (3): e25.
4. Sterne J. Text analytics for social media: evolving tools for an evolving environment. SAS
Institute, Inc., Cary, NC, n.d.
5. Chew C, Eysenbach G. Pandemics in the age of Twitter: content analysis of Tweets during the
2009 H1N1 outbreak. PLoS One 2010; 5 (11): e14118.
6. Eysenbach G. Infodemiology and infoveillance: framework for an emerging set of public
health informatics methods to analyze search, communication and publication behavior on the
Internet. J Med Internet Res 2009; 11 (1): e11.
7. Bekhuis T, Kreinacke M, Spallek H, Song M. Using the Natural Language Toolkit to reduce
the number of messages for in-depth content analyses: a case study. In: 2010 American Medical
Informatics Association Annual Symposium Proceedings, p.981; 2010 November 13-17;
Washington, DC.
Preprocessing messages posted by dentists
Page 20
8. Bekhuis T, Kreinacke M, Spallek H, Song M, O'Donnell JA. Using natural language
processing to enable in-depth analysis of clinical messages posted to an Internet mailing list: a
feasibility study. J Med Internet Res 2011 Nov 23;13(4):e98. doi:10.2196/jmir.1799.
PMID:22112583. PMCID:PMC3236668.
9. Eysenbach G, Till JE. Ethical issues in qualitative research on internet communities. BMJ.
2001 Nov 10;323(7321):1103-5. PMID:11701577.
10. Crocker DH. RFC 822 - Standard for the format of ARPA Internet text messages. 1982.
http://www.faqs.org/rfcs/rfc822.html (last accessed 06.21.2011).
11. Bernstein DJ. Internet mail: using maildir format. http://cr.yp.to/proto/maildir.html (last
accessed 06.21.2011).
12. Freed N, Borenstein N. RFC 2045 - Multipurpose Internet mail extensions (MIME) Part One:
format of Internet message bodies. http://www.faqs.org/rfcs/rfc2045.html (last accessed
06.21.2011).
13. Moore K. RFC 2047 - MIME (multipurpose Internet mail extensions) part three: message
header extensions for non-ASCII text. 1996. http://www.faqs.org/rfcs/rfc2047.html (last
accessed 06.21.2011).
14. Natural Language Toolkit. http://www.nltk.org/
15. Bird S, Klein E, Loper E. Natural Language Processing with Python. Sebastopol, CA:
O'Reilly; 2009.
Preprocessing messages posted by dentists
Page 21
16. Xiong L, Gardner J, Jurczyk P, Agichtein E, Lu JJ. Information Discovery on Electronic
Health Records, Vagelis Hristidis (Ed.). Chapman and Hall/CRC ; 2009. Privacy preserving
information discovery on EHRs; p.197-225.
17. Gardner J, Xiong L, Wang F, Post A, Saltz J, Grandison T. An evaluation of feature sets and
sampling techniques for de-identification of medical records. In: IHI '10 Proceedings of the 1st
ACM International Health Informatics Symposium; 2010 November 11-12; Arlington VA.
18. Beckwith BA, Mahaadevan R, Balis UJ, Kuo F. Development and evaluation of an open
source software tool for de-identification of pathology reports. BMC Medl Inform Decis Mak
2006; 6 (12): [electronic]. doi:10.1186/1472-6947-6-12.
19. Neamatullah I, Douglass MM, Lehman LH, Reisner A,Villarroel M, Long WJ, Szolovits P,
Moody GB, Mark RG, Clifford GD. Automated de-identification of free-text medical records.
BMC Medl Inform Decis Mak 2008; 8 (32).
20. Aho AV. Algorithms for finding patterns in strings. In: Handbook of Theoretical Computer
Science, Volume A: Algorithms and Complexity, Jan van Leeuwen (Ed.), 255-300. MIT Press
1994.
21. PHP: Regular expressions (Perl compatible). http://de.php.net/manual/en/book.pcre.php (last
accessed 06.21.2011).
22. Nonnecke B, Preece J. Lurker demographics: counting the silent. In: Proceedings of the
SIGCHI Conference on Human Factors in Computing Systems. 2000 April 01-06; The Hague,
Netherlands.
Preprocessing messages posted by dentists
Page 22
23. Li HBC, Cao YBC,Tang ZBC. Electronic mail data cleaning. Patent No. US 7,590,608 B2.
Microsoft Corporation (Redmond, WA, US). 2009.
http://www.freepatentsonline.com/7590608.pdf (last accessed 06.21.2011).
24. Bogg, P. Pattern based approaches to pre-processing structured text: a newsfeed example. In:
Proceedings of the Third International Conference on Computational Science, ICCS. 2003 June
2-4; Mebourne, Australia and St. Petersburg, Russia.
Download