TECHNICAL REPORT no. 09262011 , J.A. O’Donnell

Page 1 TECHNICAL REPORT no. 09262011 Preprocessing messages posted by dentists to an Internet mailing list: a report of methods developed for a study of clinical content M. Kreinacke1, T. Bekhuis2, *, H. Spallek3, M. Song3, J.A. O’Donnell4 1 Institute of Business Taxation, Leibniz University of Hanover, Germany 2 Department of Biomedical Informatics, School of Medicine, University of Pittsburgh, Pennsylvania, US 3 Center for Dental Informatics, Department of Dental Public Health, School of Dental Medicine, University of Pittsburgh, Pennsylvania, US 4 Office of Education and Curriculum, School of Dental Medicine, University of Pittsburgh, Pennsylvania, US * Corresponding author: Tanja Bekhuis, PhD, MS, MLIS Department of Biomedical Informatics, School of Medicine, University of Pittsburgh 5607 Baum Boulevard, Room 514, Pittsburgh, PA, US 15206 Telephone: +1 412-648-9324 E-mail: tcb24@pitt.edu Preprocessing messages posted by dentists Page 2 Summary Objectives: Mining social media artifacts requires substantial processing before content analyses. In this report, we describe our procedures for preprocessing 14,576 e-mail messages sent to a mailing list of several hundred dental professionals. Our goal was to transform the messages into a format useful for natural language processing (NLP) to enable subsequent discovery of clinical topics expressed in the corpus. Methods: Preprocessing involved message capture, database creation and import, extraction of multipurpose Internet mail extensions, decoding of encoded text, de-identification, and cleaning. We also developed a Web-based tool to identify signals for noisy strings and sections, and to verify the effectiveness of customized noise filters. We tailored our cleaning strategies to delete text and images that would impede NLP and in-depth content analyses. Before applying the full set of filters to each message, we determined an effective filter order. Results: Preprocessing messages improved effectiveness of NLP by 38%. Sources of noise included personal information in the salutation, the farewell, and the signature block; names and places mentioned in the body of the text; threads with quoted text; advertisements; embedded or attached images; spam- and virus-scanning notifications; auto text parts; e-mail addresses; and Web links. We identified 53 patterns of noise and delivered a set of de-identified and cleaned messages to the NLP analyst. Conclusion: Preprocessing electronic messages can markedly improve subsequent NLP to enable discovery of clinical topics. Keywords: Electronic mail; data processing; natural language processing; dental informatics Preprocessing messages posted by dentists Page 3 1. Introduction Translation of the best clinical evidence for dentists at the point of care is a major challenge because we know little about what it is they want or need to know (1). To answer this question, social media artifacts, such as text-based messages, images, or videos, are potentially rich data sources due to their currency and presumed content validity. For example, mining exchanges among peers in a virtual community of practice (2, 3) is likely to be more fruitful than fielding a survey, especially when the response rate is typically low and items are in a forced choice format. Nevertheless, survey methodology is well developed, whereas mining social media is an emergent field (4–6). Thus, procedural questions arise. A first step in developing useful procedures is to consider the entire workflow in terms of research informatics. In other words, consider which procedures need to be carried out to complete various tasks and in what order to ensure the feasibility of a study. For researchers interested in mining social media artifacts, one of the first tasks is to identify a resource with artifacts worth analyzing. For our team, an overriding goal is to identify clinical topics relevant to practicing dentists. We therefore elected to mine e-mail messages posted to a private Internet discussion list of dental professionals, primarily dentists from North America. This particular list has endured for several years, has a relatively stable number of subscribers, and has a clinical focus. To ensure the feasibility of labor-intensive content analyses, we devised a workflow to enable our planned research. The larger project is described in (7, 8). In this report, we describe how we preprocessed electronic messages. Substantial preprocessing is essential before meaningful natural language processing (NLP) can be carried out. NLP, in Preprocessing messages posted by dentists Page 4 turn, is a precursor to subsequent qualitative content analyses. Thus, an initial task involves transforming messages into a format useful for NLP to enable eventual discovery of potentially relevant clinical topics (see Fig. 1). Fig. 1. Workflow for preprocessing electronic messages Additionally, we offer this report as a kind of primer on how to process e-mail, beginning with very basic notions regarding message capture and format, and ending with procedures for identifying noise and developing filters. Our intention is to encourage researchers to explore electronic mail as an interesting data source. 2. Methods Preprocessing messages posted by dentists Page 5 2.1 Message capture and format Collecting messages from a social media-based member exchange, such as a subscription mailing list, requires permission from the owner or the administrator of the exchange, and in some cases, community leaders. Automated screen scraping may be possible without permission, but this raises ethical questions not addressed here (9). Additionally, investigators need to obtain Institutional Review Board approval. For this study, we obtained approval from both the administrator (the founder of the mailing list) and the University of Pittsburgh Institutional Review Board (IRB PRO08040313). Once permission is granted and IRB approval attained, collecting messages is easily achieved by subscribing to the list during the study interval. At the end of the study interval, messages can be saved from the mail account. Internet mail is stored in a defined file format as described by RFC 822 (10). Each message consists of a header with meta-information about the message followed by a body, which is usually textual although images are sometimes embedded or attached. The header consists of the subject line, a date and time stamp, sender and recipient information, as well as technical metadata usually not displayed to the user, such as a list of mail servers involved in delivery. Each header line is a key value pair that typically looks like “Subject: <the message topic>”. A single blank line separates the header from the body of the message. When texts are written in English, they can be stored in ASCII format as is. However, texts written in languages with accented characters usually require some form of encoding. This is because e-mail is historically limited to 7-bit character sets. We briefly describe the implications for decoding messages in section 2.4. 2.2 Database Preprocessing messages posted by dentists Page 6 Messages captured over time must be imported into a database. We used a MySQL database server for data storage and developed all tools in the PHP programming language on a Linux server. However, our procedures are generalizable to other environments. Most modern mail server systems store each message in a separate file, usually in the Maildir format (11). Thus, the import process must iterate over all files in a folder and process each one as a message. The metadata is read first from the message header in a line-by-line manner until a blank line appears. The rest of the file is the body part and can be read in one great chunk. The metadata of interest can now be stored along with the body part in the database. 2.3 Computing sender statistics We computed sender statistics by querying the metadata for the messages stored in the database prior to de-identification. We also processed sender names and addresses to ensure that authors could be identified even with changing e-mail addresses and variant spellings. To identify senders, one must analyze the “From” field in the header, which contains the sender’s e-mail address and often a name. The standard format is “First Last <user@example.com>”; variants include “<user@example.com> (First Last)”, “<user@example.com> First Last” or just the e-mail address without a name. Thus, the first step involves matching names and e-mail addresses, and then saving these in separate database columns. Senders sometimes change the spelling of their names. Thus, it would seem straightforward to count users based on their e-mail addresses rather than their names. However, users sometimes change their e-mail addresses or use two addresses, e.g., one for the office and one for home. Counting e-mail addresses therefore inflates the sender count. We corrected this by counting two Preprocessing messages posted by dentists Page 7 or more e-mail addresses as one if they were used along with the same sender name. This task required trivial manual intervention. However, it could be carried out with SQL statements within the database. We created two intermediate database views with queries and subqueries that could be used to identify senders with multiple addresses and/or multiple variants of their names. The result was a translation table with three columns: the sender’s name, a list of addresses, and the ‘main’ address. The first two columns were populated with data from the database views of senders having more than one address. We then assigned a main address. The last step was another database view with two columns: one for unique addresses and one for the number of messages sent from an address. This view applied the translation table so that each sender was listed only once with his or her main address. We exported the data from this view into an Excel sheet to compute sender statistics. 2.4 Decoding the message body In the simplest case, the message body contains text as is. If so, no decoding is required. However, some form of translation is necessary in a number of other cases. For example: a. The message text uses a character set other than the American standard US ASCII. This is so for languages with accented characters, non-Latin characters or special characters not contained in the US ASCII character set. Most notable are Latin 1 (Western European languages, ISO 8859 1), Code Page 1252 (legacy Windows systems), and UTF 8 (Unicode). Unicode character sets contain characters from a large number of languages and in this way combine the other mentioned sets. With Unicode, it is possible to have English, Chinese, and Arabic text in the same document. The name of the character set used in a message is noted in its message Preprocessing messages posted by dentists Page 8 header. We found as many as 10 different character sets in the messages. About 48% were in Latin 1, 46% in US-ASCII, and 4% in UTF-8. In order to make the entire set of messages useable for NLP, conversion into one uniform character representation was necessary. b. If a message contains long lines of more than 1,000 characters or non-ASCII characters, it can ‘break’ some mail systems because of historical limitations. In this case, a content transfer encoding transforms the data into an acceptable form (12). Current implementations can handle content with 8-bit characters and lately even multibyte representations, e.g., UTF 8. Another purpose of transfer encoding is to embed images and binary attachments. Still, a transfer encoding is sometimes applied to text parts. The result is usually longer in bytes, but guaranteed to fit the 7-bit restriction. Usual encodings are Quoted Printable, Base64, and UUencode. Note that the applied encoding is specified in the message header so that the receiving e-mail client can handle the content. Note also that the message headers themselves are always in 7-bit USASCII format with no encoding except for an optional inline encoding for the subject. c. If a message comprises more than one part, there must be some way to distinguish among them. Multiple parts can be alternative representations of the same content. For example, when the writer formats his or her text, the formatted text body is sent in HTML or RTF format along with a plain text body for recipients unable to handle the formatting. Furthermore, formatted text may contain inline images that make up additional parts just as attachments that the sender adds to the message. The Multipurpose Internet Mail Extensions (MIME) define a standard for appending multiple MIME parts with additional meta-information on how to handle them (13). Preprocessing messages posted by dentists Page 9 To enable subsequent natural language processing and content analyses, we removed e-mail encodings. We also disregarded any formatting, dropped attachments and inline images, and converted all messages to a uniform character representation in the UTF 8 character set. Thus, the task of decoding the message body consisted of the following steps (order matters): 1. Split up the MIME parts if more than one exists; detect the plain text part and omit all others. If only a formatted text part is included, convert to a plain text representation. 2. Decode the content transfer encoding if one is applied. The result is human readable text. 3. Convert the applied character set to UTF 8. 4. Store the text in the database. 2.5 De-identification and text cleaning Preliminary de-identification took place when we imported the data. This involved saving message dates and bodies in the database, and disregarding sender and recipient names, as well as e-mail addresses. Note that we imported the e-mail addresses into a database used to compute sender statistics, but not into the database used for NLP. At this stage, messages from the mailing list were technically useable for NLP. However, remaining noise seriously impeded discovery of clinical topics in subsequent analyses. (By noise we mean data irrelevant to the task of finding clinical topics in the corpus of e-mail messages.) A major source of noise is personal information, such as data found in the salutation, the farewell, and the signature block at the bottom of the message, as well as names and places inside the body of the text. The latter can be removed easily using a wordlist and a gazetteer available in Preprocessing messages posted by dentists Page 10 the Natural Language Toolkit (14, 15). However, signature blocks are more challenging as they appear in many forms and therefore require tailored algorithms for detection and removal. Common to most forms of noise is that a human reader can effortlessly identify them based on his or her experience and from the context, whereas an automated tool may be much less successful. This is partly because no consistent markers identify the message parts. Although automated tools have been developed primarily to de-identify electronic health records and clinical reports (16-19), to our knowledge, none is good enough for the task of processing messages to be used as input for natural language processing. To cope with the sheer message volume, we therefore decided on a semi-automated approach. In general, we applied customized text filters based on regular expressions to filter out noise. Regular expressions are text patterns written in a formal language that can be interpreted by a processor (20, 21). They provide a concise way to find text patterns of interest. To write regular expressions, we searched for words that signal noisy sections. We created a number of regular expressions based on our experience reading messages from the discussion list, as well as on signals we identified. After finding and replacing noisy strings, we further examined individual messages and then added regular expressions to remove remaining noise. The process for finding noise and developing filters consisted of five steps: 1. We created a Web-based tool to facilitate fast manual inspection of individual messages. The user can display randomly selected message(s) or search for signaling words within message(s). Data for this tool are the decoded messages in plain text. Preprocessing messages posted by dentists Page 11 2. Using our Web-based tool, we identified typical patterns of noise and then created appropriate text filters based on regular expressions to identify noisy strings or sections. For example, a reply to an earlier message often starts with quoted text preceded by a marker automatically inserted by the mail client. A typical marker consists of one or two lines of text resembling the following: “On <date>, <name> wrote: ”. At times, a line break follows the comma. The marker sometimes starts with “At” instead of “On” and may carry a number of hyphens at the start. Similar patterns have the form “In article <reference> <name> wrote: ” or “<name> wrote in message <reference>: ”. Common to all such patterns is that they directly precede quoted text. One goal was to find similar patterns of noise in a large number of messages with a small set of filters. Thus, the filters were written to match several patterns. For example, for patterns where the subject starts with either “On” or “At” with a possible comma between the date and the name, the expression “(On|At) <date>,? <name>” matches 4 different strings. 3. In order to verify the effect of our text filters, we further developed the Web tool to display simultaneously filtered and non-filtered messages. We carried out three tasks with the enhanced Web tool: a. Verify that noisy strings or sections are successfully identified and that cleaning is effective. b. Verify that relevant text about a clinical topic is not accidentally removed if it resembles a pattern of noise. c. Search for remaining patterns of noise. Preprocessing messages posted by dentists Page 12 4. After interacting with the enhanced Web tool, we repeated steps 2 and 3 until no further noise was observed. 5. Lastly, we applied the complete set of filters to each message and saved the cleaned messages in the database. Many messages contain two or three lines of advertisement from free e-mail providers displayed at the bottom and others contain confidentiality notifications. Similar forms of noise are inline notifications from spam- and virus-scanning software, as well as auto text parts from Microsoft Outlook, such as “The message is ready to be sent with the following file …”, and inline images in the old UUencode format. Signatures are often prefaced with a blank line followed by a line of only two hyphens. However, this pattern may occur in the middle of a message text if the author typed two hyphens, e.g., when trying to automate insertion of an em dash. Moreover, many e-mail clients use different markers for signatures or no markers at all. In the latter case, signatures can be detected automatically by matching a typical pattern inside the signature block itself, assuming a pattern exists. We removed remaining noise consisting of e-mail addresses, Web links, and dates and times mentioned in the messages. Some patterns occur within other patterns. For example, e-mail addresses can appear in a text body as well as within a signature block. For these cases, the order in which text filters are applied matters. For example, it is advisable to detect signatures first because the appearance of e-mail addresses and links helps to identify the signature block. Preprocessing messages posted by dentists Page 13 The process for text cleaning consisted of six steps, in this order: 1. Remove UUencoded inline images. 2. Remove quoted text and software generated auto texts. 3. Remove signatures, advertisements, and virus scanner notifications. 4. Remove e-mail addresses, Web links, and dates and times. 5. Repeat Steps 3 and 4 until no further matches are found. 6. Remove fragments of cascading style sheets. Signatures, advertisements, and virus scanner notifications can appear in any order at the end of the message, but only the last pattern is detected at a time. This is because location is part of the pattern. The filters in step 3 for text cleaning must thus be applied iteratively. Repetition along with step 4 helps when signature detection fails, as can happen when the sender adds an e-mail address or a link below his or her signature. Additionally, removal of quoted text must occur before removal of signatures because quoted text may appear at the bottom of the message. In sum, after preprocessing, the resultant corpus consisted of de-identified, plain text messages relatively free of noise. Most importantly, the messages were now useful for meaningful NLP. 3. Results 3.1 Senders Preprocessing messages posted by dentists Page 14 Approximately 450 dental professionals subscribed to the discussion list during the study interval from April 2008 to May 2009. Of the subscribers, 245 (54%) posted messages. Just 20 dental professionals (8% of those who posted messages; about 5% of all subscribers) sent 7,288 (50%) of the messages. A small percentage of the senders (1%) were responsible for about 10% of the total traffic; 36% of the senders sent 90% of the messages (see Fig. 2). This distribution is typical of other virtual communities of practice or discussion lists (3, 22). The mean number of messages sent was 59 (range: 1 to 699). 800 700 N messages sent 600 500 400 300 200 100 Subscriber (ordered by N messages sent) Fig. 2. Messages sent by subscriber 3.2 Messages and patterns of noise 241 231 221 211 201 191 181 171 161 151 141 131 121 111 91 101 81 71 61 51 41 31 21 11 1 0 Preprocessing messages posted by dentists Page 15 We retrieved 14,576 messages posted to the mailing list by dental professionals. We identified 53 patterns of noise and cleaned 12,674 (87%) messages. Thirteen percent (1902) of the messages were devoid of noisy strings or segments. Patterns included: 1 pattern for UUencoded inline images; 18 patterns for quoted text; 3 patterns for software generated auto text; 22 patterns for signatures, advertisements, and virus scanner notifications; 8 patterns for e-mail addresses, Web links, and dates and times; and 1 pattern for fragments of cascading style sheets. Most of the messages (n=10,490; 72%) contained quoted text in threads that we removed. 3.3 Effectiveness We informally assessed the effectiveness of preprocessing by comparing the number of clinically relevant collocated trigrams (triples of text strings or ‘words’) derived from text processed with preliminary vs final procedures. During NLP, we selected the top 300 collocated trigrams by sorting according to pointwise mutual information. When the input text had been processed using preliminary procedures, NLP yielded 103 (34%) clinically relevant collocated trigrams. Whereas when the input text had been processed using final procedures, NLP yielded 142 (47%) clinically relevant collocated trigrams. This is a 38% improvement in effectiveness. For details regarding NLP in the larger study, see (8). 4. Discussion To our knowledge, tools to process electronic mail for research endeavors are not publicly available. However, Li, Cao, and Tang from Microsoft were recently awarded a patent for a data Preprocessing messages posted by dentists Page 16 processing system that cleans electronic mail (Patent No. US 7,590,608 B2; see (23)). The motivation for their system is that cleaned messages can be used for automatic routing, SPAM filtering, summarization, information extraction, discovery of trends in news groups, and more (cf. section 5 in (23)). Until such tools become available, in-house development is necessary. 4.1 What is noise? In carrying out this project, we discerned that defining noise in social media artifacts is essential. Consider, for example, that the purpose of de-identification of medical records or clinical reports is typically to preserve confidentiality, and perhaps to facilitate data sharing. In contrast, de-identification here serves another very important function, i.e., to remove data about the senders that could obscure discovery of clinical topics during NLP and content analyses. Hence, personal information is noise in this context. To expunge noise adequately implies that one knows what kind of data will impede fruitful analysis. For example, our preliminary cleaning efforts were rather rudimentary. Consequently, early NLP analyses were disappointing. We therefore had to reconsider what kinds of strings, segments, images, and attachments were in fact noise as we defined it. Once we had a better idea of what we were looking for, cleaning improved dramatically. 4.2 Reducing potential bias The majority of messages include replies to earlier ones. Most of these consist of a new message along with response(s) that might accrue in a thread or chain of messages. If we had retained the responses in the threads, the relative frequencies of words and phrases, and therefore the empirical probabilities, would not accurately characterize the clinical topics. Rather, frequencies Preprocessing messages posted by dentists Page 17 based on threads probably would be more important for discourse analysis. As we are concerned with discovering clinical topics of interest to dentists instead of analyzing the conversations per se, we took care to delete threaded responses. Note that in longer threads, the new message can make up less than 10% of the e-mail. In our opinion, it is likely that retaining the threads could have severely biased our natural language processing of messages. Bogg (24) makes a similar point in terms of removing information that could otherwise skew the data in subsequent text mining. 4.3 Limitations To the extent that patterns of noise identified in this report typify e-mail, our preprocessing may generalize to studies where the unit of analysis is an e-mail message. However, to the extent that some patterns are unique, generalizability may be limited. For example, signature blocks are particularly challenging, as some of them do not follow a clear pattern. We therefore had to add customized signature filters for a small group of senders. Detecting responses to threaded messages can also be problematic. Typically, each line of an earlier message is preceded with “>”, but this is not the case when the sender uses the forward function instead of the reply function. In some cases, a sender uses the forward function and then replies within the thread. To mark his reply, he may use idiosyncratic formatting, such as bold text, a different color, or even “>”. For these scenarios, pattern-matching algorithms cannot distinguish a new message from earlier messages in the thread. Other problems occur when quoted text is damaged and its marker is missing. For example: > Here is quoted text that is wrapped, Preprocessing messages posted by dentists Page 18 but the marker is not repeated. In this case, automated pattern matching will miss the second line. 4.4 Future research In the future, we will adopt a more structured approach for finding patterns of noise and evaluating accuracy (e.g., see (24)). Additionally, dentists interested in best clinical care sometimes include excerpts from scientific articles in their messages. As these are not easily identified with rule-based pattern matching, we will consider using statistical approaches to identify such excerpts. The reason for considering article excerpts as potential sources of noise is that the distribution of clinical topics may be distorted by the inclusion of formal scientific language. 4.5 Conclusion In this report, we described procedures for preprocessing electronic messages that other biomedical researchers may find useful. We found that preprocessing markedly improved NLP to enable discovery of clinical topics in a subsequent study of content. 5. Acknowledgments We thank David S. Dodell, DMD, for permission to collect messages posted to the discussion list. The Pittsburgh Biomedical Informatics Training Program 5T15LM007059 partially supported this research. Preprocessing messages posted by dentists Page 19 6. References 1. Spallek H, Song M, Polk DE, Bekhuis T, Frantsve-Hawley J, Aravamudhan K. Barriers to implementing evidence-based clinical guidelines: a survey of early adopters. J Evid Based Dent Pract 2010; 10 (4): 195-206. PMID: 21093800. PMCID: PMC3011934. 2. Spallek H, Irwin JY, Schleyer T, Butler BS, Weiss PM. Supporting the emergence of dental informatics with an online community. Int J Comput Dent 2007; 10 (3): 247-264. 3. Falkman G, Gustafsson M, Jontell M, Torgersson O. SOMWeb: a semantic web-based system for supporting collaboration of distributed medical communities of practice. J Med Internet Res 2008; 10 (3): e25. 4. Sterne J. Text analytics for social media: evolving tools for an evolving environment. SAS Institute, Inc., Cary, NC, n.d. 5. Chew C, Eysenbach G. Pandemics in the age of Twitter: content analysis of Tweets during the 2009 H1N1 outbreak. PLoS One 2010; 5 (11): e14118. 6. Eysenbach G. Infodemiology and infoveillance: framework for an emerging set of public health informatics methods to analyze search, communication and publication behavior on the Internet. J Med Internet Res 2009; 11 (1): e11. 7. Bekhuis T, Kreinacke M, Spallek H, Song M. Using the Natural Language Toolkit to reduce the number of messages for in-depth content analyses: a case study. In: 2010 American Medical Informatics Association Annual Symposium Proceedings, p.981; 2010 November 13-17; Washington, DC. Preprocessing messages posted by dentists Page 20 8. Bekhuis T, Kreinacke M, Spallek H, Song M, O'Donnell JA. Using natural language processing to enable in-depth analysis of clinical messages posted to an Internet mailing list: a feasibility study. J Med Internet Res 2011 Nov 23;13(4):e98. doi:10.2196/jmir.1799. PMID:22112583. PMCID:PMC3236668. 9. Eysenbach G, Till JE. Ethical issues in qualitative research on internet communities. BMJ. 2001 Nov 10;323(7321):1103-5. PMID:11701577. 10. Crocker DH. RFC 822 - Standard for the format of ARPA Internet text messages. 1982. http://www.faqs.org/rfcs/rfc822.html (last accessed 06.21.2011). 11. Bernstein DJ. Internet mail: using maildir format. http://cr.yp.to/proto/maildir.html (last accessed 06.21.2011). 12. Freed N, Borenstein N. RFC 2045 - Multipurpose Internet mail extensions (MIME) Part One: format of Internet message bodies. http://www.faqs.org/rfcs/rfc2045.html (last accessed 06.21.2011). 13. Moore K. RFC 2047 - MIME (multipurpose Internet mail extensions) part three: message header extensions for non-ASCII text. 1996. http://www.faqs.org/rfcs/rfc2047.html (last accessed 06.21.2011). 14. Natural Language Toolkit. http://www.nltk.org/ 15. Bird S, Klein E, Loper E. Natural Language Processing with Python. Sebastopol, CA: O'Reilly; 2009. Preprocessing messages posted by dentists Page 21 16. Xiong L, Gardner J, Jurczyk P, Agichtein E, Lu JJ. Information Discovery on Electronic Health Records, Vagelis Hristidis (Ed.). Chapman and Hall/CRC ; 2009. Privacy preserving information discovery on EHRs; p.197-225. 17. Gardner J, Xiong L, Wang F, Post A, Saltz J, Grandison T. An evaluation of feature sets and sampling techniques for de-identification of medical records. In: IHI '10 Proceedings of the 1st ACM International Health Informatics Symposium; 2010 November 11-12; Arlington VA. 18. Beckwith BA, Mahaadevan R, Balis UJ, Kuo F. Development and evaluation of an open source software tool for de-identification of pathology reports. BMC Medl Inform Decis Mak 2006; 6 (12): [electronic]. doi:10.1186/1472-6947-6-12. 19. Neamatullah I, Douglass MM, Lehman LH, Reisner A,Villarroel M, Long WJ, Szolovits P, Moody GB, Mark RG, Clifford GD. Automated de-identification of free-text medical records. BMC Medl Inform Decis Mak 2008; 8 (32). 20. Aho AV. Algorithms for finding patterns in strings. In: Handbook of Theoretical Computer Science, Volume A: Algorithms and Complexity, Jan van Leeuwen (Ed.), 255-300. MIT Press 1994. 21. PHP: Regular expressions (Perl compatible). http://de.php.net/manual/en/book.pcre.php (last accessed 06.21.2011). 22. Nonnecke B, Preece J. Lurker demographics: counting the silent. In: Proceedings of the SIGCHI Conference on Human Factors in Computing Systems. 2000 April 01-06; The Hague, Netherlands. Preprocessing messages posted by dentists Page 22 23. Li HBC, Cao YBC,Tang ZBC. Electronic mail data cleaning. Patent No. US 7,590,608 B2. Microsoft Corporation (Redmond, WA, US). 2009. http://www.freepatentsonline.com/7590608.pdf (last accessed 06.21.2011). 24. Bogg, P. Pattern based approaches to pre-processing structured text: a newsfeed example. In: Proceedings of the Third International Conference on Computational Science, ICCS. 2003 June 2-4; Mebourne, Australia and St. Petersburg, Russia.

TECHNICAL REPORT no. 09262011 , J.A. O’Donnell

Related documents

Products

Support

TECHNICAL REPORT no. 09262011 , J.A. O’Donnell

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib