Enron Emails

advertisement
Enron Emails
Philip Saponaro
Background
• Founded in 1985, Bankrupted in 2001
– Sustained finances by accounting fraud
• Federal Energy Regulatory Commision (FERC)
– Released emails to public as part of investigation
– Problems with original dataset
• Invalid email addresses
• Sensitive information of employees
• Duplicate employees, misspellings
Background
• Dataset was cleaned up and prepared
– SRI International
• A research and development organization
• Might hear “CALO dataset”
• Made data easier to parse
– Invalid email addresses user@enron.com
• Removed duplicate employees
–
–
–
–
For the most part…
phanis-s is a misspelling of panus-s
whalley-l is a duplicate of whalley-g
Emails are separated, even though they are same employees
Dataset Information
• Organized into folders by employee
– 150 employees listed, including the two duplicates
– Mostly management
• Not all employees are guilty of fraud
• 517,431 email messages
• No attachments to make file size manageable
– Attachments available in original FERC dataset
– Electronic Discovery Reference Model (EDRM) has
attachments
Dataset Information
• Each directory has folders from mailbox
– Inbox, sent, deleted, etc
• Each email has a header
– Starts with Message-ID: some_id
– Contains To:, From:, Subject:, Date:, etc
• Dates made canonical, replacing raw date
– Also contains the character set used
• Content-Type tag
• ASCII, Latin-1, etc
• Some folders contain duplicate email messages
– _sent_mail, all_documents, discussion_threads
Statistics
• All statistics from CALO
– Available on /m/blizzard/corpora/enron_emails
– Avaiable online: http://www.cs.cmu.edu/~enron/
– Used two versions: original (“dirty”) and cleaned
• Removed long security key strings
• Removed weird email formatting
– *********TEXT********
• Removed any HTML tags and some header info
Statistics
• Token/Type ratio
– Dirty: 63.2
– Clean: 87.4
• Hapax Legomena
– Dirty: 41.8% of types, 0.66% of tokens
– Clean: 24.4% of types, 0.27% of tokens
• Average Word Length
– Dirty: 23.8, due to long ID strings and such
– Clean: 13.7, due to email addresses, website links, file
names
Statistics -- Pronouns
• Percentage of Pronouns
– Dirty: 2.2%
– Clean: 2.4%
– Google: 1.5%
Statistics -- Pronouns
Statistics -- Contractions
• Percentage of Contractions
– Dirty: .3%
– Clean: .33%
– Google: .0002%
Statistics -- Contractions
Publications
• Exploration of Communication Networks from the
Enron Email Corpus
– Jana Diesner
– Kathleen M. Carley
– CMU paper about social network, cliques, etc in Enron,
and analysis about amount of emails sent during the Enron
crisis
• Graph Theoretic and Spectral Analysis of Enron Email
Data
– Anurat Chapanond, Mukkai S. Krishnamoorthy and Bülent
Yener
– Uses graph theory and spectral analysis to discover social
structures within Enron
Helpful Links
• EDRM – Enron Emails + attachments in PST
format
– http://edrm.net/resources/data-sets/enron-data-setfiles
• CALO dataset – CMU website with descriptions
and links
– http://www-2.cs.cmu.edu/~enron/
• Tools – Visualization, Database, hand-annotated
sets
– http://bailando.sims.berkeley.edu/enron_email.html
The End?
Download