Enron Emails Philip Saponaro Background • Founded in 1985, Bankrupted in 2001 – Sustained finances by accounting fraud • Federal Energy Regulatory Commision (FERC) – Released emails to public as part of investigation – Problems with original dataset • Invalid email addresses • Sensitive information of employees • Duplicate employees, misspellings Background • Dataset was cleaned up and prepared – SRI International • A research and development organization • Might hear “CALO dataset” • Made data easier to parse – Invalid email addresses user@enron.com • Removed duplicate employees – – – – For the most part… phanis-s is a misspelling of panus-s whalley-l is a duplicate of whalley-g Emails are separated, even though they are same employees Dataset Information • Organized into folders by employee – 150 employees listed, including the two duplicates – Mostly management • Not all employees are guilty of fraud • 517,431 email messages • No attachments to make file size manageable – Attachments available in original FERC dataset – Electronic Discovery Reference Model (EDRM) has attachments Dataset Information • Each directory has folders from mailbox – Inbox, sent, deleted, etc • Each email has a header – Starts with Message-ID: some_id – Contains To:, From:, Subject:, Date:, etc • Dates made canonical, replacing raw date – Also contains the character set used • Content-Type tag • ASCII, Latin-1, etc • Some folders contain duplicate email messages – _sent_mail, all_documents, discussion_threads Statistics • All statistics from CALO – Available on /m/blizzard/corpora/enron_emails – Avaiable online: http://www.cs.cmu.edu/~enron/ – Used two versions: original (“dirty”) and cleaned • Removed long security key strings • Removed weird email formatting – *********TEXT******** • Removed any HTML tags and some header info Statistics • Token/Type ratio – Dirty: 63.2 – Clean: 87.4 • Hapax Legomena – Dirty: 41.8% of types, 0.66% of tokens – Clean: 24.4% of types, 0.27% of tokens • Average Word Length – Dirty: 23.8, due to long ID strings and such – Clean: 13.7, due to email addresses, website links, file names Statistics -- Pronouns • Percentage of Pronouns – Dirty: 2.2% – Clean: 2.4% – Google: 1.5% Statistics -- Pronouns Statistics -- Contractions • Percentage of Contractions – Dirty: .3% – Clean: .33% – Google: .0002% Statistics -- Contractions Publications • Exploration of Communication Networks from the Enron Email Corpus – Jana Diesner – Kathleen M. Carley – CMU paper about social network, cliques, etc in Enron, and analysis about amount of emails sent during the Enron crisis • Graph Theoretic and Spectral Analysis of Enron Email Data – Anurat Chapanond, Mukkai S. Krishnamoorthy and Bülent Yener – Uses graph theory and spectral analysis to discover social structures within Enron Helpful Links • EDRM – Enron Emails + attachments in PST format – http://edrm.net/resources/data-sets/enron-data-setfiles • CALO dataset – CMU website with descriptions and links – http://www-2.cs.cmu.edu/~enron/ • Tools – Visualization, Database, hand-annotated sets – http://bailando.sims.berkeley.edu/enron_email.html The End?