Testing classification of SET-based documents using JGAAP

Social Engineering Toolkit Tests   Had to download python ‘pefile’, and enter ‘python setup.py install’ command in terminal, from with extracted folder from download. Had to install ‘rar’ via apt-get, to use .rar file extension in SET. Social-Engineering Attacks Spear-Phishing (involves deception, masquerading as a trusted entity): You can create a template (option 3), use your own payload (option 2), or use a wizard-style interface (option 1), which takes you through, step by step.  Option 1 - wizard-style interface (Testing with DLL hijacking (no. 1)): o A list of exploits to be chosen from, is shown: o Testing with Wireshark, within the Kali Linux virtual machine, I was able to ping from the Windows host to it, without issue: o Windows Host – o Kali Linux VM – o An IP address to send the payload to, is entered. o Then an encoding for the payload is selected (I selected no. 16). o o o o You can also select a port to send the data to, on the host with the previously selected network interface/IP address. I left the default option set (HTTPS - port 443). Finally, you are able to use the vulnerability/exploit selected from the very beginning: I chose Microsoft Powerpoint 2010 (Option 8). You can then name the file (default: ‘openthis’), and choose its extension. ‘Rar’ was unavailable, so the application defaulted to a ‘.zip’ file. o Then you are able to send the file out to a single email or multiple emails – the selected option was phishing, after all. I selected a single email. o I then selected option 1, to use a pre-defined template (no. 6), and tested it using a fake gmail account: setest901@gmail.com Password setest019 I chose option 1. o o o My avast antivirus installation, caught the payload… :( At least it’s working properly. After a retry, a gMail exception was caught, stating that the email attachment/payload was caught, since its content presented “a potential security issue”, which was correct. To get around this, I had to use payload option 2 (Custom Written Document) I renamed the file to ‘test.xc’, and it was delivered via my normal Gmail account (post2base@gmail.com), to setest901@gmail.com, as shown below: The email itself: I then parsed the text from the email (associated with the file ‘test.xc’) with JGAAP, and its inbuilt Stanford POS tagger. I also created four other payloads, using different email templates. In total, text from these options’ emails was used:      No. 9 No. 6 No. 3 No. 5 No. 7 Text from the emails of options 3, 6 and 9 were associated with authors’ names which were the email’s subject names, whereas the email text of options 5 and 7 was marked as unknown. Using a Linear SVM or Gaussian (Radial Basis Kernel) SVM with the Stanford Part of Speech Tagger resulted in: --------------------------------------------------------test_xc.txt C:\Users\Luther\Desktop\SET data\test_xc.txt Canonicizers: Normalize ASCII, Normalize Whitespace EventDriver: Stanford Part of Speech tagginmodel : english-bidirectional-distsim Analysis: Linear SVM 1. Order Confirmation 1.0 test_xc.txt C:\Users\Luther\Desktop\SET data\test_xc.txt Canonicizers: Normalize ASCII, Normalize Whitespace EventDriver: Stanford Part of Speech tagginmodel : english-bidirectional-distsim Analysis: Gaussian SVM 1. Order Confirmation 1.0 --------------------------------------------------------The first file most closely matched the author (folder) containing that file, so this made sense. --------------------------------------------------------test2_xc.txt C:\Users\Luther\Desktop\SET data\test2_xc.txt Canonicizers: Normalize ASCII, Normalize Whitespace EventDriver: Stanford Part of Speech tagginmodel : english-bidirectional-distsim Analysis: Linear SVM 1. WOAAAA!!!!!!!!!! This is crazy... 2.0 test2_xc.txt C:\Users\Luther\Desktop\SET data\test2_xc.txt Canonicizers: Normalize ASCII, Normalize Whitespace EventDriver: Stanford Part of Speech tagginmodel : english-bidirectional-distsim Analysis: Gaussian SVM 1. WOAAAA!!!!!!!!!! This is crazy... 2.0 --------------------------------------------------------The second file most closely matched the author (folder) containing that file, so this made sense. --------------------------------------------------------test3_xc.txt C:\Users\Luther\Desktop\SET data\test3_xc.txt Canonicizers: Normalize ASCII, Normalize Whitespace EventDriver: Stanford Part of Speech tagginmodel : english-bidirectional-distsim Analysis: Linear SVM 1. Dan Brown's Angels & Demons 3.0 test3_xc.txt C:\Users\Luther\Desktop\SET data\test3_xc.txt Canonicizers: Normalize ASCII, Normalize Whitespace EventDriver: Stanford Part of Speech tagginmodel : english-bidirectional-distsim Analysis: Gaussian SVM 1. Dan Brown's Angels & Demons 3.0 --------------------------------------------------------The third file most closely matched the author (folder) containing that file, so this made sense. --------------------------------------------------------test4_xc.txt C:\Users\Luther\Desktop\SET data\test4_xc.txt Canonicizers: Normalize ASCII, Normalize Whitespace EventDriver: Stanford Part of Speech tagginmodel : english-bidirectional-distsim Analysis: Linear SVM 1. Dan Brown's Angels & Demons 3.0 test4_xc.txt C:\Users\Luther\Desktop\SET data\test4_xc.txt Canonicizers: Normalize ASCII, Normalize Whitespace EventDriver: Stanford Part of Speech tagginmodel : english-bidirectional-distsim Analysis: Gaussian SVM 1. Dan Brown's Angels & Demons 3.0 --------------------------------------------------------The fourth file most closely matched the third author (Dan Brown's Angels & Demons) Why? More tests needed? --------------------------------------------------------test5_xc.txt C:\Users\Luther\Desktop\SET data\test5_xc.txt Canonicizers: Normalize ASCII, Normalize Whitespace EventDriver: Stanford Part of Speech tagginmodel : english-bidirectional-distsim Analysis: Linear SVM 1. Dan Brown's Angels & Demons 3.0 test5_xc.txt C:\Users\Luther\Desktop\SET data\test5_xc.txt Canonicizers: Normalize ASCII, Normalize Whitespace EventDriver: Stanford Part of Speech tagginmodel : english-bidirectional-distsim Analysis: Gaussian SVM 1. Dan Brown's Angels & Demons 3.0 --------------------------------------------------------The fifth file most closely matched the third author (Dan Brown's Angels & Demons) Why? More tests needed? --------------------------------------------------------Since the fourth and fifth files were both matched with the third author (Dan Brown's Angels & Demons), does this mean they are similar to each other? I associated the fourth file with an author equal to its email’s subject name, and then just tested using the only unknown text file as the fifth file. The rest of the conditions were the same. Results: --------------------------------------------------------test5_xc.txt C:\Users\Luther\Desktop\SET data\test5_xc.txt Canonicizers: Normalize ASCII, Normalize Whitespace EventDriver: Stanford Part of Speech tagginmodel : english-bidirectional-distsim Analysis: Linear SVM 1. Status Report 4.0 test5_xc.txt C:\Users\Luther\Desktop\SET data\test5_xc.txt Canonicizers: Normalize ASCII, Normalize Whitespace EventDriver: Stanford Part of Speech tagginmodel : english-bidirectional-distsim Analysis: Gaussian SVM 1. Status Report 4.0 --------------------------------------------------------- My hypothesis was correct, since the fourth file was matched most closely with the fifth file.  Spear Phishing – Option 3: My next test was to compare text from an unrelated source, but same field (deception), to these SET generated email texts, to see if they were considered similar or not to the set of email texts used in the previous test. This was carried out using a number of techniques:   A Gaussian SVM A Linear SVM o Each with the Stanford Part of Speech Tagger I selected the text from the Brennan-Greenstadt Adversarial Corpus (source: https://psal.cs.drexel.edu/index.php/Main_Page) --------------------------------------------------------------------------------Citations: Michael Brennan, Sadia Afroz, and Rachel Greenstadt. 2012. Adversarial stylometry: Circumventing authorship recognition to preserve privacy and anonymity. ACM Trans. Inf. Syst. Secur. 15, 3, Article 12 (November 2012). Michael Brennan and Rachel Greenstadt. Practical Attacks Against Authorship Recognition Techniques in Proceedings of the Twenty-First Conference on Innovative Applications of Artificial Intelligence (IAAI), Pasadena, California, July 2009. --------------------------------------------------------------------------------I used attacks classified by the corpus to obfuscation attacks, and compared to them to the SET generated attacks. These were the results: a_obfuscation.txt D:\Dropbox\Coding\3rd Yr Project\Project Docs\SET data\attacks_obfuscation\a_obfuscation.txt Canonicizers: none EventDriver: Stanford Part of Speech tagginmodel : english-bidirectional-distsim Analysis: Linear SVM 1. Order Confirmati?on 1.0 a_obfuscation.txt D:\Dropbox\Coding\3rd Yr Project\Project Docs\SET data\attacks_obfuscation\a_obfuscation.txt Canonicizers: none EventDriver: Stanford Part of Speech tagginmodel : english-bidirectional-distsim Analysis: Gaussian SVM 1. Baby Pics 5.0 b_obfuscation.txt D:\Dropbox\Coding\3rd Yr Project\Project Docs\SET data\attacks_obfuscation\b_obfuscation.txt Canonicizers: none EventDriver: Stanford Part of Speech tagginmodel : english-bidirectional-distsim Analysis: Linear SVM 1. Order Confirmati?on 1.0 b_obfuscation.txt D:\Dropbox\Coding\3rd Yr Project\Project Docs\SET data\attacks_obfuscation\b_obfuscation.txt Canonicizers: none EventDriver: Stanford Part of Speech tagginmodel : english-bidirectional-distsim Analysis: Gaussian SVM 1. Baby Pics 5.0 c_obfuscation.txt D:\Dropbox\Coding\3rd Yr Project\Project Docs\SET data\attacks_obfuscation\c_obfuscation.txt Canonicizers: none EventDriver: Stanford Part of Speech tagginmodel : english-bidirectional-distsim Analysis: Linear SVM 1. Dan Brown's Angels & Demons 3.0 c_obfuscation.txt D:\Dropbox\Coding\3rd Yr Project\Project Docs\SET data\attacks_obfuscation\c_obfuscation.txt Canonicizers: none EventDriver: Stanford Part of Speech tagginmodel : english-bidirectional-distsim Analysis: Gaussian SVM 1. Baby Pics 5.0 d_obfuscation.txt D:\Dropbox\Coding\3rd Yr Project\Project Docs\SET data\attacks_obfuscation\d_obfuscation.txt Canonicizers: none EventDriver: Stanford Part of Speech tagginmodel : english-bidirectional-distsim Analysis: Linear SVM 1. Dan Brown's Angels & Demons 3.0 d_obfuscation.txt D:\Dropbox\Coding\3rd Yr Project\Project Docs\SET data\attacks_obfuscation\d_obfuscation.txt Canonicizers: none EventDriver: Stanford Part of Speech tagginmodel : english-bidirectional-distsim Analysis: Gaussian SVM 1. Baby Pics 5.0 e_obfuscation.txt D:\Dropbox\Coding\3rd Yr Project\Project Docs\SET data\attacks_obfuscation\e_obfuscation.txt Canonicizers: none EventDriver: Stanford Part of Speech tagginmodel : english-bidirectional-distsim Analysis: Linear SVM 1. Order Confirmati?on 1.0 e_obfuscation.txt D:\Dropbox\Coding\3rd Yr Project\Project Docs\SET data\attacks_obfuscation\e_obfuscation.txt Canonicizers: none EventDriver: Stanford Part of Speech tagginmodel : english-bidirectional-distsim Analysis: Gaussian SVM 1. Baby Pics 5.0 f_obfuscation.txt D:\Dropbox\Coding\3rd Yr Project\Project Docs\SET data\attacks_obfuscation\f_obfuscation.txt Canonicizers: none EventDriver: Stanford Part of Speech tagginmodel : english-bidirectional-distsim Analysis: Linear SVM 1. Order Confirmati?on 1.0 f_obfuscation.txt D:\Dropbox\Coding\3rd Yr Project\Project Docs\SET data\attacks_obfuscation\f_obfuscation.txt Canonicizers: none EventDriver: Stanford Part of Speech tagginmodel : english-bidirectional-distsim Analysis: Gaussian SVM 1. Baby Pics 5.0 g_obfuscation.txt D:\Dropbox\Coding\3rd Yr Project\Project Docs\SET data\attacks_obfuscation\g_obfuscation.txt Canonicizers: none EventDriver: Stanford Part of Speech tagginmodel : english-bidirectional-distsim Analysis: Linear SVM 1. Order Confirmati?on 1.0 g_obfuscation.txt D:\Dropbox\Coding\3rd Yr Project\Project Docs\SET data\attacks_obfuscation\g_obfuscation.txt Canonicizers: none EventDriver: Stanford Part of Speech tagginmodel : english-bidirectional-distsim Analysis: Gaussian SVM 1. Baby Pics 5.0 h_obfuscation.txt D:\Dropbox\Coding\3rd Yr Project\Project Docs\SET data\attacks_obfuscation\h_obfuscation.txt Canonicizers: none EventDriver: Stanford Part of Speech tagginmodel : english-bidirectional-distsim Analysis: Linear SVM 1. Dan Brown's Angels & Demons 3.0 h_obfuscation.txt D:\Dropbox\Coding\3rd Yr Project\Project Docs\SET data\attacks_obfuscation\h_obfuscation.txt Canonicizers: none EventDriver: Stanford Part of Speech tagginmodel : english-bidirectional-distsim Analysis: Gaussian SVM 1. Baby Pics 5.0 k_obfuscation.txt D:\Dropbox\Coding\3rd Yr Project\Project Docs\SET data\attacks_obfuscation\k_obfuscation.txt Canonicizers: none EventDriver: Stanford Part of Speech tagginmodel : english-bidirectional-distsim Analysis: Linear SVM 1. Order Confirmati?on 1.0 k_obfuscation.txt D:\Dropbox\Coding\3rd Yr Project\Project Docs\SET data\attacks_obfuscation\k_obfuscation.txt Canonicizers: none EventDriver: Stanford Part of Speech tagginmodel : english-bidirectional-distsim Analysis: Gaussian SVM 1. Baby Pics 5.0 m_obfuscation.txt D:\Dropbox\Coding\3rd Yr Project\Project Docs\SET data\attacks_obfuscation\m_obfuscation.txt Canonicizers: none EventDriver: Stanford Part of Speech tagginmodel : english-bidirectional-distsim Analysis: Linear SVM 1. Order Confirmati?on 1.0 m_obfuscation.txt D:\Dropbox\Coding\3rd Yr Project\Project Docs\SET data\attacks_obfuscation\m_obfuscation.txt Canonicizers: none EventDriver: Stanford Part of Speech tagginmodel : english-bidirectional-distsim Analysis: Gaussian SVM 1. Baby Pics 5.0 p_obfuscation.txt D:\Dropbox\Coding\3rd Yr Project\Project Docs\SET data\attacks_obfuscation\p_obfuscation.txt Canonicizers: none EventDriver: Stanford Part of Speech tagginmodel : english-bidirectional-distsim Analysis: Linear SVM 1. Order Confirmati?on 1.0 p_obfuscation.txt D:\Dropbox\Coding\3rd Yr Project\Project Docs\SET data\attacks_obfuscation\p_obfuscation.txt Canonicizers: none EventDriver: Stanford Part of Speech tagginmodel : english-bidirectional-distsim Analysis: Gaussian SVM 1. Baby Pics 5.0 s_obfuscation.txt D:\Dropbox\Coding\3rd Yr Project\Project Docs\SET data\attacks_obfuscation\s_obfuscation.txt Canonicizers: none EventDriver: Stanford Part of Speech tagginmodel : english-bidirectional-distsim Analysis: Linear SVM 1. Order Confirmati?on 1.0 s_obfuscation.txt D:\Dropbox\Coding\3rd Yr Project\Project Docs\SET data\attacks_obfuscation\s_obfuscation.txt Canonicizers: none EventDriver: Stanford Part of Speech tagginmodel : english-bidirectional-distsim Analysis: Gaussian SVM 1. Baby Pics 5.0 Documents c, d and h were classified to match mostly with the 3rd author ‘Dan Brown's Angels & Demons’, with the other documents being classified to ‘Order Confirmation’, when using a Linear SVM. With a Gaussian SVM, all documents were classified to match ‘Baby Pics’ - author 5. These were the results for the imitation corpus: -------------------------------------------------------------------------------------------------------------------------------------------------a_imitation.txt C:\SD\Dropbox\Coding\3rd Yr Project\Project Docs\SET data\attacks_imitation\a_imitation.txt Canonicizers: none EventDriver: Stanford Part of Speech tagginmodel : english-bidirectional-distsim Analysis: Gaussian SVM 1. Baby Pics 5.0 a_imitation.txt C:\SD\Dropbox\Coding\3rd Yr Project\Project Docs\SET data\attacks_imitation\a_imitation.txt Canonicizers: none EventDriver: Stanford Part of Speech tagginmodel : english-bidirectional-distsim Analysis: Linear SVM 1. Dan Brown's Angels & Demons 3.0 -------------------------------------------------------------------------------------------------------------------------------------------------b_imitation.txt C:\SD\Dropbox\Coding\3rd Yr Project\Project Docs\SET data\attacks_imitation\b_imitation.txt Canonicizers: none EventDriver: Stanford Part of Speech tagginmodel : english-bidirectional-distsim Analysis: Gaussian SVM 1. Baby Pics 5.0 b_imitation.txt C:\SD\Dropbox\Coding\3rd Yr Project\Project Docs\SET data\attacks_imitation\b_imitation.txt Canonicizers: none EventDriver: Stanford Part of Speech tagginmodel : english-bidirectional-distsim Analysis: Linear SVM 1. Dan Brown's Angels & Demons 3.0 -------------------------------------------------------------------------------------------------------------------------------------------------c_imitation.txt C:\SD\Dropbox\Coding\3rd Yr Project\Project Docs\SET data\attacks_imitation\c_imitation.txt Canonicizers: none EventDriver: Stanford Part of Speech tagginmodel : english-bidirectional-distsim Analysis: Gaussian SVM 1. Baby Pics 5.0 c_imitation.txt C:\SD\Dropbox\Coding\3rd Yr Project\Project Docs\SET data\attacks_imitation\c_imitation.txt Canonicizers: none EventDriver: Stanford Part of Speech tagginmodel : english-bidirectional-distsim Analysis: Linear SVM 1. Dan Brown's Angels & Demons 3.0 -------------------------------------------------------------------------------------------------------------------------------------------------d_imitation.txt C:\SD\Dropbox\Coding\3rd Yr Project\Project Docs\SET data\attacks_imitation\d_imitation.txt Canonicizers: none EventDriver: Stanford Part of Speech tagginmodel : english-bidirectional-distsim Analysis: Gaussian SVM 1. Baby Pics 5.0 d_imitation.txt C:\SD\Dropbox\Coding\3rd Yr Project\Project Docs\SET data\attacks_imitation\d_imitation.txt Canonicizers: none EventDriver: Stanford Part of Speech tagginmodel : english-bidirectional-distsim Analysis: Linear SVM 1. Dan Brown's Angels & Demons 3.0 -------------------------------------------------------------------------------------------------------------------------------------------------e_imitation.txt C:\SD\Dropbox\Coding\3rd Yr Project\Project Docs\SET data\attacks_imitation\e_imitation.txt Canonicizers: none EventDriver: Stanford Part of Speech tagginmodel : english-bidirectional-distsim Analysis: Gaussian SVM 1. Baby Pics 5.0 e_imitation.txt C:\SD\Dropbox\Coding\3rd Yr Project\Project Docs\SET data\attacks_imitation\e_imitation.txt Canonicizers: none EventDriver: Stanford Part of Speech tagginmodel : english-bidirectional-distsim Analysis: Linear SVM 1. Order Confirmati?on 1.0 -------------------------------------------------------------------------------------------------------------------------------------------------f_imitation.txt C:\SD\Dropbox\Coding\3rd Yr Project\Project Docs\SET data\attacks_imitation\f_imitation.txt Canonicizers: none EventDriver: Stanford Part of Speech tagginmodel : english-bidirectional-distsim Analysis: Gaussian SVM 1. Baby Pics 5.0 f_imitation.txt C:\SD\Dropbox\Coding\3rd Yr Project\Project Docs\SET data\attacks_imitation\f_imitation.txt Canonicizers: none EventDriver: Stanford Part of Speech tagginmodel : english-bidirectional-distsim Analysis: Linear SVM 1. Order Confirmati?on 1.0 -------------------------------------------------------------------------------------------------------------------------------------------------- g_imitation.txt C:\SD\Dropbox\Coding\3rd Yr Project\Project Docs\SET data\attacks_imitation\g_imitation.txt Canonicizers: none EventDriver: Stanford Part of Speech tagginmodel : english-bidirectional-distsim Analysis: Gaussian SVM 1. Baby Pics 5.0 g_imitation.txt C:\SD\Dropbox\Coding\3rd Yr Project\Project Docs\SET data\attacks_imitation\g_imitation.txt Canonicizers: none EventDriver: Stanford Part of Speech tagginmodel : english-bidirectional-distsim Analysis: Linear SVM 1. Order Confirmati?on 1.0 -------------------------------------------------------------------------------------------------------------------------------------------------h_imitation.txt C:\SD\Dropbox\Coding\3rd Yr Project\Project Docs\SET data\attacks_imitation\h_imitation.txt Canonicizers: none EventDriver: Stanford Part of Speech tagginmodel : english-bidirectional-distsim Analysis: Gaussian SVM 1. Baby Pics 5.0 h_imitation.txt C:\SD\Dropbox\Coding\3rd Yr Project\Project Docs\SET data\attacks_imitation\h_imitation.txt Canonicizers: none EventDriver: Stanford Part of Speech tagginmodel : english-bidirectional-distsim Analysis: Linear SVM 1. Order Confirmati?on 1.0 -------------------------------------------------------------------------------------------------------------------------------------------------k_imitation.txt C:\SD\Dropbox\Coding\3rd Yr Project\Project Docs\SET data\attacks_imitation\k_imitation.txt Canonicizers: none EventDriver: Stanford Part of Speech tagginmodel : english-bidirectional-distsim Analysis: Gaussian SVM 1. Baby Pics 5.0 k_imitation.txt C:\SD\Dropbox\Coding\3rd Yr Project\Project Docs\SET data\attacks_imitation\k_imitation.txt Canonicizers: none EventDriver: Stanford Part of Speech tagginmodel : english-bidirectional-distsim Analysis: Linear SVM 1. Dan Brown's Angels & Demons 3.0 -------------------------------------------------------------------------------------------------------------------------------------------------m_imitation.txt C:\SD\Dropbox\Coding\3rd Yr Project\Project Docs\SET data\attacks_imitation\m_imitation.txt Canonicizers: none EventDriver: Stanford Part of Speech tagginmodel : english-bidirectional-distsim Analysis: Gaussian SVM 1. Baby Pics 5.0 m_imitation.txt C:\SD\Dropbox\Coding\3rd Yr Project\Project Docs\SET data\attacks_imitation\m_imitation.txt Canonicizers: none EventDriver: Stanford Part of Speech tagginmodel : english-bidirectional-distsim Analysis: Linear SVM 1. Dan Brown's Angels & Demons 3.0 -------------------------------------------------------------------------------------------------------------------------------------------------p_imitation.txt C:\SD\Dropbox\Coding\3rd Yr Project\Project Docs\SET data\attacks_imitation\p_imitation.txt Canonicizers: none EventDriver: Stanford Part of Speech tagginmodel : english-bidirectional-distsim Analysis: Gaussian SVM 1. Baby Pics 5.0 p_imitation.txt C:\SD\Dropbox\Coding\3rd Yr Project\Project Docs\SET data\attacks_imitation\p_imitation.txt Canonicizers: none EventDriver: Stanford Part of Speech tagginmodel : english-bidirectional-distsim Analysis: Linear SVM 1. Dan Brown's Angels & Demons 3.0 -------------------------------------------------------------------------------------------------------------------------------------------------s_imitation.txt C:\SD\Dropbox\Coding\3rd Yr Project\Project Docs\SET data\attacks_imitation\s_imitation.txt Canonicizers: none EventDriver: Stanford Part of Speech tagginmodel : english-bidirectional-distsim Analysis: Gaussian SVM 1. Baby Pics 5.0 s_imitation.txt C:\SD\Dropbox\Coding\3rd Yr Project\Project Docs\SET data\attacks_imitation\s_imitation.txt Canonicizers: none EventDriver: Stanford Part of Speech tagginmodel : english-bidirectional-distsim Analysis: Linear SVM 1. Dan Brown's Angels & Demons 3.0 When parsing using the Stanford POS tagger, all imitation documents, matched with the 5th author - ‘Baby Pics’, when using a Gaussian SVM, and all imitation documents apart from e, g and h, matched the 3rd author - ‘Dan Brown’s Angels and Demons’, when using a Linear SVM. Documents e, g and h matched with ‘Order Confirmation’, when applying a Linear SVM to the imitation documents. Overall, when either obfuscating or imitating, it was the case that the only matched document was no. 5 – ‘Baby Pics’, when using a Gaussian SVM at least, so there is no way to determine whether the chosen documents from the SET were more likely to be obfuscating their text, rather than imitating some text, and vice versa, using a Gaussian SVM. In the case of the Linear SVM:      Obfuscation: o c, d and h were classified to match mostly with the 3rd author ‘Dan Brown's Angels & Demons’ o All other documents (9) were classified as matching ‘Order Confirmation’ Imitation: o e, g and h matched with ‘Order Confirmation’ o All others (9) matched with ‘Dan Brown’s Angels and Demons’ You could assume for a hypothesis, that since 75% of documents matched ‘Order Confirmation’ when obfuscating, you could research features of obfuscation, determine which are in use in the ‘Order Confirmation’ document, and then be able to classify other similar SET documents using that feature set. o The same could be applied to the imitation side of things, as 75% of documents were classified to be closest to ‘Dan Brown’s Angels and Demons’. The other funny thing is that each of the authors ‘Dan Brown’s Angels and Demons’ (3rd), and ‘Order Confirmation’ (1st) held the corresponding other quarter of document classifications, for either obfuscation and imitation, respectively. o This means that if creating a distance measure, a lower weighting could be placed on features relating to obfuscation in the first case (3rd author), and imitation in the second (1st author). o Maybe 75% of features could be used to detect obfuscation, and 25% imitation, for invoice/receipt/acknowledgement style documents, and vice-versa, for imitation/persuasion style documents. Create test to determine whether SET documents have more of a focus on imitation or obfuscation. o State features which determined this o Maybe research distance measures  Start with own emails test against SET – ground truth  Create distance measure with this  In future ask people to create emails for testing, which are in a similar vein the SET/Kali Linux generated ones.  Compare results to distance measure.  See which corpora when applied to distance measure produces best results over time  If approx. half of corpora are suited to one distance measure, and the approx. other half, another distance measure, maybe try combining the two, and adjusting weightings of metrics/features, to make sure a correct results is obtained for all classifications of corpora, within a certain threshold. JGAAP terminology (Source: http://evllabs.com/jgaap/w/index.php/JGAAP#Events) Canonicizer – Preprocessors for raw document text. It means something that puts the raw text of a document into canonical form (e.g. changing characters to lowercase, removing punctuation) Events – Singletons generated from the text of the document used to represent it to the learning algorithm (e.g. Words, Characters, Parts of Speech, Word BiGrams) Cullers – Operate on the resulting events from processed documents. They remove events from those generated creating a set that meets its stipulations. E.g.    Reduce to only the 50 most common events Reduce to the 100 least common events Reduce to only events that appear in every document. Analysis – Analysis methods use the event based representations of documents to create models of their authors. It then uses these models to evaluate the most likely author of unknown documents from the pool of authors it knows about. E.g. Support Vector Machines, or Nearest Neighbour. JGAAP code   API is available, along with a command line interface (a Java class for each) Javadocs: o Instructions for using the JGAAP API:  First add documents both known and unknown (via addDocument)  All other settings can be performed in any order which are:  setLanguage  addCanonicizer  addEventDriver – Used to generate a List of Events ordered in the sequence they are found in the document  addEventCuller  addAnalysisDriver  addDistanceFunction  Note: Of the settings only one EventDriver and one AnalysisDriver are required to run an experiment o The execute method is then used to start the experiment o Results are placed in unknown documents  To access them use the getUnknownDocuments method in the API o The results can be retrieved as a List<Pair<String, Double>>  This is a sorted list from most likely to least likely author followed by a score generated based on your settings, using the getRawResult method  You can also get a Map of Maps of the raw results with the getRawResults method   o o Type is: (Map<EventDriver, Map<AnalysisDriver, List<String, Double>>>) They can also be retrieved as a string using either the getFormattedResult or getResult methods. For examples of how to use the API class see the com.jgaap.ui package for a GUI example or the com.jgaap.backend.CLI class for a command line example JGAAP Command Line Interface Done using the JGAAP Experiment Engine (Source: http://evllabs.com/jgaap/w/index.php/Experiment_Engine) The Experiment Engine is accessible through JGAAP's command line interface using the -ee flag and pointing it to a run file, i.e. java -jar jgaap.jar -ee ../experiment.run A run file is a csv where the top row is a title for the experiment and each subsequent row is an experiment to be run. A ‘Corpus File’ contains information on the authors and titles of documents that can be used in JGAAP to allow for the quick reproduction of experiments. Corpus files can be generated within JGAAP after loading documents via the menu option “File>Batch Documents>Save Documents” Standard: Author,/path/to/file.txt,Title Author,/path/to/other/file.txt,Title, /path/to/unknown/file.txt,Title Example: A,/home/ryan/example/sampleA-01.txt A,/home/ryan/example/sampleA-02.txt B,/home/ryan/example/sampleB-01.txt B,/home/ryan/example/sampleB-02.txt C,/home/ryan/example/sampleC-01.txt C,/home/ryan/example/sampleC-02.txt ,/home/ryan/example/test-01.txt ,/home/ryan/example/test-02.txt Titles are optional. If you don’t assign one, a default title will be assigned.

Testing classification of SET-based documents using JGAAP

Related documents

Products

Support

Testing classification of SET-based documents using JGAAP

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib