Testing classification of SET-based documents using JGAAP

advertisement
Social Engineering Toolkit Tests


Had to download python ‘pefile’, and enter ‘python setup.py install’ command in terminal,
from with extracted folder from download.
Had to install ‘rar’ via apt-get, to use .rar file extension in SET.
Social-Engineering Attacks
Spear-Phishing (involves deception, masquerading as a trusted entity):
You can create a template (option 3), use your own payload (option 2), or use a wizard-style
interface (option 1), which takes you through, step by step.

Option 1 - wizard-style interface (Testing with DLL hijacking (no. 1)):
o A list of exploits to be chosen from, is shown:
o
Testing with Wireshark, within the Kali Linux virtual machine, I was able to ping from
the Windows host to it, without issue:
o
Windows Host –
o
Kali Linux VM –
o
An IP address to send the payload to, is entered.
o
Then an encoding for the payload is selected (I selected no. 16).
o
o
o
o
You can also select a port to send the data to, on the host with the previously
selected network interface/IP address. I left the default option set (HTTPS - port
443).
Finally, you are able to use the vulnerability/exploit selected from the very
beginning:
I chose Microsoft Powerpoint 2010 (Option 8). You can then name the file (default:
‘openthis’), and choose its extension.
‘Rar’ was unavailable, so the application defaulted to a ‘.zip’ file.
o
Then you are able to send the file out to a single email or multiple emails – the
selected option was phishing, after all. I selected a single email.
o
I then selected option 1, to use a pre-defined template (no. 6), and tested it using a
fake gmail account: setest901@gmail.com Password setest019
I chose option 1.
o
o
o
My avast antivirus installation, caught the payload… :( At least it’s working properly.
After a retry, a gMail exception was caught, stating that the email
attachment/payload was caught, since its content presented “a potential security
issue”, which was correct.
To get around this, I had to use payload option 2 (Custom Written Document)
I renamed the file to ‘test.xc’, and it was delivered via my normal Gmail account
(post2base@gmail.com), to setest901@gmail.com, as shown below:
The email itself:
I then parsed the text from the email
(associated with the file ‘test.xc’) with
JGAAP, and its inbuilt Stanford POS
tagger. I also created four other
payloads, using different email
templates. In total, text from these
options’ emails was used:





No. 9
No. 6
No. 3
No. 5
No. 7
Text from the emails of options 3, 6 and 9 were associated with authors’ names which were the
email’s subject names, whereas the email text of options 5 and 7 was marked as unknown.
Using a Linear SVM or Gaussian (Radial Basis Kernel) SVM with the Stanford Part of Speech Tagger
resulted in:
--------------------------------------------------------test_xc.txt C:\Users\Luther\Desktop\SET data\test_xc.txt
Canonicizers: Normalize ASCII, Normalize Whitespace
EventDriver: Stanford Part of Speech tagginmodel : english-bidirectional-distsim
Analysis: Linear SVM
1. Order Confirmation 1.0
test_xc.txt C:\Users\Luther\Desktop\SET data\test_xc.txt
Canonicizers: Normalize ASCII, Normalize Whitespace
EventDriver: Stanford Part of Speech tagginmodel : english-bidirectional-distsim
Analysis: Gaussian SVM
1. Order Confirmation 1.0
--------------------------------------------------------The first file most closely matched the author (folder) containing that file, so this made
sense.
--------------------------------------------------------test2_xc.txt C:\Users\Luther\Desktop\SET data\test2_xc.txt
Canonicizers: Normalize ASCII, Normalize Whitespace
EventDriver: Stanford Part of Speech tagginmodel : english-bidirectional-distsim
Analysis: Linear SVM
1. WOAAAA!!!!!!!!!! This is crazy... 2.0
test2_xc.txt C:\Users\Luther\Desktop\SET data\test2_xc.txt
Canonicizers: Normalize ASCII, Normalize Whitespace
EventDriver: Stanford Part of Speech tagginmodel : english-bidirectional-distsim
Analysis: Gaussian SVM
1. WOAAAA!!!!!!!!!! This is crazy... 2.0
--------------------------------------------------------The second file most closely matched the author (folder) containing that file, so this made
sense.
--------------------------------------------------------test3_xc.txt C:\Users\Luther\Desktop\SET data\test3_xc.txt
Canonicizers: Normalize ASCII, Normalize Whitespace
EventDriver: Stanford Part of Speech tagginmodel : english-bidirectional-distsim
Analysis: Linear SVM
1. Dan Brown's Angels & Demons 3.0
test3_xc.txt C:\Users\Luther\Desktop\SET data\test3_xc.txt
Canonicizers: Normalize ASCII, Normalize Whitespace
EventDriver: Stanford Part of Speech tagginmodel : english-bidirectional-distsim
Analysis: Gaussian SVM
1. Dan Brown's Angels & Demons 3.0
--------------------------------------------------------The third file most closely matched the author (folder) containing that file, so this made
sense.
--------------------------------------------------------test4_xc.txt C:\Users\Luther\Desktop\SET data\test4_xc.txt
Canonicizers: Normalize ASCII, Normalize Whitespace
EventDriver: Stanford Part of Speech tagginmodel : english-bidirectional-distsim
Analysis: Linear SVM
1. Dan Brown's Angels & Demons 3.0
test4_xc.txt C:\Users\Luther\Desktop\SET data\test4_xc.txt
Canonicizers: Normalize ASCII, Normalize Whitespace
EventDriver: Stanford Part of Speech tagginmodel : english-bidirectional-distsim
Analysis: Gaussian SVM
1. Dan Brown's Angels & Demons 3.0
--------------------------------------------------------The fourth file most closely matched the third author (Dan Brown's Angels & Demons) Why? More
tests needed?
--------------------------------------------------------test5_xc.txt C:\Users\Luther\Desktop\SET data\test5_xc.txt
Canonicizers: Normalize ASCII, Normalize Whitespace
EventDriver: Stanford Part of Speech tagginmodel : english-bidirectional-distsim
Analysis: Linear SVM
1. Dan Brown's Angels & Demons 3.0
test5_xc.txt C:\Users\Luther\Desktop\SET data\test5_xc.txt
Canonicizers: Normalize ASCII, Normalize Whitespace
EventDriver: Stanford Part of Speech tagginmodel : english-bidirectional-distsim
Analysis: Gaussian SVM
1. Dan Brown's Angels & Demons 3.0
--------------------------------------------------------The fifth file most closely matched the third author (Dan Brown's Angels & Demons) Why? More
tests needed?
--------------------------------------------------------Since the fourth and fifth files were both matched with the third author (Dan Brown's Angels &
Demons), does this mean they are similar to each other?
I associated the fourth file with an author equal to its email’s subject name, and then just tested
using the only unknown text file as the fifth file. The rest of the conditions were the same.
Results:
--------------------------------------------------------test5_xc.txt C:\Users\Luther\Desktop\SET data\test5_xc.txt
Canonicizers: Normalize ASCII, Normalize Whitespace
EventDriver: Stanford Part of Speech tagginmodel : english-bidirectional-distsim
Analysis: Linear SVM
1. Status Report 4.0
test5_xc.txt C:\Users\Luther\Desktop\SET data\test5_xc.txt
Canonicizers: Normalize ASCII, Normalize Whitespace
EventDriver: Stanford Part of Speech tagginmodel : english-bidirectional-distsim
Analysis: Gaussian SVM
1. Status Report 4.0
---------------------------------------------------------
My hypothesis was correct, since the fourth file was matched most closely with the fifth file.

Spear Phishing – Option 3:
My next test was to compare text from an unrelated source, but same field (deception), to these SET
generated email texts, to see if they were considered similar or not to the set of email texts used in
the previous test. This was carried out using a number of techniques:


A Gaussian SVM
A Linear SVM
o Each with the Stanford Part of Speech Tagger
I selected the text from the Brennan-Greenstadt Adversarial Corpus
(source: https://psal.cs.drexel.edu/index.php/Main_Page)
--------------------------------------------------------------------------------Citations:
Michael Brennan, Sadia Afroz, and Rachel Greenstadt. 2012. Adversarial stylometry: Circumventing authorship recognition to preserve
privacy and anonymity. ACM Trans. Inf. Syst. Secur. 15, 3, Article 12 (November 2012).
Michael Brennan and Rachel Greenstadt. Practical Attacks Against Authorship Recognition Techniques in Proceedings of the Twenty-First
Conference on Innovative Applications of Artificial Intelligence (IAAI), Pasadena, California, July 2009.
--------------------------------------------------------------------------------I used attacks classified by the corpus to obfuscation attacks, and compared to them to the SET
generated attacks. These were the results:
a_obfuscation.txt D:\Dropbox\Coding\3rd Yr Project\Project Docs\SET
data\attacks_obfuscation\a_obfuscation.txt
Canonicizers: none
EventDriver: Stanford Part of Speech tagginmodel : english-bidirectional-distsim
Analysis: Linear SVM
1. Order Confirmati?on 1.0
a_obfuscation.txt D:\Dropbox\Coding\3rd Yr Project\Project Docs\SET
data\attacks_obfuscation\a_obfuscation.txt
Canonicizers: none
EventDriver: Stanford Part of Speech tagginmodel : english-bidirectional-distsim
Analysis: Gaussian SVM
1. Baby Pics 5.0
b_obfuscation.txt D:\Dropbox\Coding\3rd Yr Project\Project Docs\SET
data\attacks_obfuscation\b_obfuscation.txt
Canonicizers: none
EventDriver: Stanford Part of Speech tagginmodel : english-bidirectional-distsim
Analysis: Linear SVM
1. Order Confirmati?on 1.0
b_obfuscation.txt D:\Dropbox\Coding\3rd Yr Project\Project Docs\SET
data\attacks_obfuscation\b_obfuscation.txt
Canonicizers: none
EventDriver: Stanford Part of Speech tagginmodel : english-bidirectional-distsim
Analysis: Gaussian SVM
1. Baby Pics 5.0
c_obfuscation.txt D:\Dropbox\Coding\3rd Yr Project\Project Docs\SET
data\attacks_obfuscation\c_obfuscation.txt
Canonicizers: none
EventDriver: Stanford Part of Speech tagginmodel : english-bidirectional-distsim
Analysis: Linear SVM
1. Dan Brown's Angels & Demons 3.0
c_obfuscation.txt D:\Dropbox\Coding\3rd Yr Project\Project Docs\SET
data\attacks_obfuscation\c_obfuscation.txt
Canonicizers: none
EventDriver: Stanford Part of Speech tagginmodel : english-bidirectional-distsim
Analysis: Gaussian SVM
1. Baby Pics 5.0
d_obfuscation.txt D:\Dropbox\Coding\3rd Yr Project\Project Docs\SET
data\attacks_obfuscation\d_obfuscation.txt
Canonicizers: none
EventDriver: Stanford Part of Speech tagginmodel : english-bidirectional-distsim
Analysis: Linear SVM
1. Dan Brown's Angels & Demons 3.0
d_obfuscation.txt D:\Dropbox\Coding\3rd Yr Project\Project Docs\SET
data\attacks_obfuscation\d_obfuscation.txt
Canonicizers: none
EventDriver: Stanford Part of Speech tagginmodel : english-bidirectional-distsim
Analysis: Gaussian SVM
1. Baby Pics 5.0
e_obfuscation.txt D:\Dropbox\Coding\3rd Yr Project\Project Docs\SET
data\attacks_obfuscation\e_obfuscation.txt
Canonicizers: none
EventDriver: Stanford Part of Speech tagginmodel : english-bidirectional-distsim
Analysis: Linear SVM
1. Order Confirmati?on 1.0
e_obfuscation.txt D:\Dropbox\Coding\3rd Yr Project\Project Docs\SET
data\attacks_obfuscation\e_obfuscation.txt
Canonicizers: none
EventDriver: Stanford Part of Speech tagginmodel : english-bidirectional-distsim
Analysis: Gaussian SVM
1. Baby Pics 5.0
f_obfuscation.txt D:\Dropbox\Coding\3rd Yr Project\Project Docs\SET
data\attacks_obfuscation\f_obfuscation.txt
Canonicizers: none
EventDriver: Stanford Part of Speech tagginmodel : english-bidirectional-distsim
Analysis: Linear SVM
1. Order Confirmati?on 1.0
f_obfuscation.txt D:\Dropbox\Coding\3rd Yr Project\Project Docs\SET
data\attacks_obfuscation\f_obfuscation.txt
Canonicizers: none
EventDriver: Stanford Part of Speech tagginmodel : english-bidirectional-distsim
Analysis: Gaussian SVM
1. Baby Pics 5.0
g_obfuscation.txt D:\Dropbox\Coding\3rd Yr Project\Project Docs\SET
data\attacks_obfuscation\g_obfuscation.txt
Canonicizers: none
EventDriver: Stanford Part of Speech tagginmodel : english-bidirectional-distsim
Analysis: Linear SVM
1. Order Confirmati?on 1.0
g_obfuscation.txt D:\Dropbox\Coding\3rd Yr Project\Project Docs\SET
data\attacks_obfuscation\g_obfuscation.txt
Canonicizers: none
EventDriver: Stanford Part of Speech tagginmodel : english-bidirectional-distsim
Analysis: Gaussian SVM
1. Baby Pics 5.0
h_obfuscation.txt D:\Dropbox\Coding\3rd Yr Project\Project Docs\SET
data\attacks_obfuscation\h_obfuscation.txt
Canonicizers: none
EventDriver: Stanford Part of Speech tagginmodel : english-bidirectional-distsim
Analysis: Linear SVM
1. Dan Brown's Angels & Demons 3.0
h_obfuscation.txt D:\Dropbox\Coding\3rd Yr Project\Project Docs\SET
data\attacks_obfuscation\h_obfuscation.txt
Canonicizers: none
EventDriver: Stanford Part of Speech tagginmodel : english-bidirectional-distsim
Analysis: Gaussian SVM
1. Baby Pics 5.0
k_obfuscation.txt D:\Dropbox\Coding\3rd Yr Project\Project Docs\SET
data\attacks_obfuscation\k_obfuscation.txt
Canonicizers: none
EventDriver: Stanford Part of Speech tagginmodel : english-bidirectional-distsim
Analysis: Linear SVM
1. Order Confirmati?on 1.0
k_obfuscation.txt D:\Dropbox\Coding\3rd Yr Project\Project Docs\SET
data\attacks_obfuscation\k_obfuscation.txt
Canonicizers: none
EventDriver: Stanford Part of Speech tagginmodel : english-bidirectional-distsim
Analysis: Gaussian SVM
1. Baby Pics 5.0
m_obfuscation.txt D:\Dropbox\Coding\3rd Yr Project\Project Docs\SET
data\attacks_obfuscation\m_obfuscation.txt
Canonicizers: none
EventDriver: Stanford Part of Speech tagginmodel : english-bidirectional-distsim
Analysis: Linear SVM
1. Order Confirmati?on 1.0
m_obfuscation.txt D:\Dropbox\Coding\3rd Yr Project\Project Docs\SET
data\attacks_obfuscation\m_obfuscation.txt
Canonicizers: none
EventDriver: Stanford Part of Speech tagginmodel : english-bidirectional-distsim
Analysis: Gaussian SVM
1. Baby Pics 5.0
p_obfuscation.txt D:\Dropbox\Coding\3rd Yr Project\Project Docs\SET
data\attacks_obfuscation\p_obfuscation.txt
Canonicizers: none
EventDriver: Stanford Part of Speech tagginmodel : english-bidirectional-distsim
Analysis: Linear SVM
1. Order Confirmati?on 1.0
p_obfuscation.txt D:\Dropbox\Coding\3rd Yr Project\Project Docs\SET
data\attacks_obfuscation\p_obfuscation.txt
Canonicizers: none
EventDriver: Stanford Part of Speech tagginmodel : english-bidirectional-distsim
Analysis: Gaussian SVM
1. Baby Pics 5.0
s_obfuscation.txt D:\Dropbox\Coding\3rd Yr Project\Project Docs\SET
data\attacks_obfuscation\s_obfuscation.txt
Canonicizers: none
EventDriver: Stanford Part of Speech tagginmodel : english-bidirectional-distsim
Analysis: Linear SVM
1. Order Confirmati?on 1.0
s_obfuscation.txt D:\Dropbox\Coding\3rd Yr Project\Project Docs\SET
data\attacks_obfuscation\s_obfuscation.txt
Canonicizers: none
EventDriver: Stanford Part of Speech tagginmodel : english-bidirectional-distsim
Analysis: Gaussian SVM
1. Baby Pics 5.0
Documents c, d and h were classified to match mostly with the 3rd author ‘Dan Brown's Angels &
Demons’, with the other documents being classified to ‘Order Confirmation’, when using a Linear
SVM. With a Gaussian SVM, all documents were classified to match ‘Baby Pics’ - author 5.
These were the results for the imitation corpus:
-------------------------------------------------------------------------------------------------------------------------------------------------a_imitation.txt C:\SD\Dropbox\Coding\3rd Yr Project\Project Docs\SET
data\attacks_imitation\a_imitation.txt
Canonicizers: none
EventDriver: Stanford Part of Speech tagginmodel : english-bidirectional-distsim
Analysis: Gaussian SVM
1. Baby Pics 5.0
a_imitation.txt C:\SD\Dropbox\Coding\3rd Yr Project\Project Docs\SET
data\attacks_imitation\a_imitation.txt
Canonicizers: none
EventDriver: Stanford Part of Speech tagginmodel : english-bidirectional-distsim
Analysis: Linear SVM
1. Dan Brown's Angels & Demons 3.0
-------------------------------------------------------------------------------------------------------------------------------------------------b_imitation.txt C:\SD\Dropbox\Coding\3rd Yr Project\Project Docs\SET
data\attacks_imitation\b_imitation.txt
Canonicizers: none
EventDriver: Stanford Part of Speech tagginmodel : english-bidirectional-distsim
Analysis: Gaussian SVM
1. Baby Pics 5.0
b_imitation.txt C:\SD\Dropbox\Coding\3rd Yr Project\Project Docs\SET
data\attacks_imitation\b_imitation.txt
Canonicizers: none
EventDriver: Stanford Part of Speech tagginmodel : english-bidirectional-distsim
Analysis: Linear SVM
1. Dan Brown's Angels & Demons 3.0
-------------------------------------------------------------------------------------------------------------------------------------------------c_imitation.txt C:\SD\Dropbox\Coding\3rd Yr Project\Project Docs\SET
data\attacks_imitation\c_imitation.txt
Canonicizers: none
EventDriver: Stanford Part of Speech tagginmodel : english-bidirectional-distsim
Analysis: Gaussian SVM
1. Baby Pics 5.0
c_imitation.txt C:\SD\Dropbox\Coding\3rd Yr Project\Project Docs\SET
data\attacks_imitation\c_imitation.txt
Canonicizers: none
EventDriver: Stanford Part of Speech tagginmodel : english-bidirectional-distsim
Analysis: Linear SVM
1. Dan Brown's Angels & Demons 3.0
-------------------------------------------------------------------------------------------------------------------------------------------------d_imitation.txt C:\SD\Dropbox\Coding\3rd Yr Project\Project Docs\SET
data\attacks_imitation\d_imitation.txt
Canonicizers: none
EventDriver: Stanford Part of Speech tagginmodel : english-bidirectional-distsim
Analysis: Gaussian SVM
1. Baby Pics 5.0
d_imitation.txt C:\SD\Dropbox\Coding\3rd Yr Project\Project Docs\SET
data\attacks_imitation\d_imitation.txt
Canonicizers: none
EventDriver: Stanford Part of Speech tagginmodel : english-bidirectional-distsim
Analysis: Linear SVM
1. Dan Brown's Angels & Demons 3.0
-------------------------------------------------------------------------------------------------------------------------------------------------e_imitation.txt C:\SD\Dropbox\Coding\3rd Yr Project\Project Docs\SET
data\attacks_imitation\e_imitation.txt
Canonicizers: none
EventDriver: Stanford Part of Speech tagginmodel : english-bidirectional-distsim
Analysis: Gaussian SVM
1. Baby Pics 5.0
e_imitation.txt C:\SD\Dropbox\Coding\3rd Yr Project\Project Docs\SET
data\attacks_imitation\e_imitation.txt
Canonicizers: none
EventDriver: Stanford Part of Speech tagginmodel : english-bidirectional-distsim
Analysis: Linear SVM
1. Order Confirmati?on 1.0
-------------------------------------------------------------------------------------------------------------------------------------------------f_imitation.txt C:\SD\Dropbox\Coding\3rd Yr Project\Project Docs\SET
data\attacks_imitation\f_imitation.txt
Canonicizers: none
EventDriver: Stanford Part of Speech tagginmodel : english-bidirectional-distsim
Analysis: Gaussian SVM
1. Baby Pics 5.0
f_imitation.txt C:\SD\Dropbox\Coding\3rd Yr Project\Project Docs\SET
data\attacks_imitation\f_imitation.txt
Canonicizers: none
EventDriver: Stanford Part of Speech tagginmodel : english-bidirectional-distsim
Analysis: Linear SVM
1. Order Confirmati?on 1.0
--------------------------------------------------------------------------------------------------------------------------------------------------
g_imitation.txt C:\SD\Dropbox\Coding\3rd Yr Project\Project Docs\SET
data\attacks_imitation\g_imitation.txt
Canonicizers: none
EventDriver: Stanford Part of Speech tagginmodel : english-bidirectional-distsim
Analysis: Gaussian SVM
1. Baby Pics 5.0
g_imitation.txt C:\SD\Dropbox\Coding\3rd Yr Project\Project Docs\SET
data\attacks_imitation\g_imitation.txt
Canonicizers: none
EventDriver: Stanford Part of Speech tagginmodel : english-bidirectional-distsim
Analysis: Linear SVM
1. Order Confirmati?on 1.0
-------------------------------------------------------------------------------------------------------------------------------------------------h_imitation.txt C:\SD\Dropbox\Coding\3rd Yr Project\Project Docs\SET
data\attacks_imitation\h_imitation.txt
Canonicizers: none
EventDriver: Stanford Part of Speech tagginmodel : english-bidirectional-distsim
Analysis: Gaussian SVM
1. Baby Pics 5.0
h_imitation.txt C:\SD\Dropbox\Coding\3rd Yr Project\Project Docs\SET
data\attacks_imitation\h_imitation.txt
Canonicizers: none
EventDriver: Stanford Part of Speech tagginmodel : english-bidirectional-distsim
Analysis: Linear SVM
1. Order Confirmati?on 1.0
-------------------------------------------------------------------------------------------------------------------------------------------------k_imitation.txt C:\SD\Dropbox\Coding\3rd Yr Project\Project Docs\SET
data\attacks_imitation\k_imitation.txt
Canonicizers: none
EventDriver: Stanford Part of Speech tagginmodel : english-bidirectional-distsim
Analysis: Gaussian SVM
1. Baby Pics 5.0
k_imitation.txt C:\SD\Dropbox\Coding\3rd Yr Project\Project Docs\SET
data\attacks_imitation\k_imitation.txt
Canonicizers: none
EventDriver: Stanford Part of Speech tagginmodel : english-bidirectional-distsim
Analysis: Linear SVM
1. Dan Brown's Angels & Demons 3.0
-------------------------------------------------------------------------------------------------------------------------------------------------m_imitation.txt C:\SD\Dropbox\Coding\3rd Yr Project\Project Docs\SET
data\attacks_imitation\m_imitation.txt
Canonicizers: none
EventDriver: Stanford Part of Speech tagginmodel : english-bidirectional-distsim
Analysis: Gaussian SVM
1. Baby Pics 5.0
m_imitation.txt C:\SD\Dropbox\Coding\3rd Yr Project\Project Docs\SET
data\attacks_imitation\m_imitation.txt
Canonicizers: none
EventDriver: Stanford Part of Speech tagginmodel : english-bidirectional-distsim
Analysis: Linear SVM
1. Dan Brown's Angels & Demons 3.0
-------------------------------------------------------------------------------------------------------------------------------------------------p_imitation.txt C:\SD\Dropbox\Coding\3rd Yr Project\Project Docs\SET
data\attacks_imitation\p_imitation.txt
Canonicizers: none
EventDriver: Stanford Part of Speech tagginmodel : english-bidirectional-distsim
Analysis: Gaussian SVM
1. Baby Pics 5.0
p_imitation.txt C:\SD\Dropbox\Coding\3rd Yr Project\Project Docs\SET
data\attacks_imitation\p_imitation.txt
Canonicizers: none
EventDriver: Stanford Part of Speech tagginmodel : english-bidirectional-distsim
Analysis: Linear SVM
1. Dan Brown's Angels & Demons 3.0
-------------------------------------------------------------------------------------------------------------------------------------------------s_imitation.txt C:\SD\Dropbox\Coding\3rd Yr Project\Project Docs\SET
data\attacks_imitation\s_imitation.txt
Canonicizers: none
EventDriver: Stanford Part of Speech tagginmodel : english-bidirectional-distsim
Analysis: Gaussian SVM
1. Baby Pics 5.0
s_imitation.txt C:\SD\Dropbox\Coding\3rd Yr Project\Project Docs\SET
data\attacks_imitation\s_imitation.txt
Canonicizers: none
EventDriver: Stanford Part of Speech tagginmodel : english-bidirectional-distsim
Analysis: Linear SVM
1. Dan Brown's Angels & Demons 3.0
When parsing using the Stanford POS tagger, all imitation documents, matched with the 5th
author - ‘Baby Pics’, when using a Gaussian SVM, and all imitation documents apart from e, g
and h, matched the 3rd author - ‘Dan Brown’s Angels and Demons’, when using a Linear SVM.
Documents e, g and h matched with ‘Order Confirmation’, when applying a Linear SVM to the
imitation documents.
Overall, when either obfuscating or imitating, it was the case that the only matched document was no. 5 – ‘Baby Pics’, when using a
Gaussian SVM at least, so there is no way to determine whether the chosen documents from the SET were more likely to be
obfuscating their text, rather than imitating some text, and vice versa, using a Gaussian SVM.
In the case of the Linear SVM:





Obfuscation:
o
c, d and h were classified to match mostly with the 3rd author ‘Dan Brown's Angels & Demons’
o
All other documents (9) were classified as matching ‘Order Confirmation’
Imitation:
o
e, g and h matched with ‘Order Confirmation’
o
All others (9) matched with ‘Dan Brown’s Angels and Demons’
You could assume for a hypothesis, that since 75% of documents matched ‘Order Confirmation’ when obfuscating, you
could research features of obfuscation, determine which are in use in the ‘Order Confirmation’ document, and then be
able to classify other similar SET documents using that feature set.
o
The same could be applied to the imitation side of things, as 75% of documents were classified to be closest to
‘Dan Brown’s Angels and Demons’.
The other funny thing is that each of the authors ‘Dan Brown’s Angels and Demons’ (3rd), and ‘Order Confirmation’ (1st)
held the corresponding other quarter of document classifications, for either obfuscation and imitation, respectively.
o
This means that if creating a distance measure, a lower weighting could be placed on features relating to
obfuscation in the first case (3rd author), and imitation in the second (1st author).
o
Maybe 75% of features could be used to detect obfuscation, and 25% imitation, for
invoice/receipt/acknowledgement style documents, and vice-versa, for imitation/persuasion style documents.
Create test to determine whether SET documents have more of a focus on imitation or
obfuscation.
o State features which determined this
o Maybe research distance measures
 Start with own emails test against SET – ground truth
 Create distance measure with this
 In future ask people to create emails for testing, which are in a similar vein
the SET/Kali Linux generated ones.
 Compare results to distance measure.
 See which corpora when applied to distance measure produces best
results over time
 If approx. half of corpora are suited to one distance measure, and
the approx. other half, another distance measure, maybe try
combining the two, and adjusting weightings of metrics/features, to
make sure a correct results is obtained for all classifications of
corpora, within a certain threshold.
JGAAP terminology
(Source: http://evllabs.com/jgaap/w/index.php/JGAAP#Events)
Canonicizer – Preprocessors for raw document text. It means something that puts the raw text of a
document into canonical form (e.g. changing characters to lowercase, removing punctuation)
Events – Singletons generated from the text of the document used to represent it to the learning
algorithm (e.g. Words, Characters, Parts of Speech, Word BiGrams)
Cullers – Operate on the resulting events from processed documents. They remove events from
those generated creating a set that meets its stipulations.
E.g.



Reduce to only the 50 most common events
Reduce to the 100 least common events
Reduce to only events that appear in every document.
Analysis – Analysis methods use the event based representations of documents to create models of
their authors. It then uses these models to evaluate the most likely author of unknown documents
from the pool of authors it knows about.
E.g. Support Vector Machines, or Nearest Neighbour.
JGAAP code


API is available, along with a command line interface (a Java class for each)
Javadocs:
o Instructions for using the JGAAP API:
 First add documents both known and unknown (via addDocument)
 All other settings can be performed in any order which are:
 setLanguage
 addCanonicizer
 addEventDriver – Used to generate a List of Events ordered in the
sequence they are found in the document
 addEventCuller
 addAnalysisDriver
 addDistanceFunction
 Note: Of the settings only one EventDriver and one AnalysisDriver are
required to run an experiment
o The execute method is then used to start the experiment
o Results are placed in unknown documents
 To access them use the getUnknownDocuments method in the API
o The results can be retrieved as a List<Pair<String, Double>>
 This is a sorted list from most likely to least likely author followed by a score
generated based on your settings, using the getRawResult method
 You can also get a Map of Maps of the raw results with the getRawResults
method


o
o
Type is: (Map<EventDriver, Map<AnalysisDriver, List<String,
Double>>>)
They can also be retrieved as a string using either the getFormattedResult or
getResult methods.
For examples of how to use the API class see the com.jgaap.ui package for a GUI
example
or the com.jgaap.backend.CLI class for a command line example
JGAAP Command Line Interface
Done using the JGAAP Experiment Engine
(Source: http://evllabs.com/jgaap/w/index.php/Experiment_Engine)
The Experiment Engine is accessible through JGAAP's command line interface using the -ee flag and
pointing it to a run file, i.e. java -jar jgaap.jar -ee ../experiment.run
A run file is a csv where the top row is a title for the experiment and each subsequent row is an
experiment to be run.
A ‘Corpus File’ contains information on the authors and titles of documents that can be used in
JGAAP to allow for the quick reproduction of experiments.
Corpus files can be generated within JGAAP after loading documents via the menu option
“File>Batch Documents>Save Documents”
Standard:
Author,/path/to/file.txt,Title
Author,/path/to/other/file.txt,Title,
/path/to/unknown/file.txt,Title
Example:
A,/home/ryan/example/sampleA-01.txt
A,/home/ryan/example/sampleA-02.txt
B,/home/ryan/example/sampleB-01.txt
B,/home/ryan/example/sampleB-02.txt
C,/home/ryan/example/sampleC-01.txt
C,/home/ryan/example/sampleC-02.txt
,/home/ryan/example/test-01.txt
,/home/ryan/example/test-02.txt
Titles are optional. If you don’t assign one, a default title will be assigned.
Download