ScrapperMin By John Kenedy ScrapperMin is a scripting language

advertisement
ScrapperMin
By John Kenedy
ScrapperMin is a scripting language that is similar with C language, except that everything is a
function call including assigning variable. A variable can contains a single String or an array of
String
SET(‘varA’, ‘I am John’);
SET(‘varA’, ‘I am John’;’My full name is John Kenedy’);
The second sample assign varA with two strings, two strings can be come an array with only
seprating it with semicolon.
ScrapperMin language specialise itself for the purpose of Web Scraping, doing POST/GET to
website. It supports loops (for/while), conditional if, string operations, or web client operations
or web client operations using OAuth. It is built using .NET Framework 4.0 that comes with
libraries for web scrapping. The script can run the ready made functions to enable fast creation
of Web Bot. Some sample of what it could do
1.
2.
3.
4.
Sending post to website that does not via API, such as Vbulletin forum.
Sending post to website that offers API through Open Authentication
Sending Private Message or Visitor Message for people in forum.
Auto login into website and getting links from the site to download the file directly via
browser.
5. Uploading large files to website and getting the link by not using the API
The art of Web Scraping is by analysing the FORM tag of HTML and get the required input, fill in
the input and passing it through POST/GET to the action link. ScrapperMin easily gets required
POST string from a form and let you fill some parameters with your information. Example
SET('PG', WC_GetPage('https://www.furk.net', ''));
SET('PS', WC_GetPostStringRaw('https://www.furk.net', GET('PG'), '0',
'login={PARAM0}&pwd={PARAM1}', ''));
First the PG is assigned with value from WC_GetPage. WC stands for Web Client, which contains
all functions related to Web Client such as GetPage. The real class name of WC is
CookieWebClient, you can read the CHM documentation of CookieWebclient to know what
functions it exposed for your usage.
The GetPage function has two parameter, the url and the referer. Above we ask Web Client to
get the https://www.furk.net and we don’t send any referer to it. It returns HTML page and
store in PG variable, and we pass it to WC_GetPostStringRaw, which we pass the url then the
RAW html to it (the PG variable). And we specify index ‘0’ which is the first form encounter in
the HTML and it will construct the require POST string and when constructing the POST string
we tell the function when encountering tag name login, fill it with {PARAM0} which is the first
input from user, and if encounter pwd then fill in {PARAM1} which is the second user input.
The result of the POST string might also include the security token of the FORM (presumably it
has), which we don’t care as we are care to fill in what we need. The security token is used by
most website to prevent Cross Site Scripting attack. Notice the last parameter is ‘’ which is
empty string, it is used for removing certain tag name from the generated POST string such as
Preview tag, which tells the server to preview first before post, we remove nothing in this
sample.
The POST string is stored in variable PS, and now its time for us to send this POST string back to
server to do the login
SET('PG2', WC_PostPage('https://www.furk.net/api/login/login', GET('PS')));
Here we call WC_PostPage which is a method to send POST information to server, we can know
the url of the login by analysing using F12 on chrome or firefox to see the url get calls when we
click the login button. We pass the information after filling username and password to the
server, the result of the server respond will be stored at PG2.
After this, we need to check whether PG2 contains any string information that indicates we
have successfully login
IF (GET('PG2'), 'CONTAINS', 'status":"ok"',
LOG('Successful Login Furk');
, EXIT('Invalid Furk Login'));
IF syntax has 5 parameters, first is the operand 1, second the operator, continues with operan 2,
then script to call if evaluation is true, then script to call if evaluation is false.
We see LOG function with string parameter where it will write down the text to window. Similar
with LOG, EXIT functions will write down text to window but terminate the script immediately.
Run ScrapperMinW.exe and select HorriblesFurk.txt. You will see two box for your input
because the software detects two {PARAMx} inside the script and prepare two box for your
input. Input your username and password. And Click Run. Whether it display successful login or
fail will depends on correct username/password on Furk.net website. Assuming you had one,
you will be presented successful login information and continues to run the next script after
LOG success string.
SET('PGX', WC_GetPage('http://www.horriblesubs.info/lib/latest.php',''));
FOR('G', SO_TagMatch(GET('PGX'), '<div class="episode"', '</div></div></div>'),
SET('URLS', SO_TagMatch(GET('G'), 'http://nyaa.se/?page=download&tid=', '"'));
The next script will use GetPage again but this time it will get from horriblesubs.info website,
the return html is stored inside PGX. Then we see this script
SO_TagMatch(GET('PGX'), '<div class="episode"', '</div></div></div>'
SO_TagMatch is part of String Operations class which officially call StringOps.cs which it expose
TagMatch function which is one of the function from a list of functions, the full list of functions
can be read from CHM documentation.
TagMatch takes 3 parameters, the full string, the start string, the end string. It will return list of
string between start and end string inside the full string.
<div class="episode"AAA</div></div></div><div class="episode"BBB</div></div></div>
If full string is like above, it will return array containing AAA and BBB.
The return of TagMatch is filled as second parameter of FOR statement (loop). The first
parameter of FOR will be the variable name that store the string for each of the string it found
on second parameter, the third parameter will be the script to execute.
FOR('G', SO_TagMatch(GET('PGX'), '<div class="episode"', '</div></div></div>'),
SET('URLS', SO_TagMatch(GET('G'), 'http://nyaa.se/?page=download&tid=', '"'));
Above will get all episodes published from Horriblesubs.info and loop each of the episode and
store it in G. Then since G contains other dirty information, we need to TagMatch it again to get
only nyaa.se releases, we set the URLS which is a list of url found in each episode, because each
episode might contain 480p, 720p 1080p releases.
SET('I', '');
FOR('H', GET('URLS'),
SET('I', JOIN('', GET('I'), '0'));
We continue we set I as empty because in this example we want to get only the last release
found in each episode which is 1080p, or sometimes 720p. We will loop URLS and each nyaa
&tid= will be stored in H and we increase I length by using JOIN.
JOIN(‘,’, ‘a’;’b’);
Above will return a,b. Join takes first argument the separator while second argument the array
to concatenate with the separator.
IF (LENGTH(GET('I')), '=', COUNT(GET('URLS')),
SET('URL', JOIN('', 'http://nyaa.se/?page=download&tid=', GET('H')));
, PASS());
Again we use IF, where the LENGTH of I which is 1 is it equal to COUNT of array URLS. Count is
to count the length of array, while length to count the length of a string. The fourth parameter
is the script to run when it meets the criteria which is the last index of URLS. The fifth
parameters PASS() will do nothing as it is required to fill in something.
After it meet the last releases (1080p), it will create the real URL by appending botth the
nyaa.se download url and the tid. Once it is set in URL variable, we are ready to pass it to furk
IF ('URL', '=', '', PASS(),
IF (SO_IsInFile('horriblesubs.txt', GET('URL')), '=', '0',
SET('PG4', WC_PostPageFormXml('https://www.furk.net/api/dl/add',
'url';'notify', GET('URL');'1', ''));
IF(GET('PG4'), 'CONTAINS', '"status":"ok"',
SO_AppendToFile('horriblesubs.txt', GET('URL'));
LOG(FORMAT('Success adding {0}', GET('URL')));
, LOG(FORMAT('Fail adding {0}', GET('URL'))));
, PASS());
);
The last part of this script will check if URL is empty? If yes then PASS() or do nothing, if no we
will check if this URL exist in horriblesubs.txt?
SO_IsInFile is a method publish by StringOps.cs where it will read all lines from a text file and
check a string contains inside any of the line in the line, it return 1 if exist, so in the IF we check
for ‘0’. If It does not found it return ‘0’ which we don’t want to pass the link to furk.net when
we already process the file.
PG4 variable will contain html of WC_PostPageFormXml method publish by Web Client, this
method will do POST to a url with the POST string, but the POST string is separated between
key and value. The first parameter is the url, second parameter is the list of key, third
parameter is the lift of value, while the fourth parameter is the referer.
We pass ‘url’;’notify’ so it becomes a list or array containing two string, the final post string will
be
url=……&notify=1
Where …… is the URL we found from horriblesubs. Then it check whether Furk.net returns
successfully accepted the link and start downloading the file, if yes then we write append the
link to the file horriblesubs.txt using SO_AppendToFile. And we do logging of successfully added
the link.
Syntax List (non Library)
1. GET
Eg : GET(‘a’);
Return the variable value
2. SET
Eg : SET(‘a’, ‘I am John’);
SET the variable a with value I am John
3. REM
Eg : REM(‘this is remarks’)
To set remarks, remarks must be added to top of script for indicating the input purpose
of each input inside the script
4. PASS
Eg : PASS()
To do nothing, to serve as place holder for IF, WHILE statement
5. EXIT
Eg : EXIT(‘Error login’)
To print text in the parameter and exit the script
6. OA_ConsumerKey
Eg : OA_ConsumerKey(‘key’)
Set the consumer key of Oauth authentication
7. OA_ConsumerSecret
Eg : OA_ConsumerSecret(‘secret’)
Set the consumer secret of Oauth authentication
8. OA_TokenKey
Eg : OA_TokenKey(‘key’)
Set the respond from Oauth login token key to script
9. OA_TokenSecret
Eg : OA_TokenSecret(‘secret’)
Set the respond from Oauth login token secret to script
10. LOG
Eg : LOG(‘Sucess login’)
To print text in the parameter
11. FORMAT
Eg : Format('upload_type=file&sess_id={0}&srv_tmp_url={1}&tos=1', GET('SID'),
GET('TMP'));
To format a string same with String.Format in C# to replace place holder {0} {1} … {n}
with argument after the first string
12. TRIM
Eg : TRIM(‘ aaaa ‘);
To trim the white space inside string
13. JOIN
Eg : JOIN(‘,’, ‘a’;’b’);
To join string with separator in the first argument
14. SLEEP
Eg : Sleep(‘3’)
To sleep for certain seconds specified in first argument
15. LENGTH
Eg : LENGTH(‘abcdef’);
Return the length of string
16. COUNT
Eg : COUNT(‘ab’;’bc’;’cd’);
Return the count of array
17. SPLIT
Eg : SPLIT(‘a,b,c’, ‘,’);
Split a string with substring in second argument
18. REPLACE
Eg : REPLACE(GET('URL'), '&', '&')
To replace string with another string. Sample will remove & with &
19. REMOVEAT
Eg : REMOVEAT(‘1’, ‘a’;’b’;’c’);
To remove array index specified in first argument. Which return ‘a’;’c’
20. PROCESSSTART
Eg : PROCESSSTART(‘notepad.exe’);
To start a file in using Windows shell.
21. LOADSCRIPT
Eg : LOADSCRIPT(‘kaskuspm.txt’, ‘username’, ‘password’, ‘title’, ‘comment’);
To run another script using the argument specified.
22. FILTER
Eg : Filter(‘1’, ‘a’;’b’;’c’);
To get the string in specified index of first argument, from list of string in second
argument. Example return ‘a’.
Filter also supports like : Filter(‘0-2’, ‘a’;’b’;’c’,’d’); return ‘a’;’b’;’c’
Filter also supports like : Filter(‘1,3’, ‘a’;’b’;’c’,’d’); return ‘b’;’d’
Comma means return the selected index
Dash (-) means return the index between it.
23. IF
IF syntax has 5 parameters, first is the operand 1, second the operator, continues with
operan 2, then script to call if evaluation is true, then script to call if evaluation is false.
The operator of IF can be
- CONTAINS
If a string contains some other part of string in third argument
- =
If a string equal to third argument
-
-
-
-
IN
If a string contains inside a list of string
!=
If a string not equal to third argument
STARTSWITH
If a string starts with third argument
ENDSWITH
If a string ends with third argument
<
If a string is lesser than third argument. If the string is integer, it will compare as
integer instead of string
<=
If a string is lesser or equal than third argument. If the string is integer, it will
compare as integer instead of string
>
If a string is bigger than third argument. If the string is integer, it will compare as
integer instead of string
>=
If a string is bigger or equal than third argument. If the string is integer, it will
compare as integer instead of string
Syntax List (Library)
1. SO_
Any syntax start with SO_ will use reflection to invoke StringOps.cs functions. See the
CHM documentation for all methods publish there
2. WC_
Any syntax start with WC_ will use reflection to invoke CookieWebClient.cs functions.
See the CHM documentation for all methods publish there
3. OA_
Any syntax start with OA_ will use reflection to invoke OAuth.cs functions. See the CHM
documentation for all methods publish there
ScrapperMin Files
1. ScrapperMin.exe
To run scripts (contains in Scripts folder) or datas (contains in Datas folder) in console
mode, allow parameter to be saved in file and pass the file name when running
ScrapperMin.exe. It is meant for normal user to schedule ScrapperMin.exe in Task
Scheduler so that it periodically do Web Scraping.
Example : scrappermin.exe -scr=tusfiles.txt -args=param.txt
-scr means to run script in scripts folder with the name tusfiles.txt. –args means to run
the script and fill in parameter from param.txt
Param.txt contains parameter information in a format such as below
[param]
UserID
[param]
Password
[param]
E:\test.mkv
Each parameter is below [param] syntax, above has three parameters, which is UserID,
Password and a file and pass it to tusfiles.txt script that has {PARAM0} {PARAM1} and
{PARAM2}. Example above will upload file to tusfiles.txt.
This example is not really suitable to be schedule in Task Scheduler, the more
appropriate example to be scheduled in Task Scheduler would be HorribleFurk.txt script
which is to check any new release from Horriblesubs.info website and pass the link to
Furk.net to download it.
The reason ScrapperMin.exe (console mode) exist allow user to automate the process of
running script by having a prepared argument in a file.
2. ScrapperMinW.exe
To create/run scripts (contains in Scripts folder) in Windows Graphical User Interface.
This tool is for developer testing their script, creating new script or package the script.
Using this tool is straight forward since all parameter is created as a textbox above the
UI and developer enter information and click Run to test it.
This tool allows packaging any created script to become Data. Data is a package
containing one or more scripts and encrypt it using a password (using the first section of
Data tab of ScrapperMinW.exe). Later developer can develop their own software that
reference ScrapperMinDLL.dll and load the data by passing the password and run the
script.
Or developer can package the Data into a separate software that serves the purpose to
run the selected scripts using second section of Data tab in ScrapperMinW.exe. Just like
the first section of Data tab for combining multiple scripts into one file and encrypt it,
the second section also ask for a folder to save the output that contains
ScrapperRun.exe that will serves the purpose of running the script (the developer must
enter the Entry file which is the file name to be executed first once Start is click).
Sample scenario, developer run ScrapperMinW.exe and create two scripts, a.txt and
b.txt. Developer want to ship both scripts to his user by going to Data Tab, check a.txt
and b.txt, enter the Data filename such as Simple.Dat and a password and a Entry File
which is b.txt, because the script starts from b.txt as the main function.
3. ScrapperRun.exe
To detect any datas folder in the current path, if yes will load the first file with .dat
encounter. Or any scripts folder in the current path, if yes will load the first file
encoutner.
Loads means it will find the amount of textbox requires to run the script and make a
User Interface with text box of that amount. There is a tab where user click Start to run
the script. ScrapperRun.exe is suitable to ship for normal user who runs script using GUI
without knowing programming.
4. ScrapperUpdater.exe
Running this file will download the latest version of ScrapperMin and replace your local
file with the new version.
5. ScrapperMinDLL.dll
The core library where everything is inside here.
Download