ScrapperMin By John Kenedy ScrapperMin is a scripting language that is similar with C language, except that everything is a function call including assigning variable. A variable can contains a single String or an array of String SET(‘varA’, ‘I am John’); SET(‘varA’, ‘I am John’;’My full name is John Kenedy’); The second sample assign varA with two strings, two strings can be come an array with only seprating it with semicolon. ScrapperMin language specialise itself for the purpose of Web Scraping, doing POST/GET to website. It supports loops (for/while), conditional if, string operations, or web client operations or web client operations using OAuth. It is built using .NET Framework 4.0 that comes with libraries for web scrapping. The script can run the ready made functions to enable fast creation of Web Bot. Some sample of what it could do 1. 2. 3. 4. Sending post to website that does not via API, such as Vbulletin forum. Sending post to website that offers API through Open Authentication Sending Private Message or Visitor Message for people in forum. Auto login into website and getting links from the site to download the file directly via browser. 5. Uploading large files to website and getting the link by not using the API The art of Web Scraping is by analysing the FORM tag of HTML and get the required input, fill in the input and passing it through POST/GET to the action link. ScrapperMin easily gets required POST string from a form and let you fill some parameters with your information. Example SET('PG', WC_GetPage('https://www.furk.net', '')); SET('PS', WC_GetPostStringRaw('https://www.furk.net', GET('PG'), '0', 'login={PARAM0}&pwd={PARAM1}', '')); First the PG is assigned with value from WC_GetPage. WC stands for Web Client, which contains all functions related to Web Client such as GetPage. The real class name of WC is CookieWebClient, you can read the CHM documentation of CookieWebclient to know what functions it exposed for your usage. The GetPage function has two parameter, the url and the referer. Above we ask Web Client to get the https://www.furk.net and we don’t send any referer to it. It returns HTML page and store in PG variable, and we pass it to WC_GetPostStringRaw, which we pass the url then the RAW html to it (the PG variable). And we specify index ‘0’ which is the first form encounter in the HTML and it will construct the require POST string and when constructing the POST string we tell the function when encountering tag name login, fill it with {PARAM0} which is the first input from user, and if encounter pwd then fill in {PARAM1} which is the second user input. The result of the POST string might also include the security token of the FORM (presumably it has), which we don’t care as we are care to fill in what we need. The security token is used by most website to prevent Cross Site Scripting attack. Notice the last parameter is ‘’ which is empty string, it is used for removing certain tag name from the generated POST string such as Preview tag, which tells the server to preview first before post, we remove nothing in this sample. The POST string is stored in variable PS, and now its time for us to send this POST string back to server to do the login SET('PG2', WC_PostPage('https://www.furk.net/api/login/login', GET('PS'))); Here we call WC_PostPage which is a method to send POST information to server, we can know the url of the login by analysing using F12 on chrome or firefox to see the url get calls when we click the login button. We pass the information after filling username and password to the server, the result of the server respond will be stored at PG2. After this, we need to check whether PG2 contains any string information that indicates we have successfully login IF (GET('PG2'), 'CONTAINS', 'status":"ok"', LOG('Successful Login Furk'); , EXIT('Invalid Furk Login')); IF syntax has 5 parameters, first is the operand 1, second the operator, continues with operan 2, then script to call if evaluation is true, then script to call if evaluation is false. We see LOG function with string parameter where it will write down the text to window. Similar with LOG, EXIT functions will write down text to window but terminate the script immediately. Run ScrapperMinW.exe and select HorriblesFurk.txt. You will see two box for your input because the software detects two {PARAMx} inside the script and prepare two box for your input. Input your username and password. And Click Run. Whether it display successful login or fail will depends on correct username/password on Furk.net website. Assuming you had one, you will be presented successful login information and continues to run the next script after LOG success string. SET('PGX', WC_GetPage('http://www.horriblesubs.info/lib/latest.php','')); FOR('G', SO_TagMatch(GET('PGX'), '<div class="episode"', '</div></div></div>'), SET('URLS', SO_TagMatch(GET('G'), 'http://nyaa.se/?page=download&tid=', '"')); The next script will use GetPage again but this time it will get from horriblesubs.info website, the return html is stored inside PGX. Then we see this script SO_TagMatch(GET('PGX'), '<div class="episode"', '</div></div></div>' SO_TagMatch is part of String Operations class which officially call StringOps.cs which it expose TagMatch function which is one of the function from a list of functions, the full list of functions can be read from CHM documentation. TagMatch takes 3 parameters, the full string, the start string, the end string. It will return list of string between start and end string inside the full string. <div class="episode"AAA</div></div></div><div class="episode"BBB</div></div></div> If full string is like above, it will return array containing AAA and BBB. The return of TagMatch is filled as second parameter of FOR statement (loop). The first parameter of FOR will be the variable name that store the string for each of the string it found on second parameter, the third parameter will be the script to execute. FOR('G', SO_TagMatch(GET('PGX'), '<div class="episode"', '</div></div></div>'), SET('URLS', SO_TagMatch(GET('G'), 'http://nyaa.se/?page=download&tid=', '"')); Above will get all episodes published from Horriblesubs.info and loop each of the episode and store it in G. Then since G contains other dirty information, we need to TagMatch it again to get only nyaa.se releases, we set the URLS which is a list of url found in each episode, because each episode might contain 480p, 720p 1080p releases. SET('I', ''); FOR('H', GET('URLS'), SET('I', JOIN('', GET('I'), '0')); We continue we set I as empty because in this example we want to get only the last release found in each episode which is 1080p, or sometimes 720p. We will loop URLS and each nyaa &tid= will be stored in H and we increase I length by using JOIN. JOIN(‘,’, ‘a’;’b’); Above will return a,b. Join takes first argument the separator while second argument the array to concatenate with the separator. IF (LENGTH(GET('I')), '=', COUNT(GET('URLS')), SET('URL', JOIN('', 'http://nyaa.se/?page=download&tid=', GET('H'))); , PASS()); Again we use IF, where the LENGTH of I which is 1 is it equal to COUNT of array URLS. Count is to count the length of array, while length to count the length of a string. The fourth parameter is the script to run when it meets the criteria which is the last index of URLS. The fifth parameters PASS() will do nothing as it is required to fill in something. After it meet the last releases (1080p), it will create the real URL by appending botth the nyaa.se download url and the tid. Once it is set in URL variable, we are ready to pass it to furk IF ('URL', '=', '', PASS(), IF (SO_IsInFile('horriblesubs.txt', GET('URL')), '=', '0', SET('PG4', WC_PostPageFormXml('https://www.furk.net/api/dl/add', 'url';'notify', GET('URL');'1', '')); IF(GET('PG4'), 'CONTAINS', '"status":"ok"', SO_AppendToFile('horriblesubs.txt', GET('URL')); LOG(FORMAT('Success adding {0}', GET('URL'))); , LOG(FORMAT('Fail adding {0}', GET('URL')))); , PASS()); ); The last part of this script will check if URL is empty? If yes then PASS() or do nothing, if no we will check if this URL exist in horriblesubs.txt? SO_IsInFile is a method publish by StringOps.cs where it will read all lines from a text file and check a string contains inside any of the line in the line, it return 1 if exist, so in the IF we check for ‘0’. If It does not found it return ‘0’ which we don’t want to pass the link to furk.net when we already process the file. PG4 variable will contain html of WC_PostPageFormXml method publish by Web Client, this method will do POST to a url with the POST string, but the POST string is separated between key and value. The first parameter is the url, second parameter is the list of key, third parameter is the lift of value, while the fourth parameter is the referer. We pass ‘url’;’notify’ so it becomes a list or array containing two string, the final post string will be url=……&notify=1 Where …… is the URL we found from horriblesubs. Then it check whether Furk.net returns successfully accepted the link and start downloading the file, if yes then we write append the link to the file horriblesubs.txt using SO_AppendToFile. And we do logging of successfully added the link. Syntax List (non Library) 1. GET Eg : GET(‘a’); Return the variable value 2. SET Eg : SET(‘a’, ‘I am John’); SET the variable a with value I am John 3. REM Eg : REM(‘this is remarks’) To set remarks, remarks must be added to top of script for indicating the input purpose of each input inside the script 4. PASS Eg : PASS() To do nothing, to serve as place holder for IF, WHILE statement 5. EXIT Eg : EXIT(‘Error login’) To print text in the parameter and exit the script 6. OA_ConsumerKey Eg : OA_ConsumerKey(‘key’) Set the consumer key of Oauth authentication 7. OA_ConsumerSecret Eg : OA_ConsumerSecret(‘secret’) Set the consumer secret of Oauth authentication 8. OA_TokenKey Eg : OA_TokenKey(‘key’) Set the respond from Oauth login token key to script 9. OA_TokenSecret Eg : OA_TokenSecret(‘secret’) Set the respond from Oauth login token secret to script 10. LOG Eg : LOG(‘Sucess login’) To print text in the parameter 11. FORMAT Eg : Format('upload_type=file&sess_id={0}&srv_tmp_url={1}&tos=1', GET('SID'), GET('TMP')); To format a string same with String.Format in C# to replace place holder {0} {1} … {n} with argument after the first string 12. TRIM Eg : TRIM(‘ aaaa ‘); To trim the white space inside string 13. JOIN Eg : JOIN(‘,’, ‘a’;’b’); To join string with separator in the first argument 14. SLEEP Eg : Sleep(‘3’) To sleep for certain seconds specified in first argument 15. LENGTH Eg : LENGTH(‘abcdef’); Return the length of string 16. COUNT Eg : COUNT(‘ab’;’bc’;’cd’); Return the count of array 17. SPLIT Eg : SPLIT(‘a,b,c’, ‘,’); Split a string with substring in second argument 18. REPLACE Eg : REPLACE(GET('URL'), '&amp;', '&') To replace string with another string. Sample will remove &amp; with & 19. REMOVEAT Eg : REMOVEAT(‘1’, ‘a’;’b’;’c’); To remove array index specified in first argument. Which return ‘a’;’c’ 20. PROCESSSTART Eg : PROCESSSTART(‘notepad.exe’); To start a file in using Windows shell. 21. LOADSCRIPT Eg : LOADSCRIPT(‘kaskuspm.txt’, ‘username’, ‘password’, ‘title’, ‘comment’); To run another script using the argument specified. 22. FILTER Eg : Filter(‘1’, ‘a’;’b’;’c’); To get the string in specified index of first argument, from list of string in second argument. Example return ‘a’. Filter also supports like : Filter(‘0-2’, ‘a’;’b’;’c’,’d’); return ‘a’;’b’;’c’ Filter also supports like : Filter(‘1,3’, ‘a’;’b’;’c’,’d’); return ‘b’;’d’ Comma means return the selected index Dash (-) means return the index between it. 23. IF IF syntax has 5 parameters, first is the operand 1, second the operator, continues with operan 2, then script to call if evaluation is true, then script to call if evaluation is false. The operator of IF can be - CONTAINS If a string contains some other part of string in third argument - = If a string equal to third argument - - - - IN If a string contains inside a list of string != If a string not equal to third argument STARTSWITH If a string starts with third argument ENDSWITH If a string ends with third argument < If a string is lesser than third argument. If the string is integer, it will compare as integer instead of string <= If a string is lesser or equal than third argument. If the string is integer, it will compare as integer instead of string > If a string is bigger than third argument. If the string is integer, it will compare as integer instead of string >= If a string is bigger or equal than third argument. If the string is integer, it will compare as integer instead of string Syntax List (Library) 1. SO_ Any syntax start with SO_ will use reflection to invoke StringOps.cs functions. See the CHM documentation for all methods publish there 2. WC_ Any syntax start with WC_ will use reflection to invoke CookieWebClient.cs functions. See the CHM documentation for all methods publish there 3. OA_ Any syntax start with OA_ will use reflection to invoke OAuth.cs functions. See the CHM documentation for all methods publish there ScrapperMin Files 1. ScrapperMin.exe To run scripts (contains in Scripts folder) or datas (contains in Datas folder) in console mode, allow parameter to be saved in file and pass the file name when running ScrapperMin.exe. It is meant for normal user to schedule ScrapperMin.exe in Task Scheduler so that it periodically do Web Scraping. Example : scrappermin.exe -scr=tusfiles.txt -args=param.txt -scr means to run script in scripts folder with the name tusfiles.txt. –args means to run the script and fill in parameter from param.txt Param.txt contains parameter information in a format such as below [param] UserID [param] Password [param] E:\test.mkv Each parameter is below [param] syntax, above has three parameters, which is UserID, Password and a file and pass it to tusfiles.txt script that has {PARAM0} {PARAM1} and {PARAM2}. Example above will upload file to tusfiles.txt. This example is not really suitable to be schedule in Task Scheduler, the more appropriate example to be scheduled in Task Scheduler would be HorribleFurk.txt script which is to check any new release from Horriblesubs.info website and pass the link to Furk.net to download it. The reason ScrapperMin.exe (console mode) exist allow user to automate the process of running script by having a prepared argument in a file. 2. ScrapperMinW.exe To create/run scripts (contains in Scripts folder) in Windows Graphical User Interface. This tool is for developer testing their script, creating new script or package the script. Using this tool is straight forward since all parameter is created as a textbox above the UI and developer enter information and click Run to test it. This tool allows packaging any created script to become Data. Data is a package containing one or more scripts and encrypt it using a password (using the first section of Data tab of ScrapperMinW.exe). Later developer can develop their own software that reference ScrapperMinDLL.dll and load the data by passing the password and run the script. Or developer can package the Data into a separate software that serves the purpose to run the selected scripts using second section of Data tab in ScrapperMinW.exe. Just like the first section of Data tab for combining multiple scripts into one file and encrypt it, the second section also ask for a folder to save the output that contains ScrapperRun.exe that will serves the purpose of running the script (the developer must enter the Entry file which is the file name to be executed first once Start is click). Sample scenario, developer run ScrapperMinW.exe and create two scripts, a.txt and b.txt. Developer want to ship both scripts to his user by going to Data Tab, check a.txt and b.txt, enter the Data filename such as Simple.Dat and a password and a Entry File which is b.txt, because the script starts from b.txt as the main function. 3. ScrapperRun.exe To detect any datas folder in the current path, if yes will load the first file with .dat encounter. Or any scripts folder in the current path, if yes will load the first file encoutner. Loads means it will find the amount of textbox requires to run the script and make a User Interface with text box of that amount. There is a tab where user click Start to run the script. ScrapperRun.exe is suitable to ship for normal user who runs script using GUI without knowing programming. 4. ScrapperUpdater.exe Running this file will download the latest version of ScrapperMin and replace your local file with the new version. 5. ScrapperMinDLL.dll The core library where everything is inside here.