web crawler **** crawler4j

การพัฒนา WEB CRAWLER ขั้นพื้ นฐาน ด้วย CRAWLER4J เครื่องมือที่ใช้พฒ ั นา      Eclipse IDE Java SDK Library “crawler4j” download >> “ https://code.google.com/p/crawler4j/ ” การทางานของ web crawler start Initialize frontier Dequeue URL from frontier Fetch page Extract URLs and add to frontier Store page Done ? Yes stop No การทางานของ web crawler  ขั้นแรกจะเป็ นกระบวนการที่ใช้ในการเก็บ URL ของเว็บเพจ (Intialize Frontier) (Breadth First Search)  ใช้คิวแบบเข้าก่อนออกก่อน (First In First Out : FIFO)  หาข้อมูลโดยใช้เส้นทางในการค้นหาที่ส้น ั ที่สุด (Shortest Paths)  ใช้การค้นหาข้อมูลแนวกว้าง  ขั้นตอนในการดึง URL (Dequeue) ไปให้ส่วนที่ทาหน้าที่รอ้ งขอหรือ ดาวน์โหลดเว็บเพจ (Fetch Page) การทางานของ web crawler    ทาการสกัด URL (Extract URLs) จากเว็บเพจที่ถกู ดึงมาแล้วว่ามี URL ที่ตอ้ งทางานต่ออีกหรือไม่ เป็ นการหาความลึกของลิงค์ (Link Depth) หากมีจะทาการบันทึก URL นั้นลงในฐานข้อมูล วนไปเริ่มกระบวนการใหม่จนกว่าจะไม่มลี ิงค์ URL จึงจะเสร็จสิ้ น กระบวนการ Step 1     Download the latest crawler4j-x.x.zip file (contains crawler4j-x.x.jar) zip file also contains a log4j.properties file that you'll need later to properly create a log4j appender. download >> “ https://code.google.com/p/crawler4j/ ” Step 2    In Eclipse, create a new project, such as WebcrawlerTest. Right click on your new web crawler project and select the Build Path option. Select the “Add External Archives” option and select ALL of the jar files that you extracted from the zip files. Step 3  Create a Java class name “BasicCrawler” import edu.uci.ics.crawler4j.crawler.Page; import edu.uci.ics.crawler4j.crawler.WebCrawler; import edu.uci.ics.crawler4j.parser.HtmlParseData; import edu.uci.ics.crawler4j.url.WebURL; import java.util.Set; import java.util.regex.Pattern; import org.apache.http.Header; public class BasicCrawler extends WebCrawler { private final static Pattern FILTERS = Pattern.compile(".*(\\.(css|js|bmp|gif|jpe?g" + "|png|tiff?|mid|mp2|mp3|mp4"+ "|wav|avi|mov|mpeg|ram|m4v|pdf" + "|rm|smil|wmv|swf|wma|zip|rar|gz))$"); } Step 3   Create method name “shouldVisit” Implement this function to specify whether the given url . It should be crawled or not (based on crawling logic). public boolean shouldVisit(Page page, WebURL url) { String href = url.getURL().toLowerCase(); return !FILTERS.matcher(href).matches() && href.startsWith("http://"); } Step 3   Create method name “visit” This function is called when a page is fetched and ready to be processed by your program. public void visit(Page page) { int docid = page.getWebURL().getDocid(); String url = page.getWebURL().getURL(); String domain = page.getWebURL().getDomain(); String path = page.getWebURL().getPath(); String subDomain = page.getWebURL().getSubDomain(); String parentUrl = page.getWebURL().getParentUrl(); String anchor = page.getWebURL().getAnchor(); // continue next page >>> Step 3 System.out.println("Docid: " + docid); System.out.println("URL: " + url); System.out.println("Domain: '" + domain + "'"); System.out.println("Sub-domain: '" + subDomain + "'"); System.out.println("Path: '" + path + "'"); System.out.println("Parent page: " + parentUrl); System.out.println("Anchor text: " + anchor); if (page.getParseData() instanceof HtmlParseData) { HtmlParseData htmlParseData = (HtmlParseData) page.getParseData(); String text = htmlParseData.getText(); String html = htmlParseData.getHtml(); List<WebURL> links = htmlParseData.getOutgoingUrls(); System.out.println("Text length: " + text.length()); System.out.println("Html length: " + html.length()); System.out.println("Number of outgoing links: " + links.size()); } System.out.println("============="); } Step 4  Create a Java class name “BasicCrawlController” import edu.uci.ics.crawler4j.crawler.CrawlConfig; import edu.uci.ics.crawler4j.crawler.CrawlController; import edu.uci.ics.crawler4j.fetcher.PageFetcher; import edu.uci.ics.crawler4j.robotstxt.RobotstxtConfig; import edu.uci.ics.crawler4j.robotstxt.RobotstxtServer; public class BasicCrawlController { } Step 4  Create main method public static void main(String[] args) throws Exception { String crawlStorageFolder = "C:\\Storage\\"; int numberOfCrawlers = 1 CrawlConfig config = new CrawlConfig(); config.setCrawlStorageFolder(crawlStorageFolder); config.setPolitenessDelay(1000); config.setMaxDepthOfCrawling(2); config.setMaxPagesToFetch(1000); config.setResumableCrawling(false); //continue next page Step 4 PageFetcher pageFetcher = new PageFetcher(config); RobotstxtConfig robotstxtConfig = new RobotstxtConfig(); RobotstxtServer robotstxtServer = new RobotstxtServer(robotstxtConfig, pageFetcher); CrawlController controller = new CrawlController(config, pageFetcher, robotstxtServer); controller.addSeed("http://www.ics.uci.edu/"); controller.start(BasicCrawler.class, numberOfCrawlers); } Step 5 Run  Select Run → Run Configurations option and select your project and BasicCrawlController as the main class. Click Apply button and then Run to start your first webcrawler. Deleting content of: C:\Storage\frontier INFO [main] Crawler 1 started. INFO [main] Crawler 2 started. Step 5 Result Docid: 1 URL: http://www.ics.uci.edu/~lopes/ Domain: 'uci.edu' Sub-domain: 'www.ics' Path: '/~lopes/' Parent page: null Anchor text: null Text length: 2442 Html length: 9987 Number of outgoing links: 34 ============= Step 5 Finish Crawl INFO [Thread-1] It looks like no thread is working, waiting for 10 seconds to make sure... INFO [Thread-1] No thread is working and no more URLs are in queue waiting for another 10 seconds to make sure... INFO [Thread-1] All of the crawlers are stopped. Finishing the process... INFO [Thread-1] Waiting for 10 seconds before final clean up...

web crawler **** crawler4j

Related documents

Products

Support

web crawler **** crawler4j

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib