Natural language processing tools Lê Đức Trọng 1 Crawler and Parser tools • Crawler tools: • Crawler 4j: http://code.google.com/p/crawler4j/ • httpClient: http://hc.apache.org/httpclient-3.x/ • Parser tools: • htmlParser: http://htmlparser.sourceforge.net/ • Jsoup html parser: http://jsoup.org/ • Neko html parser: http://nekohtml.sourceforge.net/ 2 Vietnamese NLP – Tools • JVnTextPro: http://sourceforge.net/projects/jvntextpro/ • Sentence Segmentation, Sentence Tokenization, Word Segmentation, POS-Tagging • VnToolkit: http://www.loria.fr/~lehong/softwares.php • An automatic tagger for Vietnamese texts • A tokenize for automatic word segmentation of Vietnamese texts • A sentence detector for automatic detecting sentences of Vietnamese texts • VLSP Tools: http://vlsp.vietlp.org:8080/demo/?page=resources • Vietnamese Chunking 3 NLP Toolkits • LingPipe: http://alias-i.com/lingpipe/ • Find the names of people, organizations or locations in news • Automatically classify Twitter search results into categories • Suggest correct spellings of queries • Mallet - Machine Learning for Language Toolkit: http://mallet.cs.umass.edu/ • Statistic, document classification, clustering, topic modeling, information extraction • Stanford NLP softwares: http://www-nlp.stanford.edu/software/ • Word segmentation, part-of-speech tagging, named entity recognition, chunking, parsing, classification and coreference resolution • NLTK: http://www.nltk.org/ • Open source Python modules, linguistic data and documentation for research and development in natural language processing and text analytics. • OpenNLP: http://opennlp.apache.org/ • Tokenization, sentence segmentation, part-of-speech tagging, named entity extraction, chunking, parsing, and coreference resolution 4 Machine learning libraries • Conditional random fields (CRF) • CRF: http://crf.sourceforge.net/ • Maximum entropy (Maxent) • OpenNLP, Mallet • Support vector machine (SVM) • libSVM: http://www.csie.ntu.edu.tw/~cjlin/libsvm/ • svmLight: http://svmlight.joachims.org/ 5