Text Parsing in Python - Gayatri Nittala - Madhubala Vasireddy Text Parsing ► The three W’s! ► Efficiency and Perfection What is Text Parsing? ► common programming task ► extract or split a sequence of characters Why is Text Parsing? ► Simple file parsing A tab separated file ► Data extraction Extract specific information from log file ► Find and replace ► Parsers – syntactic analysis ► NLP Extract information from corpus POS Tagging Text Parsing Methods ► String Functions ► Regular Expressions ► Parsers String Functions ► String module in python Faster, easier to understand and maintain ► If you can do, DO IT! ► Different built-in functions Find-Replace Split-Join Startswith and Endswith Is methods Find and Replace ► find, index, rindex, replace ► EX: Replace a string in all files in a directory files = glob.glob(path) for line in fileinput.input(files,inplace=1): lineno = 0 lineno = string.find(line, stext) if lineno >0: line =line.replace(stext, rtext) sys.stdout.write(line) startswith and endswith ► Extract quoted words from the given text myString = "\"123\""; if (myString.startswith("\"")) print "string with double quotes“ ► Find if the sentences are interrogative or exclamative ► What an amazing game that was! ► Do you like this? endings = ('!', '?') sentence.endswith(endings) isMethods ► to check alphabets, numerals, character case etc m = 'xxxasdf ‘ m.isalpha() False Regular Expressions ► concise way for complex patterns ► amazingly powerful ► wide variety of operations ► when you go beyond simple, think about regular expressions! Real world problems ► Match IP Addresses, email addresses, URLs ► Match balanced sets of parenthesis ► Substitute words ► Tokenize ► Validate ► Count ► Delete duplicates ► Natural Language processing RE in Python ► Unleash the power - built-in re module ► Functions to compile patterns ► complie to perform matches ► match, search, findall, finditer to perform opertaions on match object ► group, start, end, span to substitute ► ►- sub, subn Metacharacters Compiling patterns ► re.complile() ► pattern for IP Address ^[0-9]+\.[0-9]+\.[0-9]+\.[0-9]+$ ^\d+\.\d+\.\d+\.\d+$ ^\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}$ ^([01]?\d\d?|2[0-4]\d|25[0-])\. ([01]?\d\d?|2[0-4]\d|25[0-5])\. ([01]?\d\d?|2[0-4]\d|25[0-5])\. ([01]?\d\d?|2[0-4]\d|25[0-5])$ Compiling patterns ► pattern for matching parenthesis \(.*\) \([^)]*\) \([^()]*\) Substitute Perform several string substitutions on a given string import re def make_xlat(*args, **kwargs): adict = dict(*args, **kwargs) rx = re.compile('|'.join(map(re.escape, adict))) def one_xlate(match): return adict[match.group(0)] def xlate(text): return rx.sub(one_xlate, text) return xlate ► Count ► Split and count words in the given text p = re.compile(r'\W+') len(p.split('This is a test for split().')) Tokenize ► Parsing and Natural Language Processing s = 'tokenize these words' words = re.compile(r'\b\w+\b|\$') words.findall(s) ['tokenize', 'these', 'words'] Common Pitfalls ► operations on fixed strings, single character class, no case sensitive issues ► re.sub() and string.replace() ► re.sub() and string.translate() ► match vs. search ► greedy vs. non-greedy PARSERS ► Flat and Nested texts ► Nested tags, Programming language constructs ► Better to do less than to do more! Parsing Non flat texts ► Grammar ► States ► Generate tokens and Act on them ► Lexer - Generates a stream of tokens ► Parser - Generate a parse tree out of the tokens ► Lex and Yacc Grammar Vs RE ► Floating Point #---- EBNF-style description of Python ---# floatnumber ::= pointfloat | exponentfloat pointfloat ::= [intpart] fraction | intpart "." exponentfloat ::= (intpart | pointfloat) exponent intpart ::= digit+ fraction ::= "." digit+ exponent ::= ("e" | "E") ["+" | "-"] digit+ digit ::= "0"..."9" Grammar Vs RE pat = r'''(?x) ( ( ( # exponentfloat # intpart or pointfloat # pointfloat (\d+)?[.]\d+ # optional intpart with fraction | \d+[.] # intpart with period ) # end pointfloat | \d+ # intpart ) # end intpart or pointfloat [eE][+-]?\d+ # exponent ) # end exponentfloat | ( # pointfloat (\d+)?[.]\d+ # optional intpart with fraction | \d+[.] # intpart with period ) # end pointfloat ''' PLY - The Python Lex and Yacc ► higher-level and cleaner grammar language ► LALR(1) parsing ► extensive input validation, error reporting, and diagnostics ► Two moduoles lex.py and yacc.py Using PLY - Lex and Yacc ► Lex: ► Import the [lex] module ► Define a list or tuple variable 'tokens', the lexer is allowed to produce ► Define tokens - by assigning to a specially named variable ('t_tokenName') ► Build the lexer mylexer = lex.lex() mylexer.input(mytext) # handled by yacc Lex t_NAME = r'[a-zA-Z_][a-zA-Z0-9_]*' def t_NUMBER(t): r'\d+' try: t.value = int(t.value) except ValueError: print "Integer value too large", t.value t.value = 0 return t t_ignore = " \t" Yacc ► Import the 'yacc' module ► Get a token map from a lexer ► Define a collection of grammar rules ► Build the parser yacc.yacc() yacc.parse('x=3') Yacc ► Specially named functions having a 'p_' prefix def p_statement_assign(p): 'statement : NAME "=" expression' names[p[1]] = p[3] def p_statement_expr(p): 'statement : expression' print p[1] Summary String Functions A thumb rule - if you can do, do it. ► Regular Expressions Complex patterns - something beyond simple! ► Lex and Yacc Parse non flat texts - that follow some rules ► References ► http://docs.python.org/ ► http://code.activestate.com/recipes/langs/python/ ► http://www.regular-expressions.info/ ► http://www.dabeaz.com/ply/ply.html ► Mastering Regular Expressions by Jeffrey E F. Friedl ► Python Cookbook by Alex Martelli, Anna Martelli & David Ascher ► Text processing in Python by David Mertz Thank You Q&A