Introduction to Natural Language Processing April 2022 1.Introduction Mahmoud Daif 1 Self Introduction • • • • • • • • 2013 graduated from Cairo University 2020 graduated from Hosei University, Tokyo, Japan 2021 co-founded zData Also, currently working as an algorithm engineer in the autonomous vehicles industry. Company website: https://zdata.ai/ Linkedin: https://www.linkedin.com/in/mahmouddaif/ Youtube:https://www.youtube.com/channel/UCF99zMIai zUjoXUH6eA1j8w email: daif@zdata.ai 2 Course Contents • • • • • • • • • Intro to NLP Regular Expressions Intro to Machine Learning Language Modeling Text Categorization Sequence to Sequence Models Machine Translation Practical Exercises from NLP100 https://nlp100.github.io/en/ Paper Reading 3 Development Environment • • • OS, it’s recommended to use linux. Python 3, recommended to use Anaconda https://www.anaconda.com/products/distribution Editor, any, I personally prefer vscode https://code.visualstudio.com/download 4 What is NLP? ■ Natural Language Processing (NLP) is a field in Artificial Intelligence (AI) devoted to creating computers that use natural language as input and/or output. 5 Why NLP? ■ ■ To interact with computing devices using human (natural) languages. For example, ■ Building intelligent robots (AI). ■ Enabling voice-controlled operation. To access (large amount of) information and knowledge stored in the form of human languages quickly. 6 Early days of NLP: Machines that Can Speak ■ HAL 9000 in “2001: A Space Odyssey” 7 Machines that Can Speak (cont.) ▪ C3PO in Star Wars ■ KITT in Knight Rider 8 But Still a Sci-Fi… ■ “However, as 2001 approached it became clear that 2001's predictions in computer technology were far fetched. Natural language, lip reading, planning and plain common sense in computers were still the stuff of science fiction.” HAL 9000 - Wikipedia 9 Test for Intelligence – Turing Test ▪ Alan Turing (1950) proposed a test of a machine's capability to perform human-like conversation. ■ A human judge engages in a natural language conversation with two other parties, one a human and the other a machine; if the judge cannot reliably tell which is which, then the machine is said to pass the test. 10 Exercise: 00. Reversed string ▪ Obtain the string that arranges letters of the string “stressed” in reverse order (tail to head). 11 Exercise: 00. Reversed string solution txt = "stressed"[::-1] print(txt) 12 Early Conversational Programs https://web.njit.edu/~ronkowit/eliza.html • ELIZA (by Joseph Weizenbaum), 1966 – A psychotherapist – No real understanding; simple pattern-matching to respond to user input -> canned responses (my ?x depresses me) (why does your ?x depress you) 5 (life ?x) (why do you say it ?x) 3 (I could ?x) (you could ?x) 2 (because ?x) (that is a good reason) 3 (?x) (tell me more) 0 13 >> Hello. How do you do. Please state your problem. >> Men are all alike. In what way ? >> They're always bugging us about something or other. I'm not sure I understand you fully. >> They're always bugging us about something or other. Can you think of a specific example ? >> Well, my boyfriend made me come here. Your boyfriend made you come here ? >> He says I'm depressed much of the time. I am sorry to hear that you are depressed. 14 Modern NLP: NLP in the Days of Big Data Three trends: 1. An enormous amount of information is now available in machine readable form as natural language text (newspapers, web pages, medical records, financial filings, product reviews, discussion forums, etc.) 2. Conversational agents are becoming an important form of human-computer communication 3. Much of human-human interaction is now mediated by computers via social media 15 NLP Applications • Three prominent application areas: ⮚Text analytics/mining (from “unstructured data”) ▪ Sentiment analysis ▪ Topic identification ⮚Conversational agents ▪ Siri, Cortana, Amazon Alexa, Google Assistant ▪ Chatbots ⮚Machine translation 16 Text Analytics • Data-mining of weblogs, microblogs, discussion forums, user reviews, and other forms of user-generated media. 17 Text Analytics (cont.) • Typically this involves the extraction of limited kinds of information from texts – Entity mentions – Sentiment 18 Demo • Sentiment Analysis with Python NLTK Text Classification – http://text-processing.com/demo/sentiment/ • Tweet Sentiment Visualization Tool – https://www.csc2.ncsu.edu/faculty/healey/tweet_viz/tweet_app/ 19 Conversational Agents • Combine – Speech recognition/synthesis – Question answering • From the web and from structured information sources (freebase, dbpedia, yago, etc.) – Simple agent-like abilities • • • • Create/edit calendar entries Reminders Directions Invoking/interacting with other apps 20 Mitsuku https://chat.kuki.ai/chat 21 Question Answering • Traditional information retrieval provides documents/resources that provide users with what they need to satisfy their information needs. • Question answering on the other hand directly provides an answer to information needs posed as questions. 22 IBM Watson https://www.youtube.com/watch?v=WFR3lOm_xhE 23 Machine Translation • • The automatic translation of texts between languages is one of the oldest non-numerical applications in Computer Science. In the past 15 years or so, MT has gone from a niche academic curiosity to a robust commercial industry. 24 Exercise: 01. “schooled” ▪ Obtain the string that concatenates the 1st, 3rd, 5th, and 7th letters in the string “schooled”. 25 Exercise: 01. “schooled” solution string = 'schooled' str_to_list = list(string) indices = [0, 2, 4, 6] target_char = [str_to_list[i] for i in indices] string = ''.join(target_char) print(string) 26 Exercise: 02. “shoe” + “cold” = “schooled” ▪ Obtain the string “schooled” by concatenating the letters in “shoe” and “cold” one after the other from head to tail. 27 Exercise: 02. “shoe” + “cold” = “schooled” solution string_1 = 'shoe' string_2 = 'cold' string_1 = list(string_1) string_2 = list(string_2) result = [] for i in range(len(string_1)): result.append(string_1[i]) result.append(string_2[i]) result = ''.join(result) print(result) 28 Text Mining Applications – Unsupervised • Text clustering • Trend analysis Trend for the Term “text mining” from Google Trends Cluster No. Comment Key Words 1 1, 3, 4 doctor, staff, friendly, helpful 2 5, 6, 8 treatment, results, time, schedule 3 2, 7 service, clinic, fast 29 Text Mining Applications – Supervised – Many typical predictive modeling or classification applications can be enhanced by incorporating textual data in addition to traditional input variables. • churning propensity models that include customer center notes, website forms, emails, and Twitter messages • hospital admission prediction models incorporating medical records notes as a new source of information • insurance fraud modeling using adjustor notes • sentiment categorization (next page) • stylometry or forensic applications that identify the author of a particular writing sample Sentiment Analysis • The field of sentiment analysis deals with categorization (or classification) of opinions expressed in textual documents Green color represents positive tone, red color represents negative tone, and product features and model names are highlighted in blue and brown, respectively. 31 NLP Tasks • NLP applications require several NLP analyses: – Word tokenization – Sentence boundary detection – Part-of-speech (POS) tagging • to identify the part-of-speech (e.g. noun, verb) of each word – Named Entity (NE) recognition • to identify proper nouns (e.g. names of person, location, organization; domain terminologies) – Parsing • to identify the syntactic structure of a sentence – Semantic analysis • to derive the meaning of a sentence 32 1. Part-Of-Speech (POS) Tagging • POS tagging is a process of assigning a POS or lexical class marker to each word in a sentence (and all sentences in a corpus). Input: Output: the lead paint is unsafe the/Det lead/N paint/N is/V unsafe/Adj 33 2. Named Entity Recognition (NER) • NER is to process a text and identify named entities in a sentence – e.g. “U.N. official Ekeus heads for Baghdad.” 34 3. Shallow Parsing • Shallow (or Partial) parsing identifies the (base) syntactic phases in a sentence. [NP He] [v saw] [NP the big dog] • After NEs are identified, dependency parsing is often applied to extract the syntactic/dependency relations between the NEs. [PER Bill Gates] founded [ORG Microsoft]. found nsubj Bill Gates dobj Dependency Relations nsubj(Bill Gates, found) dobj(found, Microsoft) Microsoft 35 4. Information Extraction (IE) • • • Identify specific pieces of information (data) in an unstructured or semi-structured text Transform unstructured information in a corpus of texts or web pages into a structured database (or templates) Applied to various types of text, e.g. – Newspaper articles – Scientific articles – Web pages – etc. 36 But NLP very is hard.. • Understanding natural languages is hard … because of inherent ambiguity • Engineering NLP systems is also hard … because of: – Huge amount of data resources needed (e.g. grammar, dictionary, documents to extract statistics from) – Computational complexity (intractable) of analyzing a sentence 37 Ambiguity (1) “Get the cat with the gloves.” 38 Ambiguity (2) Find at least 5 meanings of this sentence: “I made her duck” 1. 2. 3. 4. 5. I cooked waterfowl for her benefit (to eat) I cooked waterfowl belonging to her I created the (plaster?) duck she owns I caused her to quickly lower her head or body I waved my magic wand and turned her into undifferentiated waterfowl 39 Ambiguity (3) Some ambiguous headlines • • • • • Juvenile Court to Try Shooting Defendant Teacher Strikes Idle Kids Kids Make Nutritious Snacks Bush Wins on Budget, but More Lies Ahead Hospitals are Sued by 7 Foot Doctors 40 Ambiguity is Pervasive • Phonetics – – – – – – – – – – I mate or duck I’m eight or duck Eye maid; her duck Aye mate, her duck I maid her duck I’m aid her duck I mate her duck I’m ate her duck I’m ate or duck I mate or duck Sound like “I made her duck” 41 The Bottom Line • Complete NL Understanding (thus general intelligence) is impossible. • But we can make incremental progress. • Also we have made successes in limited domains. 42 The Big Picture Approach All of these applications operate by exploiting underlying regularities in human languages. Sometimes in complex ways, sometimes in pretty trivial ways. Language structure Formal models Practical applications 43 Exercise: 03. Pi Split the sentence “Now I need a drink, alcoholic of course, after the heavy lectures involving quantum mechanics”. into words, and create a list whose element presents the number of alphabetical letters in the corresponding word. 44 Exercise: 03. Pi solution string = 'Now I need a drink, alcoholic of course, after the heavy lectures involving quantum mechanics' str_to_list = string.split(' ') def count_words(word): result = {} for i in range(len(word)): if word[i] not in result: result[word[i]] = 0 result[word[i]] += 1 return result char_count_for_each = [] for word in str_to_list: count = count_words(word) char_count_for_each.append(count) print(char_count_for_each 45 Homework • Problems 4-9 https://nlp100.github.io/en/ch01.html 46 Introduction to Natural Language Processing April 2022 Mahmoud Daif 2. Regular Expressions 47 Document Search • ‘Information Retrieval (IR)’ implies a query (e.g. search terms) – For a given query, relevant or similar documents are returned. • But most basic document retrieval technique is keyword/search term matching. – Retrieve all (or selected) documents which contain the search terms -by string matching – Python example: >>> s1 = 'public' >>> s2 = 'public' >>> s2 == s1 True https://drive.google.com/file/d/11SBASI_8WV3hd954XcwQ0CC0PqAYiqN/view?usp= sharing myword = “month python” with open("textfile.txt") as openfile: for line in openfile: if myword in line: print(line) 48 Regular Expressions and Text Searching • Regular expressions are a compact textual representation of a set of strings that constitute a language – In the simplest case, regular expressions describe regular languages • Here, a language means a set of strings given some alphabet. • Extremely versatile and widely used technology – Emacs, vi, perl, grep, etc. 49 Example • Find all the instances of the word “the” in a text. – /the/ – /[tT]he/ – /\b[tT]he\b/ 50 String Matching Using Patterns • • Often, we wish to find a substring which matches a pattern e.g. E-mail addresses: 1. Any number of alphanumeric characters and/or dots (not a dot at beginning or end) 2. @ 3. Any number of alphanumeric characters and/or dots (not a dot at beginning or end); must be at least one dot • Examples: – valid: tomuro@cs.depaul.edu, noriko.tomuro@gmail.com – Invalid: .tomuro@cs.depaul.edu, tomuro@depaul • But if you want to specify search words by patterns, regular expressions are commonly used. 51 Regular Expressions (1) Regular expression is an algebra for defining patterns. For example, a regular expression “a*b” matches with a string “aaaab”. But without going through the formal definitions, here is a (partial) summary. 1. Simple Patterns – – – – Characters match themselves. Note the chars are case-sensitive. Metacharacters – not to be used literally _as is_ .^$*+?{}[]\|() To use a metacharacter, a back-slash has to be given before it \. \^ \+ etc. Other special characters \t, \n, etc. 52 Regular Expressions (2) 2. Character classes – – – [abc] – a, b, or c [^abc] – any character except a, b, or c. [a-zA-Z] – a through z, or A through Z inclusive (range) 3. Predefined character classes – – – – – – – . (dot) – any character \d – a digit ([0-9]) \D – a non-digit ([^0-9]) \s – a whitespace character (e.g. space, \t, \n, \r) \S – a non-whitespace character \w – a word character ([a-zA-Z_0-9]) \W – a non-word character ([^\w]) 4. Boundary matchers – – ^ -- the beginning of a line $ -- the end of a line 53 Regular Expressions (3) 5. Greedy quantifiers – – – – – X? – X, once or not at all Z* -- X, zero or more times X+ -- X, one or more times X{n} – X, exactly n times X{n,m} – X, at least n but no more than m times 6. Logical operators – – – XY – X followed by Y X|Y – either X or Y (X) – X, as a capturing group 54 Regular Expression in Python (1) • • Regular expressions are in the ‘re’ package. Notation for patterns is slightly different from other languages – using raw string as an alternative to Regular string. Regular String "ab*" "\\\\section" "\\w+\\s+\\1" • Raw string r"ab*" r"\\section" r"\w+\s+\1" First compile an expression (into an re object). Then match it against a string. – >>> import re >>> p = re.compile('ab*') – >>> print('lang\tver\nPython\t3') – >>> print(r'lang\tver\nPython\t3') 55 Regular Expression in Python (2) • Matching a re object against a string is done in several ways. Method/Attribute match() search() findall() finditer() Purpose Determine if the RE matches at the beginning of the string. Scan through a string, looking for any location where this RE matches. Find all substrings where the RE matches, and returns them as a list. Find all substrings where the RE matches, and returns them as aniterator. 56 >>> import re >>> sent = "This book on tennis cost $3.99 at Walmart." >>> p1 = re.compile("ten") >>> m1 = p1.match(sent) >>> m1 >>> p2 = re.compile(".*ten.*") >>> m2 = p2.match(sent) >>> m2 <_sre.SRE_Match object; span=(0, 42), match='This book on tennis cost $3.99 at Walmart.'> >>> m3 = re.search(p1,sent) >>> m3 <_sre.SRE_Match object; span=(13, 16), match='ten'> >>> m4 = re.search(p2,sent) >>> m4 <_sre.SRE_Match object; span=(0, 42), match='This book on tennis cost $3.99 at Walmart.'> >>> pp1 = re.compile("is") >>> m5 = re.findall(pp1, sent) >>> m5 ['is', 'is'] >>> pp2 = re.compile("\\d") >>> m6 = re.search(pp2, sent) >>> m6 <_sre.SRE_Match object; span=(26, 27), match='3'> >>> pp3 = re.compile("\\d+") >>> m7 = re.search(pp3, sent) >>> m7 <_sre.SRE_Match object; span=(26, 27), match='3'> 57 >>> pp3 = re.compile("\\$\\d+\\.\\d\\d") >>> m8 = re.search(pp3, sent) >>> m8 <_sre.SRE_Match object; span=(25, 30), match='$3.99'> >>> pp4 = re.compile(r"\$\d+\.\d\d") >>> m9 = re.search(pp4, sent) >>> m9 <_sre.SRE_Match object; span=(25, 30), match='$3.99'> 58 Regular Expression in Python (3) • • Grouping – You can retrieve the matched substrings using parentheses. Capturing groups are numbered by counting their opening parentheses from left to right. In the expression ((A)(B(C))), for example, there are four such groups: – – – – • ((A)(B(C))) (A) (B(C)) (C) Group zero always stands for the entire expression. 59 >>> ppp1 = re.compile("(\\w+) cost (\\$\\d+\\.\\d\\d)") >>> mm1 = re.search(ppp1, sent) >>> mm1 <_sre.SRE_Match object; span=(13, 30), match='tennis cost $3.99'> >>> mm1.group(0) 'tennis cost $3.99' >>> mm1.group(1) 'tennis' >>> mm1.group(2) '$3.99' 60 TutorialsPoint, http://www.tutorialspoint.com/python/python_reg_expressions.htm61 Assignment • • NLP 100 20, 21, 22, 23 https://nlp100.github.io/en/ch03.html 62 63 Introduction to Python Lecture 3 Mahmoud Daif Overview ∙ ∙ ∙ ∙ History Installing & Running Python Names & Assignment Sequences types: Lists, Tuples, and Strings ∙ Mutability 64 Brief History of Python ∙ Invented in the Netherlands, early 90s by Guido van Rossum ∙ Named after Monty Python ∙ Open sourced from the beginning ∙ Considered a scripting language, but is much more ∙ Scalable, object oriented and functional from the beginning ∙ Used by Google from the beginning ∙ Increasingly popular 65 Python’s Benevolent Dictator For Life “Python is an experiment in how much freedom programmers need. Too much freedom and nobody can read another's code; too little and expressiveness is endangered.” - Guido van Rossum 66 http://docs.python.org/ 67 The Python tutorial is good! 68 69 Running Python The Python Interpreter ∙ Typical Python implementations offer both an interpreter and compiler ∙ Interactive interface to Python with a read-eval-print loop [finin@linux2 ~]$ python Python 2.4.3 (#1, Jan 14 2008, 18:32:40) [GCC 4.1.2 20070626 (Red Hat 4.1.2-14)] on linux2 Type "help", "copyright", "credits" or "license" for more information. >>> def square(x): ... return x * x ... >>> map(square, [1, 2, 3, 4]) [1, 4, 9, 16] >>> 70 Installing ∙ Python is pre-installed on most Unix systems, including Linux and MAC OS X ∙ The pre-installed version may not be the most recent one (2.6.2 and 3.1.1 as of Sept 09) ∙ Download from https://www.anaconda.com/products/distribution ∙ Python comes with a large library of standard modules ∙ There are several options for an IDE • IDLE – works well with Windows • Eclipse with Pydev (http://pydev.sourceforge.net/) • Spyder 71 IDLE Development Environment ∙ IDLE is an Integrated DeveLopment Environment for Python, typically used on Windows ∙ Multi-window text editor with syntax highlighting, auto-completion, smart indent and other. ∙ Python shell with syntax highlighting. ∙ Integrated debugger with stepping, persistent breakpoints, and call stack visi72 bility Running Interactively on UNIX On Unix… % python >>> 3+3 6 ∙ Python prompts with ‘>>>’. ∙ To exit Python (not Idle): • In Unix, type CONTROL-D • In Windows, type CONTROL-Z + <Enter> • Evaluate exit() 73 Python Scripts ∙ When you call a python program from the command line the interpreter evaluates each expression in the file ∙ Familiar mechanisms are used to provide command line arguments and/or redirect input and output ∙ Python also has mechanisms to allow a python program to act both as a script and as a module to be imported and used by another python program 74 Example of a Script #! /usr/bin/python """ reads text from standard input and outputs any email addresses it finds, one to a line. """ import re from sys import stdin # a regular expression ~ for a valid email address pat = re.compile(r'[-\w][-.\w]*@[-\w][-\w.]+[a-zA-Z]{2,4}') for line in stdin.readlines(): for address in pat.findall(line): print(address) 75 results python> python email0.py <email.txt bill@msft.com gates@microsoft.com steve@apple.com bill@msft.com python> 76 Getting a unique, sorted list import re from sys import stdin pat = re.compile(r'[-\w][-.\w]*@[-\w][-\w.]+[a-zA-Z]{2,4}’) # found is an initially empty set (a list w/o duplicates) found = set( ) for line in stdin.readlines(): for address in pat.findall(line): found.add(address) # sorted() takes a sequence, returns a sorted list of its elements for address in sorted(found): print(address) 77 results python> python email2.py <email.txt bill@msft.com gates@microsoft.com steve@apple.com python> 78 Simple functions: ex.py """factorial done recursively and iteratively""" def fact1(n): ans = 1 for i in range(2,n): ans = ans * n return ans def fact2(n): if n < 1: return 1 else: 79 Simple functions: ex.py 671> python Python 2.5.2 … >>> import ex >>> ex.fact1(6) 1296 >>> ex.fact2(200) 78865786736479050355236321393218507…000000L >>> ex.fact1 <function fact1 at 0x902470> >>> fact1 Traceback (most recent call last): File "<stdin>", line 1, in <module> 80 NameError: name 'fact1' is not defined 81 The Basics A Code Sample (in IDLE) x = 34 - 23 # A comment. y = “Hello” # Another one. z = 3.45 if z == 3.45 or y == “Hello”: x=x+1 y = y + “ World” # String concat. print(x) print(y) 82 Enough to Understand the Code ∙ Indentation matters to code meaning • Block structure indicated by indentation ∙ First assignment to a variable creates it • Variable types don’t need to be declared. • Python figures out the variable types on its own. ∙ Assignment is = and comparison is == ∙ For numbers + - * / % are as expected • Special use of + for string concatenation and % for string formatting (as in C’s printf) ∙ Logical operators are words (and, or, not) not symbols ∙ The basic printing command is print() 83 Basic Datatypes ∙ Integers (default for numbers) z = 5 / 2 # Answer 2, integer division ∙ Floats x = 3.456 ∙ Strings • Can use “” or ‘’ to specify with “abc” == ‘abc’ • Unmatched can occur within the string: “matt’s” • Use triple double-quotes for multi-line strings or strings than contain both ‘ and “ inside of them: “““a‘b“c””” 84 Whitespace Whitespace is meaningful in Python: especially indentation and placement of newlines ∙ Use a newline to end a line of code Use \ when must go to next line prematurely ∙ No braces {} to mark blocks of code, use consistent indentation instead • First line with less indentation is outside of the block • First line with more indentation starts a nested block ∙ Colons start of a new block in many constructs, e.g. function definitions, then clauses 85 Comments ∙ Start comments with #, rest of line is ignored ∙ Can include a “documentation string” as the first line of a new function or class you define ∙ Development environments, debugger, and other tools use it: it’s good style to include one def fact(n): “““fact(n) assumes n is a positive integer and returns facorial of n.””” assert(n>0) return 1 if n==1 else n*fact(n-1) 86 Assignment ∙ Binding a variable in Python means setting a name to hold a reference to some object • Assignment creates references, not copies ∙ Names in Python do not have an intrinsic type, objects have types • Python determines the type of the reference automatically based on what data is assigned to it ∙ You create a name the first time it appears on the left side of an assignment expression: x=3 ∙ A reference is deleted via garbage collection after any names bound to it have passed out of scope ∙ Python uses reference semantics (more later) 87 Naming Rules ∙ Names are case sensitive and cannot start with a number. They can contain letters, numbers, and underscores. bob Bob _bob _2_bob_ bob_2 BoB ∙ There are some reserved words: and, assert, break, class, continue, def, del, elif, else, except, exec, finally, for, from, global, if, import, in, is, lambda, not, or, pass, print, raise, return, try, while 88 Naming conventions The Python community has these recommended naming conventions ∙ joined_lower for functions, methods and, attributes ∙ joined_lower or ALL_CAPS for constants ∙ StudlyCaps for classes ∙ camelCase only to conform to pre-existing conventions ∙ Attributes: interface, _internal, __private 89 Assignment ∙ You can assign to multiple names at the same time >>> x, y = 2, 3 >>> x 2 >>> y 3 This makes it easy to swap values >>> x, y = y, x ∙ Assignments can be chained >>> a = b = x = 2 90 Accessing Non-Existent Name Accessing a name before it’s been properly created (by placing it on the left side of an assignment), raises an error >>> y Traceback (most recent call last): File "<pyshell#16>", line 1, in -toplevely NameError: name ‘y' is not defined >>> y = 3 >>> y 3 91 92 Sequence types: Tuples, Lists, and Strings Sequence Types 1. Tuple: (‘john’, 32, [CMSC]) ∙ A simple immutable ordered sequence of items ∙ Items can be of mixed types, including collection types 2. Strings: “John Smith” • Immutable • Conceptually very much like a tuple 3. List: [1, 2, ‘john’, (‘up’, ‘down’)] ∙ Mutable ordered sequence of items of mixed types 93 Similar Syntax ∙ All three sequence types (tuples, strings, and lists) share much of the same syntax and functionality. ∙ Key difference: • Tuples and strings are immutable • Lists are mutable ∙ The operations shown in this section can be applied to all sequence types • most examples will just show the operation performed on one 94 Sequence Types 1 ∙ Define tuples using parentheses and commas >>> tu = (23, ‘abc’, 4.56, (2,3), ‘def’) ∙ Define lists are using square brackets and commas >>> li = [“abc”, 34, 4.34, 23] ∙ Define strings using quotes (“, ‘, or “““). >>> st >>> st >>> st string = “Hello World” = ‘Hello World’ = “““This is a multi-line that uses triple quotes.””” 95 Sequence Types 2 ∙ Access individual members of a tuple, list, or string using square bracket “array” notation ∙ Note that all are 0 based… >>> tu = (23, ‘abc’, 4.56, (2,3), ‘def’) >>> tu[1] # Second item in the tuple. ‘abc’ >>> li = [“abc”, 34, 4.34, 23] >>> li[1] # Second item in the list. 34 >>> st = “Hello World” >>> st[1] # Second character in string. ‘e’ 96 Positive and negative indices >>> t = (23, ‘abc’, 4.56, (2,3), ‘def’) Positive index: count from the left, starting with 0 >>> t[1] ‘abc’ Negative index: count from right, starting with –1 >>> t[-3] 4.56 97 Slicing: return copy of a subset >>> t = (23, ‘abc’, 4.56, (2,3), ‘def’) Return a copy of the container with a subset of the original members. Start copying at the first index, and stop copying before second. >>> t[1:4] (‘abc’, 4.56, (2,3)) Negative indices count from end >>> t[1:-1] (‘abc’, 4.56, (2,3)) 98 Slicing: return copy of a =subset >>> t = (23, ‘abc’, 4.56, (2,3), ‘def’) Omit first index to make copy starting from beginning of the container >>> t[:2] (23, ‘abc’) Omit second index to make copy starting at first index and going to end >>> t[2:] (4.56, (2,3), ‘def’) 99 Copying the Whole Sequence ∙ [ : ] makes a copy of an entire sequence >>> t[:] (23, ‘abc’, 4.56, (2,3), ‘def’) ∙ Note the difference between these two lines for mutable sequences >>> l2 = l1 # Both refer to 1 ref, # changing one affects both >>> l2 = l1[:] # Independent copies, two refs 100 The ‘in’ Operator ∙ Boolean test whether a value is inside a container: >>> t >>> 3 False >>> 4 True >>> 4 False = [1, 2, 4, 5] in t in t not in t ∙ For strings, tests for substrings >>> a = 'abcde' >>> 'c' in a True >>> 'cd' in a True >>> 'ac' in a False ∙ Be careful: the in keyword is also used in the syntax of for loops and list comprehensions 101 The + Operator The + operator produces a new tuple, list, or string whose value is the concatenation of its arguments. >>> (1, 2, 3) + (4, 5, 6) (1, 2, 3, 4, 5, 6) >>> [1, 2, 3] + [4, 5, 6] [1, 2, 3, 4, 5, 6] >>> “Hello” + “ ” + “World” ‘Hello World’ 102 The * Operator ∙ The * operator produces a new tuple, list, or string that “repeats” the original content. >>> (1, 2, 3) * 3 (1, 2, 3, 1, 2, 3, 1, 2, 3) >>> [1, 2, 3] * 3 [1, 2, 3, 1, 2, 3, 1, 2, 3] >>> “Hello” * 3 ‘HelloHelloHello’ 103 104 Mutability: Tuples vs. Lists Lists are mutable >>> li = [‘abc’, 23, 4.34, 23] >>> li[1] = 45 >>> li [‘abc’, 45, 4.34, 23] ∙ We can change lists in place. ∙ Name li still points to the same memory reference when we’re done. 105 Tuples are immutable >>> t = (23, ‘abc’, 4.56, (2,3), ‘def’) >>> t[2] = 3.14 Traceback (most recent call last): File "<pyshell#75>", line 1, in -topleveltu[2] = 3.14 TypeError: object doesn't support item assignment ∙ You can’t change a tuple. ∙ You can make a fresh tuple and assign its reference to a previously used name. >>> t = (23, ‘abc’, 3.14, (2,3), ‘def’) ∙ The immutability of tuples means they’re faster than lists. 106 Operations on Lists Only >>> li = [1, 11, 3, 4, 5] >>> li.append(‘a’) # Note the method syntax >>> li [1, 11, 3, 4, 5, ‘a’] >>> li.insert(2, ‘i’) >>>li [1, 11, ‘i’, 3, 4, 5, ‘a’] 107 The extend method vs + ∙ + creates a fresh list with a new memory ref ∙ extend operates on list li in place. >>> li.extend([9, 8, 7]) >>> li [1, 2, ‘i’, 3, 4, 5, ‘a’, 9, 8, 7] ∙ Potentially confusing: • extend takes a list as an argument. • append takes a singleton as an argument. >>> li.append([10, 11, 12]) >>> li [1, 2, ‘i’, 3, 4, 5, ‘a’, 9, 8, 7, [10, 108 11, 12]] Operations on Lists Only Lists have many methods, including index, count, remove, reverse, sort >>> li = [‘a’, ‘b’, ‘c’, ‘b’] >>> li.index(‘b’) # index of 1st occurrence 1 >>> li.count(‘b’) # number of occurrences 2 >>> li.remove(‘b’) # remove 1st occurrence >>> li [‘a’, ‘c’, ‘b’] 109 Operations on Lists Only >>> li = [5, 2, 6, 8] >>> li.reverse() >>> li [8, 6, 2, 5] # reverse the list *in place* >>> li.sort() >>> li [2, 5, 6, 8] # sort the list *in place* >>> li.sort(some_function) # sort in place using user-defined comparison 110 Summary: Tuples vs. Lists ∙ Lists slower but more powerful than tuples • Lists can be modified, and they have lots of handy operations and methods • Tuples are immutable and have fewer features ∙ To convert between tuples and lists use the list() and tuple() functions: li = list(tu) tu = tuple(li) 111 Assignment Delivery ● ● ● ● TA Eng. Wafaa w.momen@fci-cu.edu.eg TA Eng. Ibrahim I.gomaa@fci-cu.edu.eg 3 assignments in total you will get 4, but you can choose to only deliver 3 ● Assignments will be uploaded to the blackboard. 112 Lecture 4: Sentiment Analysis by NLTK Mahmoud Daif AFINN-111 https://github.com/fnielsen/afinn/blob/m aster/afinn/data/AFINN-111.txt A list of words rated between (-5) negative to (5) positive! Lec 5