Uploaded by omaryansona

ERP all lectures +mid

advertisement
Introduction to
Natural Language Processing
April 2022
1.Introduction
Mahmoud Daif
1
Self Introduction
•
•
•
•
•
•
•
•
2013 graduated from Cairo University
2020 graduated from Hosei University, Tokyo, Japan
2021 co-founded zData
Also, currently working as an algorithm engineer in the
autonomous vehicles industry.
Company website: https://zdata.ai/
Linkedin: https://www.linkedin.com/in/mahmouddaif/
Youtube:https://www.youtube.com/channel/UCF99zMIai
zUjoXUH6eA1j8w
email: daif@zdata.ai
2
Course Contents
•
•
•
•
•
•
•
•
•
Intro to NLP
Regular Expressions
Intro to Machine Learning
Language Modeling
Text Categorization
Sequence to Sequence Models
Machine Translation
Practical Exercises from NLP100
https://nlp100.github.io/en/
Paper Reading
3
Development Environment
•
•
•
OS, it’s recommended to use linux.
Python 3, recommended to use Anaconda
https://www.anaconda.com/products/distribution
Editor, any, I personally prefer vscode
https://code.visualstudio.com/download
4
What is NLP?
■
Natural Language Processing (NLP) is a field in
Artificial Intelligence (AI) devoted to creating
computers that use natural language as input
and/or output.
5
Why NLP?
■
■
To interact with computing devices using human
(natural) languages. For example,
■ Building intelligent robots (AI).
■ Enabling voice-controlled operation.
To access (large amount of) information and
knowledge stored in the form of human languages
quickly.
6
Early days of NLP:
Machines that Can Speak
■
HAL 9000 in “2001: A Space Odyssey”
7
Machines that Can Speak (cont.)
▪ C3PO
in Star Wars
■
KITT
in Knight Rider
8
But Still a Sci-Fi…
■
“However, as 2001 approached it became clear that
2001's predictions in computer technology were far
fetched. Natural language, lip reading, planning
and plain common sense in computers were still the
stuff of science fiction.”
HAL 9000 - Wikipedia
9
Test for Intelligence – Turing Test
▪ Alan Turing (1950) proposed a
test of a machine's capability to
perform human-like
conversation.
■
A human judge engages in a natural language
conversation with two other parties, one a human
and the other a machine; if the judge cannot
reliably tell which is which, then the machine is said
to pass the test.
10
Exercise: 00. Reversed string
▪
Obtain the string that arranges letters of the
string “stressed” in reverse order (tail to head).
11
Exercise: 00. Reversed string solution
txt = "stressed"[::-1]
print(txt)
12
Early Conversational Programs
https://web.njit.edu/~ronkowit/eliza.html
• ELIZA (by Joseph Weizenbaum), 1966
– A psychotherapist
– No real understanding; simple pattern-matching
to respond to user input -> canned responses
(my ?x depresses me) (why does your ?x depress you) 5
(life ?x) (why do you say it ?x) 3
(I could ?x) (you could ?x) 2
(because ?x) (that is a good reason) 3
(?x) (tell me more) 0
13
>> Hello.
How do you do. Please state your problem.
>> Men are all alike.
In what way ?
>> They're always bugging us about something or other.
I'm not sure I understand you fully.
>> They're always bugging us about something or other.
Can you think of a specific example ?
>> Well, my boyfriend made me come here.
Your boyfriend made you come here ?
>> He says I'm depressed much of the time.
I am sorry to hear that you are depressed.
14
Modern NLP:
NLP in the Days of Big Data
Three trends:
1. An enormous amount of information is now
available in machine readable form as natural
language text (newspapers, web pages, medical
records, financial filings, product reviews, discussion
forums, etc.)
2. Conversational agents are becoming an important
form of human-computer communication
3. Much of human-human interaction is now mediated by
computers via social media
15
NLP Applications
• Three prominent application areas:
⮚Text analytics/mining (from “unstructured data”)
▪ Sentiment analysis
▪ Topic identification
⮚Conversational agents
▪ Siri, Cortana, Amazon Alexa, Google Assistant
▪ Chatbots
⮚Machine translation
16
Text Analytics
• Data-mining of weblogs, microblogs, discussion forums,
user reviews, and other forms of user-generated media.
17
Text Analytics (cont.)
• Typically this involves the extraction of limited kinds of
information from texts
– Entity mentions
– Sentiment
18
Demo
• Sentiment Analysis with Python NLTK Text Classification
– http://text-processing.com/demo/sentiment/
• Tweet Sentiment Visualization Tool
– https://www.csc2.ncsu.edu/faculty/healey/tweet_viz/tweet_app/
19
Conversational Agents
• Combine
– Speech recognition/synthesis
– Question answering
• From the web and from structured information sources (freebase,
dbpedia, yago, etc.)
– Simple agent-like abilities
•
•
•
•
Create/edit calendar entries
Reminders
Directions
Invoking/interacting with other apps
20
Mitsuku
https://chat.kuki.ai/chat
21
Question Answering
• Traditional information retrieval provides
documents/resources that provide users with what they
need to satisfy their information needs.
• Question answering on the other hand directly provides
an answer to information needs posed as questions.
22
IBM Watson
https://www.youtube.com/watch?v=WFR3lOm_xhE
23
Machine Translation
•
•
The automatic translation of texts between languages is one of the
oldest non-numerical applications in Computer Science.
In the past 15 years or so, MT has gone from a niche academic
curiosity to a robust commercial industry.
24
Exercise: 01. “schooled”
▪
Obtain the string that concatenates the
1st, 3rd, 5th, and 7th letters in the string
“schooled”.
25
Exercise: 01. “schooled” solution
string = 'schooled'
str_to_list = list(string)
indices = [0, 2, 4, 6]
target_char = [str_to_list[i] for i in indices]
string = ''.join(target_char)
print(string)
26
Exercise: 02. “shoe” + “cold” = “schooled”
▪
Obtain the string “schooled” by concatenating the letters in “shoe” and
“cold” one after the other from head to tail.
27
Exercise: 02. “shoe” + “cold” = “schooled” solution
string_1 = 'shoe'
string_2 = 'cold'
string_1 = list(string_1)
string_2 = list(string_2)
result = []
for i in range(len(string_1)):
result.append(string_1[i])
result.append(string_2[i])
result = ''.join(result)
print(result)
28
Text Mining Applications – Unsupervised
• Text clustering
• Trend analysis
Trend for the Term “text mining” from Google Trends
Cluster
No.
Comment
Key Words
1
1, 3, 4
doctor, staff,
friendly, helpful
2
5, 6, 8
treatment, results,
time, schedule
3
2, 7
service, clinic, fast
29
Text Mining Applications – Supervised
– Many typical predictive modeling or
classification applications can be
enhanced by incorporating textual data in
addition to traditional input variables.
• churning propensity models that include
customer center notes, website forms, emails, and Twitter messages
• hospital admission prediction models
incorporating medical records notes as a
new source of information
• insurance fraud modeling using adjustor
notes
• sentiment categorization (next page)
• stylometry or forensic applications that
identify the author of a particular writing
sample
Sentiment Analysis
•
The field of sentiment analysis deals with categorization (or
classification) of opinions expressed in textual documents
Green color represents positive tone, red color represents negative tone, and
product features and model names are highlighted in blue and brown, respectively.
31
NLP Tasks
• NLP applications require several NLP analyses:
– Word tokenization
– Sentence boundary detection
– Part-of-speech (POS) tagging
• to identify the part-of-speech (e.g. noun, verb) of each word
– Named Entity (NE) recognition
• to identify proper nouns (e.g. names of person, location,
organization; domain terminologies)
– Parsing
• to identify the syntactic structure of a sentence
– Semantic analysis
• to derive the meaning of a sentence
32
1. Part-Of-Speech (POS) Tagging
• POS tagging is a process of assigning a POS or lexical
class marker to each word in a sentence (and all
sentences in a corpus).
Input:
Output:
the lead paint is unsafe
the/Det lead/N paint/N is/V unsafe/Adj
33
2. Named Entity Recognition (NER)
• NER is to process a text and identify named entities in a
sentence
– e.g. “U.N. official Ekeus heads for Baghdad.”
34
3. Shallow Parsing
•
Shallow (or Partial) parsing identifies the (base) syntactic phases in
a sentence.
[NP He] [v saw] [NP the big dog]
•
After NEs are identified, dependency parsing is often applied to
extract the syntactic/dependency relations between the NEs.
[PER Bill Gates] founded [ORG Microsoft].
found
nsubj
Bill Gates
dobj
Dependency Relations
nsubj(Bill Gates, found)
dobj(found, Microsoft)
Microsoft
35
4. Information Extraction (IE)
•
•
•
Identify specific pieces of information (data) in an
unstructured or semi-structured text
Transform unstructured information in a corpus of texts
or web pages into a structured database (or templates)
Applied to various types of text, e.g.
– Newspaper
articles
– Scientific
articles
– Web pages
– etc.
36
But NLP very is hard..
• Understanding natural languages is hard …
because of inherent ambiguity
• Engineering NLP systems is also hard …
because of:
– Huge amount of data resources needed (e.g.
grammar, dictionary, documents to extract
statistics from)
– Computational complexity (intractable) of
analyzing a sentence
37
Ambiguity (1)
“Get the cat with the gloves.”
38
Ambiguity (2)
Find at least 5 meanings of this sentence:
“I made her duck”
1.
2.
3.
4.
5.
I cooked waterfowl for her benefit (to eat)
I cooked waterfowl belonging to her
I created the (plaster?) duck she owns
I caused her to quickly lower her head or body
I waved my magic wand and turned her into
undifferentiated waterfowl
39
Ambiguity (3)
Some ambiguous headlines
•
•
•
•
•
Juvenile Court to Try Shooting Defendant
Teacher Strikes Idle Kids
Kids Make Nutritious Snacks
Bush Wins on Budget, but More Lies Ahead
Hospitals are Sued by 7 Foot Doctors
40
Ambiguity is Pervasive
• Phonetics
–
–
–
–
–
–
–
–
–
–
I mate or duck
I’m eight or duck
Eye maid; her duck
Aye mate, her duck
I maid her duck
I’m aid her duck
I mate her duck
I’m ate her duck
I’m ate or duck
I mate or duck
Sound like
“I made her duck”
41
The Bottom Line
• Complete NL Understanding (thus general
intelligence) is impossible.
• But we can make incremental progress.
• Also we have made successes in limited domains.
42
The Big Picture Approach
All of these applications operate by exploiting
underlying regularities in human languages.
Sometimes in complex ways, sometimes in pretty trivial
ways.
Language
structure
Formal
models
Practical
applications
43
Exercise: 03. Pi
Split the sentence “Now I need a drink, alcoholic of
course, after the heavy lectures involving quantum
mechanics”. into words, and create a list whose
element presents the number of alphabetical letters in
the corresponding word.
44
Exercise: 03. Pi solution
string = 'Now I need a drink, alcoholic of course, after the
heavy lectures involving quantum mechanics'
str_to_list = string.split(' ')
def count_words(word):
result = {}
for i in range(len(word)):
if word[i] not in result:
result[word[i]] = 0
result[word[i]] += 1
return result
char_count_for_each = []
for word in str_to_list:
count = count_words(word)
char_count_for_each.append(count)
print(char_count_for_each
45
Homework
•
Problems 4-9 https://nlp100.github.io/en/ch01.html
46
Introduction to
Natural Language Processing
April 2022
Mahmoud Daif
2. Regular Expressions
47
Document Search
•
‘Information Retrieval (IR)’ implies a query (e.g. search terms)
– For a given query, relevant or similar documents are returned.
•
But most basic document retrieval technique is keyword/search
term matching.
– Retrieve all (or selected) documents which contain the search terms -by string matching
– Python example:
>>> s1 = 'public'
>>> s2 = 'public'
>>> s2 == s1
True
https://drive.google.com/file/d/11SBASI_8WV3hd954XcwQ0CC0PqAYiqN/view?usp=
sharing
myword = “month python”
with open("textfile.txt") as openfile:
for line in openfile:
if myword in line:
print(line)
48
Regular Expressions and Text
Searching
• Regular expressions are a compact textual
representation of a set of strings that constitute a
language
– In the simplest case, regular expressions describe regular
languages
• Here, a language means a set of strings given some alphabet.
• Extremely versatile and widely used technology
– Emacs, vi, perl, grep, etc.
49
Example
• Find all the instances of the word “the” in a text.
– /the/
– /[tT]he/
– /\b[tT]he\b/
50
String Matching Using Patterns
•
•
Often, we wish to find a substring which matches a pattern
e.g. E-mail addresses:
1. Any number of alphanumeric characters and/or dots (not a dot at
beginning or end)
2. @
3. Any number of alphanumeric characters and/or dots (not a dot at
beginning or end); must be at least one dot
•
Examples:
– valid: tomuro@cs.depaul.edu, noriko.tomuro@gmail.com
– Invalid: .tomuro@cs.depaul.edu, tomuro@depaul
•
But if you want to specify search words by patterns, regular
expressions are commonly used.
51
Regular Expressions (1)
Regular expression is an algebra for defining patterns. For example, a
regular expression “a*b” matches with a string “aaaab”.
But without going through the formal definitions, here is a (partial)
summary.
1. Simple Patterns
–
–
–
–
Characters match themselves. Note the chars are case-sensitive.
Metacharacters – not to be used literally _as is_
.^$*+?{}[]\|()
To use a metacharacter, a back-slash has to be given before it
\. \^ \+ etc.
Other special characters
\t, \n, etc.
52
Regular Expressions (2)
2. Character classes
–
–
–
[abc] – a, b, or c
[^abc] – any character except a, b, or c.
[a-zA-Z] – a through z, or A through Z inclusive (range)
3. Predefined character classes
–
–
–
–
–
–
–
. (dot) – any character
\d – a digit ([0-9])
\D – a non-digit ([^0-9])
\s – a whitespace character (e.g. space, \t, \n, \r)
\S – a non-whitespace character
\w – a word character ([a-zA-Z_0-9])
\W – a non-word character ([^\w])
4. Boundary matchers
–
–
^ -- the beginning of a line
$ -- the end of a line
53
Regular Expressions (3)
5. Greedy quantifiers
–
–
–
–
–
X? – X, once or not at all
Z* -- X, zero or more times
X+ -- X, one or more times
X{n} – X, exactly n times
X{n,m} – X, at least n but no more than m times
6. Logical operators
–
–
–
XY – X followed by Y
X|Y – either X or Y
(X) – X, as a capturing group
54
Regular Expression in Python (1)
•
•
Regular expressions are in the ‘re’ package.
Notation for patterns is slightly different from other languages –
using raw string as an alternative to Regular string.
Regular String
"ab*"
"\\\\section"
"\\w+\\s+\\1"
•
Raw string
r"ab*"
r"\\section"
r"\w+\s+\1"
First compile an expression (into an re object). Then match it
against a string.
– >>> import re
>>> p = re.compile('ab*')
– >>> print('lang\tver\nPython\t3')
– >>> print(r'lang\tver\nPython\t3')
55
Regular Expression in Python (2)
•
Matching a re object against a string is done in several ways.
Method/Attribute
match()
search()
findall()
finditer()
Purpose
Determine if the RE matches at the beginning of the
string.
Scan through a string, looking for any location where this
RE matches.
Find all substrings where the RE matches, and returns
them as a list.
Find all substrings where the RE matches, and returns
them as aniterator.
56
>>> import re
>>> sent = "This book on tennis cost $3.99 at Walmart."
>>> p1 = re.compile("ten")
>>> m1 = p1.match(sent)
>>> m1
>>> p2 = re.compile(".*ten.*")
>>> m2 = p2.match(sent)
>>> m2
<_sre.SRE_Match object; span=(0, 42), match='This book on tennis cost $3.99 at Walmart.'>
>>> m3 = re.search(p1,sent)
>>> m3
<_sre.SRE_Match object; span=(13, 16), match='ten'>
>>> m4 = re.search(p2,sent)
>>> m4
<_sre.SRE_Match object; span=(0, 42), match='This book on tennis cost $3.99 at Walmart.'>
>>> pp1 = re.compile("is")
>>> m5 = re.findall(pp1, sent)
>>> m5
['is', 'is']
>>> pp2 = re.compile("\\d")
>>> m6 = re.search(pp2, sent)
>>> m6
<_sre.SRE_Match object; span=(26, 27), match='3'>
>>> pp3 = re.compile("\\d+")
>>> m7 = re.search(pp3, sent)
>>> m7
<_sre.SRE_Match object; span=(26, 27), match='3'>
57
>>> pp3 = re.compile("\\$\\d+\\.\\d\\d")
>>> m8 = re.search(pp3, sent)
>>> m8
<_sre.SRE_Match object; span=(25, 30), match='$3.99'>
>>> pp4 = re.compile(r"\$\d+\.\d\d")
>>> m9 = re.search(pp4, sent)
>>> m9
<_sre.SRE_Match object; span=(25, 30), match='$3.99'>
58
Regular Expression in Python (3)
•
•
Grouping – You can retrieve the matched substrings using
parentheses.
Capturing groups are numbered by counting their opening
parentheses from left to right. In the expression ((A)(B(C))), for
example, there are four such groups:
–
–
–
–
•
((A)(B(C)))
(A)
(B(C))
(C)
Group zero always stands for the entire expression.
59
>>> ppp1 = re.compile("(\\w+) cost (\\$\\d+\\.\\d\\d)")
>>> mm1 = re.search(ppp1, sent)
>>> mm1
<_sre.SRE_Match object; span=(13, 30), match='tennis cost $3.99'>
>>> mm1.group(0)
'tennis cost $3.99'
>>> mm1.group(1)
'tennis'
>>> mm1.group(2)
'$3.99'
60
TutorialsPoint, http://www.tutorialspoint.com/python/python_reg_expressions.htm61
Assignment
•
•
NLP 100 20, 21, 22, 23
https://nlp100.github.io/en/ch03.html
62
63
Introduction to
Python
Lecture 3
Mahmoud Daif
Overview
∙
∙
∙
∙
History
Installing & Running Python
Names & Assignment
Sequences types: Lists, Tuples, and
Strings
∙ Mutability
64
Brief History of Python
∙ Invented in the Netherlands, early 90s
by Guido van Rossum
∙ Named after Monty Python
∙ Open sourced from the beginning
∙ Considered a scripting language, but is
much more
∙ Scalable, object oriented and functional
from the beginning
∙ Used by Google from the beginning
∙ Increasingly popular
65
Python’s Benevolent Dictator For Life
“Python is an experiment in
how much freedom
programmers need. Too much
freedom and nobody can read
another's code; too little and
expressiveness is endangered.”
- Guido van Rossum
66
http://docs.python.org/
67
The Python tutorial is good!
68
69
Running
Python
The Python Interpreter
∙ Typical Python implementations offer
both an interpreter and compiler
∙ Interactive interface to Python with a
read-eval-print loop
[finin@linux2 ~]$ python
Python 2.4.3 (#1, Jan 14 2008, 18:32:40)
[GCC 4.1.2 20070626 (Red Hat 4.1.2-14)] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> def square(x):
... return x * x
...
>>> map(square, [1, 2, 3, 4])
[1, 4, 9, 16]
>>>
70
Installing
∙ Python is pre-installed on most Unix systems,
including Linux and MAC OS X
∙ The pre-installed version may not be the most
recent one (2.6.2 and 3.1.1 as of Sept 09)
∙ Download from https://www.anaconda.com/products/distribution
∙ Python comes with a large library of standard
modules
∙ There are several options for an IDE
• IDLE – works well with Windows
• Eclipse with Pydev (http://pydev.sourceforge.net/)
• Spyder
71
IDLE Development Environment
∙ IDLE is an Integrated DeveLopment
Environment for Python, typically used on
Windows
∙ Multi-window text editor with syntax
highlighting, auto-completion, smart indent
and other.
∙ Python shell with syntax highlighting.
∙ Integrated debugger
with stepping, persistent breakpoints,
and call stack visi72
bility
Running Interactively on UNIX
On Unix…
% python
>>> 3+3
6
∙ Python prompts with ‘>>>’.
∙ To exit Python (not Idle):
• In Unix, type CONTROL-D
• In Windows, type CONTROL-Z + <Enter>
• Evaluate exit()
73
Python Scripts
∙ When you call a python program from the
command line the interpreter evaluates each
expression in the file
∙ Familiar mechanisms are used to provide
command line arguments and/or redirect
input and output
∙ Python also has mechanisms to allow a
python program to act both as a script and as
a module to be imported and used by another
python program
74
Example of a Script
#! /usr/bin/python
""" reads text from standard input and outputs any email
addresses it finds, one to a line.
"""
import re
from sys import stdin
# a regular expression ~ for a valid email address
pat = re.compile(r'[-\w][-.\w]*@[-\w][-\w.]+[a-zA-Z]{2,4}')
for line in stdin.readlines():
for address in pat.findall(line):
print(address)
75
results
python> python email0.py <email.txt
bill@msft.com
gates@microsoft.com
steve@apple.com
bill@msft.com
python>
76
Getting a unique, sorted list
import re
from sys import stdin
pat = re.compile(r'[-\w][-.\w]*@[-\w][-\w.]+[a-zA-Z]{2,4}’)
# found is an initially empty set (a list w/o duplicates)
found = set( )
for line in stdin.readlines():
for address in pat.findall(line):
found.add(address)
# sorted() takes a sequence, returns a sorted list of its elements
for address in sorted(found):
print(address)
77
results
python> python email2.py <email.txt
bill@msft.com
gates@microsoft.com
steve@apple.com
python>
78
Simple functions: ex.py
"""factorial done recursively and iteratively"""
def fact1(n):
ans = 1
for i in range(2,n):
ans = ans * n
return ans
def fact2(n):
if n < 1:
return 1
else:
79
Simple functions: ex.py
671> python
Python 2.5.2 …
>>> import ex
>>> ex.fact1(6)
1296
>>> ex.fact2(200)
78865786736479050355236321393218507…000000L
>>> ex.fact1
<function fact1 at 0x902470>
>>> fact1
Traceback (most recent call last):
File "<stdin>", line 1, in <module> 80
NameError: name 'fact1' is not defined
81
The Basics
A Code Sample (in IDLE)
x = 34 - 23
# A comment.
y = “Hello”
# Another one.
z = 3.45
if z == 3.45 or y == “Hello”:
x=x+1
y = y + “ World” # String concat.
print(x)
print(y)
82
Enough to Understand the Code
∙ Indentation matters to code meaning
• Block structure indicated by indentation
∙ First assignment to a variable creates it
• Variable types don’t need to be declared.
• Python figures out the variable types on its own.
∙ Assignment is = and comparison is ==
∙ For numbers + - * / % are as expected
• Special use of + for string concatenation and % for
string formatting (as in C’s printf)
∙ Logical operators are words (and, or,
not) not symbols
∙ The basic printing command is print()
83
Basic Datatypes
∙ Integers (default for numbers)
z = 5 / 2 # Answer 2, integer division
∙ Floats
x = 3.456
∙ Strings
• Can use “” or ‘’ to specify with “abc” == ‘abc’
• Unmatched can occur within the string: “matt’s”
• Use triple double-quotes for multi-line strings or
strings than contain both ‘ and “ inside of them:
“““a‘b“c”””
84
Whitespace
Whitespace is meaningful in Python: especially
indentation and placement of newlines
∙ Use a newline to end a line of code
Use \ when must go to next line prematurely
∙ No braces {} to mark blocks of code, use
consistent indentation instead
• First line with less indentation is outside of the block
• First line with more indentation starts a nested block
∙ Colons start of a new block in many constructs,
e.g. function definitions, then clauses
85
Comments
∙ Start comments with #, rest of line is ignored
∙ Can include a “documentation string” as the
first line of a new function or class you define
∙ Development environments, debugger, and
other tools use it: it’s good style to include one
def fact(n):
“““fact(n) assumes n is a positive
integer and returns facorial of n.”””
assert(n>0)
return 1 if n==1 else n*fact(n-1)
86
Assignment
∙ Binding a variable in Python means setting a name to
hold a reference to some object
• Assignment creates references, not copies
∙ Names in Python do not have an intrinsic type,
objects have types
• Python determines the type of the reference automatically
based on what data is assigned to it
∙ You create a name the first time it appears on the left
side of an assignment expression:
x=3
∙ A reference is deleted via garbage collection after
any names bound to it have passed out of scope
∙ Python uses reference semantics (more later)
87
Naming Rules
∙ Names are case sensitive and cannot start
with a number. They can contain letters,
numbers, and underscores.
bob
Bob
_bob
_2_bob_
bob_2
BoB
∙ There are some reserved words:
and, assert, break, class, continue,
def, del, elif, else, except, exec,
finally, for, from, global, if,
import, in, is, lambda, not, or,
pass, print, raise, return, try,
while
88
Naming conventions
The Python community has these
recommended naming conventions
∙ joined_lower for functions, methods and,
attributes
∙ joined_lower or ALL_CAPS for constants
∙ StudlyCaps for classes
∙ camelCase only to conform to pre-existing
conventions
∙ Attributes: interface, _internal, __private
89
Assignment
∙ You can assign to multiple names at the
same time
>>> x, y = 2, 3
>>> x
2
>>> y
3
This makes it easy to swap values
>>> x, y = y, x
∙ Assignments can be chained
>>> a = b = x = 2
90
Accessing Non-Existent Name
Accessing a name before it’s been properly
created (by placing it on the left side of an
assignment), raises an error
>>> y
Traceback (most recent call last):
File "<pyshell#16>", line 1, in -toplevely
NameError: name ‘y' is not defined
>>> y = 3
>>> y
3
91
92
Sequence types:
Tuples, Lists, and
Strings
Sequence Types
1. Tuple: (‘john’, 32, [CMSC])
∙ A simple immutable ordered sequence of
items
∙ Items can be of mixed types, including
collection types
2. Strings: “John Smith”
• Immutable
• Conceptually very much like a tuple
3. List: [1, 2, ‘john’, (‘up’, ‘down’)]
∙ Mutable ordered sequence of items of
mixed types
93
Similar Syntax
∙ All three sequence types (tuples,
strings, and lists) share much of the
same syntax and functionality.
∙ Key difference:
• Tuples and strings are immutable
• Lists are mutable
∙ The operations shown in this section
can be applied to all sequence types
• most examples will just show the
operation performed on one
94
Sequence Types 1
∙ Define tuples using parentheses and commas
>>> tu = (23, ‘abc’, 4.56, (2,3), ‘def’)
∙ Define lists are using square brackets and
commas
>>> li = [“abc”, 34, 4.34, 23]
∙ Define strings using quotes (“, ‘, or “““).
>>> st
>>> st
>>> st
string
= “Hello World”
= ‘Hello World’
= “““This is a multi-line
that uses triple quotes.”””
95
Sequence Types 2
∙ Access individual members of a tuple, list, or
string using square bracket “array” notation
∙ Note that all are 0 based…
>>> tu = (23, ‘abc’, 4.56, (2,3), ‘def’)
>>> tu[1]
# Second item in the tuple.
‘abc’
>>> li = [“abc”, 34, 4.34, 23]
>>> li[1]
# Second item in the list.
34
>>> st = “Hello World”
>>> st[1]
# Second character in string.
‘e’
96
Positive and negative indices
>>> t = (23, ‘abc’, 4.56, (2,3), ‘def’)
Positive index: count from the left, starting with 0
>>> t[1]
‘abc’
Negative index: count from right, starting with –1
>>> t[-3]
4.56
97
Slicing: return copy of a subset
>>> t = (23, ‘abc’, 4.56, (2,3), ‘def’)
Return a copy of the container with a subset of
the original members. Start copying at the first
index, and stop copying before second.
>>> t[1:4]
(‘abc’, 4.56, (2,3))
Negative indices count from end
>>> t[1:-1]
(‘abc’, 4.56, (2,3))
98
Slicing: return copy of a =subset
>>> t = (23, ‘abc’, 4.56, (2,3), ‘def’)
Omit first index to make copy starting from
beginning of the container
>>> t[:2]
(23, ‘abc’)
Omit second index to make copy starting at first
index and going to end
>>> t[2:]
(4.56, (2,3), ‘def’)
99
Copying the Whole Sequence
∙ [ : ] makes a copy of an entire sequence
>>> t[:]
(23, ‘abc’, 4.56, (2,3), ‘def’)
∙ Note the difference between these two lines for
mutable sequences
>>> l2 = l1 # Both refer to 1 ref,
# changing one affects both
>>> l2 = l1[:] # Independent copies, two
refs
100
The ‘in’ Operator
∙ Boolean test whether a value is inside a container:
>>> t
>>> 3
False
>>> 4
True
>>> 4
False
= [1, 2, 4, 5]
in t
in t
not in t
∙ For strings, tests for substrings
>>> a = 'abcde'
>>> 'c' in a
True
>>> 'cd' in a
True
>>> 'ac' in a
False
∙ Be careful: the in keyword is also used in the syntax
of for loops and list comprehensions
101
The + Operator
The + operator produces a new tuple, list, or
string whose value is the concatenation of its
arguments.
>>> (1, 2, 3) + (4, 5, 6)
(1, 2, 3, 4, 5, 6)
>>> [1, 2, 3] + [4, 5, 6]
[1, 2, 3, 4, 5, 6]
>>> “Hello” + “ ” + “World”
‘Hello World’
102
The * Operator
∙ The * operator produces a new tuple, list, or
string that “repeats” the original content.
>>> (1, 2, 3) * 3
(1, 2, 3, 1, 2, 3, 1, 2, 3)
>>> [1, 2, 3] * 3
[1, 2, 3, 1, 2, 3, 1, 2, 3]
>>> “Hello” * 3
‘HelloHelloHello’
103
104
Mutability:
Tuples vs. Lists
Lists are mutable
>>> li = [‘abc’, 23, 4.34, 23]
>>> li[1] = 45
>>> li
[‘abc’, 45, 4.34, 23]
∙ We can change lists in place.
∙ Name li still points to the same memory
reference when we’re done.
105
Tuples are immutable
>>> t = (23, ‘abc’, 4.56, (2,3), ‘def’)
>>> t[2] = 3.14
Traceback (most recent call last):
File "<pyshell#75>", line 1, in -topleveltu[2] = 3.14
TypeError: object doesn't support item assignment
∙ You can’t change a tuple.
∙ You can make a fresh tuple and assign its
reference to a previously used name.
>>> t = (23, ‘abc’, 3.14, (2,3), ‘def’)
∙ The immutability of tuples means they’re faster
than lists.
106
Operations on Lists Only
>>> li = [1, 11, 3, 4, 5]
>>> li.append(‘a’) # Note the method
syntax
>>> li
[1, 11, 3, 4, 5, ‘a’]
>>> li.insert(2, ‘i’)
>>>li
[1, 11, ‘i’, 3, 4, 5, ‘a’]
107
The extend method vs +
∙ + creates a fresh list with a new memory ref
∙ extend operates on list li in place.
>>> li.extend([9, 8, 7])
>>> li
[1, 2, ‘i’, 3, 4, 5, ‘a’, 9, 8, 7]
∙ Potentially confusing:
• extend takes a list as an argument.
• append takes a singleton as an argument.
>>> li.append([10, 11, 12])
>>> li
[1, 2, ‘i’, 3, 4, 5, ‘a’, 9, 8, 7, [10,
108
11, 12]]
Operations on Lists Only
Lists have many methods, including index, count,
remove, reverse, sort
>>> li = [‘a’, ‘b’, ‘c’, ‘b’]
>>> li.index(‘b’) # index of 1st occurrence
1
>>> li.count(‘b’) # number of occurrences
2
>>> li.remove(‘b’) # remove 1st occurrence
>>> li
[‘a’, ‘c’, ‘b’]
109
Operations on Lists Only
>>> li = [5, 2, 6, 8]
>>> li.reverse()
>>> li
[8, 6, 2, 5]
# reverse the list *in place*
>>> li.sort()
>>> li
[2, 5, 6, 8]
# sort the list *in place*
>>> li.sort(some_function)
# sort in place using user-defined comparison
110
Summary: Tuples vs. Lists
∙ Lists slower but more powerful than tuples
• Lists can be modified, and they have lots of
handy operations and methods
• Tuples are immutable and have fewer
features
∙ To convert between tuples and lists use the
list() and tuple() functions:
li = list(tu)
tu = tuple(li)
111
Assignment Delivery
●
●
●
●
TA Eng. Wafaa w.momen@fci-cu.edu.eg
TA Eng. Ibrahim I.gomaa@fci-cu.edu.eg
3 assignments in total
you will get 4, but you can choose to only
deliver 3
● Assignments will be uploaded to the
blackboard.
112
Lecture 4: Sentiment
Analysis by NLTK
Mahmoud Daif
AFINN-111
https://github.com/fnielsen/afinn/blob/m
aster/afinn/data/AFINN-111.txt
A list of words rated between (-5) negative to (5)
positive!
Lec 5
Download