PythonStringsCh8

advertisement
PYTHON
STRINGS
CHAPTER 8
FROM
THINK PYTHON
HOW TO THINK LIKE A COMPUTER SCIENTIST
STRINGS
A string is a sequence of characters. You may access the
individual characters one at a time with the bracket operator.
>>> name = ‘Simpson’
>>> FirstLetter = name[0]
‘Simpson’
name[0] name[1]
name[4]
name[-1]==name[6]
name[len(name)-1]
Also remember that len(name) is 7 #number of characters
TRAVERSING A STRING
SEVERAL WAYS
name =‘Richard Simpson’
index=0
while index < len(name):
for i in range(len(name)):
letter = name[i]
print letter
letter = name[index]
print letter
index = index + 1
for char in name:
print char
for i in range(len(name)):
print name[i]
This make sense?
CONCATENATION
#The + operator is used to concat
#two strings together
first=‘Monty’
second = ‘Python’
full = first+second
print full
MontyPython
#Reversing a string
word = 'Hello Monty'
rev_word = ''
for char in word:
rev_word = char + rev_word
print rev_word
ytnoM olleH
STRING SLICES
A slice is a connect subsegment(substring) of a string.
s = ‘Did you say shrubberies? ‘
a= s[0:7]  slice from 0 to 6 ( not including 7)
a is ‘Did you’
b= s[8:11]
 slice from 8 to 10
b is ‘say’
c=s[12:]
 slice from 12 to end (returns a suffix)
is ‘shrubberies’
d=s[:3]
is ‘Did’
 from 0 to 2 (returns a prefix)
STRINGS ARE
IMMUTABLE
You can only build new strings, you CANNOT modify and
existing one. Though you can redefine it. For example
name = ‘Superman’
name[0]=‘s’
 Will generate an error
name = ‘s’+name[1:]  this would work
print name
superman
METHODS VRS FUNCTIONS
type.do_something()
# here do_something is a method
‘Hello’.upper()
# returns ‘HELLO’
value.isdigit()
# returns True if all char’s are digits
name.startswidth(‘Har’) # returns True if so!
do_something(type)
# here do_something is a function
Examples:
len(“TATATATA”)
# returns the length of a string
math.sqrt(34.5)
# returns the square root of 34.5
STRING METHODS
Methods are similar to functions except they are called in a
different way (ie different syntax) It uses dot notation
word =‘rabbit’
uword = word.upper()
string
method
 return the string capitalized
no arguments
there are a lot of string methods. Here is another
string.capitalize() Returns a copy of the string with only its first
character capitalized.
FIND()
A STRING METHOD
string.find(sub[, start[, end]])
Return the lowest index in the string where substring sub is found, such
that ub is contained in the range [start, end]. Optional arguments start and
end are interpreted as in slice notation. Returns -1 if sub is not found.
statement = ‘What makes you think she's a witch? Well she turned me into a newt’
index = statement.find('witch')
print index
index2 = statement.find('she')
>>>
print index2
29
index3 = statement.find('she',index)
21
print index3
41
>>>
THE IN OPERATOR WITH
STRINGS
The word in is a boolean
operator that takes two strings
and returns True if the first
appears as a substring in the
second.
>>> ‘a’ in ‘King Arthur’
False
What does this function do?
def mystery (word1,word2)
for letter in word1:
if letter in word2:
print letter
>>> ‘Art’ in ‘King Arthur’
True
Prints letters that occur in both words
MORE EXAMPLES
>>> ‘TATA’ in ‘TATATATATATA’
True
‘AA’ in ‘TATATATATATATATA’
False
>>> ‘AC’ + ‘TG’
‘ACTG’
>>> 5* ‘TA’
‘TATATATATA’
>>>‘MNKMDLVADVAEKTDLS’[1:4]
‘NKM’
>>>‘MNKMDLVADVAEKTDLS’[8:-1]
‘DVAEKTDL’
>>>‘MNKMDLVADVAEKTDLS’[-5,-4]
‘K’
>>>‘MNKMDLVADVAEKTDLS’[10:]
‘AEKTDLS’
>>>‘MNKMDLVADVAEKTDLS’[5:5]
‘’
>>>‘MNKMDLVADVAEKTD’.find(‘LV’)
5
STRING
COMPARISIONS
The relational operators also work here
if word == ‘bananas’:
print ‘Yes I want one’
Put words in alphabetical order
if word1 < word2:
print word1,word2
else:
print word2, word1
NOTE: in python all upper case letters come before lower
case! i.e. ‘Hello’ is before ‘hello’
LETS DOWNLOAD A
BOOK AND ANALYZE IT
Go to http://www.gutenberg.org/ and download the first edition
of Origin of the Species by Charles Darwin. Be sure in
download the pure text file version and save as oots.txt
(http://www.gutenberg.org/files/1228/1228.txt)
This little program will read in the file and print it to the screen.
file = open('oots.txt', 'r') #open for reading
print file.read()
NOTE: The entire file is read in and stored in the memory of the
computer under the name file!
See: http://www.pythonforbeginners.com/systemsprogramming/reading-and-writing-files-in-python/
I DON’T WANT THE
WHOLE FILE!
The readline() function will read from a file line by line (rather
than pulling the entire file in at once).
Use readline() when you want to get the first line of the file,
subsequent calls to readline() will return successive lines.
Basically, it will read a single line from the file and return a
string containing characters up to \n.
# prints first line
file = open('newfile.txt', 'r')
print file.readline()
# prints first line
file = open('newfile.txt', 'r')
line=file.readline()
print line
#prints first 100 lines
file = open('oots.txt', 'r')
for i in range(100):
print file.readline()
#prints entire file using in operator
file = open('oots.txt', 'r')
for line in file:
print file
DOES THE ORIGIN HAVE THE
WORD EVOLUTION IN IT?
#searches for the word ‘evolution’ in the file. It checks every
#line individually in this program. This saves space over
#reading in the entire book into memory.
file = open('oots.txt', 'r')
for line in file:
if line.find('evolution')!= -1: # if not in line return -1
print line
print 'done'
#Is this true of the 6th edition? Check it out.
What if we want to know which line the string occurs in?
LETS DOWNLOAD
SOME DNA
Where do we get DNA? Well
http://en.wikipedia.org/wiki/List_of_biological_databases
contains a nice list
Lets use this one http://www.ncbi.nlm.nih.gov/
Under nucleotide type in Neanderthal and download
KC879692.1 ( it was the fifth one in my search) This is the
entire Mitochondria sequence for a Neanderthal found in the
Denisova cave in the Altai mountains.
Here it is
http://www.ncbi.nlm.nih.gov/nuccore/KC879692.1
Here is the Denisovian mitochondria.
STRIPPING THE
ANNOTATION INFO
The annotation info for a Genbank file is everything written above the
ORIGIN line. Lets get rid of this stuff using a flag variable
file = open('neanderMito.gb', 'r')
fileout = open("stripNeander.txt", "w")
# This code strips all lines above and including the ORIGIN line
# It uses a flag variable called originFlag
originFlag = False
for line in file:
if originFlag == True:
print line,
#The comma suppresses the line feed
fileout.write(line)
if line.find('ORIGIN')!= -1: # When this turns false start printing
originFlag = True
# to the output file
fileout.close()  An absolute requirement to dump buffer
STRIPPING THE
ANNOTATION INFO 2
The annotation info for a Genbank file is everything written above the
ORIGIN line. Another method
file = open('neanderMito.gb', 'r')
fileout = open("stripNeander.txt", "w")
line = file.readline()
while not line.startswith('ORIGIN'): # skip up to ORIGIN
line = file.readline()
line = file.readline()
while not line.startswith('//'):
print line,
another string method. Look
it up!
fileout.write(line)
line = file.readline()
fileout.close()
 An absolute requirement to dump buffer
NOW WE HAVE NOW
1 gatcacaggt ctatcaccct attaaccact cacgggagct ctccatgcat ttggtatttt
61 cgtctggggg gtgtgcacgc gatagcattg cgagacgctg gagccggagc accctatgtc
121 gcagtatctg tctttgattc ctgccccatc ctattattta tcgcacctac gttcaatatt
181 acagacgagc atacctacta aagtgtgtta attaattaat gcttgtagga cataataata
241 acgattaaat gtctgcacag ccgctttcca cacagacatc ataacaaaaa atttccacca
301 aacccccccc ctccccccgc ttctggccac agcacttaaa catatctctg ccaaacccca
361 aaaacaaaga accctaacac cagcctaacc agatttcaaa ttttatcttt tggcggtata
421 cacttttaac agtcaccccc taactaacac attattttcc cctcccactc ccatactact
481 aatctcatca atacaacccc cgcccatcct acccagcaca caccgctgct aaccccatac
541 cccgagccaa ccaaacccca aagacacccc ccacagttta tgtagcttac ctcctcaaag
We want to get rid of the numbers and spaces. How does one do this?
What type of characters are left in this file?
digits, a,c,t,g, spaces, and CR’s
SO LETS STRIP EVERYTHING
BUT A,C,T,G
file = open("stripNeander.txt", "r")
fileout = open('neanderMitostripped.txt', 'w')
# This code strips all characters but a,c,t,g
for line in file:
for char in line:
if char in ['a','c','t','g']: # I’m using a list here
fileout.write(char)
fileout.close()
What is in the fileout now?
One very long line, i.e. there are NO spaces or CR’s
WHAT IF WE WANT TO DO ALL
THIS ON A LOT OF FILES?
The easiest way would be to turn the previous processing to a function . Then we can use the
function on the files.
# This function strips all characters but a,c,t,g from file name, returns string
def stripGenBank(name):
file = open(name, "r")
sequence = ''
originFlag=False
for line in file:
if originFlag == True:
for char in line:
if char in ['a','c','t','g']: # I’m using a list here
sequence = sequence + char # attach the new char on the end
if line.find('ORIGIN')!= -1:
originFlag = True
return (sequence)
print stripGenBank('neanderMito.gb')
LETS COMPARE NEANDERTHAL
WITH DENISOVAN
neander =stripGenBank('neanderMito.gb')
denison = stripGenBank('denosovanMito.gb')
for i in range(10000):
if neander[i]!=denison[i]:
print '\nFiles first differ at location ',i
index = i+1
print 'Neanderthal is ',neander[i], ' and Denisovan is ',denison[i]
break
print neander[:index] #Dump up to where they differ
print denison[:index]
THE OUTPUT OF THIS
COMPARISON
Files first differ at location 145
Neanderthal is c and Denisovan is t
gatcacaggtctatcaccctattaaccactcacgggagctctccatgcatttggtatt
ttcgtctggggggtgtgcacgcgatagcattgcgagacgctggagccggagcac
cctatgtcgcagtatctgtctttgattcctgccc
gatcacaggtctatcaccctattaaccactcacgggagctctccatgcatttggtatt
ttcgtctggggggtgtgcacgcgatagcattgcgagacgctggagccggagcac
cctatgtcgcagtatctgtctttgattcctgcct
Download