PYTHON STRINGS CHAPTER 8 FROM THINK PYTHON HOW TO THINK LIKE A COMPUTER SCIENTIST STRINGS A string is a sequence of characters. You may access the individual characters one at a time with the bracket operator. >>> name = ‘Simpson’ >>> FirstLetter = name[0] ‘Simpson’ name[0] name[1] name[4] name[-1]==name[6] name[len(name)-1] Also remember that len(name) is 7 #number of characters TRAVERSING A STRING SEVERAL WAYS name =‘Richard Simpson’ index=0 while index < len(name): for i in range(len(name)): letter = name[i] print letter letter = name[index] print letter index = index + 1 for char in name: print char for i in range(len(name)): print name[i] This make sense? CONCATENATION #The + operator is used to concat #two strings together first=‘Monty’ second = ‘Python’ full = first+second print full MontyPython #Reversing a string word = 'Hello Monty' rev_word = '' for char in word: rev_word = char + rev_word print rev_word ytnoM olleH STRING SLICES A slice is a connect subsegment(substring) of a string. s = ‘Did you say shrubberies? ‘ a= s[0:7] slice from 0 to 6 ( not including 7) a is ‘Did you’ b= s[8:11] slice from 8 to 10 b is ‘say’ c=s[12:] slice from 12 to end (returns a suffix) is ‘shrubberies’ d=s[:3] is ‘Did’ from 0 to 2 (returns a prefix) STRINGS ARE IMMUTABLE You can only build new strings, you CANNOT modify and existing one. Though you can redefine it. For example name = ‘Superman’ name[0]=‘s’ Will generate an error name = ‘s’+name[1:] this would work print name superman METHODS VRS FUNCTIONS type.do_something() # here do_something is a method ‘Hello’.upper() # returns ‘HELLO’ value.isdigit() # returns True if all char’s are digits name.startswidth(‘Har’) # returns True if so! do_something(type) # here do_something is a function Examples: len(“TATATATA”) # returns the length of a string math.sqrt(34.5) # returns the square root of 34.5 STRING METHODS Methods are similar to functions except they are called in a different way (ie different syntax) It uses dot notation word =‘rabbit’ uword = word.upper() string method return the string capitalized no arguments there are a lot of string methods. Here is another string.capitalize() Returns a copy of the string with only its first character capitalized. FIND() A STRING METHOD string.find(sub[, start[, end]]) Return the lowest index in the string where substring sub is found, such that ub is contained in the range [start, end]. Optional arguments start and end are interpreted as in slice notation. Returns -1 if sub is not found. statement = ‘What makes you think she's a witch? Well she turned me into a newt’ index = statement.find('witch') print index index2 = statement.find('she') >>> print index2 29 index3 = statement.find('she',index) 21 print index3 41 >>> THE IN OPERATOR WITH STRINGS The word in is a boolean operator that takes two strings and returns True if the first appears as a substring in the second. >>> ‘a’ in ‘King Arthur’ False What does this function do? def mystery (word1,word2) for letter in word1: if letter in word2: print letter >>> ‘Art’ in ‘King Arthur’ True Prints letters that occur in both words MORE EXAMPLES >>> ‘TATA’ in ‘TATATATATATA’ True ‘AA’ in ‘TATATATATATATATA’ False >>> ‘AC’ + ‘TG’ ‘ACTG’ >>> 5* ‘TA’ ‘TATATATATA’ >>>‘MNKMDLVADVAEKTDLS’[1:4] ‘NKM’ >>>‘MNKMDLVADVAEKTDLS’[8:-1] ‘DVAEKTDL’ >>>‘MNKMDLVADVAEKTDLS’[-5,-4] ‘K’ >>>‘MNKMDLVADVAEKTDLS’[10:] ‘AEKTDLS’ >>>‘MNKMDLVADVAEKTDLS’[5:5] ‘’ >>>‘MNKMDLVADVAEKTD’.find(‘LV’) 5 STRING COMPARISIONS The relational operators also work here if word == ‘bananas’: print ‘Yes I want one’ Put words in alphabetical order if word1 < word2: print word1,word2 else: print word2, word1 NOTE: in python all upper case letters come before lower case! i.e. ‘Hello’ is before ‘hello’ LETS DOWNLOAD A BOOK AND ANALYZE IT Go to http://www.gutenberg.org/ and download the first edition of Origin of the Species by Charles Darwin. Be sure in download the pure text file version and save as oots.txt (http://www.gutenberg.org/files/1228/1228.txt) This little program will read in the file and print it to the screen. file = open('oots.txt', 'r') #open for reading print file.read() NOTE: The entire file is read in and stored in the memory of the computer under the name file! See: http://www.pythonforbeginners.com/systemsprogramming/reading-and-writing-files-in-python/ I DON’T WANT THE WHOLE FILE! The readline() function will read from a file line by line (rather than pulling the entire file in at once). Use readline() when you want to get the first line of the file, subsequent calls to readline() will return successive lines. Basically, it will read a single line from the file and return a string containing characters up to \n. # prints first line file = open('newfile.txt', 'r') print file.readline() # prints first line file = open('newfile.txt', 'r') line=file.readline() print line #prints first 100 lines file = open('oots.txt', 'r') for i in range(100): print file.readline() #prints entire file using in operator file = open('oots.txt', 'r') for line in file: print file DOES THE ORIGIN HAVE THE WORD EVOLUTION IN IT? #searches for the word ‘evolution’ in the file. It checks every #line individually in this program. This saves space over #reading in the entire book into memory. file = open('oots.txt', 'r') for line in file: if line.find('evolution')!= -1: # if not in line return -1 print line print 'done' #Is this true of the 6th edition? Check it out. What if we want to know which line the string occurs in? LETS DOWNLOAD SOME DNA Where do we get DNA? Well http://en.wikipedia.org/wiki/List_of_biological_databases contains a nice list Lets use this one http://www.ncbi.nlm.nih.gov/ Under nucleotide type in Neanderthal and download KC879692.1 ( it was the fifth one in my search) This is the entire Mitochondria sequence for a Neanderthal found in the Denisova cave in the Altai mountains. Here it is http://www.ncbi.nlm.nih.gov/nuccore/KC879692.1 Here is the Denisovian mitochondria. STRIPPING THE ANNOTATION INFO The annotation info for a Genbank file is everything written above the ORIGIN line. Lets get rid of this stuff using a flag variable file = open('neanderMito.gb', 'r') fileout = open("stripNeander.txt", "w") # This code strips all lines above and including the ORIGIN line # It uses a flag variable called originFlag originFlag = False for line in file: if originFlag == True: print line, #The comma suppresses the line feed fileout.write(line) if line.find('ORIGIN')!= -1: # When this turns false start printing originFlag = True # to the output file fileout.close() An absolute requirement to dump buffer STRIPPING THE ANNOTATION INFO 2 The annotation info for a Genbank file is everything written above the ORIGIN line. Another method file = open('neanderMito.gb', 'r') fileout = open("stripNeander.txt", "w") line = file.readline() while not line.startswith('ORIGIN'): # skip up to ORIGIN line = file.readline() line = file.readline() while not line.startswith('//'): print line, another string method. Look it up! fileout.write(line) line = file.readline() fileout.close() An absolute requirement to dump buffer NOW WE HAVE NOW 1 gatcacaggt ctatcaccct attaaccact cacgggagct ctccatgcat ttggtatttt 61 cgtctggggg gtgtgcacgc gatagcattg cgagacgctg gagccggagc accctatgtc 121 gcagtatctg tctttgattc ctgccccatc ctattattta tcgcacctac gttcaatatt 181 acagacgagc atacctacta aagtgtgtta attaattaat gcttgtagga cataataata 241 acgattaaat gtctgcacag ccgctttcca cacagacatc ataacaaaaa atttccacca 301 aacccccccc ctccccccgc ttctggccac agcacttaaa catatctctg ccaaacccca 361 aaaacaaaga accctaacac cagcctaacc agatttcaaa ttttatcttt tggcggtata 421 cacttttaac agtcaccccc taactaacac attattttcc cctcccactc ccatactact 481 aatctcatca atacaacccc cgcccatcct acccagcaca caccgctgct aaccccatac 541 cccgagccaa ccaaacccca aagacacccc ccacagttta tgtagcttac ctcctcaaag We want to get rid of the numbers and spaces. How does one do this? What type of characters are left in this file? digits, a,c,t,g, spaces, and CR’s SO LETS STRIP EVERYTHING BUT A,C,T,G file = open("stripNeander.txt", "r") fileout = open('neanderMitostripped.txt', 'w') # This code strips all characters but a,c,t,g for line in file: for char in line: if char in ['a','c','t','g']: # I’m using a list here fileout.write(char) fileout.close() What is in the fileout now? One very long line, i.e. there are NO spaces or CR’s WHAT IF WE WANT TO DO ALL THIS ON A LOT OF FILES? The easiest way would be to turn the previous processing to a function . Then we can use the function on the files. # This function strips all characters but a,c,t,g from file name, returns string def stripGenBank(name): file = open(name, "r") sequence = '' originFlag=False for line in file: if originFlag == True: for char in line: if char in ['a','c','t','g']: # I’m using a list here sequence = sequence + char # attach the new char on the end if line.find('ORIGIN')!= -1: originFlag = True return (sequence) print stripGenBank('neanderMito.gb') LETS COMPARE NEANDERTHAL WITH DENISOVAN neander =stripGenBank('neanderMito.gb') denison = stripGenBank('denosovanMito.gb') for i in range(10000): if neander[i]!=denison[i]: print '\nFiles first differ at location ',i index = i+1 print 'Neanderthal is ',neander[i], ' and Denisovan is ',denison[i] break print neander[:index] #Dump up to where they differ print denison[:index] THE OUTPUT OF THIS COMPARISON Files first differ at location 145 Neanderthal is c and Denisovan is t gatcacaggtctatcaccctattaaccactcacgggagctctccatgcatttggtatt ttcgtctggggggtgtgcacgcgatagcattgcgagacgctggagccggagcac cctatgtcgcagtatctgtctttgattcctgccc gatcacaggtctatcaccctattaaccactcacgggagctctccatgcatttggtatt ttcgtctggggggtgtgcacgcgatagcattgcgagacgctggagccggagcac cctatgtcgcagtatctgtctttgattcctgcct