Practical tools for Machine Learning Programming with Python Lecture 6: File I/O Malek Smaoui So far … • All input is entered by a user in the keyboard • Use input() function • All output is displayed on the screen and can only be read by the user at that point • Use print() function • Implications: • Human-application interaction is significant • Long term storage of results is not possible More realistically … • Consider an application performing geophysical simulation to detect the presence of oil in the ground • The application requires as input “imagery” data of multiple ground layers over large areas • The application will produce an estimation of the distribution of oil in the different portions • Current IO mode not sustainable for applications requiring significant amount of input data and producing significant amount of outputs • Entering input data to the keyboard can take a very long time or impossible due to tricky format • The data most probably already exists on the computer as a digital file • The user may not be able to consult and use all output as soon as they are on screen in one sitting • It is important to keep records of multiple simulations for comparison, archiving, etc … Files • Data can be stored “permanently” on secondary storage of computer systems • The basic unit of long term data storage is the file • The file system is the part of the computer’s operating systems responsible for managing files on the different secondary storage devices (local hard drive, network drive, removable drive, …) File IO • Programs can communicate with the file system to • Create new files on a drive • Read from existing files • Modify existing files • A program can: • read input from one or multiple files • Write output to one or multiple files • Write to and read from temporary files at different stages of solving the problem • A set of objects and functions are available to embed these operations in programs Accessing a file via file object • For the program point of view a file can be abstracted as a stream of bytes • Simplest files are (raw) text files where each byte is interpreted as a character • The function open(filename[, mode, [buffering]]) returns an object that is an abstraction (representation) of a file • The file is designated by its name (if it is in the working directory) or path only at opening • Once the object is created, it can be used to: • Read the existing stream of characters using for instance the read(…) method • Write to the stream of character at a specific positions using for instance write(…) method • “read” and “write” operation can only be done on a file object obtained by an “open” operation Opening a file • file = open(filename[, mode, [buffering]]) • Filename: a string containing the name or path of the file • Mode: an optional string specifying whether the file is to be opened for reading, writing or appending • Buffering: takes the values 0 or 1 and specifies whether the writing happens immediately or delayed until flushing or closing the stream Open mode mode Description: open for ‘r (‘rb’) Reading: OK ; Writing => error File exists: open it ; File does not exist => error Initial I/O position: beginning default mode ‘w’(‘wb’) Reading => error ; Writing: OK File exists: open and empty ; File does not exist: create new empty Initial I/O position: 0 because file initially empty ‘a’ (‘ab’) Reading => error ; Writing: OK File exists: open as is ; File does not exist: create new empty Initial I/O position: end ; ALWAYS write at the end of the file (append) ‘[m]+’ where m is any of the above Both reading and writing or appending; File existence and initial position for I/O rules apply as above; b: for read/write in binary mode (not text) Basic reading and writing • s = file.read([size]) • Returns a string of length size read from the file • If the number of characters until the end of the file n < size, then the returned string if of length n. • If size is not specified or negative, returns a string with all the characters in the file until the endof-file (eof) • file.write(content) • content must be a string Closing a file • Before a program terminates or when no more reading or writing is needed, file should be closed using file.close() • Closing the file is important: it makes sure that I/O operations complete safely once the program terminates. • Once a file object is closed, it can be reused to open another file (associated with another file) • Reusing a file object to open a new file, without closing the one currently associated with it results in: • Losing access to the currently associated file • Previous read/write operations may not complete properly Exercises 1. Write a program where you copy the contents of “input.txt” in the file “output.txt” 2. Write program which reads a set of integers from a file then appends to it their sum Accessing files on the file system • Acquiring access to specific file requires knowledge of its location in the file system (disk) • Most file systems organize files in a hierarchy or tree • The root directory of the tree is the storage device • Many branches / internal nodes represent the different subdirectories • Files are the leaves of the tree • The location of a file is specified via its path Absolute vs relative path • Absolute path: a slash separated sequence of directories starting at the file system root and specifying the hierarchy of a given file (or directory) • Working directory: by default is the directory where the module exists • Relative path: a slash separated sequence of directories starting at the working directory and specifying the hierarchy to a given file • Files in the working directory can be opened via their filename only • The filename is indeed the relative path • Files in different directories can be opened via their absolute path or their relative path Read/write cursor • Determines at which character (position) the next read/write operation will start • Set to initially to 0 (beginning of the file) when the file is opened is ‘r’ or ‘w’ modes and to the end-of-file when the file is opened in ‘a’ mode • Is updated after each read/write operation to the position at which the operation has ended • file.tell(): returns the current read/write position • file.seek(offset [, ref]): changes the read/write position by offset from ref • ref can be 0 (default) for the beginning of the file • ref can be 1 for the current position, offset has to be 0 (for text files) • ref can be 2 for the end-of-file, offset has to be 0 Exercise • Write a program where you open the file twinkle.txt then read and print to the screen exactly the ten characters at the middle of the file. • You can only read ten characters • Use the methods tell and seek to position the read/write cursor at the right character, then read that character using the read method (with its argument set to 10) Reading from a file • s = file.read([size]) • If size is not specified or negative, returns a string with all the characters in the file until the end-of-file (eof) • s = file.readline([size]) • Returns a string containing one line (up to the new line character) from the file with a maximum number of characters size • L = file.readlines([size]) • Returns a list of strings where each is a line from the file • Size always represents the maximum number of characters to be read. If less characters are left till the end of line or end of file, then shorter strings are returned Reading from file • for line in file: … • Iterate through lines using a for loop • All read operations start from the current cursor position. • The cursor is incremented by the number of characters read Exercise • Write a program where you print to the screen the lines of the file twinkle.txt in reverse order: • Output: Like a teatray in the sky. Up above the world you fly, How I wonder what you're at! Twinkle, twinkle, little bat! Writing to a file • file.write(content) • content must be a string • file.writelines(list_of_content) • list_of_content must be a list of strings • All write operations start from the current cursor position. • The cursor is incremented by the number of characters written • If the read/write cursor is not at the end-of-file, the write operation overwrites the bytes at the cursor position Exercise • Write a program where you write the lines of the file twinkle.txt in reverse order to a new file Files and functions • A file can be opened, read or written to then closed in a function • The file object is a local variable • The function can obtain the file name/path as an argument • A function can also perform operations on a file object provided as argument • Any file operations performed in the function will reflect on the file • Changes to the read/write cursor position due to these operations will reflect on the argument and should be taken into consideration in the calling code • It is possible to check whether the file is readable() or writable() before attempting the corresponding operation or use try-except to catch potential exceptions • Files opened in ‘a’ and ‘a+’ modes, are writable but the write position is always at the end. Madlibs • In the 1960s, entertainer Steve Allen often played a game called madlibs as part of his comedy routine. Allen would ask the audience to supply words that fit specific categories—a verb, an adjective, or a plural noun, for example—and then use these words to fill in blanks in a previously prepared text that he would then read back to the audience. The results were usually nonsense, but often very funny nonetheless. • In this exercise, your task is to write a program that plays madlibs with the user. The text for the story comes from a text file that includes occasional placeholders enclosed in angle brackets. Suppose the input file is the attached carroll.txt. Madlibs • Your program must prompt the user for the input file name (path), read the file and prompt the user for words or phrases to fill in the placeholders. The program then prints the resulting text (after replacing the placeholders with the user input) to the screen and stores it in an output file. • Note that the placeholders number, location and content will vary from input file to the other i.e. the program has to extract the user prompts from the input file Madlibs • Sample run based on caroll.txt