Files in Python The Basics Why use Files? • Very small amounts of data – just hardcode them into the program • A few pieces of data – ask the user to input them • More than this, you need an external file stored on secondary storage External data files • Handles large amounts of data • Data is independent of program, so program can change without changing data • Easier to edit data in an editor, instead of during run of program (can’t go back!) • Use the same data for input to different programs • Output files can be saved for later use • Output of one program can be used for input of another Text files versus Binary files • Text files created by editors, stored as ASCII codes • Binary files stored as raw binary numbers, have to be handled differently • Text files are manipulated sequentially only • Binary files can be manipulated sequentially or randomly (we will not do binary files in this class) Creating a text data file • This is done just like creating any other text file • You can use Notepad • You can use the editors of the IDEs that create Python • You can use a word processor like Word if you are careful to save as plain text • Store the text file in the same folder as you put your source code Delimiters • The \n (newline) (carriage return) is a very important symbol in text files. • It delimits what Python calls a ‘line’ in the file. • It gets put into the file whenever you press Enter at the end of a line • A blank line is represented by two newlines together \n\n • It matters whether you press Enter at the end of the last line of the file – some methods in Python will treat the last line differently because of the \n character Files in Python Buffers Why a buffer? • Computer equipment runs at different speeds, the hard drive and secondary storage in general is MUCH slower than RAM and the CPU, for example • This is a bottleneck where the faster pieces have to wait for the slower ones to deliver the action or service or data that is needed • Buffers help this bottleneck! They let the OS bring in a bigger chunk of data to RAM and hold it until the program asks for it = not so many trips to the hard drive What’s a buffer? • A buffer is an area of RAM allocated by the OS to manage these bottlenecks • Every file you open in your program will have a buffer • Also buffers for keyboard, screen, network connections, modems, printers, etc. • You the programmer do not have to worry about this happening, it’s automatic! Buffer for input file • When you read from a file, the buffer associated with the file is checked first – if there’s still data in the buffer, that’s what your program gets • If the buffer is exhausted, then the OS is told to get some more data from the file on the HD and put it in the buffer • This process continues until the program ends or until the file has no more data to read • Think of a pantry in a house – it’s a buffer between the people in the house and the supermarket Buffer for output file • You write in your program to an output file • The data does NOT go directly, immediately to the hard drive, but to an output buffer • The OS monitors this buffer – when it is full, it is all written to the hard drive at one time • Think of a garbage can in a house – it is a buffer to hold trash until it can all be taken to the landfill at one time Why do I care about buffers? • You can see most of the action on a buffer is automatic from the point of view of most programmers • BUT! if you forget to close your file when you are finished with it, the file can be left in an “unfinished” state! • Some OS’s are bad for not cleaning things up when your program is over – they should close all files automatically but sometimes they don’t! Why do I care? • A file in an “unfinished” state may be one of those files you run across after an application has crashed. If you try to erase it, the OS says “no, that file is still busy”, even though it’s not. • Especially for output files, your file on the hard drive may not get that last buffer of data that you thought your program wrote to the file if you forget to close the file! The file will be missing data or possibly missing altogether if the file was small. Before the open happens After the open After one readline() After two more readlines Don’t forget! • Don’t forget to close your files! – and the close statement must look like – infile.close() No arguments in the parentheses but they must be there! Files in Python Opening and Closing Big Picture • To use a file in a programming language – You have to open the file – Then you process the data in the file – Then you close the file when you are done with it • This is true for input files or output files Opening a file • To use a file, you first have to open it • in Python the syntax is infile = open(“xyz.txt”, “r”) # for input (read) or outfile = open(“mydata.txt”, “w”) # for output It creates a link between that variable name in the program and the file known to the OS Processing in general • Processing is general term • In Python there are at least 4 ways to read from an input file • And two ways to write to an output file • They all use loops in one way or another • See other talks for details Closing a file • When you are finished with the file (usually when you are at the end of the data if it is input) • You close the file • In Python the syntax is infile.close() Works for input or output files Note: no arguments but you MUST have () !! Otherwise the function is not actually called! Files in Python Input techniques Input from a file • The type of data you will get from a file is always string or a list of strings. • There are two ways of reading that I call “bulk reads” because with one statement they totally exhaust the file. There is no more to read after that! • The other two ways read a line at a time from the file • Files are objects so most of these will be methods called with the dot notation as usual read() • The read method is called like this datastr = infile.read() • What does it do? it reads in the entire file of data, into one string variable • The newlines and other whitespace in the file are stored in the string like every other character • Be aware if you are reading a LARGE file, this may take some time and a lot of RAM! • This is convenient if you do not care particularly where the newlines are in the file • BULK readlines() • The syntax: datalst = infile.readlines() • This method reads in ALL the data from the file and uses the \n as a delimiter to break the data into strings in a list • There is nothing more to read in the file after you execute one readlines call. • This is convenient if you know the data in the file is organized by lines, i.e. each line needs to be processed by itself • BULK readline() • Note that this is a different method from readlines – note the s! • syntax: datastr = infile.readline() • Semantics: it reads in the next line of data from the file, up to the next newline • Returns a string which has the data and a \n character at the end • Useful when you don’t want to read in ALL the data at one time, or when you have more data than RAM space to hold it • Usually used inside a while loop • Indicates the end of the data in the file by returning an empty string. Note that this is different from having an empty or blank line in the file – that is returned as “\n” Files in Python Caution about readlines vs. read and split You would think that lines = infile.readlines() and line = infile.read() lines = line.split(‘\n’) would give the same result in the variable lines, that is, a list of strings from the file, delimited by the newline characters. You would be surprised! • readlines() gives you a list of strings, each with a \n at the end • Except! if you did not press Enter on the last line of the data file, the last string in the list will not have a \n in it • read() followed by split(‘\n’) gives a list of strings, yes, but none of them will have \n in them (remember split removes the delimiters from its results) And another surprise! • If you did press Enter on the last line of the data file, readlines still works properly. The last string in the list will have a \n character just like all the others • BUT the same file read with the read/split combination will have one extra entry, an empty string at the end of the list • This is something you need to be aware of while processing your data – many programs crash because they assume that every string will be the same length, for example. Files in Python Output techniques Outputting to a file • There are two ways to do this in Python – print (more familiar, more flexible) – write (more restrictive) Using print to output to a file • You add one argument to the print function call. At the end of the argument list, put “file=“ followed by the name of the file object you have opened for output • Example print(“hi”, a, c*23, end=“”, file= outfile) • You can use anything in this print that you would in printing to the screen, end=, sep=, escaped characters, etc. • Default end= and sep=, so gives a newline at the end of every print unless you give different value • Note it says file=outfile, NOT file = “abc.txt” Using write to output to a file • write is a method, similar to the Text object in the graphics package • it is called by the output file object (dot notation) • It is allowed ONE and only one STRING argument, so you have to convert numbers to strings and concatenate strings together to make one argument • Example outfile.write(“hi”+str(ct)+”\n”) • Does NOT output a newline automatically, if you want one, you have to put one in the string Files in Python When does it crash? How a file can make a program crash • For input files, there are several things that can happen which can cause a program to crash • Some are avoidable with some care, some are not – the file does not exist that you are trying to open – trying to read past the end of the file – the data in the file is not laid out as the program expects – the file exists but is empty Output files • An output file is constructive and destructive – If the file you are opening to write to does NOT exist, it is created • Note that if you gave the path to the folder as part of the file name, the open will NOT create folders! • In other words, outfile = open(“c:\\My Documents\\cs115\\file1.txt”, “w”) will only work if the path already exists and you have permission to write to it – If the file you are opening to write to DOES exist already, all data is destroyed • tells the OS to set the length of the file to zero bytes! • If you try to write to a medium that is full, your program will crash