Python for Informatics – Chapter 7 Files



Welcome to chapter 7, Python
for Informatics Exploring Information. I'm Charles Severance,
I'm the author of the book and your host. And as always this
is brought to you by, no I'm sorry it is all creative
copyright Creative Commons attribution, the audio, the
video, the slides, and even the book. So here we go, oh
and so frankly where we've been working
all along, is we have been writing code
and talking to the CPU, hang on let me go get my CPU and
stuff hang on, be right back. Ah. OK here we go, here we go. Here's all that stuff. Remember that stuff
from the first lecture? There we go with that. Remember the mother board
from the first lecture? This is kind of a
picture of what's on the screen, the motherboard,
the CPU plugs in here, memory plugs in here. And remember how the CPU
is sort of the brains, as much brains as there
is for the operation. The CPU is asking what
next, the instructions come in through
these little pins. There's data inside
and it stores sort of semi permanent data. Variables are all stored
pretty much here in RAM. And we write our programs,
and so your Python programs they're sitting here
in this RAM in they're being fed to this CPU,
through those chips, through those pins
right the pins, I mean it doesn't really
connect like that, and so frankly up
to now everything that we've been doing is
just the Python programming language. And so the only place we've
really been operating is here. We have been putting
Python into the main memory and– the main memory, and we
have been effectively feeding instructions to the CPU, the
central processing unit as it needed them and the
program would stop. And everything
we've done so far, everything is just sort
of fiddling around here. We have never escaped it. So now we are finally
going to escape from the central processing
unit and the memory. We'll still write programs
and have variables in here, but now we're going to use a
disk, the secondary storage, the permanent media. So if I go grab my Raspberry
Pi, that just goes right there, here's my Raspberry Pi. So here we've got
Raspberry Pi which is the small version, which
of course has a CPU, memory. and graphics processor all on
this little chip right here, but the secondary
memory for the, is this little SD card that
is the secondary memory for Raspberry Pi. So the structure
the Raspberry Pi is exactly the same structure
any other personal computer, it's just smaller
and less expensive. And so in the
Raspberry Pi if you're programming the
Raspberry Pi, you're sort of finally escaping. All your programs are in
here your CPU's in here, and that's pretty much
how far you've got to run. But now, of course when
you save your files you save them to here. But now we're going
to start looking at data on the disk drive. And so it's time to escape
to the secondary memory. OK time to escape to
the secondary memory. Oh Raspberry Pi you
can go right there. OK? So it's time to find
some data to mess with. So a lot of what we've
been doing so far is just kind of the pre
work to get to the point where we can do this. And in here we're going
to have data files. Now we've been
making data files, you've been writing– every
Python program that you write on your computer
gets saved as a file, then Python reads
the file and runs it. But, now we're actually going
to start messing with some data. And so files are where we're
going to be working with. And so, one of the things
about secondary memory, is it's much larger. And this main memory of
computer is pretty large, it's just not large
enough to hold everything that the computer is
capable of holding. So the files that we're going to
work with, no we're not talking about image files or
Quicktime movies or things like that, we're going
to work with text files. Because the theme of this
course is digging through text. Sometimes we'll pull
it off the internet, sometimes we'll
read files, but it's digging through and using
all the things that we've learned so far looping and
strings and all those things to make sense of a
sequence of information. OK? Now to access file
information we have to do this thing
called opening the file. We can't just say,
yo the information is just omnipresent,
because there are so much data that you
can't have Python sort of know all the data. You literally have
hundreds of thousands of files on your
computer's hard drive and which one are
you going to read. So there's a step
that you have to do, but you call this built
in function, called open, and say oh this is the
file I want to work with, of the hundreds of
thousands, and then once you do you've kind of got
this little connector into it. And the open is a built
in function inside Python so let's say goodbye to that. The open function is built
in function in Python and it takes two parameters. The first parameter is the name
of the file, like M box dot txt and then the second is how
you're going to read it. Are you going to
read it are you going to write it, et cetera,
now most the time we'll be reading our files. So we call the open
function, and pass it in the name of the
file we want to open, and then how we want to read it. Now you can leave this
second parameter off and it assumes that you're
going to want to read the file. Now, when the open
is successful, it doesn't actually
read all of the data, because the memory is small,
small compared to a hard drive so you have to sort of step
through the data you'll tell it when to read it. So the act of opening it is not
actually reading all the data it is creating kind of like a
connection between the memory and the data that's
on the hard drive, right, it's connecting between–
that's going to fall down. Are you going to
stand up that way? Should come up with a
way to make that stand. So it's a connection,
so that your programs kind of running in here and the
file handle is just sort of a, it's like a phone call between
your memory and your disk drive. It's not the actual
data, the actual data is still sitting
on the disk drive. OK, so a graphical way
to take a look at this, is this file handle, the
thing that comes back from the open request, the open
goes and finds the file out on the disk drive
yada yada yada, and then the handle
is something that lives in the memory that is
sort of like the thing that maintains its connection
to where all the data is on the disk or on the
SD RAM that's in it. So the handle is
not all the data, but it is a mechanism that you
can use to get at the data. So if you print
it out, it doesn't have all the data from the file,
it says I am a file handle, that's opened this file
and we're in read mode. So it doesn't
actually have the data even though this is the
data that's in the file. And then we have
operations that we do to the handle like open
it, close it, read it, write it, so we do things to those. So the handle, and then
through the handle, it actually changes
what's on the disk or reads what's on the
disk, so the handle is kind of a thing
that's not there. If you attempt to open a file,
and the name of the file– now the way we're
going to do these is, these need to be in the
same folder on your computer as your Python code. Now there are trickier
ways to do it, but we're going
to keep it simple. This is the name of a
file in the same folder, as the Python code
that you're running. And if it's not, then we
get of course a trace back and we're used to using
reading trace backs by now, no such file directory
stuff dot txt oh of course I forgot to save it
or I typed it wrong. So the next thing
we have to learn is the notion of the
newline character. We haven't seen this
so far, but there's a special character
in files that is used to indicate
the end of a line, because these text files
that we've been writing, including the Python
programs that you have, are organized into lines. Each line has a variable
length and there's a special non printing character
that you just don't see. Now you see it, because you
see a line, multiple lines, but you don't see
the character itself. So it turns out that this
character is very important, because the data is just a
stream of characters on disk and then it's
punctuated by new lines, to tell it when it's
time to end a line. So, if we are building
a string constant for new line is backslash n. And so when we make
a string that we want to have a new
line in it, we'll say hello back slash n world. And then if you
print it out one way you actually see
the backslash n, but then if use the
print to print it out you see sort of like
the– it moves back down to the left margin and down. So sometimes you see the
slash n and sometimes it's shown as movement
right, it moves it. The other thing
that's important is even though we represent this as
two characters, the backslash n is represented as two
characters in a string, it's actually one character. So if we print it out
we see x new line y, and if we ask how many
characters are in stuff, which is this string, it says three. That's important. OK there is one, two, three, the
new line is a single character. This is just a syntax that
we used to sort of encode a new line in a string. OK. So even though these are just
a long sequence of characters punctuated by new
lines visually, text editors and
operating systems show these files to us
as a sequence of lines. And it doesn't take very long to
just start thinking about them as a sequence of lines. As a matter of fact,
maybe you never wish I never told
you about new lines. But when we start
reading files, we're going to have to deal
with these new lines, so the way that we sort of
have to mentally visualize of what these text
files look like, is they have a new line that
punctuates the end of the line. Now in reality if
we look at this, this r really comes
right after it. Right, this is all a
bunch of characters and the new lines
are punctuation. OK to say this is first line,
second line, third line, fourth line. So you've got to think
that each of these things is here sitting at
the end of the line. And so the number of
characters in this line include that new line. Now the new line
is one character. OK? So, how do we read these files? Well we've already talked
about doing an open x file. And I'm just–
this x file, again, that's just an mnemonic
name that I made up. This is a handle, remember
it's not all the data, but the handle is the way
that we can read the data, we can use it as
an access point. The coolest way to
read a file if it's a text file in
multiple lines, is to use a determinant loop, a
for loop, for cheese in x files. So this– remember we would
put a list of numbers or string here. Now we've put a
file handle here. Python knows automatically
that each time we're going to run this
loop, is going to go to the next line of the file. Automatically. For, a cheese is just a stupid
name that I came up with, it probably would be better
to call it line rather than cheese, but for
cheese in and then it goes, each file and then stops
when it reads the whole file. So this line will print
out every line in the file. That's how you do it. These three lines open a file
read every line in the file. So a file handle itself, is
a special kind of a sequence, much like a list of
numbers or a string is a sequence of characters. So one of the things we can do,
to combine one of our counting idioms, is count the
number of lines in a file. OK and so how we would do that
as we would open the file, set a counter to zero,
this time I'll use a mnemonic variable called
count, for line in f hand that says, run this
indented text once for each line in the file,
free time in the file add count equals count plus
1, when the for loop is done, print the count. Pretty straightforward,
very few other languages are capable of
writing that program in as quick and as dense, as
to sync the way as Python is. Python does a really,
really nice job of this. OK so that's how
you count the lines. Open it write a for
loop and then add one. Now we can't just say, so what
you can't and this gives you a sense, you can't
say Len f hand. And that's because this
isn't really the data, that's sort of you have to
like pull, pull it and read it to get the data out of
it, but though we'll see another way reading it later. OK so that's counting
the lines in a file. It turns out you can also
read the entire file. Now if you read the entire file
it's not broken into lines. You're getting all the
characters punctuated by new lines and
you get everything. Now you don't want to
read this if it's too big. So it's going to all try to read
into the memory of the computer and if the memory's
not big enough you just slow down to a crawl. But it's a real tiny file,
this works just fine. And so we have sort of
real– we open a file and we say f hand dot read. This is basically saying
hey, dear fhand, read it all and return it to me as a string. So that's a string with all the
lines of the file concatenated together with new
lines, which is actually exactly what's in the file. It's the raw data. That for loop sort of
looks for the new line and does all the
stuff automatically for us that's quite nice. So then we can like because
imp is a string at this point, we can just print
the length of it, we can say oh, there's
94,626 characters that came from that file. It reads the whole thing, whole
file, reads the whole file. We can also do things
like slice it now and so this is the first
20 characters from zero up to but not including 20. So this is our file. OK. So that's reading
through the whole file. So let me go back a little bit. This is the file that
we're going to play with, this file here that we're going
to play with in this class is a mailbox file. And this is actual real data
and these are real people and these are real dates,
having to do with an open source project that I worked
on called Sakai. I actually have a tattoo of
Sakai here on my shoulder. Maybe in an upcoming lecture
I'll have a short sleeve shirt and show you my tattoo but
for now I can't because I've got clothes on. So but this is real data, it's
the mbox dot txt and mbox dot txt file. So, so that's the
file that we're going to use for most of
the next few assignments. It'll be the same file,
you'll get tired of it, and you get to know all these
people Steven and Chen Wen and all the people in the file. So we can search for
lines that have a prefix. This is kind of find pattern
from the looping lecture. So we're going to go through
a list of lines in a file and we're going to only
print out the ones that match a certain thing. So again, we open
the file up, we're going to write a for
loop that's going to say for each line in the file. If the line, and then we
can call a utility function inside of string because
line is a string, if line starts with
from, print it out. So this means it's going
to loop through all of the lines of
the file and it's going to print the
ones that start with the string from colon. OK again four lines
complete Python program to read this file and
print the lines that have a prefix of from. So if you run this program,
and I suggest that you do, this is what the output's
going to look like. And it's like wait a second. I'm seeing the lines, seeing
the lines, that have the froms but then I get
these blank lines. And why is that? Why this blank lines there? If I look at the program, I mean
I'm not printing blank lines, I'm only printing lines
that start with from. I'm not doing that, so why? What do you think? Give you a second. I've certainly done enough
foreshadowing in this lecture. Well it turns out these
new lines are the problem. So it turns out that
the print, we've been doing this all
along you just didn't, we didn't make a fuss about it. The print adds a new line
at the end of everything that it prints. So these yellow new
lines are coming from the print statement. But when we read the file,
each line ends in a new line. So these green new lines
are actually from the file. They're the ones from the file. So what's happening is
we're seeing two lines, and so that turns
into a blank line. So how do we deal with that? Well, we've got
a string function that conveniently
solves that problem. OK. And that is we're going
to call our strip. If you recall we had
strip, l strip and r strip to strip white space on
one side, on the other side or on both sides. So in this one we're
going to use r strip. We're going to say,
we're going to read the line that this line is
going to have a new line in it. Our strip says pull white space
and the new lines are also comma'd as white space. Blanks or new lines
are white space. And then we're going to replace
this with no new line in it, then we're going to ask
if it starts with a from and then we're
going to print out. And then we go and we're going
to see exactly what we're looking for in this file. And there's no new lines. Now they're, so the new
line that's coming out here, is the one from the print,
not the one from the file because the one
from the file got wiped out by that
particular line. OK. So another general pattern
of these file based loops that we have done this,
is a skipping pattern. Now you can do–
the non skipping pattern is where
you're saying, I'm going to look for lines
that start with from and do something to them. Sometimes you want to
do something to all, you want to say here's a
bunch lines I'm going to skip, and then I'm going
to do something. So the skipping
pattern uses continue. And so the first few
lines here are the same. We open a file, we read
each line in the file then we're going to strip
off the white space. You're going to get tired
typing these three lines, because you're going
to do it a lot. Open the file, start
reading the file, strip the white
space for each line. And you can make it so that
you can look for some fact. In this case, I'm going to say
if not line starts with from, means this is true for all the
lines this don't start from, continue. If you remember
continue goes up. So the continue says, I'm
done, it finishes the iteration and it doesn't do
anything down here. And so this is– and
then we can do something. So I've kind of
flipped this, where I said, these are the
things I'm interested in, that's lines that
start with from, so I'm going to skip the lines
that don't, so I'm going to use continue. Either way you can do it,
depending on the complexity or how much– often when
you're– this is a good pattern when you have lots of
lines of code down here, that you're going to do
a lot of cool stuff with. You can also use things like
in, to select lines, right? So I'm going to look
for lines that have at uct dot ac dot za in them. So again I'm going to
open it up, open these, go through each line
in the file, I'm going to strip their
white space out and if not uct if this, if
this string is not in line, then I'm going to continue. So it's a way for me to
skip all of the lines that don't have this string in it. So these lines dupes,
that one has it too, and then we're going
to print it out. Then we'll print out the
ones that make it past here. OK. So, but in is another
way to do searching. Like it starts with et cetera. So one more thing that
you might want to try is, so we can count right? Now and this is a pattern for
prompting for a file name. And so you'll get tired
of sort of changing your code every time you want
to open a different file, because you probably want to
run the program with mbox one, sandbox short, because it
just, just so you can test it with different things of data. So here's just another pattern. We add this line to say raw
input enter the file name, and there you go, we'll
type in the file name, and then the thing that we
open is whatever we entered as the file name, and
then the rest of it is pretty much yada yada. So here I'm, reading
the whole file, if the line starts with subject,
count equals count plus one, and then there were 1797
subject lines in inbox dot txt. There were 27 subject lines
and mbox short dot txt. OK. So that's prompting
for the file names. Now open– the open
statement fails if the file name doesn't exist,
so you might want to add a try and except around that. If you want to, if you're
just writing code for yourself and you it is OK, then you
don't have to write try except, but if you want to catch it,
and catch a bad file name, then you take the open and
turn into these four lines. So this is the code that
we think might blow up and it's going to blow up we
know it's going to blow up. If they enter a bad file
name like nana booboo, right, this is going to blow up. So what do we do? We use try and except. We put try around that,
we're going to take out some insurance on
that particular line, and then if it fails we're
going to print this message and then say exit, to get out. So if you get a good file, if
you get a good file it works, skips the except then runs the
thing, prints out the count. That's what's happening here. If on the other hand you
get a bad file, comes here, open blows up, runs the except
prints this out and then quits. So that's how this one
works with a bad file and now no trace back. Right. So we are, it's kind
of a short lecture. Were done with Chapter Seven . We open a file,
we read the file, we take out white space
at the end with r strip, we used string functions. So this is kind of
putting it all together and it's kind of short little
programs now, so it's not– and you know,
starting now, we're going to start putting
these things together and start actually
doing work because now we have from the
first few chapters we have basic
capabilities of Python, now we have some
data to work with. Going forward we're going to
do increasingly sophisticated things with that data. So I can't wait to see
you in the next lecture.

18 Comments

  1. Kostas Nikolouts said:

    Is it possible to use open in writing mode to use it as a logging system?
    So this file will have all the executions of the program.

    June 26, 2019
    Reply
  2. Daniel said:

    LMAO cheese

    June 26, 2019
    Reply
  3. Nazmus Salehin said:

    Sir can you make lectures on object oriented programming ?

    June 26, 2019
    Reply
  4. Lily Golden said:

    Love it! Thank you.

    June 26, 2019
    Reply
  5. ali veli said:

    i downloaded the mbox.txt file via the given link in the book. when i ran counting line program, it returned as 132044', despite the count of line is 132045 in the book. so if you guys have this small issue like me just open the text file you have downloaded and press enter to place the indicator to downline, you can add a few white spaces if you want.
    the thing is the last line of the text does not include 'n' so you can give one more newline character by going down one more line. i hope this helps because it may seem confusing to beginners like myself.

    June 26, 2019
    Reply
  6. John Richmond said:

    Where can I find the text file from this lecture?

    June 26, 2019
    Reply
  7. Mikle Talalaevskiy said:

    Thanks, though would you fix audio at 6th minute for other viewers?

    June 26, 2019
    Reply
  8. thisguy315 said:

    does anyone else get no audio for this video?

    June 26, 2019
    Reply
  9. Captain Radd said:

    I'm really getting into these lectures now, and feel like I'm making progress. Fantastic!

    June 26, 2019
    Reply
  10. Mars Williams said:

    You're a great teacher! I learned more in five minutes of your video than I did in several days of tutorials from Learn Python the Hard Way. I love that I can write Python in a way that is more familiar to me (from previously learning Ruby), and can write loops more succinctly now that I have watched this video. 

    June 26, 2019
    Reply
  11. sreejith kc said:

    I awlasy get this error, not sure why? this file is located inside the python directory but still get this error,please let me know can i fix it

    >>> test = open('C:/Python30/greece3.txt')
    >>> for cheese in test:
    print cheese

    SyntaxError: invalid syntax (<pyshell#67>, line 2)

    June 26, 2019
    Reply
  12. sreejith kc said:

    >>> test = open('NEWS.txt', 'w')
    >>> print test
    SyntaxError: invalid syntax (<pyshell#59>, line 1)
    >>> test = open('C:/Python30/greece3.txt')
    >>> print test
    SyntaxError: invalid syntax (<pyshell#61>, line 1)

    any idea why its not working?

    June 26, 2019
    Reply
  13. Yanni Phone said:

    Outstanding lecture.

    June 26, 2019
    Reply
  14. Arundhati Bakshi said:

    os.chdir() sometimes comes in handy when navigating directories.

    June 26, 2019
    Reply
  15. Kennedy Sanchez said:

    Hi Chuck. How i could read just only the last line of a big file, and extract it's value?

    June 26, 2019
    Reply
  16. Felipe Diaz said:

    Amazing! thanks from Chile

    June 26, 2019
    Reply
  17. Chuck Severance said:

    I tweaked the instructions for 7.1, and 7.2 – and tested both and they worked swimmingly. I added a note about the file to open.

    June 26, 2019
    Reply
  18. MrBlackbisounours said:

    we can t do exercices 7.1 and 7.2 : they are not working on the course website…..can u fix it please so that we can go on in Learning chapter 9-10-11 thxs

    June 26, 2019
    Reply

Leave a Reply

Your email address will not be published. Required fields are marked *