Use Python to Load & Prepare Data Analytics



welcome to five building additional skills for data analysis in this lesson we build more education and data extraction transformation and analysis we'll learn more default dick skills which works to pivot and accumulate data will also use default dick to reverse a one-to-many mapping next we'll explore the CSD reader and how to partially consume iterators will also review pythons core looping idioms and the effective use of assertions at the end of the lesson you will be fluent with pythons rich toolset for loading and preparing data for analytics we're going to do next is apply he means to an analysis of voting blocs and odd Congress in this in preparation that we need to do we'll go take a look at the tools that will come up in that exercise so for this lesson we'll prepare for that tooling there's a handful of things that job will need and the first one is collections the default date we've already used it for grouping now we'll use it in a couple of other ways we'll use it for accumulating data while which is sometimes called tabulating and then also we'll use it to reverse a one-to-many mapping which is a core data analysis skill we'll take a look at the glob module how to read files with an encoding I'll take a look at consuming iterators are using next use the csv reader i'll go over the standard python looping idioms for high level python we'll revisit counter and show how they increment using counter very straightforward procedure and then how to use of assertions in code some of this will be reviewed for you but most of it ought to be material that shows you how to put together high level python once we've got the tools out of the closet in the next lesson we'll snap them all together and go analyze all of the voting data for the 114th u.s. congress let's begin from collections import default date what we had learned before is that default dix are used for grouping but they can be used for another purpose for accumulating data so I'll make a dictionary that's a default date and we can accumulate information about people series about characteristics for them this will be very useful for us when we tabulate our votes so the default date will be a list and the idea is we're altering the direction of our tabulation so perhaps we have data that starts out by giving you a characteristic of a favorite color red and then it gives me the favorite characteristic color of Rachel which of course is blue and the next person is Matthew who comes in at yellow the idea is the contents of the default dict at this point has a list of characteristics for each person and if we were to take a look at that default dick we'd see each person maps to a corresponding color this is does seen using P print in fact whenever you're doing non-trivial data analysis P print is one of your respirators and so we've tabulated at this point one color for each person but now the data set looks back over the people and tells us another characteristic what kind of computer they have I clearly have a Mac Rachel is a Windows programmer so she's got a VC command he's just a little boy who has a little VTech computer notice that the order of the information coming in is we've got it ordered by person always Raymond Rachel and Matthew however we're now getting a new characteristic for each of the people the Mac the PC and they have VTech so if we were to go do P print once again we take a look at where you're accumulating all of the characteristics of each person I could go through several more rounds and then each column in this accumulation each column will represent some different characteristic the default date is mainly useful during the accumulation phase because it builds the initial list but after that's done we don't actually need a default date and we can convert it to a dictionary but since we no longer need it's a defaulting behavior once all of the data is accumulated some people find that a little easier to look at and a little bit more interchangeable with other functions that expect regular dictionaries instead of default dictionaries in other words it's very common to use a default date for accumulation and then to use it to convert it to a regular dictionary for normal use after the accumulation phase so far we've seen two uses of default dict default big can be used for looping which we did in a previous lesson and in this lesson we use it for accumulation this is particularly important when your data doesn't come in the order that you actually want to process it a little bit later and that will apply in our congressional example there's one other case for default Dix that I really like and I think it's a core data analysis skill and that's how to invert our reverse a one-to-many mapping so let's take a look at about how we model a one to many mapping model one many with a dictionary where the key is the one and the value is the list of the many so likely head up above the list is the value and the key is a scalar so let's look at a sample one-to-many dictionary I've got a Spanish dictionary and say one is in the know showing off my very impressive knowledge of Spanish to dos three tres and then the English word trio doesn't have a direct equivalent in Spanish that I know of so we'll also map it to trace further we'll also include the Spanish word Libre Libre has two translations the English word is free and the Spanish word is Libre which means available are grata s– which means free of charge tomate doom one too many the mini is always a list so each of these will have a list inside it and so this is the normal pattern for a one-to-many relationship the one is the scalar key and the mini is a list in this case we only have one of the mini listed or one instance of mini and if I pretty print this we can show the structure of it each of us with 40 and the dictionary has rearranged the order of all of the entries and we have Libre and garage is going first so this is actually a very common arrangement of data in fact if we go out and take a look at Google we can go define free the word free has multiple meanings as you can see this is actually quite common way to think about data we've got a scalar as the key and then we have a list as our multiple values the question then becomes how do you invert a one-to-many dictionary it will also have a one-to-many relationship because three and trio each map to the same value so we can loop over the English Spanish dictionary getting the English word which maps to possibly one or more Spanish words in the e2s items now I want to build a build my base dictionary here ste is a default dict which is a list which we'll use to bottle all one-to-many relationship loop over the English to Spanish dictionary the English word and all of the Spanish words and the English to Spanish dictionary for every English word is possibly multiple Spanish words that's a list and we can loop over them for each Spanish word and the Spanish words to for loops because it's a two-dimensional data structure and then what we'd like to do is in s2e lookup that Spanish word the Spanish is the key and the value is always a list and we can go append to that list the corresponding English word so now we have our Spanish to English dictionary and our English to Spanish dictionary the learning points are a way to model a one-to-many our relationship is to put the 1 as a key and the mini has a list the technique for building up such dictionaries is to use a default ticked mapping to a list that's a very common way to represent data one-to-many the technique for inverting it is to flatten the data structure by looping over the first dimension then looking over the second dimension that will give us one English word for one Spanish word given the Spanish word might have multiple English values we append to the list and so we can see trace mapped to 3 and Rio and free map to Libre and garages hope you enjoyed that once you've got the hang of this and you've seen it done once or twice it's actually very easy to invert a dictionary that has a one-to-many relationship however there's a simpler case what if it's just a one-to-one relationship what mathematicians call a bijection or one-to-one and on to if we have a simpler dictionary we could have converted it directly in place using a dictionary comprehension so let me go back to the simpler case simpler case is we have a dictionary where one is oh no – is dose and three is trace at this point we don't have a mini relationship so there's not a list on the right hand side and the same value is not used more than once this is a bisection each word in English has exactly one word in Spanish each word in Spanish has exactly one word in English so how do we invert that it's quite easy we can make a for loop looking over the English word and the Spanish word and the item this is simpler than before last time we had to loop over all of the Spanish words and then we can build a dictionary using a dictionary comprehension or at this time the Spanish word is the key and the English word is the value so now you have two techniques one technique invert by ejections very simply with a dictionary comprehension that simply swaps the key and the value the other technique handles a more complicated situation where you have one too many we create a default dict we flatten the two-dimensional data structure by looping over the outer dimension then looping over the inner dimension once we have a pair of scalars we accumulate it in the default dictionary so now you know three use cases for a default de there's groupie there's accumulation and then there's reversing a one-to-many map II hope you're enjoying the skill building once you know these patterns and idioms become somewhat easy to express big ideas with only a little metal code in Python and so this becomes one of your building blocks how do i group how do I accumulate however I reverse a one-to-one mapping essentially all of them use a default dict with a list or a set you next tool that we're going to is glob the glob module seems like it doesn't have a particularly great name if I were to go into glob glob the outer glob is the module the inner glob is the function club says expand the wildcards and perhaps the Congress data correctly are in the current directory all of the starred text files for example would be listed I here so glob is kind of interesting the question is where does its name come from this name came from his glob means Global wildcard expansion and when this function was created it was exactly the right name there was a time where Python was UNIX only and every Python programmer had unix experience and had learned bash and in bash the way you what it's called when you expand a wild card is globbing in fact the part of the class reference manual that mentions this is called globbing so when this was introduced to python if it had been called anything other than glob people would have come to python and said I've looked everywhere and I can't find it how do you glob in Python but a lot of years have passed and not many people know that word anymore so if it were being renamed today it would probably be called OS expand wildcards in modern times we tend to spell out names a little bit more fully in modern times we tend to use a underscore here in the variable name and we keep all the characters lowercase and we don't use any odd terms like glob so I contend that glob was well named at the time of introduction but they name no longer makes a lot of sense our next topic is how to open a file that has an encoding in it it'll become more and more common to have files that have Unicode in inside we've gotten used to every file as just plain text but in fact one more people are using smart quotes trademark symbols M dashes in dashes or they have non ASCII characters in their name so let's go look at such a L file I'll try to open it normally open the call the data that I've downloaded from Congress and we have the congressional votes for Senate bill 8 2000 'pn that file and go to print it then notice that there was a unicode decode error and what's going on is if there's a non ASCII character inside this is fairly easy to cope with when we go to open the file we can also specify the encoding provide my screen a little bit so it's clear what we're doing you can see if you have the encoding to it it'll translate it into Unicode and so that's a fairly straightforward operation and something that you should get used to because more and more files will come with some kind of encoding if you don't want the file open to fail the next one up is how to use next or I slice to remove elements from an iterator suppose I had a simple iterator over a string one way to get a piece of data out of the iterator is to call next on it when the interesting things about next is it not only fetches the next value it consumes one element of the iterator so i can consume another element here which means that the iterator is now pointing at the letter c another tool that consumes an iterator but consumes it completely is list interestingly the list doesn't start at the beginning after all two elements of the iterator have been consumed already and so it picks up at the c list all the way through the end so a good intermediate level Python skill is how to consume some of an iterator and then pass it around to another function this is really handy when you have awesome header information in an iterator you want to pull out the headers and then pass over to some loop being able to loop over the main body of data so really good skill is being able to pass around partially consumed iterators it's remarkably easy to do we run it err to get an iterator we run next to consume some values and then we hand it over to a high-volume function a four or a list or a couple or sorted or whatnot that will consume the iterator all the way to the end the next one up to finally parse data like this up here we can do it by using a split on my out commas but that's not nearly as robust or just using the CSV module the CSV module is remarkably easy to use you wrap an object in a CSV reader so import CSV and now when we open the file for the Congressional information I can loop over all of the rows or row and then CSV reader or file F and you can print the row notice how what it is done for us is split each row into a list broken it into the appropriate fields it treated two consecutive delimiters has a indicating an empty field and the first couple of rows here are one some identification the information this is Senate vote twenty on the sanctions Enforcement Act and this is all of the headers describing what the various fields are so the CSD reader takes care of our parsing this for us it's written in Python but it also has a C extension in front of it and so the CFE reader tends to run very very fast and it tends to be very flexible as compared to just running split on a comma you can also handle quoting inside the CSV file which is important if someone has a comma in one of the fields in the name the quoting will help separate that from ma the field separator itself the next skill which is a Python fundamental is tuple packing and unpacking we can build a tuple quite easily with parentheses and commas on the right-hand side of the equal sign the datatype here is a tupple and taking these individual elements and putting it together is called tuple packing we're putting all of our data together in a suitcase and the suitcase has contents of links for unpacking is where the commas are on the left hand side of the equal sign we can bring out the fields the first name last name age and email address and unpack the person tuple into separate field so on the right side of the equal sign the commas pack on the left side of the equal sign the commas unpacked next exercise is fun those you've watched some of my videos on on YouTube will already know a little bit about the core Python looping idioms they're somewhat straightforward I'll give a traditional data set that I'd like to use I have a set of our named laman Rachel Matthew this is a common way to enter a list which is to type a string and then split it it's a little easier to type and a little easier to edit later on we'll have some associated colors Raymond of course is red Rachel is blue Matthew is yellow and then I'll put in the names of some cities awesome Dallas and Austin so these will be our data that we'll use to cover Python core looping idioms our first task is to just simply loop over the names and we'll print them title case or upper case here's the old-fashioned way of doing it and hints that you're doing it wrong is the square brackets Python does support indexing that said it's not particularly fast at indexing and it's a very low-level way of thinking the more elegant way to write this code is to loop a little list directly well show that compare and contrast the two different ways the second way is shorter the second way is clear let's take away is faster the next task is to loop over all of the names and show their position in the list a traditional way to do this is to loop over the indices and to show that Raymond is the first element in the list and Matthews the third element in the list but conceptually what we're doing into numerating so there's higher-level construct for that for I and name and enumerate the names telling it to start the count from 1 otherwise it defaults to 0 and I and the name the second way is clear about what it does the second way is faster in Python and the second way because it names a numerate let you know a precisely what you're trying to do I also like that you can control the start argument for enumerate the next task is to print out all of the colors but in reverse order there's a traditional way to do that for I in the range and we need to get the indices to going over the length of the colors minus one looping down to minus 1 excluding minus 1 and then decrementing the step of each one finally we can show the I to color if we've done this correctly yellow should show first this code is awful is fairly hard to get right an amazing number of people will put the second argument at 0 which will omit the final color or they'll forget the minus 1 I hear it used to be the idiom for how you loop backwards the more modern idiom is to simply loop directly for color in reversed colors print the color it's much clearer about what it does and interestingly it's faster than the other way the hint that you're doing it wrong is using these square brackets of the next task is to bring names together with the colors and bring them together pairwise if there's any more names than colors omit the unpaired item if there's any more colors and names omit the unpaired items here was the traditional way to do it we want n is the number of times the loop will be the minimum of the number of names other number of colors we loop over the indices to the smaller of those two values and print out the i's name is associated with the ith color this is called lockstep iteration and it brings the values together pairwise associating weymouth red rachel with blue matthew with yellow what there's a better way and the better way uses zip we can zip together the names and colors and print the name and the color the second way is clear about what its intention is it's shorter than the first one and the second way is also faster there's a lot to like about it zipper has been around for a very long time it was a present early versions of lists although it had an unusual name I think it was called map car which is not nearly as clear as if it suggests bringing together items all like a zipper although a real zipper interleaves Python zipper brings elements head-to-head Raymond with red Rachel with blue Matt he would helo the next task is to show all of the colors but in alphabetical order in color and sorted colors Clint the color I think this reads very nicely show me the assorted colors as you learned in a previous lesson you can also apply key functions a key function is a function that takes one argument and it will transform one of these elements into a key that will be used as a sort key so an example would be the key function is Len I would like to sort all of these colors by their length this is the shortest color at three characters this is the longest color at six characters the last task is to loop over all the cities without duplicates currently our cities looks like this and we have duplicates the tool for eliminating duplicates in Python is set for city in the set of all the cities at those cities what's interesting about all these looping idioms is they compose together very nicely for instance Chicago is out of alphabetical order we can fix this by using sorted for those of you know SQL this would parallel the query select city from city order by city and we want to make them distinct the way you say distinct in Python is set the way you say order by in Python is sorted and these tools compose really nicely we could reverse all of those and put them in reverse order like this so that Houston goes first further we can't count them using a numerate and that will give us position number for the city by default counting from zero unless we set the start argument up to a higher value you can go further and we could somewhere in here perhaps put in a map the unbound method stir up err this style of programming is called functional programming where we take the output of one function into the next and the glue that holds them all together is each one of them emits an iterator and consumes an iterator and this shows that all the Python core looping idioms can be composed together rather nicely it would be quite rare and unusual to actually put this many together but you could and whenever you need to snap two pieces together to a compose it's easier to do that than it is to try and break it out into a smaller pieces and our little review of flour collections counter will be in order let me clear up our screen here import collections and what our counter is good for look good for counting things how many Reds have we seen so far a regular dictionary would give a key error here but a counter is suitable for incrementing count how many Blues have we seen plus equal one and we see another red and so the state of the counter is we have two Reds and one blues so counters are wonderful for counting things hence the name counter all the technique is to make an instance of it and then to look up values and increment them ma by one some nice features of the counter we discussed before our service and most common method what is the one most common pair or the two most common by default it gives you everything and another tool was we could list out all of the elements which expands them by their multiplicity showing both of the two reds and the one blue very last topic is assertions assertions take a statement that's supposed to be true verifies that it's true and then doesn't complain about it however if you give it an incorrect assertion it complains by raising exception so choose for checkpointing your program when you believe that there's certain assumptions that are are true they're supposed to be true at that point and you just want to check them you put in an assertion the more complex your manipulations the more likely it is that you're going to need assertions to help you debug the program and that wraps up this lesson we now have all the tools we need our looping idioms are knowing several ways to use default Dix using for grouping using for tabulation using it for one too many mappings using glob to scan expand wildcards reading files with encoding packing and unpacking tuples how to use a csv reader how to increment instances of counter and how to use assertions these tools together will allow us to express really big ideas in python with a little amount of code and in the next lesson we'll use it to analyze that congressional data set I look forward to seeing you then

4 Comments

  1. nareshgb1 said:

    After watching this video, I am finally ging to remember defaultdict and csv reader without having to google them (well, almost)

    July 12, 2019
    Reply
  2. Luiz Ferraz said:

    The algorithm showed to reverse a one-to-many is actually is a many-to-many reverse. A one-to-many is a subset of many-to-many so it also works for that.

    July 12, 2019
    Reply
  3. Iaroslav Karandashev said:

    Raymond is awesome! Best promotion of Python 🙂

    July 12, 2019
    Reply
  4. Safa Safa said:

    How can i make my it clearer to you? do i need to btw?

    July 12, 2019
    Reply

Leave a Reply

Your email address will not be published. Required fields are marked *