Regular Expressions in Python - Lecture Notes

hey there how's it going everybody in this video we're gonna learn how to work with regular expressions in Python using the built in re module and sometimes you'll hear these called reg X's now I recently released a video on regular expressions as a standalone topic because they can actually be used independently of any one programming language so you can use them to search for text patterns within text editors and things like that so this video will use a lot of those same same examples that I used in that video except here we're going to see a lot more that is Python specific so if you've never used regular expressions they basically allow us to search for and match specific patterns of text and they can look extremely complicated but that's mainly because there's just so much that we can do with them so you can create a regular expression for just about any pattern of text that you can think of so let's look at some examples and see what this looks like so I have a Python module here that we're going to use to search for some simple patterns to start out so first of all to use regular expressions within Python we need to import the re module which obviously stands for regular expressions so I'm doing that up here on line one and the text that we're going to be searching is this multi-line string here so the variable name is text to search and we have a lot of things in here in this multi-line string so you can see at the top we have some lower case some upper case some digits just some special characters a URL phone numbers and some names and also have a simple sentence here that we're going to use in an example as well now before writing our first regular expression we need to know what a raw string is because we're going to be using these a lot throughout this video now a raw string and Python is just a string prefixed with an R and that tells python not to handle back slashes in any special way so for example normally back slashes are used to specify tabs or new lines and things like that so if I was to print out the string backslash T and then just the text tab so if I run this then we can see that Python replaced our backslash T with an actual tab but a raw string will just interpret the string literally so if we put an R in front of this string and then run this then we can see that the backslash is no longer handled in any special way and that's important for us because we want our regular expressions to interpret the strings we're passing in and not have python doing any anything to them first so we'll be using these raw strings throughout the video so just be aware of what these are okay so let's write some expressions and search for some patterns so to write our patterns I'm going to use the re dot compiled method the compiled method will allow us to separate out our patterns into a variable and also we'll make it easier to reuse that variable to perform multiple searches so to use this we can just say that our pattern is equal to re dot compile and now we can specify our pattern so first let's just create a pattern that matches some literal characters so if we just wanted to search for the literal text ABC then we could say our for our raw string and then we could just pass in the literal text ABC and now that we have that pattern specified now let's actually search through our text with that pattern so I'm going to create a variable here called matches and I'm going to set this equal to that pattern dot find it er and I'll explain this more in just a second but what we want to search is our very long string up here at the top our multi-line string called texts at search so I'm going to paste that in and now to print out all of our matches I'm just going to do a for loop and I'm gonna say for match in matches print out the match so if we save that and run it then we can see that this find it err method returns an iterator that contains all of the matches so we're going to look at more regular expression methods later in the video but I think that find it err is one of the best for gathering all of the matches and an easy to read format so each of these Mac match objects shows us the span and the match itself so the span is the beginning and end index of the match so when we search this text with this pattern using the find it err method it only found one match of ABC and it found it in our alphabet from indexes 1 to 4 now these indexes are useful because it allows us to use the string slicing functionality in Python where we could just plug in these values and get the exact match so if I was to print out our string from indexes one two four so I will print out text to search from indexes one two four now if I run that then we can see that we got ABC and that is the exact value that we matched now if you've never seen this string slicing functionality before and would like to see more of what you can do with that then I do have a separate video on that specific topic and I'll put a link to that in the description section below but for now I'll just go ahead and remove that now if I make my screen a little larger here and if we look at our text to search then this is the literal ABC that it matched right here but notice it didn't match this capital ABC and that's because this is case sensetive and this search right now is also looking specifically for ABC in that order so if we were to instead come down here and search for CBA and run that then you can see that we don't get any matches now if we scroll up here and look at this meta character section then I have some examples of characters that I say need to be escaped so say for example that I wanted to search for a literal period now if I just put a period in my search so if I change my pattern to where it is only searching for a period and I run this then we can see where it does this weird thing where it matches almost everything and that's because the dot is a special character and regular expressions and we'll see what that is in just a second but if we just want to actually search for a period then we have to escape it and we can escape these characters with a backslash so within my pattern here I'm going to put a backslash before the period and if I run that then we can see that if we look at all of our matches here then all of these matches are actual literal periods from our text so escaping those characters goes for any of these characters here in the meta characters section that I say needs to be escaped now one practical example of this might be a URL so I have a URL here that is quarry MS calm now if we wanted to match this exact URL then we would need to as scaped the dot in that URL by saying something like so I'll change my search pattern here and I'll just search for a literal quarry ms and then for the dot I would need to escape that dot so backslash dot com so if I save that and run it then we can see that we did match that URL okay so a literal search isn't too exciting because that's something that we probably already know how to do within Python really we want to use regular expressions to search for patterns and to do this we're going to use some of those meta characters that we were just escaping so in my snippets file here I have a list of values where we can see the types of characters that we can match and I'm going to split these into two columns so that it stays in view while we're going over these now I'll just move this over here and then go back to my simple file example that we were just looking at okay so now let's walk through each of these and see what we can match using regular expressions so we can see that the dot which we kind of looked at a little bit ago the dot matches any character except a newline so if I come in here and just put a literal dot for our regular expression with no backslash then if we run this then we can see that all of our matches match just about any character except it doesn't match new lines so it's matching all of the characters up here in our text to search that is not a new line so you can see that we have a lot of those matches now the backslash D matches any digit between 0 and 9 so if I put in a backslash D for our pattern here and run that then you can see that all of our matches are digits between 0 and 9 so there's not as many but you can see that they're all digits now the uppercase D matches anything that is not a digit and you'll see that this is kind of a common theme with these special characters here so you can see that the capital W matches not a word we haven't gone over that yet but the capital S matches not a whitespace so the capital letters basically negate whatever the lowercase version means so capital D is not a digit so if we put that in as our pattern and save that and run it then you can see that we have a lot matches here but all of these are matches that are not a digit so you won't find any digits here in our matches now the lowercase W is a pretty common search this is for a word character and a word character is anything that is a lowercase letter an uppercase letter a digit or an underscore so if we search for a backslash lowercase W and run that then you can see that all of these are either going to be lowercase letters uppercase letters digits which we have some digits here or underscores I don't think I have any underscores and our text to search so just like I was saying before the uppercase W will match anything that is not a word character so it's basically the opposite so I will search for an uppercase W if we run that then we can see that these are everything that are not our word characters so we have no lowercase letters no uppercase letters no digits and no underscores the backslash S here matches any white space so that consists of spaces tabs and new lines so if I put in a lowercase s here for our search and run that then you can see that we get all spaces new lines which are these backslash ends and I don't think I have any tabs in here but it would match tabs as well now the uppercase s matches anything that's not whitespace so anything that's not a space or a tab or a new line so if we search for that and run it then you can see that we might have a lot of matches in here but none of these are spaces or tabs or new lines now the one's a little bit lower here these are a little bit different now these are called anchors they don't actually match any characters but rather invisible positions before or after characters and we can use this in conjunction with other patterns for searching for so first we have this backslash B which is a word boundary and word boundaries are indicated by whitespace or a non alphanumeric character so for example if I scroll up here into our text to search you can see that I have these three Haws right here now let's say that we wanted to match all of the Haase that have a word boundary directly before so we can search that if we go down here to our pattern I can search that by saying backslash B which is our word boundary and then that literal text that we're searching for so I'll just search for ha and if I save that and run it then we can see that it matched two of these and I'm gonna scroll up here and show you which ones because it's not entirely clear just by glancing at the spans but it's matching the first one here because the start of the line is a word boundary and it's also matching the second one here because this space is also a word boundary now it's not matching this last one because there's no word boundary before it this is in the middle of a word now if we replace our search instead of a lowercase B we use an uppercase B and save that and run it then you can see that we have one match because that is matching all of these haws that do not have a word boundary before it so if we scroll up here to the top then basically it's just the opposite of our last search so it's matching this last one because it is the only one that does not have a word boundary before it okay so moving on we have this caret and this dollar sign and the caret will match a position that is the beginning of a string and the dollar sign will match a position that is the end of the string so if I scroll down here to our pattern then I have a small sentence variable here that I'm going to use to demonstrate this so let me spread this over here a little bit so that that fits on one line again now I'm gonna change the text that we're searching to this sentence instead of the text to search so I'm gonna paste that in there now first we'll look at the caret which is the anchor for the beginning of a string so if I was to search a pattern and say let's say so a caret and then I'm going to search for the literal text start so if I run that then we can see that it's searching for the literal text start that is at the beginning of that string of that sentence so if I change this to something that is not at the start then run that then you can see that now we get no match so this pattern here is actually in our sentence but the character says that it has to be at the start so if I undo that then we can see that that matched our start string there because it is at the beginning of the string and we can do something similar to match the end of the string using the dollar sign so if I search for the literal text e in D and then follow that with a dollar sign and then save that and run it then you can see that it matched this text here that's at the end of our string because we have the dollar sign there now if I was to replace this with something that is in our string but it is not at the end of the string and run that then you can see that even though we have A's in this string it doesn't match because it's not at the end okay so now I'm going to change our search back to the text to search and now we're going to look at some practical examples so I'm just gonna move this snippets file back here to one panel and we will keep looking at that throughout the video then I'll go back to my simple dot pie here so now let's say that we wanted to match these phone numbers here within our multi-line string so let's write some regular expressions to do this now we can't just type in a literal search because they're all different now they have a similar pattern but they're not all the same digits so in this case we would need to use the meta characters instead of literal characters now we just have the pattern of three digits here and then a dash and this one has a period and then we have three more digits and then a dash or a period and then four digits so let's go ahead and create the pattern to match this so I'll try to fit both of these in to the screen here at once so we know that we can match any digit with a backslash d now that will match any single digit in the text like we saw before but with a phone number we can start off start off by matching three digits in a row so I'm going to put in three backslash deeds there so now if we run it at this point then you can see that all of our matches are all the three different combinations of three digits in our text to search Oh actually they're in a row so if I go up here to the top then you can see first it matched one two three and then four five six as the next one and then seven eight nine and then it moves down to the three digits of our phone number so that's a little bit closer but we need to complete this pattern so now that we're matching those first three digits now we're getting to the point and our phone number where we can match either a dash or a dot so for now let's just match any character now if we remember from our snippets that dot will match any character so if I just put this in after our three digits and that should match the hyphen or the dot but it should also match anything else so now let's continue and match the next three digits so just three more backslash DS and then we get to the point where we're going to match another dash or a dot so we'll just put in another dot to match any character there and now we're gonna match four more digits so for backslash DS so if I save that and run it then we can see that it matched let me make this a little smaller here then we can see that it matched both of our phone numbers from our text up here so now we're starting to see how this could be pretty powerful so I have a data dot text file here and this has a bunch of fake names and numbers and addresses and emails so let's open this file in Python and then run our regular expression against its contents and see if we can parse out the phone numbers from this text file so first to open this file let me scroll down here to the bottom and I'll put these in here so first we want to open the file and we can do that with with open and that is data dot txt and it's in the same directory so we can just spell it out like that without a full path and then we want to read this file as f now if you're unfamiliar with file objects then I do have a separate video on that as well and I'll leave a link to that in a description section below so now let's read in the contents of this file so I'm going to create a new variable here and call this contents and then we're going to set this equal to this F which is our file F dot read and that'll read in all of the contents and now to search the contents of that file for our pattern using our regular expression we can just copy our previous match here so I'm going to copy in this line to get our matches and paste that in and also I'm going to cut this out and move this up here and copy comment that out for now and paste that in down here ok so what we're doing here is we're reading in the contents of that file and now we're using the same pattern that we used up here to match our numbers in our text and now we want to search our contents instead of that text to search and then we're just going to print out all of those matches and we've got these matches comment it out so those shouldn't show anything so I will go ahead and run this okay so it looks like we are getting a unicode decode error I'm not sure why that's the case because everything in that data text file should be ASCII characters so to fix this really fast I'm just going to put in an encoding equals utf-8 now I will fix this by the time that I put these files up so you shouldn't have to do this part I'm not sure why that is seeing a Unicode character in there but now that should solve it so if I save that and run it now we can see that we're getting all of our phone number matches from that data dot txt file so we can already see how using these regular expressions can be extremely useful for parsing information from our data ok so now going back to our text within this symbol file so I'm going to comment out the matches from the file there and uncomment out our matches from this file so now let's say that we only wanted to match a phone number if it had a dash or a dot so right now this pattern will match any separator that is here because we are using the period in our matches which matches any character so if I put another number in here that uses a different separator so for example let me put another number that uses like an asterisk for a separator so if I save that and run it then you can see that our pattern is currently grabbing that phone number as well so to only match the dash or the dot we can use something called a character set and a character set uses these square brackets with the characters that we want to match so I'm going to put in the square square brackets here and now the characters that we want to match in this position and we want to match a dash or a dot so I'm also going to replace this period which used to match any character now we're putting in a character set and we only want to match a dash or a dot so now if I save this and then you can see that we're still matching our first two numbers here that have a dash and a dot but we're not matching this third number that has that asterisk because this is not in our character set also you probably noticed that we didn't need to escape our dot within a character set and that's because character sets have some slightly different rules you can escape these characters if you'd like but it just makes it a little bit more difficult to read now even though the character set has multiple characters in the set it's still only matching one character in our text so it's matching a character that is either a dash or a dot now if I was to put two dashes up here in one of these numbers and save that and run it then you can see it doesn't match that because it only matches this first dash and that it moves right on to looking for another digit and that's something that can throw people off when they first start working with regular expressions because if you look at these character sets you can see a lot of different characters in here in this one we only have two characters in the character set but we'll see some sets later that are much larger but even though we have multiple characters in these sets it still only matches one character up here in our text so just keep that in mind because sometimes it can throw people off when they have you know long character sets that kind of look like something like this but all of this would still only match one character so now let me make sure that that number is back to normal and save that so to look at another example of a character set let's say that we only wanted to match 800 and 900 numbers so let me copy two of these numbers here and I will separate these with a dash and I'm going to make this one here an 800 number and then I'll make this one here a 900 number so I'll save that and go down here to the bottom now to match 800 or 900 numbers we're gonna have to change our first three digits here so the first digit it's going to be either an 8 or 9 so that's a good use for a character set so we'll create a character set and match either an 8 or a 9 and now the next two digits are just both going to be literal zeros so I'll put literal zeros in there save that and run it you can see for all of our matches it printed out only the 800 and 900 numbers so now if I perform that same search on the data file that we used before so I'll copy out our our comment out our matches there that we're looping through and uncomment our data txt file now this is still using the same pattern and we just changed that pattern to match 800 and 900 numbers so if I run this then that same code now should only print out the 800 and 900 numbers from that file and you can see from our matches here that that's what we get so that's pretty cool to be able to now match these more detail to patterns so now I'm going to remove this file section I believe that's the last we're going to use that and uncomment out that loop there now within a character set the dash is actually a special character as well when it is put at the beginning or in it will just match the literal - character but when placed between values it can specify a range of values so for example we know that the backslash D matches any digit but if we only wanted to match digits between 1 and 5 to do that we could just change our entire pattern here to be a character set and if we just put in a 1 - 5 then with this dash between those values that's now going to specify a range so if I save that and run it if we look at our matches here all of our matches are going to be digits between 1 and 5 and we can use this for letters as well so if we wanted to match lowercase a through Z then we can just match let a lowercase a through lowercase Z so if I save that and run it and you'll see that all of our matches down here are lowercase letters now if we wanted to match uppercase and lowercase letters then we can just put these ranges back-to-back so right after the lowercase a through Z then I can also just put in a range of uppercase a through Z so if I save that and run it now you can see that all of our matches down here are either going to be uppercase letters or lowercase letters and you could keep adding to that and add digits onto there if you'd like as well now another special character in our character set is the carrot so if you put a carrot at the beginning then I mentioned before that outside of the character set the matches the beginning of a string but within a character set it negates the set and matches everything that is not in that character set so for example when I put the carrot before this character set here it's going to now match everything that is not a lowercase or uppercase letter so if we run this then we can see that we get a lot of matches we get a lot of new lines and digits and spaces and things like that but none of these are lower or upper case letters so let's add to our text to search here and let's say that we wanted to match the words cat mat Pat and all other three-letter words that end in 80 but we don't want to match the word bat now to write a regular expression for that instead of specifying all the characters except for a B we could just use our negation there of the character set and just say that we want everything that is not a B followed by a literal a T so if we run that then we can see that for our matches we matched cat mat Pat but we did not match this bat and that is because that carrot negates that character set of only this B character okay so now I'm going to remove those from our text now everything that we've looked at has involved single characters so far so for example this is saying match any single character that isn't a B followed by an a followed by a T but we can use things called quantifiers to match more than one character at once so let's go back to our original phone number expression from earlier and will match any character for the separator for now so I'm going to fill in that pattern so we want three digits just a period to match any character for a separator three more digits a period for any separator and then four digits so just make sure we type that correctly let me save that and run it and you can see that we're still matching these numbers just fine but you can see that we're searching for all of our digits one character at a time and it's easy to make mistakes when you have a lot of these to type out but we can use something called a quantifier to match multiple characters at a time so let me open up my snippets file again here and I'm going to again open this up into two columns and move this over now let me scroll down to the quantifiers and read through these and also let me go to my simple example back here again so that we can see this okay so our first quantifier here we have an asterisk and an asterisk will match zero or more of the pattern that we're looking for now a plus sign will match one or more the question mark will match zero or one and if we use these curly braces with a number inside that will match that exact number of the pattern if we use these curly braces with two numbers separated by a comma that will match a range of numbers and that first number is the minimum and the second number is the maximum so for our phone number over here this would be a good case to use exact numbers so instead of writing all of these out I can just go back here to this first digit that we want to match I could put in curly braces and say that I want to match three of those digits and I can do the same thing for the second section of digits there and then for the last we want to match four digits at the end so I can save that and run it and you can see that we still get the same result so that allows us to specify the amount of digits that we're looking for without needing to type them all out and possibly making a mistake along the way now here we're matching exact numbers but sometimes we don't know the exact number and we'll need to use these other quantifiers so for example down here at the bottom of our text we have these names and some lines start with the prefix of mr some start with Miss and some start with misses so let's say that we wanted to write a pattern that would match these prefixes and the entire name that comes afterward so to start off easy let's first start by just matching the names that start with mr now we can see that some of these have a period after the prefix and some do not so mr. Smith does not have a period here so to handle this let's see if I can fit all of this in here I'm gonna cut off our loop but that's okay so to write our pattern here we want to search for mr and we want to have a period after here so if we run this right now then you see that we have two matches but there are three misters up here so it's matching both of these mysteries that have the period after the prefix but this one is not currently matching now to match that we need to say that the period after the prefix is optional and we can use the question mark quantifier to do this which tells our pattern that we want to match either zero or one of those characters so if I put a question mark after our period there and then rerun that then we can see that now it is matching those m RS without the period so now to complete this pattern now after that optional period we have a space and now we're running into uppercase letters said to match uppercase letters we can use a character set like we looked at before and we can just pass in a capital A through a capital Z to match all the uppercase letters so now if we run this then we can see that we are matching up to the first letter of the Matt of the last name for all of these names now at this point we have a decision to make after the first uppercase letter we've completely matched the name for mr. t that we can see here but we still need to match the rest of our other names so we could say that we will match any word character after that uppercase letter and we can do that with a backslash W after that first uppercase letter and now we need to decide what quantifier we want to use for our word characters so we could use the plus sign quantifier which would match one or more of these word characters so if I put in a plus sign here and run this we can see that when we do that it matches mr. Schaefer and mr. Smith but it doesn't match mr. t because mr. t doesn't have a word character after that first uppercase character so a better solution here might be to use the asterisk quantifier which allows us to match zero or more of these word characters follow that first upper case so if I save that and run it now you can see that now that it's matching zero or more it includes mr. t in there and the other names now I know that we've covered a lot so far but we've just got a few more concepts to go and then we'll look at some examples that wrap everything together so we still haven't matched our miss or misses names up here so how would we do that now you might think that we should use a character set that matches either an R or an S after the after the M and there's probably some ways that we could do that and get it to work but it would be a bit ugly because then we'd have to match an optional s after that for the misses as well I think a better solution here would be to use a group now we haven't looked at groups yet but groups allow us to match several different patterns so to create a group we use parentheses and within the parentheses we can match some patterns so let's say that we wanted to match a literal R after the M and then we can use this vertical bar character which is basically an or so we can say R or a literal s and then another vertical bar and then we can say or a literal RS so now we have three different patterns here capital M followed by either an R and s or an RS so now with that little change if I save that and run it and pull this up a little bit now you can see that now we are matching all of our names now I was just saving some characters by doing it this way if you think it's more easy to read then we could have also put the M within this group as well so I could have just put an entire group here at the beginning and said mr. miss whoops miss or misses so it adds more characters but it's also a little bit more clear exactly what those groups are matching if we move that M within at the group there and if we save that and run it then we can see that we still get the same results now these groups can actually be used to capture sections of your matched regular expression and that's something what that will look at in just a minute but for now let's do a quick recap of everything that we've learned so far by looking at some examples that incorporates all of these things together so I'm going open up this emails dot PI file here and I might make the text just a little smaller so that we can fit everything in one window so I've got a file here where I have this emails variable and we have three different email addresses within this string that are fairly different emails now let's try to write a regular expression that will match all of these emails so let's match the first email address first so let's go ahead and come down here into our pattern where we are doing the re compile and now we want to put our regular expression pattern within here so first let's just match everything before the @ symbol here and we can see that everything before that @ symbol there just uppercase and lowercase letters so to match uppercase and lowercase letters we can just use a character set with a lowercase 8th or a lowercase Z followed by an uppercase a through an uppercase C and now we want to match one or more of those until we hit the @ symbol so to match one or more we can use the plus sign and we want to match one or more of those all the way up until we hit the @ symbol and we can just put in a literal @ symbol for that now after the @ symbol we have only lowercase letters but let's go ahead and put uppercase letters in that character set as well so we'll do lowercase letters with a lowercase 8th through Z and an uppercase a through Z and again we'll put in a plus sign to match one or more of those and then finally we'll match those all the way up until we hit this dot-com itself so for now we can just put in a literal match for that so I could say backslash dot to literally match that dot there and then just a com4 com so if we save that and run it then we can see that we matched that first email address now just to get some more space here I think we're done with the snippets file for now so I'm going to bring this back over here go to view and make this a single column again and go back to my emails file so now that we're matching the first address here let's build this up so that it matches the other two as well now it looks like it's not matching the second address because we need to allow a period in the first part of the expression because we have a period right here so we can add to the character set of the first part just by putting a dot in that first part of the character set there and another thing that's different is that this ends with a dot e-d-u instead of a dot com so to match that we could just match a group here so I will wrap this calm in parentheses to create a group and we can say that we want to want match calm and use the vertical bar as an or and say or dot edu so if we save that and run it you can see that now we are matching that second address as well okay so good so we're building this up a little bit at a time so finally to match our final address it looks like we need to allow numbers and a hyphen in the characters before the @ symbol so again we can just add this to our character set so we can come here to our ranges and add the digits 0 through 9 onto the end of that and we also want to put a hyphen at the end of that character set and it looks like we also have a hyphen and our domain here so this is the character set for our domain so we can add in - 2 there as well and lastly instead of a.com or a descendant net so let's add that to our group here so we'll just put in another vertical bar and add in dotnet so if we save that and run it then you can see that now we're matching all three of our email addresses now with something like email addresses it can be pretty tough writing your own regular expression from scratch but there are a lot of these available online and once we learn how to write regular expressions then we should be able to read them and figure out what they'll match now I've always found that reading other people's regular expressions to be a lot harder than writing them but let's look take a look at one and see if we can do this so there is a regular expression that I pulled off line that matches email addresses and I have this here in my snippets file at the bottom so let me copy this over and paste it in as our expression and let's walk through this so first of all just let me save it and run it to make sure that it still matches all three of our email addresses and it does now this looks a little intimidating but really these are just some large character sets here so first we have a character set that matches all lowercase letters all uppercase letters all digits then we have its matching an underscore a period a plus sign or a hyphen and then we have a plus sign that will match one or more of any of those characters in the character set and it matches those all the way up until it hits our @ symbol now after the @ symbol for the domain we have another large character set here and this matches any lowercase any uppercase any digits or any hyphens now I don't know a lot about email addresses but I'm assuming that since they left out the underscore the period and the plus sign that these aren't valid for a domain so then that is followed by a plus sign here that will match one or more of any of these characters and the character set all the way up until it hits the last period in that period is escaped with a backslash and then after that dot it will we have another character set which will match any lowercase uppercase digits hyphens or another dot and that plus sign will match one or more of any of those characters so reading regular expressions written by other people is probably one of the hardest parts of all this but if you walk through it bit by bit then you should be able to break it down like that for just about any pattern okay so the last concept that I'd like to look at in this video is how to capture information from groups now we've already seen how to match groups but we can actually use the information captured from those groups so to show an example of this I'm going to open up this file here called URLs dot PI and again let me take this down just a little bit here so that we can fit everything okay so with these URLs here we can see that some of these URLs are HTTP some are HTTPS some of them have www for the domain and some do not so they're pretty inconsistent so let's say that for each of these URLs we only wanted to grab the domain name followed by the top-level domain so in this case it would be you know for example google.com or Cory M s calm or youtube.com or nasa.gov and we just wanted to ignore everything else so let's see how we can do this so first let's write and Express that actually matches these URLs so we can say in our re dot compiled here we're creating our regular expression pattern so we can say http/2 match a literal HTTP now some of these have HTTPS and some do not so we want to match an S but put a question mark after it because if you remember the question mark matches 0 or 1 so basically it makes that s optional so then after that optional s we want to match a colon forward slash forward slash and now here we have an optional ww some have WWE and some do not now you may be thinking that what we could use a character set or something here but it's actually going to be best to use a group because within a group we can just say that that entire www with the dot and remember we have to escape that with a backslash a backslash dot that entire group is optional so just to make sure we're on the correct path if we save this and run it now you can see that we're all the way up to the domain name on all of these matches so far so let's go ahead and continue so now to match the domain name we can just match any word character so we'll just put a backslash W to match a word character and then a plus to match one or more of those and we want to want match one or more word characters all the way up to the dot so we will put in a backslash dot there to match that now we want to match the top level domain which is calm or gov so we can just match word characters again and one or more of those with a period after that so if we save that and run it then you can see that now we are matching the entire URL for all of these URLs that we have listed ok but remember that the point here was to use our groups to capture some information from the URLs so let's capture the domain name and the top-level domain and by the top-level domain again I mean the dot-com or the dot gov so to capture these sections we can just put them in a group by surrounding them with parentheses so for example this section here was our match for the domain name so I just surround that in parentheses there and then our top-level domain we want to include the dot so this would be the dot-com or the dot gov so we'll put the parentheses before the dot and then here at the very end so since we just added groups to an existing expression that shouldn't actually change our results so if we rerun this we can see that we still get the same results there but now we actually have three different groups so the first group is the optional ww the second group is our the word characters that make up the domain name and the third group is the top-level domain which is the dot-com or the dot gov now there's also a group 0 and group 0 is everything that we captured so in this case it's just the entire URL so this would be group 0 here so to show this our match object down here that we are iterating through this match object actually has a group method and we can pass in the index of the group that we want to see so we can say dot group to use that group method and now we can print out group 0 by just doing group 0 so if we print that out then like I said group 0 is the entire match so that's just the entire URL now if we printed group 1 this should be the optional ww so if we save that and run it then we can see that the URLs that have aaww dot print that out as the group and the ones that don't just print out none values so our group 2 here should be our domain name so if I print out group 2 and run that then we can see that that's what we got so we got Google query MS YouTube and NASA now group 3 should be our top level domains so if we print out group 3 and run that then you can see that we get calm calm calm and gov now we can use something called a back reference to reference our captured group and it's basically just a shorthand for accessing these group indexes so the regular expression module has a sub method that we can use to perform a substitution so let's see what this looks like and we can substitute in these back references which reference the groups so for example let's just show an example and that will become more clear so I can create a sub URLs variable here and I will set that equal to pattern dot sub and now we want to pass in the substitution so the substitution that we want to use are these back references that reference these groups so we wanted to replace these URLs with the domain name and the top-level domain so the domain name was group 2 and we use these back references with a back slash and then the number of the group so we want to replace these with a back slash 2 which is that a domain name and then a back slash 3 which is the top-level domain and now we need to pass in the text that we want to replace so let me walk through this one more time just because that can be a little computing so here we are creating a pattern and this pattern as we saw matches all of our URLs here in our string and then our subdue our LS we are using that pattern to substitute out group 2 and group 3 for all of our matches in URLs so every time it finds a match it will replace that match with group 2 which is the domain name and then group 3 which is the top-level domain so now just to show how that worked let's print out that subd URLs and save that and print it out and scroll up here and we can see that that returned a new string with all those substitutions made so if you had a large document of things that you wanted to reformat like this then learning how to do this with regular expressions could save you a ton of time and allow you to do that within just a couple of minutes okay so we're really close to being finished up here we should now have a pretty good understanding of working with regular expressions in Python but we've been using this find it err method throughout the whole video and that's because I think it does the best job of showing all the matches and the location of those matches but there are other methods that we can use for different purposes so let's take a quick look at some of those so first we have the find all method with the find it er method that we were using it returns match objects with extra information and functionality but find all will just return the matches as a list of strings now if it's matching groups then it will only return the groups so in the example we're currently using a pattern where we match those names and we have a group here at the beginning for the prefix so this is only going to match that group so if I save this and run it then you can see that it only prints out that first group and if there are multiple groups then it would return a list of tuples and the tuples would contain all of the groups now if there are no groups then it would just return all of the matches in a list of strings so if I was to change this to this pattern here to our previous phone number example so I'll do a digit of with a quantifier of three then just a dot to match any character digit with a quantifier of three any character and a digit with a quantifier of four if we save that and run it then you can see that it prints out just a list of all of our phone numbers so that's one way to print out all of your matches that you match but personally I like the find it err method a little bit more because it comes with that extra functionality of that match object so next we have the match method now match will determine if the regular expression matches at the beginning of the string so for example let's change our pattern to search for our simple sentence here and we'll just search for the literal string of start and we want to search that sentence variable so instead of that text to search will put sentence in there and instead of find all we want to see what the match method does so let's save that and run it so we got an error here because match doesn't return an iterable like find it or find all it just returns the first match and if there isn't a match then it returns none so instead of looping through our result we can just print out that matches variable so we'll just print out matches there and save that and run it and we can see that it returns that match object with match now this only matches things at the beginning of strings so if we were to search for something else that is in this sentence so I'll search for this sentence pattern right here if I save that and run it then we can see that it returns none because it's only seeing if this is matching at the beginning of that string if we want to search for matches within the entire string then we can use the search method now I'm not sure why they have a match method when regular expressions themselves have the caret to specify match results at the beginning of strings but I'm sure that there's probably some reason that I don't know of so if we wanted to search the entire string for that pattern then we can use the search method instead and just like match this only prints out the first match that it finds so if I save that and run it then we can see that it printed out that match object with the match there now if we search for something that doesn't match then this just returns none as well so if we search for something like D and E 4 does not exist if we save that and run it you can see that it just returns none since it didn't find any of those patterns in our sentence okay so the very last thing that I want to cover in this video and we'll cover it very quickly is flags and we can use flags to make our lives a bit easier when working with regular expressions in Python and you may see some of these at some point when you start using them more often so let's go ahead and take a look so for example let's say that we wanted to match a word but match it whether it was in uppercase or lowercase or a mixture of both so for example if I wanted to match the word start and our sentence but each letter could be uppercase or lowercase then normally to create a pattern like this you would have to do something like you know a character set that started with an uppercase s or a lowercase s and then followed by an uppercase T or a lowercase T and then an uppercase a or a lowercase a and you kind of get the point but since that's kind of a pain instead we can just search for that literal text and I'll just put those in all lower and then we can just add a flag to our pattern here so for this we want to use the ignore case flag so we can either write this out so this is going to be re dot this is gonna be all caps here ignore case and if we save that and run it then you can see that even though our pattern here has a lowercase s and this has an uppercase s it still matches that pattern because we have our ignore case flag here and there are short hands for these flags as well so instead of writing out ignore case I could just put a capital i' there if i save that and run it that you can see that we get the same result now there are several different flags and we won't go over them all but there is you know there's a multi-line flag that allows us to use the caret and the dollar sign to match the beginning and end of each line in a multi-line string rather than just the beginning or end of the string there is also a verbose flag that allows you to add whitespace and add comments directly within your pattern which could help you break up complicated patterns into easy-to-understand segments now there are more flags but those are probably the most common and I think I'll cover flags further in a more advanced video now there's a lot of advanced features that we could go over with regular expressions in Python and judging from my last regular expressions video there seems to be a big interest in learning advanced expressions so I'll be sure to put together an advanced video covering those topics in the near future but hopefully now you feel pretty comfortable with being able to read and write these regular expressions within Python but if anyone does have any questions about what we covered in this video then feel free to ask in the comment section below and I'll do my best to answer those and if you enjoy these tutorials and would like to support them then there are some ways you can do that the easiest way is to simply like the video and give it a thumbs up also it's a huge help to share these videos with anyone who you think would find them useful and if you have the means you can contribute through patreon and there's a link to that page and it's scripts in section below be sure to subscribe for future videos and thank you all for watching you

Transcript for:Regular Expressions in Python - Lecture Notes

Transcript for:
Regular Expressions in Python - Lecture Notes