What are Regular Expressions, and Why Do We Use Them?
When I first delved into web scraping, I came across the term regular expression, commonly referred to as regex. Several videos I'd viewed used regular expression along with BeautifulSoup. Being a curious cat, I couldn't resist looking it up on our wonderful world wide web. But what I saw was gnarly!
There were cryptic symbols strewn across the page. My brain refused to make sense of it all because it was all too hard! It waved that white flag of defeat before it even did any work to understand it. That was my first experience with regular expression- rather cowardly, I know! Fast forward 365-plus days, and I will redeem myself, and this blog is my redemption, hahaha...
To be honest, like many things in life, what comes across as impossible tasks are not so challenging once we decide to commit and take action. All the compiled anxiety and dread are dispelled as you start working on the very thing that's inducing those feelings. Regular Expressions are not as daunting as my then brain led me to think.
This blog is not solely just about my redemption. I've chosen to write about this topic because Regular Expression is a handy and powerful tool to know and learn. It has many applications, and web scraping is one of them.
So, in this blog, I will cover the following:
- What are Regular Expressions?
- Why Do We Use Regular Expressions?
- Regular Expression Basics
- Commonly Used Regular Expression functions
- Regular Expression Exercises
What are Regular Expressions?
Regular expression is formally defined as a search pattern used to search for things in a string. You can think of it like Ctrl F in Word doc, but many folds more powerful and precise.
Regular expression is not a programming language nor a python library. That said, it can be applied across many programming languages, with slight variations. But I've read it largely remains the same over different programming languages. I'm writing this blog with regular expressions in Python in mind.
Why Do We Use Regular Expressions?
There are several use cases for regular expressions:
- Web scraping: One of the key tasks in web scraping is to find and locate the content of interest to scrape, so regular expression is naturally well suited for this purpose.
- Manipulating strings/ codes: regular expressions are super handy when searching for and manipulating strings (text)/ codes, just like cltr F and Replace in Microsoft Word and Excel.
- Verification: in computing, a common task is to check a text against a pattern. For example, ever wondered how a program knows when we entered the wrong email address? It uses regular expressions to check.
Regular Expression Basics
I want to preface that what I'm about to share in this section can easily be located on the internet and can come across as rather abstract. Still, I want to share it for your convenience so you don't have to look for it.
Assuming you intend to learn regex, I suggest you review this section. Please, please try your very best to overlook this abstraction. In the Regular Expressions exercises section, I'll share an excellent tutorial on RegEx, and I promise you, it'll make a lot of sense then- strings of light bulbs will light up in your brilliant brain!💡-💡-💡
Before we move into the abstract bit, let's start with the simple but important concept. Here is a string of text:
With regex, I could type the word sun
as the pattern, and it will match the word. Even though our regex matched the word I want: "sun", the way it operates is not at a word level but at a character level. It first looks for s
, then u
, then n
. When we have all three characters matching, then we have a match. So, the key takeaway concept is that regex operates at the character level, not the word level.
This has an implication. It means that the search pattern sun
would be a match for the following:
Regex is case-sensitive. If my search pattern is Sun
, none would be a match. This is another important concept to remember.
That's the simple stuff. Ready for the stuff that's less straightforward but not-so-bad stuff? What I'm about to share are essentially three cheat sheets with examples:
- Special Characters, aka metacharacters
- Special Sequences
- Sets
Special Characters (Metacharacters)
Metacharacters are characters with a special meaning.
Special Sequences
A special sequence is a \
followed by one of the characters in the list below, and it has a special meaning, as you shall see. These characters are case-sensitive.
Sets
A special set of characters inside a pair of square brackets with special meaning.
Commonly Used Regular Expression functions
A module called re needs to be imported using an import
statement to use regular expressions in Python. Here is a list of the common functions:
re.findall ()
: this function searches for all occurrences of a regex pattern within a string and returns a list of all the matching substrings.re.search( )
: this function searches for a regex pattern within a string and returns a match object if the pattern is found.re.match( )
: this function tries to match a regex pattern at the beginning of a string and returns a match object if the pattern is found. I'm not sure why this function was created. It seems redundant when we can use special sequences and metacharacters to complete this type of search.re.split( )
: this function splits a string into a list of substrings based on a regex pattern.re.sub( )
: this function searches for a regex pattern within a string and replaces it with a specified replacement string.
Regular Expression Exercises
I've created a few regular expression exercises for you to test your newly acquired knowledge. But before you attempt it, please watch this tutorial video by Corey Schafer. In the last few days, I've watched several tutorial videos. I think this one is by far the better one that covers not only the basics but also touches on some advance topics of regular expression.
Now that you've finished the video, give these two exercises ago. There are many solutions to the exercises, and my answers are one of many.
Exercise #1
Write a regular expression pattern that will allow you to get the Name and Age in the following text:
NameAge = "Alexander is 44, Theo is 33, Janice is 21 and Charlie is 29."
Solution
Names: [A-Z][a-z]*
Ages: \d{1,2}
Let's look at Names first. The first bracket pair returns upper case for the first character of the name. The second bracket pair is lower cases, and because the names can be of varying lengths, I used the quantifier *
. Instead of *
, I could also use +
and still get the names.
For Ages, \d returns numerical values, and in the text, we can see the age contains no more than 2 digits, so I used the curly bracket, which is a type of quantifier that allows the return of 1 to 2 digits.
Exercise #2
Here is a list of emails; some are obviously erroneous.
Emails = ''' Sailor_sally@gmail.com melanie@com GandT_123@happyhour. '''
Write a regular expression pattern that will retrieve the correct email address (i.e. Sailor_sally@gmail.com
).
Solution
correctEmail= \w+@[a-z]+\.[a-z]{3}
Let's break this down.
\w+
returns 'Sailor_sally' as a match.
@
returns '@' in the email
[a-z]+
returns the domain: ‘gmail’
\.
returns '.'
[a-z]{3}
returns the top domain: 'com'
That's it, folks. Let me know how you went with the exercises.