To RegEx or Not To RegEx? (Part I)

A comprehensive tutorial on regular expression and how to use the Python module re to enhance your machine learning workflow

Steven Yan
8 min readJan 6, 2021

--

Photo by Cynthia Smith on Unsplash

2021 could not have arrived any sooner, as America as the beacon of democracy during an unprecedented pandemic is faced with an existential crisis not unlike the one Prince Hamlet grappled with during his soliloquy. On a much less significant level at the Flatiron School Data Science bootcamp and amongst my cohort members, we have ruminated and pontificated on the significance of RegEx in our data science journey.

In such discussions, the general consensus is that RegEx is a skill or topic everyone wants to understand better, but no one is particularly motivated to learn due to all of our competing demands. It has infiltrated and permeated various aspects of the curriculum from web scrapping with API or data cleaning and manipulation through Pandas of databases to Natural Language Processing.

^(1)?\s?\(?(\d{3})[\s.\-\)]?(\d{3})[\s.-]?(\d{4})$

Whether we wanted to learn RegEx or not, it was thrusted upon us for our morning code exercise, in anticipation of learning about NLP. We were given the task of developing a RegEx to match a more generalized case of telephone numbers, and this was the expression I came up with for the exercise. Right now, this might just seem like a bunch of random symbols, but hopefully by the end of this blog, you would have the tools to interpret the above expression for yourself.

This will be a two-part series that will provide a comprehensive overview of regular expression and its syntax, the functionalities that Python offers through its re module, and some coding examples from machine learning. This blog will focus on regular expression as a language, which will include understanding the basic concepts and syntax. So let’s just jump into the metaphorical RegEx pool and get ourselves wet with some examples.

Brrrring, brrrrring!

Photo by Paweł Czerwiński on Unsplash
^[2–9]\d{2}-\d{3}-\d{4}$

Explaining Anchors:

  • ^: “Hey, look at me, look at me! Nothing comes before me (usually).” The unassuming caret signals the beginning of the string or regular expression, but it has additional meaning where it is not at the very beginning. (also denoted by \A)
  • $: “Don’t forget to pay upon exiting. We hope you enjoyed your ride here at RegEx.” The dollar sign signals the end of the string or regular expression. These are what we call anchors, the caret and the dollar sign. (also denoted by \Z)

Explaining Character Class and Meta Sequences:

  • []: “I am monogamous and am a one character kind of guy or gal, but I can be super fickle or uber-inclusive.” These square brackets denote a character class, which can be as specific as 1 digit or charater or as extensive as the whole range of alphanumeric characters. Ultimately, no matter how many characters go in between the square brackets, it represents or can be matched with just one character. (If it is coupled with something called a quantifier, it means we will match that character class how many times the quantifier indicates.)
  • \d: synonymous with the character class[0-9], matching any single digit (technically any Unicode digit in Python)
  • \D (also [^0-9]): matching any non-digit character, so the capital version of each metacharacter consisting of a backslash and capital letter is the inverse version of the lowercase version. Here, we have an alternate use of the caret, which serves as a not operator inside brackets.
  • \w (also[a-zA-Z0-9_]): any alphanumeric word character, which includes letters, digits, and underscore (in Python, also any ideogram, i.e. Chinese characters)
  • \W (also[^a-zA-Z0-9_]): any non-word character. Note that \w includes \d and \W includes \D
  • \s (also[ \t\n\r\f\v]): matching any whitespace character, which includes space, tab, newline, carriage return, formfeed, vertical tab
  • \S (also[^ \t\n\r\f\v]): any non-white space character, especially useful in capturing everything up until a whitespace, i.e. the end of a word (\S+ will stop capturing upon encountering whitespace)
  • .: matching any single character except for newline \n

Detailed walkthrough:

Now that we have laid the groundwork, let’s see how some of that applies to thi simplified (not all inclusive) RegEx for phone numbers:

  • ^ : marks the beginning of regex
  • [2-9]: brackets define a character class where the first character must be a digit in the range from 2 to 9, but essentially it’s saying that phone numbers or area codes cannot begin with either 0 or 1
  • \d{2}: matching a metacharacter representing a digit followed by a quantifier of 2, which means exactly two digits, which in addition to the initial digit covers the remaining two numbers for the area code. We will formally discuss quantifiers in the next section.
  • -: exact match with a hyphen, so our regex assumes only a hyphen will separate the area code from the number for the sake of the exercise.
  • \d{3}: followed by a digit with a quantifier of 3, so exactly 3 digits
  • -: another hyphen
  • \d{4}: followed by 4 digits
  • $: marks the end of regex

Hopefully you are starting to get a feel for regex, some basic syntax, as well as the terminology used to describe the different components involved. Buckle your seatbelts though because we have a few more exercises that will increase in difficulty and illustrate how Regex will enhance your role as a data scientist.

Let’s recognize of course that there are different variations of phone numbers that may include parentheses, periods, or even a country code. How do we account for such variations in phone numbers? What happens if it’s not at the beginning of a string but in the middle?

Extracting Email Addresses

Photo by Stephen Phillips — Hostreviews.co.uk on Unsplash

Who can forget that characteristic sound of the original 56K modem and watching with anticipation as GIFs and JPEGs would slowly reveal themselves on the webpage? In case you don’t remember because you forgot or are too young, you can experience the ear-grating cacophonous melody for yourself:

(?i)\b([A-Z0-9._+-]+)@(?:[A-Z0-9-]+\.)+[A-Z]{2,}\b

Explaining Quantifiers:

Back to RegEx, in the last section, we briefly mentioned the quantifier, denoted by {}, where m and n are positive integers. Here are some of the permutations we might encounter:

  • a{m}: matching the letter a exactly m times
  • a{m,n}: matching the letter a m times at least and n times at most
  • a{m,}: matching the letter a m or more times

Here are some additional quantifiers:

  • +: one or more times, can follow any of the special characters mentioned, but refers always to whatever it is immediately to its left docile)
  • ?: zero or one times
  • *: zero or more times

Greedy vs. Lazy:

The 3 quantifiers above are greedy (matching as many as possible, aka longest match) and docile (give back characters if needed to allow the rest of the pattern to match).

  • [+*?{}]?: lazy (which means to match as few as possible, aka shortest match), when any quantifier is followed by the quantifier ?
  • [+*?{}]+: possessive (which means to match as many as possible and will not allow rest of pattern to match), when any quantifier is followed by the quantifier +

Hodgepodge of terms:

  • \b: matches any word boundary, while \B matches any non-word boundary
  • \: escapes its following metacharacter and nullifies its RegEx superpowers, i.e. for the following characters: {}[]()*+?|~$.\ , we must insert a backslash to match the literal character

So to match https:\\www.youloveregex.com (not a real website) more specifically the two backslashes in the middle of the website link would require us to type \\\\ . We can see how having to escape can become cumbersome, so to escape the escape:

df.column_new = df.column.str.extract(r'((\d+)/(\d{2})/(\d{4}))')

  • r'...' : kryptonite to the superpowers of the metacharacters is what is called the raw string notation. Any regex patterns lose their special abilities within the confines of the quotes so that every character is taken literally
  • (): indicates a group of characters or a capturing group
  • (?: ... ): indicates a matching or non-capturing group. We will discuss groups in a subsequent section.

() can be used for grouping and capturing, and we can use the above expressio to extract the date from within a string, aka month, date, and year. You will learn more about capturing groups in the subsequent blog when the functions in the re module are introduced.

Detailed Walkthrough:

  • (?i): one of the regex modifiers for case insensitivity, where the others are s, m, and x (please refer to the documentation for elaboration), so the capital letters allows for the matching of both upper and lowercase letters
  • \b: matching a word boundary at the beginning, aka whitespace
  • ([A-Z0-9._+-]+): the parentheses indicates a capturing group, the brackets [] indicate a character class which matches any of the characters inside, which includes letters, digits, underscore, period, percent, plus and minus signs, and the plus after brackets indicates there will be one or more instances of this character class, which corresponds to the username

This username allows for the use of the uppercase and lowercase letters (A-Z and a-z), the digits 0–9, as well as ._+- , which accounts for usernames like firstname.lastname@email.com.

  • @: matches with the literal at symbol
  • (?:[A-Z0-9-]+\.)+: (?: ... )indicates a non-capturing group, the brackets once again[]defines a character class containing letters, digits, and hyphens, where the plus sign+indicates that there is one or more of such characters in said class, and\.indicates the characters will be followed by a period
  • + afterwards indicates that there could be one or more non-capturing group to account for any subdomain names, i.e. @abc.xyz.com or @xyz.com

Domain names adhere to stricter guidelines, so we only can include letters, digits, and hyphens.

  • [A-Z]{2,}: followed by no less than 2 letters for the top level domain (i.e. com, net, org, de, in….)
  • \b: followed by a word boundary or white space

Next Steps

  • RegexOne: for familiarizing yourself with the syntax by working through lessons and exercises
  • HackerRank or CodeWars: for achieving mastery through code challenges in a variety of programming languages, including RegEx and Python
  • Regex101: for working through and debugging any expressions you are developing for your machine learning projects

Here is the explanation of the code I provided earlier for the coding exercise from RegexOne. I used Regex101 to debug my code, and as you can see, it will prove super useful in your journey to becoming a RegEx master.

We have now covered all the basics of regular expressions that you will need before moving onto the various functions of the re module and seeing how RegEx is applied in scope of machine learning. Stay tuned for the next installment!

--

--

Steven Yan

Data Scientist for Social Good. Former MCAT Tutor and Content Writer. Pianist and Linguaphile. UChicago and Flatiron Alum.