Article summary
Regular expressions (or regex) are incredibly helpful tools to have at your disposal as a software developer, but they’re often dangerous tools. In this post, I’m going to focus on the times where you shouldn’t use regex, and then I’ll go over some strategies and features to use when you do write regular expressions. Lastly, I’ll provide a few other tools and resources that will further aid you when using regular expressions.
If you’re unfamiliar with the basics of regular expressions, I recommend reading this great overview from Mozilla before continuing; it explains them much better than I ever could.
When Not to Use Regex
Regular expressions can be a good tool, but if you try apply them to every situation, you’ll be in for a world of hurt and confusion down the line. Using them is essentially like injecting a small programming language into another programming language. This is a significant level of complexity to introduce when you’re looking for some text parsing.
There are quite a few situations where it’s best to reach for another tool instead of regex. For example:
- Regex isn’t suited to parse HTML because HTML isn’t a regular language.
- Regex probably won’t be the tool to reach for when parsing source code. There are better tools to create tokenized outputs.
- I would avoid parsing a URL’s path and query parameters with regex. Most standard libraries have mature tools for pulling a URL apart into its corresponding parts.
- Email addresses are another example of a complicated data format that isn’t well suited to regex. Here is an example of the hoops you have to jump through with regex to parse most (not all) valid email addresses. Again, I would recommend a dedicated parsing library for this purpose.
How to Regex Better
Now that we’ve covered some situations where it’s best to avoid regular expressions, let’s review a few examples of my favorite features and techniques to use when you do write regular expressions.
Capturing groups
Capturing groups are among my favorite tools to use with regex. They allow you to refer back to particular sections of the matched text. Below, I’ve shown a very simple regex for parsing a phone number with an area code. The capture groups here are indicated by the unescaped parentheses:
(\(\d{3}\))[ |-](\d{3})[ |-](\d{4})
The first capture group is always the entire matched text. The second group here is the area code, and the third and fourth groups compose the body of the phone number. Capturing groups can also be used to reference an earlier portion of the match within the match.
Troubleshooting and documenting
If you’re working on a regular expression and you’re struggling to understand it as you’re creating it, you should probably look at alternatives or break it up into smaller pieces. Complicated regular expressions that aren’t well documented will just lead to confusion the next time you or another developer on your team comes across it.
Many regex interpreters allow you to specify a flag to ignore whitespace. With this flag enabled, you can put one logical section of the regex on each line, and you can include a comment on each line explicitly explaining what what each section is.
Breaking down the regex like this can make it a lot easier to see one character out of place or an incorrectly escaped sequence. Below, I’ve broken down the previous example into logical sections:
\( # paren
(\d{3}) # Area code (captured)
\) # paren
[ |-] # Separating space or dash
(\d{3}) # First three digits of phone number (captured)
[ |-] # Separating space or dash
(\d{4}) # last four digits of phone number (captured)
Regex Tools & Resources
Regex101 is my favorite tool to use for troubleshooting, understanding, and developing regular expressions. The site breaks down the regular expression you’ve entered character by character and tells you exactly how it’s being interpreted.
This is very useful when you’ve forgotten to escape a character or you’re trying to understand a regex you found buried in your codebase. After you’ve included the regular expression you want to test, there’s an area to insert the string or strings on which you would like to test the expression. The string will highlight all of the text that the expression matches and break down each component of the match, including the capture groups and the character positions of each part of the match in the string. Lastly, there’s a reference section in the bottom for a refresher on the syntax, as well as tools to export your regular expression to a few different languages.
Regular expressions can be powerful in the right situation. But they have plenty of theoretical and practical limitations, and you should be aware of these before trying to apply them to every situation that looks like a string needing parsing.
I’ve included a few links below for sharpening your regex skills before you put them to the test.