Regular Expressions (regex) are one of my favorite programming tools. When I first came across them, I was a bit intimidated, as I imagine many people are. To the untrained eye, they look like complete gibberish. If you are unfamiliar with regular expressions, they are basically a tool that allows you to find specific patterns of characters within text.
For example, here’s a Regular Expression for matching an IP address:
/^(?:(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\.){3}(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)$/
At some point I finally caved and decided to invest some time into learning the ancient art of regex. The more I learned, the more excited I got, and I began to find places where I could use them to solve otherwise difficult problems. After using regular expressions a handful of times, I remember having the thought that it would be an interesting project to make a regular expression library that allowed you to define regular patterns using plain readable text. Unfortunately, I never investigated whether a project like that already existed.
The Verbal Expression Project
Just recently I saw a tweet about something called Verbal Expressions, and it sparked my curiosity. Verbal Expressions is an open-source project that aims to simplify the task of writing difficult, complex regular expressions. It does this by providing a means of specifying regular expression patterns using easily readable text. Verbal Expressions has been ported to many programming languages.
Let’s take a look at a couple examples of how Verbal Expressions can be used in Ruby.
Say we have a text file containing the following list of contact information:
- Jason Bourne – 911 Assassin Alley, Grand Rapids, MI, 49512 – 735 724 4563
- Buddy The Elf – 1225 Candy Cane Lane, Anchorage, AK, 54212 – 534 457 5678
- Bono – 777 Rock Road, Nashville, TN, 12345 – 248 456 6474
To extract the name, address, and phone number from each entry in the list we could do the following:
require 'verbal_expressions'
addr_list = [
"Jason Bourne - 911 Assassin Alley, Grand Rapids, MI, 49512 - 735 724 4563",
"Buddy The Elf - 1225 Candy Cane Lane, Anchorage, AK, 54212 - 534 457 5678",
"Bono - 777 Rock Road, Nashville, TN, 12345 - 248 456 6474"]
exp = VerEx.new do
start_of_line
begin_capture 'name'
anything_but "-"
end_capture
find '-'
add('\s+')
begin_capture 'address'
anything_but '-'
end_capture
find '-'
add('\s+')
begin_capture 'telephone'
anything
end_capture
end_of_line
end
puts "Generated Regular Expression: #{exp.source}"
puts "Matched Contact Information"
addr_list.each do |line|
line.match(exp) do |match|
puts "Name: #{match['name']}"
puts "Address: #{match['address']}"
puts "Telephone: #{match['telephone']}"
end
end
This produces the following output:
Generated Regular Expression: ^(?(?:[^\-]*))(?:\-)\s+(?
(?:[^\-]*))(?:\-)\s+(?(?:.*))$Matched Contact Information
Name: Jason Bourne Address: 911 Assassin Alley, Grand Rapids, MI, 49512 Telephone: 735 724 4563
Name: Buddy The Elf Address: 1225 Candy Cane Lane, Anchorage, AK, 54212 Telephone: 534 457 5678
Name: Bono Address: 777 Rock Road, Nashville, TN, 12345 Telephone: 248 456 6474
Pretty neat, right? Actually, the idea of Verbal Expressions is great, but I found the Ruby implementation a little frustrating to use. The biggest problems are that it doesn’t utilize blocks enough, several important methods are missing, and the documentation needs a lot of work.
My Fixes
The beautiful thing about open source projects is that if you don’t like something, you can fix it. So I forked the repo and got to work.
First of all, the implementation needed a method to consume white space.
# Any whitespace character
def whitespace()
add('\ss')
end
Next it needed a way of specifying multiple occurrences of things. I added the oneormore and zeroormore methods. Notice how they take a block as a parameter; this allows the caller of the function to specify any number of things that they want multiple of.
# At least one of something
def one_or_more()
add("(?:")
yield
add(")+")
end
# Zero or more of something
def zero_or_more()
add("(?:")
yield
add(")*")
end
Next, I thought it would be helpful to have methods for specifying a digit()
and an integer()
(many digits). The digit method is simple and for number we can use our new one_or_more()
method!
# Any integer (multiple digits)
def integer
one_or_more { digit }
end
# A single digit
def digit
add('\d')
end
Verbal Expressions in Ruby: After
After submitting a pull-request with the above changes, the owner of the original repository merged them in. This is how the above example could be re-written using my additions.
def multiple_words
one_or_more do
word
zero_or_more{ whitespace }
end
end
exp = VerEx.new do
start_of_line
capture('name') { multiple_words }
find '-'
whitespace
capture('street address') {
number
multiple_words
}
find ','
whitespace
capture('city') { multiple_words }
find ','
whitespace
capture('state') { word }
find ','
whitespace
capture('zip') { integer }
end_of_line
end
Conclusion
Using Verbal Expressions does not make your code shorter. If you’re looking to save characters, this is not the tool for you. On the other hand, using Verbal Expressions does make your code more readable to a larger audience. One really nice feature is the ability to output the actual regular expression produced by the Verbal Expression. This could be very helpful to someone just learning regular expressions.
Hey, I just released http://github.com/krainboltgreene/hexpress
This looks good, where can we find your fork with these nifty changes?
Thanks. Actually the owner of the original repo merged them back into his copy. You can get it here: https://github.com/ryan-endacott/verbal_expressions
Cool, thanks.
What’s the difference between this project you provded a link for and the one in the first comment?
Comment link: https://github.com/krainboltgreene/hexpress
Your link: https://github.com/ryan-endacott/verbal_expressions
I’m not sure. I haven’t used hexpress. You’ll have to ask the author of that project.
Nifty!
A minor nitpick: your definition of “integer” might surprise some folks, since it doesn’t match negative integers, scientific notation, or common notations for hexidecimal, octal, etc. Why not just call it “digits”?