We're hiring!

We're actively seeking developers for our new Detroit location. Learn more

Experimenting with Verbal Expressions

Regular Expressions (regex) are one of my favorite programming tools. When I first came across them, I was a bit intimidated, as I imagine many people are. To the untrained eye, they look like complete gibberish. If you are unfamiliar with regular expressions, they are basically a tool that allows you to find specific patterns of characters within text.

For example, here’s a Regular Expression for matching an IP address:

/^(?:(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?).){3}(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)$/

At some point I finally caved and decided to invest some time into learning the ancient art of regex. The more I learned, the more excited I got, and I began to find places where I could use them to solve otherwise difficult problems. After using regular expressions a handful of times, I remember having the thought that it would be an interesting project to make a regular expression library that allowed you to define regular patterns using plain readable text. Unfortunately, I never investigated whether a project like that already existed.


The Verbal Expression Project

Just recently I saw a tweet about something called Verbal Expressions, and it sparked my curiosity. Verbal Expressions is an open-source project that aims to simplify the task of writing difficult, complex regular expressions. It does this by providing a means of specifying regular expression patterns using easily readable text. Verbal Expressions has been ported to many programming languages.

Let’s take a look at a couple examples of how Verbal Expressions can be used in Ruby.

Say we have a text file containing the following list of contact information:

  • Jason Bourne – 911 Assassin Alley, Grand Rapids, MI, 49512 – 735 724 4563
  • Buddy The Elf – 1225 Candy Cane Lane, Anchorage, AK, 54212 – 534 457 5678
  • Bono – 777 Rock Road, Nashville, TN, 12345 – 248 456 6474

To extract the name, address, and phone number from each entry in the list we could do the following:

require 'verbal_expressions'
 
addr_list = [
    "Jason Bourne - 911 Assassin Alley, Grand Rapids, MI, 49512 - 735 724 4563",
    "Buddy The Elf - 1225 Candy Cane Lane, Anchorage, AK, 54212 - 534 457 5678",
    "Bono - 777 Rock Road, Nashville, TN, 12345 - 248 456 6474"]
 
exp = VerEx.new do
    start_of_line
    begin_capture 'name'
    anything_but "-" 
    end_capture
    find '-'
    add('\s+')
    begin_capture 'address'
    anything_but '-'
    end_capture
    find '-'
    add('\s+')
    begin_capture 'telephone'
	anything
	end_capture
    end_of_line
end
 
puts "Generated Regular Expression: #{exp.source}"
 
puts "Matched Contact Information"
addr_list.each do |line| 
    line.match(exp) do |match|
        puts "Name: #{match['name']}"
        puts "Address: #{match['address']}"
        puts "Telephone: #{match['telephone']}"
    end
end

This produces the following output:

Generated Regular Expression: ^(?(?:[^-]))(?:-)\s+(?
(?:[^-]
))(?:-)\s+(?(?:.*))$

Matched Contact Information
Name: Jason Bourne Address: 911 Assassin Alley, Grand Rapids, MI, 49512 Telephone: 735 724 4563
Name: Buddy The Elf Address: 1225 Candy Cane Lane, Anchorage, AK, 54212 Telephone: 534 457 5678
Name: Bono Address: 777 Rock Road, Nashville, TN, 12345 Telephone: 248 456 6474

Pretty neat, right? Actually, the idea of Verbal Expressions is great, but I found the Ruby implementation a little frustrating to use. The biggest problems are that it doesn’t utilize blocks enough, several important methods are missing, and the documentation needs a lot of work.

My Fixes

The beautiful thing about open source projects is that if you don’t like something, you can fix it. So I forked the repo and got to work.

First of all, the implementation needed a method to consume white space.

# Any whitespace character
def whitespace()
    add('\ss')
end

Next it needed a way of specifying multiple occurrences of things. I added the oneormore and zeroormore methods. Notice how they take a block as a parameter; this allows the caller of the function to specify any number of things that they want multiple of.

# At least one of something
def one_or_more()
  add("(?:")
  yield
  add(")+")
end
 
# Zero or more of something
def zero_or_more()
  add("(?:")
  yield
  add(")*")
end

Next, I thought it would be helpful to have methods for specifying a digit() and an integer() (many digits). The digit method is simple and for number we can use our new one_or_more() method!

# Any integer (multiple digits)
def integer
  one_or_more { digit }
end
 
# A single digit
def digit
  add('\d')
end

Verbal Expressions in Ruby: After

After submitting a pull-request with the above changes, the owner of the original repository merged them in. This is how the above example could be re-written using my additions.

def multiple_words
	one_or_more do
		word 
		zero_or_more{ whitespace } 
	end
end
 
exp = VerEx.new do
    start_of_line
    capture('name') { multiple_words }
    find '-'
    whitespace
    capture('street address') { 
		number 
		multiple_words 
	}
    find ','
    whitespace
    capture('city') { multiple_words }
    find ','
    whitespace
    capture('state') { word }
    find ','
    whitespace
    capture('zip') { integer }
    end_of_line
end

Conclusion

Using Verbal Expressions does not make your code shorter. If you’re looking to save characters, this is not the tool for you. On the other hand, using Verbal Expressions does make your code more readable to a larger audience. One really nice feature is the ability to output the actual regular expression produced by the Verbal Expression. This could be very helpful to someone just learning regular expressions.
 

This entry was posted in Development Techniques and tagged , . Bookmark the permalink. Both comments and trackbacks are currently closed.

6 Comments

  1. Posted September 5, 2013 at 9:25 am
  2. Alex
    Posted September 5, 2013 at 9:47 am

    This looks good, where can we find your fork with these nifty changes?

  3. Posted September 5, 2013 at 2:22 pm

    Nifty!

    A minor nitpick: your definition of “integer” might surprise some folks, since it doesn’t match negative integers, scientific notation, or common notations for hexidecimal, octal, etc. Why not just call it “digits”?