Pattern Matching without Regex! – Introducing the Rosie Pattern Language

Recently, I was able to attend Jamie Jenning’s talk on the Rosie Pattern Language (RPL/“Rosie”) at Strange Loop 2018. I had not previously been aware of Rosie, but after learning about it, I am extremely excited about the prospect of never writing another regular expression again.

The Rosie Pattern Language is an alternative to traditional regular expressions. Rosie patterns are Parsing Expression Grammars (PEGs), which are more potent than formal Regular Expressions (regexes). However, they typically require much more memory.

Rosie has several benefits over traditional regexes, including the ability to parse recursive structures like HTML and JSON, to create new patterns by combining other patterns, and to name patterns. You can combine these named patterns into libraries which you can import and use elsewhere.

Rosie itself ships with a library of common useful patterns, and it is available as a C library that’s compatible with many different languages (with libffi). It produces output from matches in a variety of formats, including JSON, which makes it very easy to integrate with other tools that can easily consume structured data.

For me, Rosie’s pattern transparency and re-usability are the most exciting aspects.

Pattern Transparency & Re-Usability: Regex vs. Rosie

I use regexes frequently, and I am always frustrated by their opacity. When I come across a regex in code, I need to spend several minutes trying to understand the intended match. Fortunately, with Rosie, I can name patterns. This allows me to make sure that the name is descriptive and easy to recall in the future.

For example, if I wanted to use a regex to find IPv4 addresses, I might need to write something like:

([0-9]{1,3}\.){3}[0-9]{1,3}

To be able to print out ifconfig lines matching the regex with grep:

jk@GERTY-MK-VI ~  $ ifconfig | grep -E "([0-9]{1,3}\.){3}[0-9]{1,3}"
    inet 127.0.0.1 netmask 0xff000000
    inet 192.168.0.13 netmask 0xffffff00 broadcast 192.168.0.255
    inet 33.0.123.1 netmask 0xffffff00 broadcast 33.0.123.255

 

(If you aren’t familiar with this subject, this is tame for a regex. Try the regex for an e-mail address).

With Rosie, defining the IP address pattern may seem a bit verbose. However, once it is defined, it can be re-used anywhere—easily and without any doubt about what we are matching.

The following example shows how to define an IPv4 address. It comes from the net library which ships with Rosie:

local alias ipv4_component = [:digit:]{1,3}
local alias ip_address_v4 = { ipv4_component {"." ipv4_component}{3} }
ipv4 = ip_address_v4

Using the Rosie CLI, we can easily match for IPv4 addresses in a much clearer fashion:

jk@GERTY-MK-VI ~  $ ifconfig | rosie grep 'net.ipv4'
    inet 127.0.0.1 netmask 0xff000000
    inet 192.168.0.13 netmask 0xffffff00 broadcast 192.168.0.255
    inet 33.0.123.1 netmask 0xffffff00 broadcast 33.0.123.255

 

We can even take this a step further, using the PEG choice operator (/) to get the interface names:

jk@GERTY-MK-VI ~  $ ifconfig | rosie grep '{[:alpha:]+ [:digit:] ":" [:space:]} / net.ipv4'
lo0: flags=8049<UP,LOOPBACK,RUNNING,MULTICAST> mtu 16384
    inet 127.0.0.1 netmask 0xff000000
gif0: flags=8010<POINTOPOINT,MULTICAST> mtu 1280
stf0: flags=0<> mtu 1280
en0: flags=8863<UP,BROADCAST,SMART,RUNNING,SIMPLEX,MULTICAST> mtu 1500
    inet 192.168.0.13 netmask 0xffffff00 broadcast 192.168.0.255
p2p0: flags=8843<UP,BROADCAST,RUNNING,SIMPLEX,MULTICAST> mtu 2304
awdl0: flags=8943<UP,BROADCAST,RUNNING,PROMISC,SIMPLEX,MULTICAST> mtu 1484
en1: flags=8963<UP,BROADCAST,SMART,RUNNING,PROMISC,SIMPLEX,MULTICAST> mtu 1500
en2: flags=8963<UP,BROADCAST,SMART,RUNNING,PROMISC,SIMPLEX,MULTICAST> mtu 1500
bridge0: flags=8863<UP,BROADCAST,SMART,RUNNING,SIMPLEX,MULTICAST> mtu 1500
utun0: flags=8051<UP,POINTOPOINT,RUNNING,MULTICAST> mtu 2000
vboxnet0: flags=8843<UP,BROADCAST,RUNNING,SIMPLEX,MULTICAST> mtu 1500
    inet 33.0.123.1 netmask 0xffffff00 broadcast 33.0.123.255
utun1: flags=8051<UP,POINTOPOINT,RUNNING,MULTICAST> mtu 1380

 

In this case, the pattern is to match any number of alphabetic characters, followed by a digit, followed by a colon, followed by a space, and if that fails, to match an IPv4 address.

I can then create my own .rpl pattern file to store this for future use:

rpl 1.1

package jk
import net

local alias interface = {[:alpha:]+ [:digit:] ":" [:space:]}
ifconfig = interface / net.ipv4

Then I can invoke:

jk@GERTY-MK-VI ~  $ ifconfig | rosie -f ~/jk.rpl grep jk.ifconfig
lo0: flags=8049<UP,LOOPBACK,RUNNING,MULTICAST> mtu 16384
    inet 127.0.0.1 netmask 0xff000000
gif0: flags=8010<POINTOPOINT,MULTICAST> mtu 1280
stf0: flags=0<> mtu 1280
en0: flags=8863<UP,BROADCAST,SMART,RUNNING,SIMPLEX,MULTICAST> mtu 1500
    inet 192.168.0.13 netmask 0xffffff00 broadcast 192.168.0.255
p2p0: flags=8843<UP,BROADCAST,RUNNING,SIMPLEX,MULTICAST> mtu 2304
awdl0: flags=8943<UP,BROADCAST,RUNNING,PROMISC,SIMPLEX,MULTICAST> mtu 1484
en1: flags=8963<UP,BROADCAST,SMART,RUNNING,PROMISC,SIMPLEX,MULTICAST> mtu 1500
en2: flags=8963<UP,BROADCAST,SMART,RUNNING,PROMISC,SIMPLEX,MULTICAST> mtu 1500
bridge0: flags=8863<UP,BROADCAST,SMART,RUNNING,SIMPLEX,MULTICAST> mtu 1500
utun0: flags=8051<UP,POINTOPOINT,RUNNING,MULTICAST> mtu 2000
vboxnet0: flags=8843<UP,BROADCAST,RUNNING,SIMPLEX,MULTICAST> mtu 1500
    inet 33.0.123.1 netmask 0xffffff00 broadcast 33.0.123.255
utun1: flags=8051<UP,POINTOPOINT,RUNNING,MULTICAST> mtu 1380

 

While I have yet to make use of Rosie’s C library or other interfaces in Ruby or Python, I’m very excited for the possibilities that Rosie’s readability and ease of comprehension bring to application development.

Has anyone else made extensive use of Rosie? I’d be interested in chatting with you and learning more.