2 Comments

Ruby String#split

Taking a string and splitting it with a delimiter is a very common task in Ruby. The official documentation states that String#splitdivides str into substrings based on a delimiter, returning an array of these substrings.

The delimiter itself can be a string or regular expression:

1 2 3 4 5 6 7 
#string delimiter "hello".split('')                #=> ["h", "e", "l", "l", "o"] "hello".split('ll')              #=> ["he", "o"]  # regular expression delimiter "hello".split(//)               #=> ["h", "e", "l", "l", "o"] "hello".split(/l+/)            #=> ["he", "o"]

String#split takes an optional second parameter representing a limit. From the String#split documentation:

If the limit parameter is omitted, trailing null fields are suppressed.

If limit is a positive number, at most that number of fields will be returned (if limit is 1, the entire string is returned as the only entry in an array).

If negative, there is no limit to the number of fields returned, and trailing null fields are not suppressed.

Here are some examples of how the limit parameter works:

1 2 3 4 5 6 7 8 9 
# omitting the limit "1234 1234".split('4')          # =>["123", " 123"]  # positive limit "1234 1234".split('4', 1)      # => ["1234 1234"] "1234 1234".split('4', 2)      # => ["123", " 1234"]  # negative limit<code> "1234 1234".split('4', -1)      # => ["123", " 123", ""]

These examples have been short and simple and in my experience the common usages of String#split will be. When you start to go beyond the short and simple you will find some behavioral oddities with String#split and it will always be with regular expression delimiters.

Below is a simple example of behavioral oddity that String#split may seem to have:

"1234".split(/1/)                  # => ["", "234"]

It seems like the expected result of the above example would be [“234”] since it is splitting on the 1, but instead we’re getting an unexpected empty string.

How String#split works

Internally String#split only uses regular expression delimiters. If you pass in a string delimiter it will be escaped for a regular expression and then turned into a regular expression:

1 2 3 4 
 "1234".split("1")  # is really the same as "1234".split( Regexp.new( Regexp.escape("1") ) )

For the remainder of this article when I refer to delimiter I am referring to a regular expression delimiter since internally that is what String#split uses.

String#split keeps track the track of five important pieces of information:

  • the string itself
  • a results array which is returned
  • the position marking where to start matching the string against the delimiter. This is the start position and is initialized to 0.
  • the position marking where the string matched the delimiter. This is the matched position and is initialized to 0.
  • the position marking the offset immediately following where the string matched the delimiter

String#split operates in a loop. It continues to match the string against the delimiter until there are no more matches that can be found. It performs the following steps on each iteration:

  • from the start position match the delimiter against the string
  • set the matchedposition to where the delimiter matched the string
    • if the delimiter didn’t match the string then break the loop
  • create a substring using the start and matched positions of the string being matched. Push this substring onto the results array
  • set the start position for the next iteration

With this knowledge let’s discuss how String#split handles the previous example of:

"1234".split(/1/)                  # => ["", "234"]
  • the first loop
    • the start position is initialized to 0
    • the delimiter is matched against the string “1234”
    • the first match occurs with the first character, “1” which is at position 0. This sets the matched position to 0.
    • a substring is created using the start and matched positions and pushed onto our result array. This gives us string[start,end] which translates to “1234”[0,0] which returns an empty string.
    • the start position is reset to position 1
  • The second loop
    • start is now 1
    • The delimiter is matched against the remainder of our string, “234”
    • No match is found so the loop is finished.
  • A substring is created using the start position and remainder of the string and pushed onto the results array
  • the results array is returned

Given how String#split works it is easy to see why we have that unexpected empty string in our results array. You should note that this only occurred because the regular expression matched our string at the first character. Below is an example where the delimiter doesn’t match the first character and there is no empty string:

"1234".split(/2/)                  # => ["1", "34"]

Working With Subexpressions

Subexpressions reveal even more oddities in split’s behavior. Here are some examples:

1 2 3 4 5 
"1234".split(/.(.)/)                    # => ["", "2", "", "4"] "1234".split(/(.)(.)/)                  # => ["", "1", "2", "", "3", "4"] "1".split(/(((.)))/)                     # => ["", "1", "1", "1"] "12 456789".split(/(..).*/)        # => ["", "12"] "12 456789".split(/(..)(.*)/)      # => ["", "12", " 456789"]

A Quick Subexpression Primer

The parentheses inside the regular expressions in the above examples are subexpressions. A subexpression is the way to capture parts of the string which match the regular expression. Subexpressions can be nested. When a regular expression contains a left or right parenthesis the regular expression engine does not look for literal left or right parenthesis. In order to match on a literal parenthesis you’d have to escape them.

The below example uses two subexpressions to match the user name and the domain name of a given email address. Subexpressions are numerically defined in left-to-right order. In ruby subexpressions are stored in thread-safe global variables ranging from $1 to $9. A more Object-Oriented way access the matching groups is to use the String#match method which returns a MatchData object:

1 2 3 4 5 6 7 8 9 10 
# not using a MatchData object "dennis@atomicobject.com" =~ /(.+)@(.+)/ $1           # => "dennis" $2           # => "atomicobject.com"    # using a MatchData object md = "dennis@atomicobject.com".match /(.+)@(.+)/ md.captures                 # => ["dennis", "atomicobject.com"] md.captures[0]              # => "dennis" md.captures[1]              # => "atomicobject.com"

Back to Behavioral Oddities

String#split doesn’t split on subexpression groups. Instead, it saves them and pushes them onto the results array that is returned. This is why the below example works the way it does:

1 2 3 4 5 
 # without subexpressions "1234".split(/..../)             # =>; []  # with subexpressions<code> "1234".split(/(.)(.)(.)(.)/)     # => ["", "1", "2", "3", "4"]

Notice how we still get the empty string in the above example. This is for the same reasons as the example in the previous section.

Here is a final example:

"1".split(/(((.)))/)               # => ["", "1", "1", "1"]
  • there are three nested subexpressions
  • the regular expression matches on the very first character. So an empty string is is pushed onto our results array
  • the inner most subexpression is matched to the character “1”.
  • the middle subexpression is matched to the contents of the inner most subexpression which had the contents of “1”.
  • the outermost subexpression is matched to the contents of the middle subexpression which had the contents of “1”.
  • the outermost subexpression contents are pushed onto the results array
  • the middle subexpression contents are pushed onto the results array
  • the innermost subexpression contents are pushed onto the results array
  • the results array is returned, giving [ “”, “1”, “1”, “1”]

Final Thoughts

The empty string behavior exists in the Ruby 1.8.x series. It also exists in the Ruby 1.9 series up to trunk revision 13760 (the latest available revision as of this writing).

Using subexpressions with String#split can throw off any unsuspecting developer. If you remember that String#split won’t split on subexpressions then you’ll be fine. For more information on regular expressions the Mastering Regular Expressions book by Jeffrey Friedl is by far one of the best resources available today.