Diff, not That

Do you know about the --ignore-matching-lines option to GNU diff? I recently found a great use for it…

Difference Engine No. 2 counting wheels

We were tasked with comparing large sets of XML documents for a web service product. The project called for a SoapUI testsuite using Groovy scripts to compare complete SOAP responses against files containing the expected responses.

My first thought was of fragility: this is not the best way to validate behavior of a web service. Better to bind these XML documents into meaningful data objects and use high-level business-specific language assertions, ala Cucumber. SoapUI has powerful XPath- and XQuery-based assertion capabilities, but we
needed the expectations to be complete XML document samples, not discrete SoapUI assertions. We had to compare entire documents because these samples were intended as end-user specification for the web service under test. The documented samples had to match actual responses, and the best documentation is executable-testsuite-as-documentation.

We quickly set up a test harness in SoapUI Pro. Our first attempt simply compared the response with the contents of the file, using Groovy. We hit a problem right away.

The web service responses didn’t match exactly. Our web service schema had a couple of fields that were dynamic and difficult to compare byte-by-byte with saved versions: Timestamps and GUIDs. These parts of the response weren’t too important to our business goals; as long as the GUID was there and well-formed, we could disregard its content. In fact, it should never be the same as recorded, but we didn’t need to validate that here. Same thing with timestamps.

What saved us was GNU diff. We briefly considered implementing an XML diff routine within the Groovy script in SoapUI. That was going to be expensive, and it was not in scope. We next reached for our favorite multi-tool, Ruby, but found integration to be cumbersome. Even with high-level scripting languages and modern XML toolkits, the sort of surgical exceptions we needed to add to an otherwise whole-file diff were not going to be cheap enough to justify within our project schedule.

Then we found this:

1
2
# compare files, ignoring differences in GUIDs.
diff -I "[0-9a-f]\{32\}" foo.xml bar.xml

The --ignore-matching-lines or -I option to GNU diff is described this way in the man page:

Ignore changes whose lines all match RE.

This was perfect: we could use a robust existing diff tool instead of rolling our own, and we could tell it to exclude specific changes via a regular expression (albeit a primitive one). We quickly changed our Groovy script to shell out a call to diff. We told diff to ignore GUIDs and the particular date format we were using. The result worked well, even efficiently: we were able to use this technique for over 100 document comparisons across our suite without an unacceptably slow run time.

Here’s all the Groovy we needed to accomplish this (note that context and messageExchange are provided to the script):

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
// In each testcase script assertion:
//
import FileCompare
FileCompare.compareWithExpected(context, messageExchange)


// In our Script Library:
//
package filecompare
class FileCompare {
  def static compareWithExpected(context, mex) {
    def testCase = context.testCase.name.toLowerCase()
    def testSuite = context.testCase.testSuite.name
    def projectDir = new File(context.testCase.testSuite.project.path).parent
    def expectedFilename = "$projectDir/expected_responses/${testSuite}/${testCase}"
    def gotFilename = "${expectedFilename}_got.xml"
    new File(gotFilename).write(mex.responseContentAsXml)

    def ignoreGUIDs = "-I \"[0-9a-f]\\{32\\}\""
    def ignoreDates = "-I \"[0-9]*\\/[0-9]*\\/[0-9]\\{4\\}\""
    def diffCmd = "diff -bq --strip-trailing-cr $ignoreGUIDs $ignoreDates ${expectedFilename}.xml $gotFilename"
    def results = diffCmd.execute()
    results.waitFor()

    assert results.exitValue() == 0
    results.destroy()
  }
}