Faster, better, cheaper! TDD wins in a simple experiment
Earlier this year I had the unusual opportunity to work on a project with another developer (I’ll call him Dave – not his real name) in which each of us was free to choose our own method of application development.
This is certainly not an ideal situation. At first I considered following the client’s (Dave’s company) development methods just for consistency, but I ultimately decided to follow TDD for my portions of the project for several reasons:
- I had discussed it with the PM and Dave and they were interested in seeing how this “TDD Thing” they had been hearing about worked.
- The features of the application that the two of us were working on were easily differentiated.
- I was interested in seeing first-hand the results of TDD vs non-TDD literally side-by-side.
I also invited Dave to pair with me, even if only on occasion, and see for himself how TDD works and to see what he thought of it. He agreed it would be interesting, but it never happened.
The project started off with the both of us sitting down with an Excel sheet filled with the new features to be added to an application that had been started about a year ago, but had languished for the last several months in its not-yet-usable state. We spent the better part of the day discussing the features and estimating them.
I had been successful in selling the idea of point-based estimating (as opposed to direct time estimates) and we chose a point scale a bit higher than what I would normally have chosen – the ‘average’ story was set at 200 points. Based on this we ended up defining just over 11,000 points of stories over the course of the project.
The original plan when my contract started was for Dave to work 30 hours per week on our project while I was to work 40. Soon after we started, however, it became clear that Dave’s extra-project responsibilities were going to take up much more of his time. In the end, Dave’s hours on feature implementation work totaled 92 to my 217. This roughly correlated to amount of points we each completed. Dave was assigned and completed 3550 points of work while I was credited with 6950.
On the face of it, it would appear that Dave was slightly more efficient that me. After all I worked 2.36 times as many hours but completed only 1.96 as much work. To put it another way, he had a development velocity of 38.6 story points/dev. hours and I had a development velocity of 32.0. However, the hours outside of feature development tell the real story.

With each week’s release build we’d detail the new features that we had completed. Then the testing team would give it a good run-through, logging bugs in another Excel spreadsheet. The contents of this sheet tell an interesting story. Dave ended up with 33 entries in the list. I had a total of 22. Not a bad ratio – I had one third fewer issues while doing almost twice the amount of work. But wait – it gets even better: Of the 22 items in the issue log, five were actually additional features requested that I had time to complete due to the short bug list, so my real number of bug entries was 17.
Once the amount of time spent fixing bugs is factored into our development velocities, the relationship between Dave’s and my velocity changes dramatically. I spent 100.75 hours working on items in the testing log. If we assume that the time it took to fix Dave’s 33 bugs was proportionately similar, then we can estimate that 151 hours were spent working on these items.
So if you add this to the time spent in primary development you get:
Scott: 217 + 100 = 317 hours for 6950 points, or 22 points per hour
Dave: 92 + 151 = 243 hours for 3550 points or 14.6 points per hour

That’s a 50% improvement in overall productivity – pretty good, I’d say.
It was nice to see some hard numbers backing up what we have been preaching.


GBGames Says:
January 5th, 2010 at 11:25 AMIf we assume that the time it took to fix Dave’s 33 bugs was proportionately similar, then we can estimate that 151 hours were spent working on these items.
Er, didn’t Dave also track the time spent here? Why are we estimating? Now the “hard” numbers are based partially on an estimate. Am I missing something?
Michael Kay Says:
January 5th, 2010 at 11:27 AMAny data is better than no data, but this is well short of proving anything statistically significant. After all, we know that there will be vast variations between two different programmers using the same methods!
Rob Says:
January 5th, 2010 at 11:30 AMThat’s a really interesting yard stick to work to. I guess there a couple of things that you didn’t address, which I’d be interested to hear your views on. 1, Are you simply a more capable developer than Dave? (Don’t answer that if you don’t want to) 2, Dave was working fewer hours than you, did this cause him to have additional context switching which distracted him / caused lower quality of work? 3, Objectively would you say that TDD was the biggest contributing factor to the difference in productivity?
Scott Miller Says:
January 5th, 2010 at 12:00 PMGBGames: We tracked our development time in the spreadsheet. The bug fixing time was tracked only in our respective companies’ time tracking methods, so I don’t have access to that. Based on my observations during the project I think the estimate is a fair one.
Michael Kay: You’re right – this is anecdotal evidence from a single experiment. I don’t believe I made any grandiose claims of proofs.
Scott Miller Says:
January 5th, 2010 at 12:03 PMRob: I will contend that the TDD practices were, in fact, the biggest contributing factor, but of course not the only one.
Nayan Hajratwala Says:
January 5th, 2010 at 12:06 PMI totally agree that TDD is the right way to go, but i’m not sure your experiment is valid. It’s quite possible that you’re simply a better developer than Dave, who is able to write more bug-free code, or your features turned out to be easier, or …
There are too many variables.
Aaron Says:
January 5th, 2010 at 12:26 PMI found this humorous:
“I also invited Dave to pair with me, even if only on occasion, and see for himself how TDD works and to see what he thought of it. He agreed it would be interesting, but it never happened.”
Most people think pairing a is a great idea, but not for them. Very few are willing to give it a fair shake.
Adam Williams Says:
January 5th, 2010 at 12:42 PMFascinating. Though there are too many variables to call this a ‘valid’ experiment, I do think it proves that some developers are worth more than others, and that cannot be determined by lines of code or features produced!
Carl Erickson Says:
January 5th, 2010 at 01:53 PMI bet the QA people saved time as well. Each bug discovered takes time to chase down, make repeatable, figure out that it’s not a duplicate, document for the developer, and enter into a tracking system. I don’t suppose we can measure this, but I think it would accentuate the advantage you show above.
Justin Hunter Says:
January 5th, 2010 at 04:05 PMGreat post! I love empirical evidence like this and sincerely wished that more people did exactly what you have done here and shared the results (with their colleagues, in blogs, in articles, etc.). As you say, any data is better than none and I like your thoughtful responses to the valid questions posters have raised.
In the area of software testing, I have conducted several dozen such experiments and found that, on average, pairwise and similar combinatorial testing methods (an approach to test case identification) have consistently doubled the number of defects found per tester hour. I co-published the findings in IEEE Computer here:
https://www.hexawise.com/Combinatorial-Softwar-Testing-Case-Studies-IEEE-Computer-Kuhn-Kacker-Lei-Hunter.pdf
Thanks for the post! Keep up the good work and please keep sharing data that you find.
Incidentally, also on a similar theme: In Praise of Data-Driven Management… http://hexawise.wordpress.com/2009/08/18/learning-using-controlled-experiments-for-software-solutions/
Me Says:
January 6th, 2010 at 01:06 PMWhat about dave development skills? Maybe he’s not a good programmer…
Pekka Enberg Says:
January 8th, 2010 at 02:21 AMThe most interesting metric for TDD vs. non-TDD is not necessarily “development time” but the cost to change the software. That’s where the test-driven design and the regression test suite (a by-product) should really start to pay off.
james peckham Says:
January 9th, 2010 at 08:32 PMi’ve never done “official pairing”. it’s usually “hey man would you take a break from that come watch me as i go through this sticky bit of code, make sure i don’t do anything stupid? cool thanks”