It's not rocket science
Jack Ganssle gave a great talk at XP West Michigan last night. The theme of his talk was famous software failures and lessons learned. His personal mission seems to be raising awareness of the root causes of failures to help the embedded community improve its craft. The astounding thing I learned from Jack’s talk was the mundane nature of the common root causes of these failures.
Many of the failures Jack highlighted were from the space industry. Jack’s collection of failures included lesser known ones, like satellite launches that ended in the sea, to famous ones such as the many Mars missions that have gone wrong. Of course many of these failures were later traced to a tiny fault in a large system—the brittleness of software in this respect compared to natural systems is well-understood but nonetheless maddening.
What I found most interesting and shocking was the commonality of root causes and their mundane nature. Post-mortem studies and investigations of the failures identified these underlying problems:- lack of source control
- putting untested code into production
- working people beyond their sustainable pace
- dictating impossible schedules
Jack claims 40% of the firmware teams he visits aren’t using source control. Lost files, confusion between versions of files, and uncoordinated edits contributed to many of the failures he highlighted. Is this laziness? Ignorance? Lack of professional pride? A lack of discipline on something so trivial from teams responsible for projects with lives and hundreds of millions of dollars on the line is simply astounding.
Lack of testing, failure to regression test, or failure to test systems in the circumstances they’d experience in production was another big theme of the failures. Testing is hard, so this isn’t quite as appalling as lack of source control. But given the budget and schedule pressures these projects seem to have in common, it sure makes the agile practice of test-driven development and test-infected developers look attractive.
The last two common causes are really the same. “They” tried to reduce cost on on a project by working the engineers for 60-80 hours per week for months on end to hit a schedule, or “they” refused to allocate funds or time for testing. (Seems like we should just fire “them” and save a lot of projects—the track record of success for missions to Mars is about 50%.) As individual craftsmen we only have absolute control over one thing: ourselves. If you’re working on a project that violates the simplest and most basic practices of our profession, if you’re expected to consistently work beyond your sustainable pace, and if you’re asked to suspend your better judgment and believe magic will happen, then your choice should be obvious, if not simple: quit and find different work.
Evidently rocket science isn’t the hard part of building rockets.


gvb Says:
April 29th, 2009 at 10:45 AMJack also had a very interesting scatter plot of project complexity vs. schedule. As expected, it formed a cloud whose average could be modeled as a linear plot of the line x = y (linear, through the origin). Jack also indicated the success and failure of each project in the scatter plot and then drew a line 30% below the average line. Every point below the line was a failed project. Only a few points above the line were failed projects.
The conclusion: compressing schedule by more than 30% GUARANTEES failure.
Carl Erickson Says:
April 29th, 2009 at 01:32 PMThanks for describing the schedule vs complexity graph. That was indeed a compelling way to illustrate the danger of pushing a project schedule. I don't remember what these projects were, or how failure was defined. Did you catch that?
gvb Says:
April 29th, 2009 at 02:02 PMThe graph did not have specific projects labeled, but I suspect it was NASA/space related given the area of discussion at the time.
In terms of how failure was defined, there were three dominant themes...
- becoming lots of small pieces in the ocean,
- becoming lots of small pieces on a planet, or
- running out of fuel and drifting endlessly through space because the watchdog timer was not enabled.
;-)Allen Moore Says:
May 1st, 2009 at 01:49 PMI attended one of Jack's seminars a few years ago (out of my own pocket--not management's), and everything he says is golden. You were lucky to be able to attend. If only software developers could get "them" to attend as well. Unfortunately the "them" you refer to are generally in management, don't go to software development seminars, and are rather difficult to fire by the rank-and-file.