It's not rocket science

Jack Ganssle gave a great talk at XP West Michigan last night. The theme of his talk was famous software failures and lessons learned. His personal mission seems to be raising awareness of the root causes of failures to help the embedded community improve its craft. The astounding thing I learned from Jack’s talk was the mundane nature of the common root causes of these failures.

Many of the failures Jack highlighted were from the space industry. Jack’s collection of failures included lesser known ones, like satellite launches that ended in the sea, to famous ones such as the many Mars missions that have gone wrong. Of course many of these failures were later traced to a tiny fault in a large system—the brittleness of software in this respect compared to natural systems is well-understood but nonetheless maddening.

What I found most interesting and shocking was the commonality of root causes and their mundane nature. Post-mortem studies and investigations of the failures identified these underlying problems:
  • lack of source control
  • putting untested code into production
  • working people beyond their sustainable pace
  • dictating impossible schedules

Jack claims 40% of the firmware teams he visits aren’t using source control. Lost files, confusion between versions of files, and uncoordinated edits contributed to many of the failures he highlighted. Is this laziness? Ignorance? Lack of professional pride? A lack of discipline on something so trivial from teams responsible for projects with lives and hundreds of millions of dollars on the line is simply astounding.

Lack of testing, failure to regression test, or failure to test systems in the circumstances they’d experience in production was another big theme of the failures. Testing is hard, so this isn’t quite as appalling as lack of source control. But given the budget and schedule pressures these projects seem to have in common, it sure makes the agile practice of test-driven development and test-infected developers look attractive.

The last two common causes are really the same. “They” tried to reduce cost on on a project by working the engineers for 60-80 hours per week for months on end to hit a schedule, or “they” refused to allocate funds or time for testing. (Seems like we should just fire “them” and save a lot of projects—the track record of success for missions to Mars is about 50%.) As individual craftsmen we only have absolute control over one thing: ourselves. If you’re working on a project that violates the simplest and most basic practices of our profession, if you’re expected to consistently work beyond your sustainable pace, and if you’re asked to suspend your better judgment and believe magic will happen, then your choice should be obvious, if not simple: quit and find different work.

Evidently rocket science isn’t the hard part of building rockets.



4 Responses to “It's not rocket science”

  1. gvb Says:

    Jack also had a very interesting scatter plot of project complexity vs. schedule. As expected, it formed a cloud whose average could be modeled as a linear plot of the line x = y (linear, through the origin). Jack also indicated the success and failure of each project in the scatter plot and then drew a line 30% below the average line. Every point below the line was a failed project. Only a few points above the line were failed projects.

    The conclusion: compressing schedule by more than 30% GUARANTEES failure.

  2. Carl Erickson Says:

    Thanks for describing the schedule vs complexity graph. That was indeed a compelling way to illustrate the danger of pushing a project schedule. I don't remember what these projects were, or how failure was defined. Did you catch that?

  3. gvb Says:

    The graph did not have specific projects labeled, but I suspect it was NASA/space related given the area of discussion at the time.

    In terms of how failure was defined, there were three dominant themes...

    1. becoming lots of small pieces in the ocean,
    2. becoming lots of small pieces on a planet, or
    3. running out of fuel and drifting endlessly through space because the watchdog timer was not enabled.
    ;-)

  4. Allen Moore Says:

    I attended one of Jack's seminars a few years ago (out of my own pocket--not management's), and everything he says is golden. You were lucky to be able to attend. If only software developers could get "them" to attend as well. Unfortunately the "them" you refer to are generally in management, don't go to software development seminars, and are rather difficult to fire by the rank-and-file.


Leave a Reply

  

  

  

Stay Connected