Planning for failure while shooting for the moon

I studied the Challenger shuttle disaster as part of my Masters degree on Applied Ethics. I am no stranger to the project management – and, as it turns out, occasional project mismanagement – of NASA and its contractors. Yet in researching a recent blog posts, I also came to appreciate how seriously they endeavoured to learn from, and anticipate, their mistakes.[1] NASA’s conservative approach, and the success of that approach, is testimony to the benefits of planning for failure.[2] After each incident, NASA did not simply limit its response to hardware redesign – they also redesigned their very organisation. The mission guidelines for Mercury – one of the stepping-stones to Apollo – reflected their zero scope-creep tolerance, and their outright reject of unnecessary risk.

“Existing technology and off-the-shelf equipment should be used wherever practical, the simplest and most reliable approach to system design would be followed, an existing launch vehicle would be employed to place the spacecraft into orbit, and a progressive and logical test program would be used.”
The following Gemini program saw trial-dockings that, once perfected, would be used in the Apollo missions when the lunar and control module reconnected. NASA could have had confidence in their theories, in the simulated expertise of their astronauts, and simply crossed their fingers. Instead, they repeatedly trailed out dockings in low orbit – with cheaper rockets, less lives on the line, and less political kudos at risk if things went wrong. This testing payed off – the Lunar Module was the only component of the Apollo/Saturn systems that did not fail in a way that affected its mission, and in fact saved the lives of the Apollo 13 crew.

Consider also the thoroughness of testing of the emergency procedures, and their infrastructure, in the lead-up to the Apollo 11 missions. During this time, the pressure was on – the Space Race (and the Cold War) was in full swing. In any other organisation, there would have been an enormous amount of expectation to simply ‘bypass safety protocols and engage the warp core.’ Each pre-landing Apollo mission was an experiment that integrated lessons learned from the previous missions, introduced new variables, and then tested concepts in new conditions to ensure their trustworthiness.

These seven Apollo missions (the unmanned Apollo 4, 5 and 6 missions, and the manned Apollo 7 to 10 missions, which did not actually touch the surface of the moon) were testimony to an incredible conservativeness and valuation of human life. NASA spent an enormous portion of its budget double-checking systems that they actually hoped they would never use, such as simulating an abort of the Apollo Lunar Module descent.[3]

What makes these even more note worthy was that most of these test flights and confidence builders required the same (incredibly expensive) payload deliverer system: the Saturn V rocket.

Such thoroughness seems – particularly in hindsight – excessive.[4] However, contrast this to the Soviet manned lunar programs, whose rocket of choice failed catastrophically in one instance, destroying the launch complex and delaying their program for two years. Consider also that the Shuttle program has claimed more lives than the Gemini and Apollo programs, which managed to deliver a dozen people far beyond low-orbit, and back from the surface of the moon.

End notes

[1] “Immediately after the fire, NASA convened the Apollo 204 Accident Review Board to determine the cause of the fire. Although the ignition source was never conclusively identified, the astronauts' deaths were attributed to a wide range of lethal design and construction flaws in the early Apollo Command Module. The manned phase of the project was delayed for 20 months while these problems were corrected.” – Wikipedia

[2] Of course, NASA focused its preoccupation with testing on where it would be most relevant. For instance, Apollo 7, the first manned mission after Apollo 1, was 10 days. Yet it was not as if their entire staff and all their contractors, in all of their buildings, had to perform a two-week fire drill.

[3] In the context of a large-scale software system rollout, for instance, this would equate to making a simulated rollback (to the legacy system) an essential part of implementation.

[4] Apollo 10, for instance, was described as a ‘dress rehearsal’, where they did everything except actually land on the moon, and came within 10 000 feet of the surface.