Building Better Software By Learning From Accidents.

For those of you who might have noticed; on 18th October 2009; Twitter; posted a short and simple message on their status website. The message read:

'We are currently investigating a problem whereby some Twitter clients in widely scattered network locations are unable to connect to twitter.com'.

Put simply; twitter was down; yet another time.

When the problem was resolved and twitter came back up again; the same message that announced twitter's going down was updated to reflect the fact that the problem had been fixed. The update read:

'Network should be back to normal now. We are investigating the cause of the outage and will update this with more information when we know more.'

If you are an avid Twitter user; this probably does not come out as news which surprises you anymore.

When twitter goes down; we; the faithful twitter users; wait for twitter to come back up; the world moves on; the sky continues to be blue; the birds keep chirping and once twitter is back up again; we get back to using it.

I love twitter. Don't get me wrong; dear reader. The point of this post; is not to pick on twitter's downtime.

We have enough of you out there doing that already.

As far as Twitter's downtime is concerned; my stand on it; is simple - When they go down I wait for them to come back up and when they are up; I get back at using their service. Yes I love twitter; but then; having said that; my world does not revolve around twitter and their going down once in a while does not upset me majorly like it upsets most passionate twitter users.

Bertrand Meyer; on the other hand; has a different perspective on Twitter or any other public website going down. He explains:

Software disasters continue; they attract attention when they arise, and inevitably some kind of announcement is made that the problem is being corrected, or that a committee will study the causes; almost as inevitably, that is the last we hear of it.

In the latest issue of Risks alone, you can find several examples.

In the past months, breakdowns at Skype, Google and Twitter made headlines; we all learned about the failures, but have you seen precise analyses of what actually happened?

Bertrand compares software development to a disciplined and much more structured industry like the aviation industry and talks about learning 'learning by accidents'. He explains:

Airplanes today are incomparably safer than 20, 30, 50 years ago: 0.05 deaths per billion kilometers. That’s not by accident.
Rather, it’s by accidents. What has turned air travel from a game of chance into one of the safest modes of traveling is the relentless study of crashes and other mishaps. In the US the National Transportation Safety Board has investigated more than 110,000 accidents since it began its operations in 1967.
Any accident must, by law, be investigated thoroughly; airplanes themselves carry the famous “black boxes” whose only purpose is to provide the evidence in the case of a catastrophe. It is through this systematic and obligatory process of dissecting unsafe flights that the industry has made almost all flights safe.
Now consider software. No week passes without the announcement of some debacle due to “computers” — meaning, in most cases, bad software. The indispensable Risks forum and many pages around the Web collect software errors; several books have been devoted to the topic.
A few accidents have been investigated thoroughly; two examples are Nancy Leveson’s milestone study of the Therac-25 patient-killing medical device, and Gilles Kahn’s analysis of the Ariane 5 crash (which Jean-Marc Jézéquel and I used as a basis for our 1997 article).

Both studies improved our understanding of software engineering. But these are exceptions. Most of what we have elsewhere is made of hearsay and partial information, and plain urban legends (like the endlessly repeated story about the Venus probe that supposedly failed because a period was typed instead of a comma — most likely a canard).

Bertrand; in his post; goes so far as suggesting that we start lobbying for Software Incident Full Disclosure Law. He explains:

Progress in software engineering will come from many sources. Research is critical, including on topics which today appear exotic. But if anyone is looking for one practical, low-tech idea that has an iron-clad guarantee of improving software engineering, here it is: pass a law that requires extensive professional analysis of any large software failure.

The details are not so hard to refine. The initiative would probably have to start at the national level; any industrialized country could be the pioneer. (Or what about Europe as whole?) The law would have to define what constitutes a “large” failure; for example it could be any failure that may be software-related and has resulted in loss either of human life or of property beyond a certain threshold, say $50 million.
In the latter case, to avoid accusations of government meddling in private matters, the law could initially be limited to cases involving public money; when it has shown its value, it could then be extended to private failures as well.
Even with some limitations, such a law would have a tremendous effect. Only with a thorough investigation of software projects gone wrong can we help the majority of projects to go right.
We can no longer afford to let the IT industry get away with covering up its failures. Lobbying for a Software Incident Full Disclosure Law is the single most important step we can take today to make the world’s software better.

If you have been reading the blog so far you probably know my take on rules and laws. I am a firm believer of flocking free and yet flocking responsibly. Besides my very own personal disliking for rules; government intervention in Google or Twitter seems to defeat the whole user-driven-democratization of the internet.

The internet; dear reader; has a government and unlike most government out there; this one works.

Remember; the users who decide whether they want their account on twitter or friend-feed?

I think we; as users; are fully capable; of voting a site up by giving it our time and attention or voting a site down by not-giving-a-damn-about-it.

The model works.

It is why nobody uses MSN or why users picked Google over Yahoo and why AltaVista was sent to gather dust in utter silence.

For projects involving lives and public money; sure; but for a private organization offering a free service like Google or Twitter; I would rather prefer to be on the camp that lobbies 'against' the Incident Full Disclosure 'Law' rather than the side that lobbies 'for' it.

Having said that; the gist of what Bertrand says is pristine and pure: Talk about your failures. Investigate the root cause and share this information out.

Take the guys at 37Signals for example. These are guys who genuinely and truly believe in 'Publicizing Your Screwups'. Their famous getting-real book describes the idea rather articulately:

If something goes wrong, tell people. Even if they never saw it in the first place.
For example, Basecamp was down once for a few hours in the middle of the night. 99% of our customers never knew, but we still posted an "unexpected downtime" notice to our Everything Basecamp blog. We thought our customers deserved to know.
Here's a sample of what we post when something goes wrong: "We apologize for the downtime this morning — we had some database issues which caused major slowdowns and downtimes for some people. We've fixed the problem and are taking steps to make sure this doesn't happen again...Thanks for your patience and, once again, we're sorry for the downtime."
Be as open, honest, and transparent as possible. Don't keep secrets or hide behind spin. An informed customer is your best customer.

Even for someone as open; transparent and open as 37Signals; they seem to stop at 'we-had-some-database-issues'. I wonder if someone at 37Signals considers these questions every time their service is down:

Can we explain what this 'database issue' was?
Do the users deserve to know what the issue was and we did or are doing towards preventing the same problem for surfacing again?
Is the root cause and the fix worth sharing; maybe on a separate technical blog?

Put simply; does the screw-up being advertised; as minor as it was; create knowledge that the development community in general can benefit from?

If yes; why not share that knowledge out to the rest of the software development world as well?

If there was a blog where a project-path developer talked about his screw-ups and gave his insights on the fixes or what he learnt from the screw-ups; I would be the first regular reader there.

If folks at twitter talked about what takes twitter down and how they resolve the issues; I would love to learn from the issues and challenges they face.

If someone at Google talked about how an invalid configuration file that pretty much puts the entire internet to a stop was pushed into their production servers; and what they did to avoid the issue in the future; I would shut-the-f@#ck-up; listen; and learn from their mistakes.

Would you not like to learn from the largest and most remarkable mistakes out there too?

Oh; and while we are at it; what was your biggest screw-up?

Did you share what it taught you by posting it and lessons learnt from it; live; dear reader?

Discuss.

Building Better Software By Learning From Accidents.

Comment Section