Understanding Software Development Disasters

Software Disaster Virgins.

When it comes to programming, they say that you are a virgin till the time you have seen at least one nasty COLLOSAL F@#CKUP. I'm not talking about "Oh we are really sorry about the 15 minute downtime last evening" kind of a f@#ckup here. I am talking about the f@#ckups which have the potential of costing you your jobs, your project and at times even winding up your organization.

A DISASTER! That's what I'm talking about. That's what this series of posts is about.

Now, the thing with these software disasters is that they are not exactly out there looking for you. Most developers are going to see just one or two of these in their entire lifespan and if you are a consultant let's face it, you are going to sense this disaster coming, upload your resume on a job portal, send out a few mail blasts, run to find a new job and as soon as you do, never look back.

Let's face it, most programmers haven't seen even one of these f@#ckups in their entire life. Which is why when they bump into one, they are so utterly confused and lost that they either blank out or do something stupid like try and blame it all on the other programmer who also happens to be one of their best buddies. Most managers it seems have also not seen these f@#ckups either and my theory on why they have not seen these disasters is because their resumes are usually the first to hit the job portals when things begin to heat up. Talk about leadership... managers are notoriously famous for leading from the front when it comes to scrambling away like rats when things start going wrong.

For all those of you little boys and girls out there who have not yet lost their developer virginity, I thought I'd give you a list of things to not do, a list of things to do and most importantly a quick guide to the various mindsets that you will need to decipher and deal with if you find yourself in the middle of a software disaster. Now, if you are one of those who go running to a job portal or a placement consultant at the first sign of trouble (I interview about a dozen of these every couple of months) this article is not for you. For you, we have job portals like this one.

However, if you are someone who understands that surviving it makes you a better developer, a better manager, a much nicer person, creates long term memories and fundamentally changes the way you see your team, your work and your work life, sit back, relax and read on. There is going to be way too much information so make sure you are awake and are drawing your own conclusions because all I have for you today is just some random thoughts and you are going to have to do the hard work of gathering them in a coherent stream and making sense out of these thoughts yourself.

So, things you should know, things you should do and things you should not do during a software disaster. Before we get to that bit however, its important you understand a typical software disaster really well, primarily because when you understand what you are dealing with, you automatically tend to get better at dealing with it. Lets talk about the disaster itself and how it typically shows up. Ready? Let's begin!

Discectomy Of A Software Disaster.

It typically shows up as a simple support phone call. Usually during the evening. You are sipping your coffee, playing low music and in the flow when the phone rings. You don't want to get it. You ignore it. A couple of minutes later a simple support email drops in your inbox. Some user is not being able to login to your application. "He's probably forgotten his password or something", you tell yourself. You are about to brush it aside, when you blink. Something tells you there is something wrong. You decide to take a look. You try to login with a test user. No luck. Maybe it's just bad configuration which has locked everyone out. The problem, has your attention now. You take a look. Everything on the configuration seems fine. You hit the database server. A quick select from the User table on your production database. Then you sit there for a few minutes, wondering what just happened there, watching the screen with zero rows and a blank user table.

F@#CK! - is all you can say or think of.

It takes you about 15 seconds to realize what just happened here. The data is gone. Zip! Just like that. No users in the user table. A tiny little parallel thread spawns up in your head. You are using that thread to think about your argument with your business analyst where you were telling him that administrators should not be allowed to delete a user if the data for the users exist, but he insisted on deleting all data of a user when a user is deleted. You are wonder, maybe (just maybe) you argument would have prevented this. Another parallel thread. This one is thinking about the day when your DBA added that Cascade Delete option after telling your Business Analyst that he wasn't happy doing it and that he was strictly recommending that it not be done. But it was done and everyone had celebrated the release of a new version with all requirements met.

The Backup! The primary thread in your brain shouts out loud. Your daily automated incremental backups and the log backups you run every two hours. The sheer effort the team had put into the backup strategy is going to pay off now.

You snap out of your panic mode, try to shut down all other parallel threads in your brain and focus but you can't. You are calculating the damage. Even with the backup you are going to loose a couple of hours of data. You are going to say sorry to your user, maybe send an email. It's going to be fine. Thoughts like this one racing through your head as you log into the database backup server. A fifteen minute restore operation is going to set everything back to normal, if only you can find the incremental backup for today... you scroll through the files... to find the latest incremental backup... which is... about a couple of MONTHS OLD!

Turns out your automated backup process ran out of disk space a couple of months ago. It would have sent you an email notification when that happened but the IT guys changed the IP address of the email server without bothering to tell anyone and the DBA didn't change the SMTP configuration in the automated backup job, so the email code had been sitting there quietly, logging stuff on your Event Manager when you were expecting an email if the backup didn't happen.

F@#ck - is all you want to say now, but you lack the words or the processing power in your brain to say that out loud. The primary thread in your brain is fumbling for alternative options of recovery. Its trying to think of an answer... frantically.

The parallel threads in your brain are all busy doing damage calculations now. What you do not know is that this is just the beginning of it. From this point on Murphy's Law is going to kick in. Colloquially speaking, from this point on everything that can go wrong... will.

The phone rings again. You glance at your phone's display. It's your manager calling. You realize that he had called a few minutes ago when you blissfully ignored his call because you were in the flow. He is probably getting countless calls from your users by now. You want to focus on fixing the problem. You do not want to pick up the phone. But you know you have to... and you do.

The Disaster that can change you forever has begun.

I have witnessed a few of these in my life and have at-least been a part of more than one of these myself. During my consulting days I have also played the firefighter who people hired to clean up their shit when these disasters occurred. If there is one thing I have learned by observing all of these disasters, it is that all of them (Every single one of them) have one thing in common: They are like airplane crashes.

Airplane disasters, the ones which cost lives and millions of dollars, are typically not just a result of pilot's error. If you have spent any time reading up airplane disasters what you will learn is that most of them are caused by a series of small events. Some company somewhere manufacturing spares decides to work on the brink of the lowest quality allowed by safety standards, some purchase officer who wants to maximize profit orders these parts, some maintenance officer doesn't inspect the parts seriously enough between flights, one of the latches gives, the pilot panics, someone uses mitigated speech while communicating with the control tower... the list of things that go wrong in any given incident is almost endless. At any given point there are at-least eight or nine things which are wrong, none of which single handedly would have been capable of causing a disaster but collectively they do. That right there is how airplane disasters work... Software disasters (if you have been reading my description of the software disaster above) are no different.

A manager thinks that he can push the programmers, programmers think they can work on the brink of quality and get away with it, QA thinks it's OK to work under the assumption that the programmer must have done some basic sanity testing. Your backups fail, configurations change... add a few more tiny mishaps into the picture and one small broken window turns into a torn down warehouse or a bashed up car over time, slowly and steadily. By the time you find out, the disaster has already happened. A young college kid, just for the kick of it, just snaked up on you, did a quick SQL injection attack and wiped out all your users. You are now heading towards the ground on your first spiral dive and you have no clue about how bad things can get from here.

While airplane disasters can be attributed to a series of small things going wrong in precise order and time, and it is almost never caused by any mistake from a single individual, a pilot's way of reacting to the disaster means the difference between crashing or recovering. Inexperienced pilots either choke or panic. The ones who have seen a spiral dive before, do not choke or panic, tend to survive.

Do you think you can survive a disaster of this sort? Do you think your organization can? Do you think your team can? The answer really depends on how closely you know these disasters, what to do in them, what not to do in them and above all, how well you understand how these disasters work and how people react when they are in these disasters, which brings us to our first item on the checklist. Stuff you should do when you find yourself in the middle of the disaster... a perfect topic for the next post.

Stay tuned.

Understanding Software Development Disasters - Part1

Comment Section