Sunday, April 26, 2009

Kobayashi Maru - Dealing with downtime


One of the worst things that can happen to an online business is downtime. It doesn't matter if the servers crashed, the network connection "broke", or the data center was compromised; your users can't access your service which is bad.

Theory vs. Practice
In theory, there is no difference between theory and practice - in practice, there is.

Ideally, a website should never suffer any downtime. In reality, downtime can't be avoided and it will happen whether planned or unplanned. Unlike brick and motor structures which are designed to withstand storms and earthquakes, online businesses are very fragile with many single points of failure such as DNS, loss of power, failed hardware, the easily cut T-1 or OC-3 cable, etc.

Implementing a contingency plan isn't an easy task and your plans will only work well when systems fail as you expected them to fail. You not only have to figure out what you're going to do under certain circumstances, but you have to rehearse it on a regular basis - this is what the military refers to as "exercise training". Not exercising such as pull-ups or push-up, but rather wargaming the scenarios which are most likely to occur and then actually executing the planned response.

Contingency Planning At Adjix
Adjix is a small company with limited resources, so we've implemented a simple solution to keep links running if our servers go down. We do this by relying on Amazon's web servers.

Every time our users create a shortened Adjix or ad.vu link, we implement it as a meta-refresh web page instead of the industry's more common HTML redirect (sometimes referred to as a 301 or 302 redirect). Should our servers go down, we can make a quick DNS change which takes about five minutes to propagate throughout the Internet. We believe broken links are a bad thing and although we may not be able to capture detailed link click data when this happens, our links will continue to work. We've tested this plan, at Adjix, a few times without skipping a beat - of course that's no guarantee that it will work perfectly in a future crisis, but it does lower risk while increasing confidence.

Serving up shorten URLs in this manner is not the industry norm, but we believe it gives us the security, should bad things happen, that we can continue to keep our links working. No plan is perfect - it's foreseeable that both our servers and Amazon's servers could go down at the same time – but we believe it's a solid plan commensurate with our budget.