Wednesday, February 20, 2008

Can you achieve 100% up-time?

As an online registration provider, we strive for 100% availability of our event management software. This is very difficult to achieve in practice, and we occasionally have outages ranging from a few minutes on all servers to intermittent performance for hours affecting only specific customers. Our customers' reaction to these hopefully infrequent episodes ranges from understanding ("oh well, I have a lot of other things to do right now - I'll check back later.") to complete outrage ("I'm losing paid registrations! I have to get this report to the hotel now!").

The "New Normal"?

This article, On-Demand Outages the "New Normal"?, takes the interesting view that as people become more familiar with daily usage of Software as a Service (SaaS) such as our online registration technology, users will realize that these outages are inevitable and they will adjust their expectations and actions during the occasional downtime. No one is immune to the problem - the article references recent and period performance troubles at major SaaS companies such as Salesforce, Blackberry, and Google Apps.

My Old Friend, Who I Dearly Do Not Miss

blue screen of deathI remember the early days of Windows when the "blue screen of death" was an unwelcomed part of my daily routine. Reboot, stand up and stretch, login, get some water, wait for the programs to open, pick back up where I left off. The blue screen probably did me more benefit ergonimcally by interrupting long sessions slouched over at my PC than they caused me harm - that is, once I learned to control my temper.

Getting Better, but We're Not There Yet

I'm not excusing poor performance or saying that we won't get better (as we have in the past). But maybe we as users need to begin to consider intermittent outages of SaaS as unavoidable and expected, and thus we should create contingency plans for what we will do when they happen (besides freaking out).

2 comments:

Anonymous said...

Nice Rick! I wouldn't mind forwarding this to some of our customers, however considering they have about 40 plus users on the system at any one time, I think that slowness and outages have a greater affect on them and they might not agree.

Rick Borry said...

Reagarding Bob's comment about customers with a large number of critical users -

I agree with them that we should continually improve our uptime and strive for 100% availability outside unscheduled maintenance windows. But the point of the article is that businesses shouldn't treat SaaS outages as "unexpected". They should plan for these periods as part of their normal contingency planning, the IT staff should work to minimize the outage's affect (e.g. by having offline options available), and they should educate their staff on what to do when it happens.

For example, a policy could be:

1. When an outage happens, first check a major site such as www.yahoo.com or www.google.com to make sure that the problem isn't with your Internet connection

2. Then check with one colleague to make sure the problem is general

3. If this is the first time the outage has happened in the past 24 hours, then take a break or do something else and check back in 5 minutes or later

4. If the SaaS is still unavailable and access is critical, then contact the IT person or SaaS support in order to assess their awareness and diagnosis of the problem and time to resolve. Note that companies should have a single contact who communicates with the SaaS vendor and then spreads applicable news to all affected employees internally.

5. After 30 minutes, staff who critically need access should implement offline options as prepared by their IT group in advance, e.g. use offline software and overnight backups of data

etc.

Overall, the benefits of using a SaaS greatly outweigh the costs of planning around outages, and all reputable vendors are going to have occasional outages and are going to improve over time.