Posted by: Jeff Timmins | October 16, 2010

The right and wrong way to communicate outages

More on communication and outages.  Note – pulling this out of my Drafts so some of this is a little old.

Intuit- http://www.zdnet.com/blog/btl/intuit-services-back-online-after-extended-outage/35981

Yikes!  Service down for 36 hours in June because power failure affected Intuit’s primary and backup systems?  Major whoops, feels like the were not prepared for being an Online Services/Software provider.  Overall this looked like the wrong way to communicate.

(from the CEO of Intuit) http://community.intuit.com/letter-from-ceo-brad-smith

(comments on this page are interesting) http://smallbusiness.intuit.com/blog/where-small-is-now/2010/06/intuit-service-update-4-were-back-online.html

WordPress – http://edition.cnn.com/2010/TECH/web/06/11/wordpress.outage/ and http://www.eweek.com/c/a/Web-Services-Web-20-and-SOA/Facebook-Outage-Triggered-By-Database-Software-Error-792897/

Service down for most people for an hour in June but for some it took 12 hours to become available again.  Four days after the event WordPress provides this blog entry on what happened (high-level).  A little late, would have liked to see this 2 days max afterwards but well written and informative.  I’ll call this the “OK” way to communicate.  I especially liked the part about the Cloud is a learning process and “We’ll be using our newfound experience to keep WP.com a safe, stable, and robust place to hang your hat and have your blog call home.”

(from Matt – sorry, don’t know who “Matt” is) http://en.blog.wordpress.com/2010/06/14/downtime/

Facebook – http://www.mobilespace.in/facebook-back-up-after-worst-outage-in-four-years/

Facebook was down for 2.5 hours in September due to a Database “error”.  The detail of what went wrong in a post on their blog was great but at the end saying “the problematic system had been turned off and a permanent fix was being sought” doesn’t leave me with tons of warm fuzzes as the process that failed was exchanging invalid data with “updated information.”  Sounds like that was a good thing so I’d like to know more about how easy it would be to fix or the risk of keeping the invalid data for awhile.

As for the timing, a double bonus for Facebook.  They announced the status as available again via a Tweet – always a good idea to have a 3rd party available for status updates – and they reported the problems on the previous mentioned blog hours after resolution.

Too bad they didn’t publish a status of the fix.  Oh well, cannot have everything I guess. =}

(Update – forgot the recent Microsoft BPOS outages!) Microsoft http://www.computerworld.com/s/article/9184440/Microsoft_apologizes_for_hosted_service_outages

Three outages covering late August and early September got an official apology one day after the last outage (I didn’t see other apologies on the same blog for the early problems).   This can be easily summarized as late communication, some real information but for the most part nothing to help us understand the root cause and a sterile apology that would work well between conflicted countries.  Thanks but no thanks, we as customers depending on an external party for our day to day services need something better than that.

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

For a general “What should be communicated” related to an outage, Joe Panettieri at mspmentor.net has some good ideas in his post of How to Over Communicate During a SaaS Outage.

http://www.mspmentor.net/2010/05/19/how-to-over-communicate-during-a-saas-outage/

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

Categories

%d bloggers like this: