Lessons from ClearBooks failure

by admin on September 1, 2010

in Cloud Computing/SaaS

Earlier today I learned that ClearBooks had ‘unexpected downtime.’ Whenever you see that phrase you should substitute: ‘total cock up.’ To its credit, ClearBooks tried to communicate the issue on GetSatisfaction but in doing so exposed fundamental weaknesses of which buyers should be aware. From the GS trail:

We are currently experiencing a problem with our database server.

Downtime is currently unforseeable and could be for several hours.

Unfortunately, this has caught us out with only one system admin on site. Two system admins were typically on the road for a meeting this morning and are now retracing their steps fast to assist and get this issue resolved.

We had two local powercuts last night in our datacentre which is located in Aldershot, Hampshire. On both occassions the backup generators kicked in to keep serving up the site. This may or may not be related to the problems we are having now.

Worst case scenario as it currently stands is that we will have to set up a new db server and restore backups from two days ago.

I have previously called out ClearBooks for overstepping the marketing hype machine. Now it seems hype meets reality. This is a disaster scenario. Andrew Taylor immediately responded:

As a sys admin I’m not used to worrying inappropriately so we’ll wait to see how this plays out, but, this part of your update concerns me:

“Worst case scenario as it currently stands is that we will have to set up a new db server and restore backups from two days ago.”

Surely you have more recent backups than 2 days ago; I’m suprised you don’t have a replicated system which you can fail-over to but I’m shocked that your most up-to-date backup is 48 hours old.

I use Clearbooks because I don’t have a book-keeper and need something which helps me get my admin and invoicing done quickly; I don’t have the time to re-do all that work again.

I can live with the site being unavailable, I can’t live with losing my work.

He is right to express such concerns. The original problem was notified via GS at around 11.00 am (according to my time line.) As of 40 minutes ago, the situation was:

The RAID rebuild is about 75% complete – we can’t be sure if this will work, so we are also still working on plan B to restore from our off site backups. Our preferred option is to restore from the RAID rebuild as it should result in more data being recovered. This looks like it will finish around midnight, although we can’t be certain. If this route fails, then we will use the backups from Sunday (taken between 00:00 a.m. and 06:00 a.m)..

I wouldn’t recommend anyone stays up to wait for this tonight – but hopefully it should be ready working for you in the morning.

Rephrased: we think we know the issue but in fact we don’t know whether our current fix proposal will work. Customers do not yet know whether data has been lost or the extent to which the database is damaged.

So let’s do some analysis here. SaaS should ‘just work.’ Failure rates are generally an order of magnitude lower than on-premise systems due to the fact SaaS providers have to build in fault tolerance and resilience from the get go. BUT – if the SaaS provider doesn’t fully understand the implications of what they are building then any large scale failure wipes everyone out. That WILL happen from time to time. The question comes: what is the provider doing to ensure minimal disruption and what about potential data loss?

I recall FreshBooks had a catastrophic failure. It communicated the problems to users in clear terms, explaining that at worst, there would be a 32 minute data loss in the window when things went pear shaped. Customers lived with that and praised FreshBooks for both communicating and understanding the issues. Right now, no-one knows how much data has been lost from this ClearBooks issue. Hence the concerns of those who are commenting on a long ClearBooks GS thread. Right now, ClearBooks is looking a lot like a bunch of fake SaaS amateurs.

There are technical underpinnings to the ClearBooks story that need understanding. However, I think the more important point comes in the question of standards. In recent times I have been involved with discussions around the proposed Cloud Industry code of practice. I have advocated strongly for industry business standards while at the same time calling to account CIF for muffing its efforts, largely on the grounds it is a vendor led initiative.  They are trying to fix that and for that I give them a partial pass. As the ClearBooks story unfolds, it is interesting to note two Tweet messages:

Roan Lavery – FreeAgent talking about the ClearBooks problem:

@DuaneJackson Problem is that reflects badly on the whole industry.

…and Duane Jackson, Kashflow responding:

@roanlavery agreed. Makes accreditation schemes more appealing. I was a skeptic until today. (cc @garyturner )

It is especially gratifying to see Duane see the value in establishing standards. Customers need to draw comfort and industry standards can provide that. BUT – they require end user input. CIF understands this and is the only body attempting to get something positive done. Regardless of their past failures, they are listening and trying to act. Bottom line: if you’re not part of the conversation and are not qualified performing due diligence then don’t belly ache when things go wrong.

For more on this topic consider attending the ICAEW Cloud Computing for Accountants meeting on 24th September. I shall be speaking on these topics as part of my selection criteria talk. This incident has now become part of my presentation because as my good friend Frank Scavo says: ‘Just cuz you’re SaaS doesn’t mean you get a pass.’

Enhanced by Zemanta
Comments have been disabled for this post.
Sort: Newest | Oldest

My VAT return is due on Monday, and I have just wasted 6 hours on my computer virus scanning debugging re booting and so on, my question is why did clear books give no warning this was coming or even put on there home page this is a down time we will be back up in running in x time, I have wasted most of my day. Do I get a refund? And will Clear Books pay my fine when my VAT is not in on time?

I did not respond to this article at the time as our sole priority was returning to normal service and engaging with our customers.Yes we messed up. Having gone through the ordeal, I don't believe that there are many SaaS providers more wary of the pitfalls of an untested Data Recovery procedure than Clear Books. Believe me, it is a gut wrenching experience and one we never want repeated.We assumed we had appropriate systems in place and they simply weren't good enough. Despite everything, it was heartening to see that our users actually rallied around us. I believe their incredible support stemmed from the fact that we openly admitted the problem and outlined the action we were taking to prevent a similar event repeating itself.Well that plan of action is now in place and here it is:http://www.clearbooks.co.uk/blog/2010/11/17/what-w...

I did not respond to this article at the time as our sole priority was returning to normal service and engaging with our customers.

Yes we messed up.

Having gone through the ordeal, I don't believe that there are many SaaS providers more wary of the pitfalls of an untested Data Recovery procedure than Clear Books. Believe me, it is a gut wrenching experience and one we never want repeated.

We assumed we had appropriate systems in place and they simply weren't good enough.

Despite everything, it was heartening to see that our users actually rallied around us. I believe their incredible support stemmed from the fact that we openly admitted the problem and outlined the action we were taking to prevent a similar event repeating itself.

Well that plan of action is now in place and here it is:

http://www.clearbooks.co.uk/blog/2010/11/17/what-w...
http://www.clearbooks.co.uk/blog/2010/11/17/protec...
http://www.clearbooks.co.uk/blog/2010/11/17/steps-...
http://www.clearbooks.co.uk/blog/2010/11/17/steps-...
http://www.clearbooks.co.uk/blog/2010/11/17/steps-...
http://www.clearbooks.co.uk/blog/2010/11/17/steps-...
http://www.clearbooks.co.uk/blog/2010/11/17/steps-...

It does require risk assessment because you'll tend to find the vendor answers often raise more questions: Example: "We're SAS70 Type II certified." Great - but what does the certification actually cover? Who did it? When? When's the next review? On data centre stuff: so your stuff is secured in (say) Rackspace place? OK - what does that SLA look like?

In this case they anticipated a database restore. That didn't happen. People lost 3 days' data. What happened? Again, understanding procedures requires something of a technical eye.

Finally - as someone who is recommending systems, are you prepared to take the risk of something going down and then the client knocking on your door for potential compensation? If so then how will you indemnify? How will you cross indemnify back to the vendor? What about rebates?

See what I mean?

I'm not holding my breath for an industry standard, Dennis.In my experience such standards tend to be conceived more with endowing their issuing bodies credibility and authority than actually serving their intended signatories or the customer. And I suspect any resulting standard will likely be compromised down and diluted to a lowest common denominator of fulfilment, to maximise its vendor adoption. Happy to be proven wrong.Checking and validating the disaster recovery credentials of a SaaS provider is something that most people might not automatically think about for two reasons. First, it's not ever something customers required to do with classic software therefore its not instinctive and second, over the last ten years we have collectively credited the web with enormous levels of trust (and therefore over years complacency) from broad and generally fault free usage of services from personal email to Amazon. On the web things just tend to work, and a few moments of downtime once in a blue moon is a small price to pay for the amazing convenience of online shopping and communications we now all come to expect. But the consequences for business critical data are significantly more acute than missing out on a two-for-one offer at Amazon or harvesting your daily crops in Farmville.At Xero, we have a high degree of confidence that our systems, processes and people respect the critical importance of protecting business data for our customers. Not just because the damage to reputation and trust incurred by a vendor that fumbles such a core competency is so painful - we're not motivated by the fear of failure - but that seeking to build a product that customers will love using requires the highest levels of care, planning and execution at every level from design to delivery, not just the pretty pictures at the surface level. You must care as deeply about the exact colours and pixel perfect precision of your front end as you do your back end processes, the steps and processes you have in place to mitigate and respond to risk and failure. In other words, it's as much about culture as it is about well defined processes and architecture. Indeed, you can't have the latter without the former.We've spoken about our approach to infrastructure on our blog before, and we'd be happy to lay out our philosophy, architecture and processes in more detail.

@gary: I suspect you may be right given that history often repeats in the IT world. But SaaS is such a radical departure that those who are attempting to place standards are at least recognizing the problem, however badly thought through the initial approach. It is one more reason for end user representation IMO.Checking out credentials may not be second nature yet but it sure is one heck of a FUD talking point. I'd much rather vendors get out in the open on this one than leave it to fester.As to your points re: Xero - I have no reason to dispute what you say but that would not prevent me from taking the due diligence steps I would normally pursue in any competitive bid situation.

I'm not holding my breath for an industry standard, Dennis.In my experience such standards tend to be conceived more with endowing their issuing bodies credibility and authority than actually serving their intended signatories or the customer. And I suspect any resulting standard will likely be compromised down and diluted to a lowest common denominator of fulfilment, to maximise its vendor adoption. Happy to be proven wrong.Checking and validating the disaster recovery credentials of a SaaS provider is something that most people might not automatically think about for two reasons. First, it's not ever something customers required to do with classic software therefore its not instinctive and second, over the last ten years we have collectively credited the web with enormous levels of trust (and therefore over years complacency) from broad and generally fault free usage of services from personal email to Amazon. On the web things just tend to work, and a few moments of downtime once in a blue moon is a small price to pay for the amazing convenience of online shopping and communications we now all come to expect. But the consequences for business critical data are significantly more acute than missing out on a two-for-one offer at Amazon or harvesting your daily crops in Farmville.At Xero, we have a high degree of confidence that our systems, processes and people respect the critical importance of protecting business data for our customers. Not just because the damage to reputation and trust incurred by a vendor that fumbles such a core competency is so painful - we're not motivated by the fear of failure - but that seeking to build a product that customers will love using requires the highest levels of care, planning and execution at every level from design to delivery, not just the pretty pictures at the surface level. You must care as deeply about your back end processes, the steps and measures you have in place to mitigate and respond to risk and failure as you do about the exact colours and pixel perfect precision of your front end. In other words, it's as much about culture as it is about well defined processes and architecture. Indeed, you can't have the latter without the former.We've spoken about our approach to infrastructure on our blog before, and we'd be happy to lay out our philosophy, architecture and processes in more detail.

Update : Craig Walker, Xero's CTO, has just written about Xero's hosting infrastructure and disaster recovery systems on our blog at http://blog.xero.com/2010/09/xero-operations/

@gary: I suspect you may be right given that history often repeats in the IT world. But SaaS is such a radical departure that those who are attempting to place standards are at least recognizing the problem, however badly thought through the initial approach. It is one more reason for end user representation IMO.

Checking out credentials may not be second nature yet but it sure is one heck of a FUD talking point. I'd much rather vendors get out in the open on this one than leave it to fester.

As to your points re: Xero - I have no reason to dispute what you say but that would not prevent me from taking the due diligence steps I would normally pursue in any competitive bid situation.

Having now read through the thread http://gsfn.us/t/1csex I'm afraid that I'm astonished. Dennis is right about the need for SaaS providers to clarify their back-up procedures publicly and the comments from Kashflow about industry wide standards is key. This is about the credibility of the Cloud for accounting - mission critical stuff. At present we have no clients with data loss - that's down to luck and the fact that on a half decent weekend they were all outside drinking wine not posting invoices. It's not the same for everyone else!

Having now read through the thread http://gsfn.us/t/1csex I'm afraid that I'm astonished. Dennis is right about the need for SaaS providers to clarify their back-up procedures publicly and the comments from Kashflow about industry wide standards is key. This is about the credibility of the Cloud for accounting - mission critical stuff. At present we have no clients with data loss - that's down to luck and the fact that on a half decent weekend they were all outside drinking wine not posting invoices. It's not the same for everyone else!

Good article Dennis. Firstly must commend the Clearbooks guys on trying to be transparent through a situation that is clearly (no pun intended) going to have long-term business impact.The issue of demonstrable instant failover is v interesting, especially at critical times around deadlines, i.e. the VAT returns due in. It's an issue that's going to become more prevalent as SaaS becomes used for more mission critical systems.It reminds me of back when I was an auditor looking at IT systems. One test would be on a company's disaster recovery plan, but the audit test was failed if the DR hadn't been fully tested recently.

Just two things.1. Why don't CB have a more recent backup? Need to know more before faith can be restored.2. Accreditation may not be necessary if vendors explain their backup and restore process, the maximum amount of data loss they plan for, the maximum downtime they plan for, etc, which will provide transparency, which in turn allows experts in the field to compare vendors processes and inform the rest of us.

Just two things.

1. Why don't CB have a more recent backup? Need to know more before faith can be restored.

2. Accreditation may not be necessary if vendors explain their backup and restore process, the maximum amount of data loss they plan for, the maximum downtime they plan for, etc, which will provide transparency, which in turn allows experts in the field to compare vendors processes and inform the rest of us.

Given this failure, I'd like to see the likes of FreeAgent and Kashflow, etc, go on record and tell us about their own current disaster recovery procedures. It would certainly give us all peace of mind and reassurance that they are covering themselves and their customers.

That's an excellent point. Kashflow has provided me with information on this issue but it is not quite as straightforward as you might think. If you want to understand the pros/cons of different scenarios then consider engaging me for risk assessment.

'Clouds' deal in dataflows, but there are no standards for data flows - unlike in many process industries where joining the dots, understanding the corporate DNA, and seeing precisely how things flow is often a legal requirement.If I was considering a move to the 'cloud' I would want to know the answers to three questions about my business:1. Which IT assets or resources support a particular business process or service - allowing the question, “Which parts of the business will be directly affected should this IT System, or part thereof, fail?” to be answered2. The value of those business processes to the company operation - allowing the question “What would be the financial impact should an IT system, or component thereof, fail?” to be answered3. How data flows between the IT Systems that enable the business services to operate - which, critically, allows an assessment to me made of “Which parts of the business will be indirectly affected should this IT asset fail?”Unless these issues are addressed, and standards introduced to show precisely how data flows through the assets of the business, such incidents will be repeated.Some of us think the recent 'flash crash' occured for the same reason - insufficient understanding of how data (ie money) flows through the business. Business clarity has been sacrificed for speed and profit. Of course the global financial system falling over is only slightly more worrying than the above.It's later than a lot of people think. (apologies for the spam tweet Dennis, I'm a newbie there, need to learn the ropes)

You say: "Unless these issues are addressed, and standards introduced to show precisely how data flows through the assets of the business, such incidents will be repeated." That's not a proven fact or even a sensible supposition. Explain how you got from A to B on this one. I'm intrigued.

I come from an Oil and Gas background where understanding the flow through pipes is critical for many reasons … but here are my top 4: - 1) you don’t want to cut into a pipe without knowing what is flowing through it – the effects are not nice. 2) If you can understand the flows through components you can understand the (financial) risks of component failure and mitigate against loss. 3) When you understand the flows you can optimise the way you work. 4) Accurate risk assessments of adopting new ways of working can be made. It’s a tried and tested approach which works throughout manufacturing and standards have been adopted to ensure everyone sings from the same hymn sheet. That’s the A. How do we get to B.Let’s apply the same logic to other forms of flow which are critical to the business, such as the flow of data. If we understood how data flowed though Business and IT components we’d know the impact to the business should any component or system fail. And systems will fail regardless of SaaS or Cloud or in-house operations – that’s just what happens.Things go wrong manufacturing too. But at least everything has been documented to an international standard. Risk assessments have been made and plans formulated. Impact assessments have been made – people can die if they aren’t.I’m sure you’ll agree from your work in risk assessment and mitigation that many business people (or IT people) don’t have a clue where to start assessing financial risks when migrating to new systems. My assertion is still that unless standards are introduced to show precisely how data flows through the assets of the business, proper risk assessment is too onerous for many to adopt and such incidents will be repeated. Documenting flow would give a great starting point to assess failures in business terms.

Given this failure, I'd like to see the likes of FreeAgent and Kashflow, etc, go on record and tell us about their own current disaster recovery procedures. It would certainly give us all peace of mind and reassurance that they are covering themselves and their customers.

That's an excellent point. Kashflow has provided me with information on this issue but it is not quite as straightforward as you might think. If you want to understand the pros/cons of different scenarios then consider engaging me for risk assessment.

'Clouds' deal in dataflows, but there are no standards for data flows - unlike in many process industries where joining the dots, understanding the corporate DNA, and seeing precisely how things flow is often a legal requirement.

If I was considering a move to the 'cloud' I would want to know the answers to three questions about my business:

1. Which IT assets or resources support a particular business process or service - allowing the question, “Which parts of the business will be directly affected should this IT System, or part thereof, fail?” to be answered

2. The value of those business processes to the company operation - allowing the question “What would be the financial impact should an IT system, or component thereof, fail?” to be answered

3. How data flows between the IT Systems that enable the business services to operate - which, critically, allows an assessment to me made of “Which parts of the business will be indirectly affected should this IT asset fail?”

Unless these issues are addressed, and standards introduced to show precisely how data flows through the assets of the business, such incidents will be repeated.

Some of us think the recent 'flash crash' occured for the same reason - insufficient understanding of how data (ie money) flows through the business. Business clarity has been sacrificed for speed and profit. Of course the global financial system falling over is only slightly more worrying than the above.

It's later than a lot of people think. (apologies for the spam tweet Dennis, I'm a newbie there, need to learn the ropes)

You say: "Unless these issues are addressed, and standards introduced to show precisely how data flows through the assets of the business, such incidents will be repeated." That's not a proven fact or even a sensible supposition. Explain how you got from A to B on this one. I'm intrigued.

I come from an Oil and Gas background where understanding the flow through pipes is critical for many reasons … but here are my top 4: - 1) you don’t want to cut into a pipe without knowing what is flowing through it – the effects are not nice. 2) If you can understand the flows through components you can understand the (financial) risks of component failure and mitigate against loss. 3) When you understand the flows you can optimise the way you work. 4) Accurate risk assessments of adopting new ways of working can be made. It’s a tried and tested approach which works throughout manufacturing and standards have been adopted to ensure everyone sings from the same hymn sheet.

That’s the A. How do we get to B.

Let’s apply the same logic to other forms of flow which are critical to the business, such as the flow of data. If we understood how data flowed though Business and IT components we’d know the impact to the business should any component or system fail. And systems will fail regardless of SaaS or Cloud or in-house operations – that’s just what happens.

Things go wrong manufacturing too. But at least everything has been documented to an international standard. Risk assessments have been made and plans formulated. Impact assessments have been made – people can die if they aren’t.

I’m sure you’ll agree from your work in risk assessment and mitigation that many business people (or IT people) don’t have a clue where to start assessing financial risks when migrating to new systems. My assertion is still that unless standards are introduced to show precisely how data flows through the assets of the business, proper risk assessment is too onerous for many to adopt and such incidents will be repeated.

Documenting flow would give a great starting point to assess failures in business terms.

An excellent and impartial story Dennis. Good to see that something positive (cf. FreeAgent & Kashflow comments) is coming from this.

It does require risk assessment because you'll tend to find the vendor answers often raise more questions: Example: "We're SAS70 Type II certified." Great - but what does the certification actually cover? Who did it? When? When's the next review? On data centre stuff: so your stuff is secured in (say) Rackspace place? OK - what does that SLA look like? In this case they anticipated a database restore. That didn't happen. People lost 3 days' data. What happened? Again, understanding procedures requires something of a technical eye. Finally - as someone who is recommending systems, are you prepared to take the risk of something going down and then the client knocking on your door for potential compensation? If so then how will you indemnify? How will you cross indemnify back to the vendor? What about rebates? See what I mean?

Previous post:

Next post: