After what must have been a horrific week for ClearBooks, the company has come clean about what went wrong. In a nutshell, they didn’t know what they were doing. If that sounds harsh then look at what they say:
Then we suffered a major hardware failure on our single database server. Having a single database server was a mistake although such a severe hardware failure felt like bad luck (but then you have to expect the worst). Customers working the bank holiday weekend lost their weekend’s worth of data as we had to resort to restoring offsite backups. We immediately reacted by introducing real time replication of our database server (master/slave) to ensure data would be replicated and safe in the future.
Data losses are inexcusable for SaaS/cloud players. It happens in the on premise world and will no doubt happen in the cloud world but there is a fundamental difference. If my on premise system goes pear shaped and I lose data, then I have a problem. If a cloud player loses data, every one of its customers has a problem. ClearBooks problem at that time was two fold.
- They were unclear whether the solution they were attempting to apply would work
- Communications were poor
At the time, I compared what happened with a similar problem at FreshBooks, noting that the primary concern to the outside world is always about communicating appropriately but noting that in the FreshBooks case, only 32 minutes of data was lost. ClearBooks fell into the trap that many businesses have. It didn’t know how to communicate and so when it attempted to be transparent it ended up making matters worse. That’s meat and drink for naysayers and people like myself who have advised on this topic for many years.
On the technical side, the admission there was no redundancy is astonishing. This is IT 101. Any system, whether on premise or cloud must have redundancy. Even in my wee business I have multiple redundant systems. Years of losing data, failing to have proper backups and trusting to luck taught me the hard way that in the modern world, redundancy isn’t an option, it’s the cost of doing business. That’s even more important in the cloud where you are trusting your data to a third party. Unfortunately, in ClearBooks case, it gets worse. From the same post:
There was single point of failure with our NFS server. For the techies amongst you full details are provided here by Senior System Architect at CatN, Mark Sutton. CatN has also apologised to its customers and outlined the changes they are making here in a post by CatN’s commercial director, Joe Gardiner.
As I observed last weekend, everything at CatN was out for a period of time. The current explanation offered by CatN provides plenty of reasons to be concerned, not least:
During the process of unmounting/remounting the storage, we found that the filesystem was overdue a consistency check.
This is because we have been concentrating on building our new vCluster platform with 2N redundancy rather than upgrading the existing system. Only having a single NFS server meant that we were unable to take the storage down to perform a file system check.
Again, this is cloud infrastructure 101 failure brought about by poor management and inadequate processes. The company says that it is using agile methods to achieve its goals but adds:
For all the benefits of agile working it does increase short term risks of failure. We have put our product in “Beta” for this reason while we work on large platform development projects and ideas. When it is mature to a commercial grade we will celebrate and remove this label but until then we also look to communicate openly and swiftly what is going on if we encounter problems.
There are several problems with this statement.
- A commercial grade application service cannot reasonably expect to be 100% (or even close to) reliable when the hosting provider is straying from tried and trusted methods of data center build. There are no short cuts.
- Agile does not necessarily translate to delivery – as is obviously the case.
- Providing a commercial service as beta is perfectly acceptable provided that customers fully understand the risks they are taking. Accounting data is one of those types of data that few sane people will want to put at risk. You therefore have to ask yourself whether the totality of what is being said amounts to a set of risks you are willing to take.
In this case, the answer has to be a qualified ‘no.’ All systems fail. It’s the degree to which it causes disruption that raises concerns. In this case it doesn’t matter the outages have been relatively infrequent or for that matter that outages were persistent during a particular time frame. It is the underlying problems to which the companies are now admitting that crush confidence. This is made all the more troublesome by virtue of the invested relationship that exists between ClearBooks and Fubra.
For its part, ClearBooks says it is a start up and that mistakes will be made but that it is learning. Even the very best software companies make mistakes and some of those can look equally foolish. The difference is that those same companies do not try and pretend they are something they are not. It is only after three years that ClearBooks acknowledges it is in beta, with all the risks that implies. That’s borderline misrepresentation for a service that is taking your money.
The other difference is that ClearBooks has not learned from its past communications failures. This is always difficult because once burnt, companies often retreat into their shell. That’s the wrong approach as it only encourages speculation. I suspect that in part ClearBooks has had a lot of trouble understanding its technical problems and so does not know what to say. Its dependency upon another services which also has deep issues does not help. I am betting they had internal communications issues that clouded ClearBooks’ ability to say much that was useful. This was all compounded by the way in which it chose to communicate. As far as I can tell, customers were not receiving email updates. Instead, we saw the odd Tweet and blog post. Much as I am a social media fan for communications, email is still your best friend in these situations.
I also suspect that ClearBooks has not thought through contingency planning where this type of failure can arise. For what it is worth, access problems are not uncommon. They had an alternative way for customers to gain access but it was not communicated at the right time. That’s another reason why customers were frustrated.
The good news is that we all now know that ClearBooks is a beta solution.
Too often, customers accept without question what they are told about IT. This is never a good idea even though it is often difficult for customers to understand what they are looking at. The industry has done a poor job in outlining the risks and benefits of cloud solutions. It is only when we see situations like those that have arisen at ClearBooks that customers become aware just how much they are gambling on beta services.
As I have said before, I’m not convinced HMRC would be as forgiving as some ClearBooks customers if they came knocking. It is an unfortunate fact of life that it is the user who is responsible for their books in the eyes of HMRC. We have yet to discover whether HMRC is willing to accept hosting failure as a reasonable excuse. When understood in those terms, it is essential that buyers of these services understand what they are letting themselves in for.