We are a New Zealand company with a large global user base for our web based project management software ProWorkflow.com.
All our SaaS servers are located at the LayeredTech/Fastservers data center in Chicago. They have given us 6-7 years of solid performance with plenty of available staff for onsite 24/7 support at all times.
I get asked regularly why I don’t use New Zealand based hosting providers for our global SaaS company. The reason is simple.
I’m yet to find one in New Zealand that actually takes it seriously, has full onsite 24/7 support and a level of transparent communication at least similar to most US data centers."
I want to see:
- Multiple support staff onsite 24/7
- Redundancy that works
- Clear open communication with customers
What I’ve seen often (and lately) is:
- Staff go home at 5:30 and monitor remotely
- Redundancy that fails
- Lack of / confusing communication with customers
So what caused this blog post?
Most NZ techo’s would be aware that a few days ago that a data center had major issues with an infrastructure failure that resulted in about 24hrs of complete downtime.
I believe this affected many thousands of websites and a massive amount of email. A few of our non-related websites and email are through a well known New Zealand based hosting provider.
The outage came at a time we were dealing with some important issues and due to the lack of email our company had to pull all nighters calling customers and monitoring our DB based backup support system.
We run a 24/7 SaaS model with US customers awake while NZ sleeps which is why we need to know our servers have onsite support. If they go down, companies literally stop working. We don’t host websites, our SaaS app is a core business system for many thousands of users globally.
To re-iterate, the letter below is not from our Server provider in the US, it’s from a hosting provider here in NZ we just use for email and some small websites. During the outage, this company was very non-communicative with it’s customers. people were freaking out, didn’t know what was happening and they were also trashing the provider on Twitter and other places.
New Zealand hosting providers I’ve talked to and worked with seem to forget a basic fact that a large number of people and companies actually work late, through the night or have global user bases. Even if they do know this, a comment below shows that they seem to put less priority on issues occurring ‘Outside of work hours’ in New Zealand.
Read the letter below we received AFTER the massive outage with little communication through the event. In comparison, our other server provider had some issues lately and updated their status page regularly so customers knew what was happening.
The part that really brassed me off was the part that said:
“During more sociably acceptable hours…” they contacted help to source hardware. So thousands of people, companies and websites suffer whilst they have their breakfast waiting for the ‘shop’ to open.
In addition, their phones and support numbers weren’t accessible, the website was down and they weren’t being transparent or communicative on Twitter. Customers were mostly in the dark. At the very least they should have a blog or status website hosted separate from their facility to keep customers updated.
In contrast, the data center company we deal with in the US store replacement hardware onsite for almost everything needed in case of emergency. They also have a separately hosted Status website for updating customers.
Where was the risk analysis?
When designing the data center, did anyone ask the question “What happens if our firewalls fail?”
Take this as a lesson. Kiwi hosting providers and data centers really need to lift their game if they want global software co’s to host here.
- Tip 1. Don’t go home at 5:30
- Tip 2. Keep customers informed
- Tip 3. Have a status/blog site independently hosted
- Tip 4. Use ‘people language’ not ‘Geekspeak’ (We’re not all techos’)
- Tip 5. Keep replacement hardware/software onsite for key services
That’s the rant for the day – time to get back to dealing with the email backlog.
PS: I’m not angry, just a little frustrated and being totally honest here about how NZ providers need to lift their game.
Following Tuesday night’s reported outage (21.30-24.00) which was attributed to a core switch intermission failure, last night the same symptoms occurred (commencing 19.30). Clearly this highlighted that the corrective action of the previous night i.e. the replacement of both core switches deferred the issue rather than provided a permanent resolution.
Last night the fault was again identified by our network management software and the team reassembled consisting of the CTO, Sys-Admins and management. The issue was immediately escalated to our external maintenance support teams (CheckPoint firewall provider and hardware provider) as is standard practice for an outage of this significance. This identified that the fault appeared to be within the Checkpoint firewall clustering software (dual redundancy).
With the assistance of Checkpoint engineers the decision was made to split the firewall cluster and run them as individual stand alone units to resurrect the network. This appeared to temporarily solve the issue at 00.15. For context the firewall servers are running at 15-20% whilst not clustered i.e. with very low levels of utilisation for the spec of the equipment.
At 02.45 the network failed again. The team were still on-site monitoring the network. Our firewall maintenance providers were again called who arranged for patches to be downloaded. At 05.10 the patches were installed and the firewall management server reconfigured to accommodate the patch upgrade. This did not provide a permanent fix.
During more sociably acceptable hours we reached out to our friends in XXXX to help source checkpoint firewall hardware and to provide ‘men on the ground’ to help support our technical team that had worked through the night. In addition to this a decision was taken to move some core applications to the old network (ASA) that was still functioning as was not reliant on the check point firewalls. These include XXXXXXX.co.nz, Email (inbound and outbound) and XXXXXXX.co.nz. However the core network was re-established without the need to deploy this second network with the core applications migrated.
Low level analysis with the assistance of Checkpoint engineers in the USA identified high volumes of fragmented packets originating from one of our shared virtual hosting servers to be the root cause of the issue. These packets were flooding the firewalls and causing the outage. The source of these packets was identified and blocked at 13:50. The checkpoint firewalls then returned to normal service which finally brought the network back on line at approximately 14:00 hours.
Like all hosting companies, we do not exercise strict control over the content that customers upload to their websites. It appears that one customer site was compromised, which in turn caused the flood of malformed packets to the firewalls. Our internal network analysis software did not identify these packets as they were not ‘standard’ TCP/IP traffic.
In order to prevent this level of disruption in future we intend to move all shared virtual hosting customers behind a separate firewall that is isolated from the rest of our networks. This will ensure that should there be any re-occurrence the offending server is quarantined, and does not cause the kind of outage we have just experienced.
We do sincerely apologise for this outage. These problems are extraordinarily difficult to diagnose, and we are grateful for the assistance provided by CheckPoint engineers in the USA, and local XXXXX network engineers who have complemented the efforts of our own technical team.
Should you require a more technical update, please contact XXXXXX our CTO or please contact me on my email or directly on my cell (xxx xxxx xxxx).
Once again our apologies for this critical issue and thank you for your continued support.
About The Author:
Julian Stone is the CEO of ProActive Software, developers and creators of the leading web based project management software http://www.proworkflow.com
About The Author:
Julian Stone begin_of_the_skype_highlighting end_of_the_skype_highlighting is the CEO of ProActive Software, developers and creators of the leading web based project management software http://www.proworkflow.com.