RFO 150320: Reason For Outage – 20/March/2015

Astutium would like to apologise for the unplanned outage to some services earlier this morning.

What went wrong?

At approximately 8.56am, a management card failed and then proceeded to stop switching and passing traffic which affected some customers:

  • on single homed connections
  • on our shared hosting platform
  • on the legacy virtual private server OVZ
  • on the virtual dedicated server Xen based systems
  • some dedicated servers
  • single upstream connection co-location customers

Clients using BGP or with multiple connections to multiple switches, or where we offer transit and routing were not affected.
The problem was limited to equipment utilised by customers in rack rows C, D, and E in Global Switch 2.
All other customers in Global Switch 2 as well as in Telehouse and our overseas data centres were not affected.

Whilst the network remains fully operational and world wide accessible during this time, unfortunately as affected the switch was not correctly passing traffic onto some racks, the devices within those racks were uncontactable.

By 9.18am engineers were on site and had diagnosed the problem was with the switch and at which point they decided to replace the management card. The existing card was removed and the new card inserted.

Software was loaded onto the switch by 9.25am and operations started to resume, although it did take between 7 -12 minutes for different servers and different customers arp requests and customer switches and hardware to pick up the new system and start taking traffic again.

By 9.41am all operations should have reverted to normal.

What can we do about this in the future?

Unfortunately unplanned events do happen and hardware does fail.

Whilst the Astutium policy is to replace all equipment before the end of its’ projected working life, as well as to actively monitor all servers and services throughout the network, it is unfortunate that some things just blow up when not expected to and can take some time to fix.

There are some things that can be done at the customer end in order to minimise the impact of a single item failing, such as a BGP connection from multiple routers or multiple cables to diverse switches to duplicate connections, it is not always practical or cost effective.

If you would like to talk about the options for a Higher Availability service to take you…
• from 99% to 99.5% (or to 99.95%)
• from 99.5% to 99.9% (or to 99.95%)
• from 99.95% to 99.99%
• or even to a full 99.999%
H-A service please contact the Sales Team to discuss the options.