COMPLETED: Scheduled Maintenance: Astutium Ltd US network

Scheduled Maintenance: Astutium Ltd US network.

What: Software upgrades on the network routers and switches.

Affects: US Website Hosting, US Dedicated Servers, US Virtual/Cloud Servers and US (All) and UK (Multicraft access only) Minecraft Gaming Services.

When: Thursday, August 22, 2013 from 7:00 a.m. to 11:00 a.m. GMT

 

Planned maintenance on the Astutium Ltd US network will occur on 22/August/2013, which will have some impact on all US Website Hosting, US Dedicated Servers, US Virtual/Cloud Servers, US (All) and UK (Multicraft access only) Minecraft Gaming Services.

During this maintenance window, we will be performing necessary software upgrades on the network routers and switches.

 

A network downtime of 15 minutes is anticipated after uploads during this four hour maintenance window, and a 5-10 minute outage of connectivity to each server in turn as it is moved to the new switching fabric.

 

No servers are being powered off, this is a network cabling and software maintenance only.

 

No data-loss is expected, our US Network Engineering Team will be taking all necessary precautions to mitigate any issues during the maintenance window.

 

Thank you for your understanding and patience as we continue to improve systems in order to provide you with the best possible services and highest possible uptime.

COMPLETED: Emergency Maintenance: cPanel5

Emergency Maintenance: Hosting cPanel5 (DNS)

What: DNS propagation/distribution issues for domains on cPanel5

Affects: Resellers/customers with websites and email services on cPanel5

When: 2013/07/09 10:00

Update 1 (2013/07/09 12:02): The issue has now been resolved. If you are still having problems with the service please raise a support ticket.

COMPLETED: Emergency Maintenance: Plesk1 (Duck)

Emergency Maintenance: Hosting Plesk 1

What: Unexpected Disk Failure of Hosting Server Plesk-1

Affects: All resellers/customers with websites and email services on Plesk 1

When: 2013/06/20 10:10

Engineers are onsite investigating the extent of the problems.

It is currently expected that a new server will need to be deployed (ETA 12 hours) and hosting/email accounts recreated.

Update 1 (20/06/2013 11:40): Engineers are now confident they will be able to restore the data to replacement disks in the same server, as yet there is no complete ETA for the restore – so we are standing by the 12 hour estimate.

Update 2 (20/06/2013 14:12): Data from the latest stable backup point (19/06/2013 00:00) has been restored. If you are still having problems with the service please raise a support ticket.

COMPLETED: Scheduled Maintenance: hSphere Mail Cluster

Disk Array Maintenance: hSphere Mail Cluster

We have identified a failing member of the disk array behind the hSphere Mail Cluster.  Unfortunately due to the nature of the shared storage mail service must be interrupted to replace this member and rebuild the data to prevent any loss of stored e-mail.  Affected customers will be unable to send or receive e-mail during the window, any mail sent to your addresses will be queued on the sending server and retried according to the sending servers’ policies.

When: 07/June/2013 20:00 to 09/June/2013 20:00

Affects: Mail Services for hSphere users – Business/Personal-[1234]-hSphere plans.

What: Replacement of failing disk member and array rebuild.

Outage: Potentially the entire window, all mail services (SMTP, POP3 and IMAP4) for hSphere customers.

Update 1 – 08/06/2013 13:00 (Final):  This maintenance has been completed ahead of schedule; all affected services have been restored.  If you are still experiencing difficulties, please contact support via the usual channels.

2013-May RIPE Atlas Probe Report

The latest report from our RIPE Atlas Probe has just arrived …
This is your monthly availability report for probe 4521.

Calculation interval    : 2013-05-01 00:00:00 - 2013-06-01 00:00:00
Total Connected Time    :  31d 00:00
Total Disconnected Time :   0d 00:00
Total Availability      :    100.00%

Kind regards,
RIPE Atlas Team

Another month of 100% network availability over IPv4 and Ipv6 🙂

COMPLETED: Scheduled Maintenance – Migration of Virtual Private Servers (VPS-OVZ) to new Hardware

Migration to new OVZ/VPS Hardware

As part of our ongoing processes of QA, Monitoring and Upgrading of equipment, it is time to “retire” some of the Virtual Private Server (VPS/OVZ) systems (including several from Mergers-&-Acquisitions) and move those client VPS onto new machines.

Additionally ALL OpenVZ Virtual Server clients will be able to access the improved and extended management tools in the Client Portal with realtime CPU/Disk/Transfer usage details.

This will all be done automatically for affected machines.

In a few rare instances we will need to migrate selected virtual machines to a new IP range (where still on a legacy IP from an acquired provider), so you _may_ need to update config files, dns zones etc with a new IP address – if this is the case you will be notified individually, and all done for you if you have our server management services.

When: 24/May/2013 18:00 to 28/May/2013 18:00

Affects: All OpenVZ Virtual Private Servers

What: Transfer of VPS to new Hardware

Outage: Approx 5 minutes at some point during the window per 10GB of VPS storage used

Maintenance Plan:

In VPS-ID order…
* cleanly shutdown vps through internal ovz management tools
* copy of containers to new hardware
* update of IP address (where applicable)
* initialise vps on new hardware [ tests disk / container / os starts ok ]
* check pings in and out [ tests internet connectivity working ]

For the majority of the Astutium VPS/OVZ users, there will simply be a 10 minute (or less) period of downtime while your data is copied to the new hardware platform.

For those moving to a new IP range, you will be contacted individually with a modified plan specific to your VPS.

*Update 1*
All UK OVZ Clients migrated successfully to new hardware (and new IPs in some cases)

COMPLETED: Scheduled Maintenance: Hosting Plesk 9

Scheduled Maintenance: Hosting Plesk 9

What: Replacement of Server Hosting Server Plesk-9

Affects: All resellers/customers with websites and email services on Plesk 9

When: 2013/05/19 23:00 – 2013/05/20 11:00

The hosting server Plesk 9 is due for replacement following an intermittent fault with the drive array controller.

New server has been setup.
Access to the existing server will be stopped at 11pm
Data will be migrated to replacement server (takes approx 12 hours)

During the maintenance window you will not be able to send/receive emails, access/update websites or manage/query mysql databases.

We apologise in advance for the disruption. but it is necessary to replace hardware when it has or appears about to fail, in order to ensure ongoing services.

*Update-1*
New server has been setup and technicians are working on the restore and import of the data to the new systems
Estimated time for availability of services ~4hours (15:45 UK Time)

*Update-2*
All ecommerce sites now restored, working through the business hosting sites now
Estimated time for availability of services ~2hours (18:10 UK Time)

*Update-3-FINAL*
All sites have now been restored.  You may have DNS issues whilst your connectivity/broadband provider’s caches update.  We apologise for the extended outage due to the difficulty in retrieving data from the failing components.

Reason For Outage (RFO) April 29th 2013

Astutium Ltd try to be as informative and open as possible regarding any service affecting problems and scheduled maintenance by posting on the Astutium Status Blog for general [not client specific] issues.

Breakdown of the problems experienced 29/04/2013

Approximately 10am we received several reports of difficulty accessing our own website and some customer servers.

Neither of the 2 external monitoring systems were showing any issues, neither was anything actually down as checked by our constant internal checks.

However clients had reported an issue, and despite being able to access both our sites and theirs from our US servers, London Docklands systems and over BT FttC connections, technicians investigated, collated traceroutes and worked on identifying the problem.

It was determined that there were reproducable issues, external to our network, involving some “odd” routing paths. Traceroutes in and out from the network to a variety of locations and out through each border router were looked into, and for some went through far more networks and longer routes than would be expected from the start-point.
( for those interested traceroute.org is an excellent resource )

As part of the UK internet backbone, we control our *outbound* connectivity, no ISP controls the route *inbound* to their network, as that’s someone else’s outbound control. Outbound was working but not completely as expected.

At the same time, we were receiving a *lot* of route updates from one upstream tier-1 – far more than usual.

When a route changes on the internet (at any time a selection of ~20% of the total IP addresses are uncontactable) then the old route has to be removed, and a new “best path” selected from routes remaining available. This is done constantly by the BGP routers of every AS (Autonomous System) that makes up the internet.

BGP recalculations take around 2 minutes and during that time any Point-A may be unable to access any Point-B if it would go over the expected route. This is quite normal and expected, and TCP/IP protocols are very resilient to these regular “Pauses” that happen thousands of times per day for everyone.

In any given day we log around 2Gb of routes and route changes. Between 10am and 11am we logged over 36Gb – an increase of approximately 18 times as many in an hour than an average day.

The upstream producing the bulk of these add-and-withdraw route messages was disconnected in an emergency router maintenance (lasting 10 minutes at 11am), and the network stabilised again over the remaining tier-1, tier-2 and multiple peering connections. The upstream NOC have been informed and are investigating.

Later in the day at around 2.30pm we saw through our monitoring and client reports an identical issue starting again.

Investigation into the problem showed that the symptoms were the same (constant recalculation of routes) however this time it appeared to be generated internally rather than from an upstream.

Network technicians attended our main London datacentre as remote diagnosis was impossible, and we saw that our VRRP gateway routers we’re passing the responsibility of routing back and forth in a “see-saw” pattern – gateway3 failing to gateway4, gateway5 to gateway6, then gateway6 to gateway5, gateway4 to gateway3 etc

In depth analysis took place, and the problem was identified as the switches interconnecting the routers to each other stopping and restarting (during which time each pair of gateway routers couldn’t see all the others so one became the primary for each vlan, then when the switch came back up all the vlans switched back over) repeatedly.

This was happening when receiving certain amounts of traffic of unusual sized packets, so either the switches are faulty, or more likely the switch software/firmware has a fault under certain conditions. A temporary fix was installed to the existing setup, and the status of the network and routing monitored closely.

Alternative switches were added to the routing infrastructure, installed and configured. Migration of routers to the new switch fabric started appx 20.00 – all the cabling between each router was moved/replaced and connected to the new switches.

Services were brought up on the gateway routers in the following order:

  • colocated servers
  • dedicated servers
  • cloud servers
  • shared hosting
  • management and monitoring
  • internal servers
  • development / non production servers

By 21.45 all services were restored to standard levels of service, and have remained on/up since.

The similarity of the 2 issues (same effect to end-customer, temporary inability to connect to email, websites, ftp, ssh, rdp) leading to an up/down appearance as routes changed and were redistributed then changed again, redistributed and changed again implies that they’re somehow inter-related

Whether that’s a new type of attack (probably not from teh profile) a specific switch/software bug (most likely) or some “global” change or new protocol over tcp/ip not previously handled remains to be seen (we saw very strange results some years ago when P2P traffic became very prevalent, which impacted router performance worldwide for a while).

We will now be setting up a replica system in our lab, and working with the switch manufacturers on a longer term solution than simply adding more switches and changing cabling.

Additionally if we see stability from our upstreams then external router setup will revert back to previous settings, if that is not immediately forthcoming, alternative upstreams will be connected.

Clients with dual-bonded etherchannel connections to multiple redundant switches were not have been affected until the switch replacements in the evening.

BGP downstream clients would have seen an increase in the morning of route-changes, but remained 100% available throughout the day.

All clients where we handle the routing and/or provide the final gateway before accessing the outside world (our own systems included) will have seen approximately 11 outages of between 2 and 10 minutes during the day, and around 2 hours during the hardware replacements.

COMPLETED: Emergency Maintenance: SW-Routed1

Emergency Maintenance: Switch Routed 1

What: Intermittent Failure of Switch Routed 1

Affects: All routed services

When: 2013/04/29 – Various

Engineers are onsite preparing a replacement.

A further brief outage will be required to replace/swap the failed component.

Update 1: 29/04/2013 22:36 – The affected network component has now been fully replaced.  We don’t expect any further issues related to this issue.

COMPLETED: Emergency Maintenance: Hosting Plesk 8

Emergency Maintenance: Hosting Plesk 8

What: Failure of Hosting Server Plesk-8

Affects: All resellers/customers with websites and email services on Plesk 8

When: 2013/04/24 04:32

Engineers are onsite investigating the extent of the problems

It is currently expected that a new server will need to be deployed (ETA 12 hours) and hosting/email accounts recreated.

Update 1 2013/April/24 08:10:  Copy of data to new server is underway, estimated time to complete: 630 minutes

Update 2 2013/April/24 19:01:  Service has been restored – if you are still experiencing issues please raise a support ticket.