status.astutium.com – Page 12 – Astutium Ltd Service and Systems Status

Sep 04

COMPLETED: PHP Upgrade – cPanel Services

By Cameron in CPanel/WHM, Emergency Maintenance, Reseller Hosting

Due to major security issues with the currently deployed version of the PHP runtime, and the appearance of 0-day exploits for these issues; we have been forced to upgrade PHP to the latest 5.4 branch on all cPanel services.

For most up-to-date pre-packaged applications, this should have no effect. However, custom sites, modules for pre-packaged software and templates may have been relying on older syntax or features that are now no longer allowed.

If you have used the Softaculous 1-click-installer, please make sure you update your installations as soon as possible. Otherwise you may need to make adjustments or changes to the code of your site or obtain upgrades from 3rd party vendors.

We apologise that this announcement could not have been made sooner, however the nature and urgency of the issues prevented that.

Aug 23

COMPLETED: Scheduled Maintenance: US Minecraft Servers

By Rob in GameServers, Minecraft, Planned Maintenance

Scheduled Maintenance: US Minecraft servers

What: Migrating US Minecraft servers to new hardware.

Affects: US Minecraft servers

When: Monday, 26th August 2013 12:00 – Monday, 26th August 18:00

As part of our review of the Astutium (inc Othello and EliteHosting) MineCraft Hosting Services following the launch of 1.6.x, we are going to be migrating our US Minecraft Servers to New Hardware.

This affects our US Servers only, EU servers have already been upgraded.

This will allow for “in-place” upgrades to larger and more powerful plans instantly and on-time through our client portal, and for you to get the fastest response and highest uptime possible for your Minecraft Server(s).

Due to the size of some worlds, number of plugins and players, we expect the actual time to copy your minecraft server to a new machine to be around 30 minutes, and this will be undertaken on Monday 26th August.

We already have a backup copy of the server content on our Cloud platform in London, so any migration can be “undone” as quick as possible if the copy fails for any reason (it is not expected to, but we like to be prepared)

On Monday afternoon we will …

shut down your Minecraft Server Virtual Machine
manually copy over all your files and settings to the new servers
confirm the copy has completed correctly
assign your new IP and update multicraft
email you the access details for your MineCraft Server
restart the MC server on the new hardware.

Your IPv4 address(es) will be changing, due to the way they’re allocated by the Multicraft Control Panel.

If you are using one of our sub-domains to access your MC-Server this will be updated automatically.

If you are accessing by IP and Port, then you will need to put the new details into your minecraft launcher.

These steps are being taken to improve the performance of your service, to allow for the changes needed by 1.6.2 (now available in your control panel) and for the extra ram and processing requirements of the 1.6.x branch.

Additionally, we are now officially retiring all variations of the 1.4.x branch of Minecraft (1.6.x is the current version), so if you’re using the really old 1.4.7+ versions currently, we advise you upgrade immediately to a newer version through your control panel.

Finally, if you need a new Minecraft Server, or simply want a much bigger MC Server, we currently have available high-performance fully-dedictaed MC servers and a 16% discount on Server Upgrades to your current package – contact us for more details.

maintenance, minecraft, multicraft, planned maintenance

Aug 15

COMPLETED: Scheduled Maintenance: Astutium Ltd US network

By Rob in Cloud Servers, CPanel/WHM, Dedicated Servers, Planned Maintenance, Reseller Hosting, Shared Hosting, Virtual Servers

Scheduled Maintenance: Astutium Ltd US network.

What: Software upgrades on the network routers and switches.

Affects: US Website Hosting, US Dedicated Servers, US Virtual/Cloud Servers and US (All) and UK (Multicraft access only) Minecraft Gaming Services.

When: Thursday, August 22, 2013 from 7:00 a.m. to 11:00 a.m. GMT

Planned maintenance on the Astutium Ltd US network will occur on 22/August/2013, which will have some impact on all US Website Hosting, US Dedicated Servers, US Virtual/Cloud Servers, US (All) and UK (Multicraft access only) Minecraft Gaming Services.

During this maintenance window, we will be performing necessary software upgrades on the network routers and switches.

A network downtime of 15 minutes is anticipated after uploads during this four hour maintenance window, and a 5-10 minute outage of connectivity to each server in turn as it is moved to the new switching fabric.

No servers are being powered off, this is a network cabling and software maintenance only.

No data-loss is expected, our US Network Engineering Team will be taking all necessary precautions to mitigate any issues during the maintenance window.

Thank you for your understanding and patience as we continue to improve systems in order to provide you with the best possible services and highest possible uptime.

Jul 09

COMPLETED: Emergency Maintenance: cPanel5

By Chris in Emergency Maintenance

Emergency Maintenance: Hosting cPanel5 (DNS)

What: DNS propagation/distribution issues for domains on cPanel5

Affects: Resellers/customers with websites and email services on cPanel5

When: 2013/07/09 10:00

Update 1 (2013/07/09 12:02): The issue has now been resolved. If you are still having problems with the service please raise a support ticket.

Jun 20

COMPLETED: Emergency Maintenance: Plesk1 (Duck)

By Cameron in Emergency Maintenance, Plesk

Emergency Maintenance: Hosting Plesk 1

What: Unexpected Disk Failure of Hosting Server Plesk-1

Affects: All resellers/customers with websites and email services on Plesk 1

When: 2013/06/20 10:10

Engineers are onsite investigating the extent of the problems.

It is currently expected that a new server will need to be deployed (ETA 12 hours) and hosting/email accounts recreated.

Update 1 (20/06/2013 11:40): Engineers are now confident they will be able to restore the data to replacement disks in the same server, as yet there is no complete ETA for the restore – so we are standing by the 12 hour estimate.

Update 2 (20/06/2013 14:12): Data from the latest stable backup point (19/06/2013 00:00) has been restored. If you are still having problems with the service please raise a support ticket.

Jun 05

COMPLETED: Scheduled Maintenance: hSphere Mail Cluster

By Cameron in HSphere, Planned Maintenance, Shared Hosting

Disk Array Maintenance: hSphere Mail Cluster

We have identified a failing member of the disk array behind the hSphere Mail Cluster. Unfortunately due to the nature of the shared storage mail service must be interrupted to replace this member and rebuild the data to prevent any loss of stored e-mail. Affected customers will be unable to send or receive e-mail during the window, any mail sent to your addresses will be queued on the sending server and retried according to the sending servers’ policies.

When: 07/June/2013 20:00 to 09/June/2013 20:00

Affects: Mail Services for hSphere users – Business/Personal-[1234]-hSphere plans.

What: Replacement of failing disk member and array rebuild.

Outage: Potentially the entire window, all mail services (SMTP, POP3 and IMAP4) for hSphere customers.

Update 1 – 08/06/2013 13:00 (Final): This maintenance has been completed ahead of schedule; all affected services have been restored. If you are still experiencing difficulties, please contact support via the usual channels.

Jun 01

2013-May RIPE Atlas Probe Report

By Rob in Transit/Connectivity

The latest report from our RIPE Atlas Probe has just arrived …
This is your monthly availability report for probe 4521.

Calculation interval    : 2013-05-01 00:00:00 - 2013-06-01 00:00:00
Total Connected Time    :  31d 00:00
Total Disconnected Time :   0d 00:00
Total Availability      :    100.00%

Kind regards,
RIPE Atlas Team

Another month of 100% network availability over IPv4 and Ipv6 🙂

May 22

COMPLETED: Scheduled Maintenance – Migration of Virtual Private Servers (VPS-OVZ) to new Hardware

By Rob in Planned Maintenance, Virtual Servers

Migration to new OVZ/VPS Hardware

As part of our ongoing processes of QA, Monitoring and Upgrading of equipment, it is time to “retire” some of the Virtual Private Server (VPS/OVZ) systems (including several from Mergers-&-Acquisitions) and move those client VPS onto new machines.

Additionally ALL OpenVZ Virtual Server clients will be able to access the improved and extended management tools in the Client Portal with realtime CPU/Disk/Transfer usage details.

This will all be done automatically for affected machines.

In a few rare instances we will need to migrate selected virtual machines to a new IP range (where still on a legacy IP from an acquired provider), so you _may_ need to update config files, dns zones etc with a new IP address – if this is the case you will be notified individually, and all done for you if you have our server management services.

When: 24/May/2013 18:00 to 28/May/2013 18:00

Affects: All OpenVZ Virtual Private Servers

What: Transfer of VPS to new Hardware

Outage: Approx 5 minutes at some point during the window per 10GB of VPS storage used

Maintenance Plan:

In VPS-ID order…
* cleanly shutdown vps through internal ovz management tools
* copy of containers to new hardware
* update of IP address (where applicable)
* initialise vps on new hardware [ tests disk / container / os starts ok ]
* check pings in and out [ tests internet connectivity working ]

For the majority of the Astutium VPS/OVZ users, there will simply be a 10 minute (or less) period of downtime while your data is copied to the new hardware platform.

For those moving to a new IP range, you will be contacted individually with a modified plan specific to your VPS.

*Update 1*
All UK OVZ Clients migrated successfully to new hardware (and new IPs in some cases)

May 17

COMPLETED: Scheduled Maintenance: Hosting Plesk 9

By Rob in Emergency Maintenance

Scheduled Maintenance: Hosting Plesk 9

What: Replacement of Server Hosting Server Plesk-9

Affects: All resellers/customers with websites and email services on Plesk 9

When: 2013/05/19 23:00 – 2013/05/20 11:00

The hosting server Plesk 9 is due for replacement following an intermittent fault with the drive array controller.

New server has been setup.
Access to the existing server will be stopped at 11pm
Data will be migrated to replacement server (takes approx 12 hours)

During the maintenance window you will not be able to send/receive emails, access/update websites or manage/query mysql databases.

We apologise in advance for the disruption. but it is necessary to replace hardware when it has or appears about to fail, in order to ensure ongoing services.

*Update-1*
New server has been setup and technicians are working on the restore and import of the data to the new systems
Estimated time for availability of services ~4hours (15:45 UK Time)

*Update-2*
All ecommerce sites now restored, working through the business hosting sites now
Estimated time for availability of services ~2hours (18:10 UK Time)

*Update-3-FINAL*
All sites have now been restored. You may have DNS issues whilst your connectivity/broadband provider’s caches update. We apologise for the extended outage due to the difficulty in retrieving data from the failing components.

Apr 30

Reason For Outage (RFO) April 29th 2013

By Rob in Emergency Maintenance

Astutium Ltd try to be as informative and open as possible regarding any service affecting problems and scheduled maintenance by posting on the Astutium Status Blog for general [not client specific] issues.

Breakdown of the problems experienced 29/04/2013

Approximately 10am we received several reports of difficulty accessing our own website and some customer servers.

Neither of the 2 external monitoring systems were showing any issues, neither was anything actually down as checked by our constant internal checks.

However clients had reported an issue, and despite being able to access both our sites and theirs from our US servers, London Docklands systems and over BT FttC connections, technicians investigated, collated traceroutes and worked on identifying the problem.

It was determined that there were reproducable issues, external to our network, involving some “odd” routing paths. Traceroutes in and out from the network to a variety of locations and out through each border router were looked into, and for some went through far more networks and longer routes than would be expected from the start-point.
( for those interested traceroute.org is an excellent resource )

As part of the UK internet backbone, we control our *outbound* connectivity, no ISP controls the route *inbound* to their network, as that’s someone else’s outbound control. Outbound was working but not completely as expected.

At the same time, we were receiving a *lot* of route updates from one upstream tier-1 – far more than usual.

When a route changes on the internet (at any time a selection of ~20% of the total IP addresses are uncontactable) then the old route has to be removed, and a new “best path” selected from routes remaining available. This is done constantly by the BGP routers of every AS (Autonomous System) that makes up the internet.

BGP recalculations take around 2 minutes and during that time any Point-A may be unable to access any Point-B if it would go over the expected route. This is quite normal and expected, and TCP/IP protocols are very resilient to these regular “Pauses” that happen thousands of times per day for everyone.

In any given day we log around 2Gb of routes and route changes. Between 10am and 11am we logged over 36Gb – an increase of approximately 18 times as many in an hour than an average day.

The upstream producing the bulk of these add-and-withdraw route messages was disconnected in an emergency router maintenance (lasting 10 minutes at 11am), and the network stabilised again over the remaining tier-1, tier-2 and multiple peering connections. The upstream NOC have been informed and are investigating.

Later in the day at around 2.30pm we saw through our monitoring and client reports an identical issue starting again.

Investigation into the problem showed that the symptoms were the same (constant recalculation of routes) however this time it appeared to be generated internally rather than from an upstream.

Network technicians attended our main London datacentre as remote diagnosis was impossible, and we saw that our VRRP gateway routers we’re passing the responsibility of routing back and forth in a “see-saw” pattern – gateway3 failing to gateway4, gateway5 to gateway6, then gateway6 to gateway5, gateway4 to gateway3 etc

In depth analysis took place, and the problem was identified as the switches interconnecting the routers to each other stopping and restarting (during which time each pair of gateway routers couldn’t see all the others so one became the primary for each vlan, then when the switch came back up all the vlans switched back over) repeatedly.

This was happening when receiving certain amounts of traffic of unusual sized packets, so either the switches are faulty, or more likely the switch software/firmware has a fault under certain conditions. A temporary fix was installed to the existing setup, and the status of the network and routing monitored closely.

Alternative switches were added to the routing infrastructure, installed and configured. Migration of routers to the new switch fabric started appx 20.00 – all the cabling between each router was moved/replaced and connected to the new switches.

Services were brought up on the gateway routers in the following order:

colocated servers
dedicated servers
cloud servers
shared hosting
management and monitoring
internal servers
development / non production servers

By 21.45 all services were restored to standard levels of service, and have remained on/up since.

The similarity of the 2 issues (same effect to end-customer, temporary inability to connect to email, websites, ftp, ssh, rdp) leading to an up/down appearance as routes changed and were redistributed then changed again, redistributed and changed again implies that they’re somehow inter-related

Whether that’s a new type of attack (probably not from teh profile) a specific switch/software bug (most likely) or some “global” change or new protocol over tcp/ip not previously handled remains to be seen (we saw very strange results some years ago when P2P traffic became very prevalent, which impacted router performance worldwide for a while).

We will now be setting up a replica system in our lab, and working with the switch manufacturers on a longer term solution than simply adding more switches and changing cabling.

Additionally if we see stability from our upstreams then external router setup will revert back to previous settings, if that is not immediately forthcoming, alternative upstreams will be connected.

Clients with dual-bonded etherchannel connections to multiple redundant switches were not have been affected until the switch replacements in the evening.

BGP downstream clients would have seen an increase in the morning of route-changes, but remained 100% available throughout the day.

All clients where we handle the routing and/or provide the final gateway before accessing the outside world (our own systems included) will have seen approximately 11 outages of between 2 and 10 minutes during the day, and around 2 hours during the hardware replacements.