«

»

Apr 30

Reason For Outage (RFO) April 29th 2013

Astutium Ltd try to be as informative and open as possible regarding any service affecting problems and scheduled maintenance by posting on the Astutium Status Blog for general [not client specific] issues.

Breakdown of the problems experienced 29/04/2013

Approximately 10am we received several reports of difficulty accessing our own website and some customer servers.

Neither of the 2 external monitoring systems were showing any issues, neither was anything actually down as checked by our constant internal checks.

However clients had reported an issue, and despite being able to access both our sites and theirs from our US servers, London Docklands systems and over BT FttC connections, technicians investigated, collated traceroutes and worked on identifying the problem.

It was determined that there were reproducable issues, external to our network, involving some “odd” routing paths. Traceroutes in and out from the network to a variety of locations and out through each border router were looked into, and for some went through far more networks and longer routes than would be expected from the start-point.
( for those interested traceroute.org is an excellent resource )

As part of the UK internet backbone, we control our *outbound* connectivity, no ISP controls the route *inbound* to their network, as that’s someone else’s outbound control. Outbound was working but not completely as expected.

At the same time, we were receiving a *lot* of route updates from one upstream tier-1 – far more than usual.

When a route changes on the internet (at any time a selection of ~20% of the total IP addresses are uncontactable) then the old route has to be removed, and a new “best path” selected from routes remaining available. This is done constantly by the BGP routers of every AS (Autonomous System) that makes up the internet.

BGP recalculations take around 2 minutes and during that time any Point-A may be unable to access any Point-B if it would go over the expected route. This is quite normal and expected, and TCP/IP protocols are very resilient to these regular “Pauses” that happen thousands of times per day for everyone.

In any given day we log around 2Gb of routes and route changes. Between 10am and 11am we logged over 36Gb – an increase of approximately 18 times as many in an hour than an average day.

The upstream producing the bulk of these add-and-withdraw route messages was disconnected in an emergency router maintenance (lasting 10 minutes at 11am), and the network stabilised again over the remaining tier-1, tier-2 and multiple peering connections. The upstream NOC have been informed and are investigating.

Later in the day at around 2.30pm we saw through our monitoring and client reports an identical issue starting again.

Investigation into the problem showed that the symptoms were the same (constant recalculation of routes) however this time it appeared to be generated internally rather than from an upstream.

Network technicians attended our main London datacentre as remote diagnosis was impossible, and we saw that our VRRP gateway routers we’re passing the responsibility of routing back and forth in a “see-saw” pattern – gateway3 failing to gateway4, gateway5 to gateway6, then gateway6 to gateway5, gateway4 to gateway3 etc

In depth analysis took place, and the problem was identified as the switches interconnecting the routers to each other stopping and restarting (during which time each pair of gateway routers couldn’t see all the others so one became the primary for each vlan, then when the switch came back up all the vlans switched back over) repeatedly.

This was happening when receiving certain amounts of traffic of unusual sized packets, so either the switches are faulty, or more likely the switch software/firmware has a fault under certain conditions. A temporary fix was installed to the existing setup, and the status of the network and routing monitored closely.

Alternative switches were added to the routing infrastructure, installed and configured. Migration of routers to the new switch fabric started appx 20.00 – all the cabling between each router was moved/replaced and connected to the new switches.

Services were brought up on the gateway routers in the following order:

  • colocated servers
  • dedicated servers
  • cloud servers
  • shared hosting
  • management and monitoring
  • internal servers
  • development / non production servers

By 21.45 all services were restored to standard levels of service, and have remained on/up since.

The similarity of the 2 issues (same effect to end-customer, temporary inability to connect to email, websites, ftp, ssh, rdp) leading to an up/down appearance as routes changed and were redistributed then changed again, redistributed and changed again implies that they’re somehow inter-related

Whether that’s a new type of attack (probably not from teh profile) a specific switch/software bug (most likely) or some “global” change or new protocol over tcp/ip not previously handled remains to be seen (we saw very strange results some years ago when P2P traffic became very prevalent, which impacted router performance worldwide for a while).

We will now be setting up a replica system in our lab, and working with the switch manufacturers on a longer term solution than simply adding more switches and changing cabling.

Additionally if we see stability from our upstreams then external router setup will revert back to previous settings, if that is not immediately forthcoming, alternative upstreams will be connected.

Clients with dual-bonded etherchannel connections to multiple redundant switches were not have been affected until the switch replacements in the evening.

BGP downstream clients would have seen an increase in the morning of route-changes, but remained 100% available throughout the day.

All clients where we handle the routing and/or provide the final gateway before accessing the outside world (our own systems included) will have seen approximately 11 outages of between 2 and 10 minutes during the day, and around 2 hours during the hardware replacements.