Heart servers

thepieman · #1 10-02-2016, 07:02 PM

Blooming heart servers......

Nice to see SES back up and running

scooby doo · #2 10-02-2016, 07:05 PM

Yes been down to long

thepieman · #3 10-02-2016, 07:07 PM

This is the status update that PPP sent me earlier, looks like they had some trouble and must have been a good few sites down!!

System Status
Welcome to our system status page. This page is updated with any planned maintenance or current system issues.
Please check this page before contacting customer services.
Current System Status
Power outage

Wed, 10 Feb 2016 15:05:44 +0000
We suffered a brief interruption in power during works in one of our data centre halls. This has now been restored and all servers are coming back online. We will provide further updates as work progresses every 20 minutes
Wed, 10 Feb 2016 15:50:00 +0000
While a top priority is to keep your websites and services online, our highest priority is the safety of our engineers. When working with equipment such as UPS (massive battery arrays), generators, switching equipment and multiple sources from the mains, we ensure our team are safe. In this case, a voltage sensor was due for replacement, we had switched over to bypass this unit, however for a reason yet unknown a safety warning was triggered which resulted in the shut down of power for about 9 minutes. The full reason is yet unknown, we have our expert data centre team, as well as the equipment providers, working to understand this fully. Right now we can assure you that power is flowing to the data centre as it should be.
Wed, 10 Feb 2016 16:01:00 +0000
Currently our priority is to restore service to infrastructure servers, this is in progress now and will allow us to being work restoring service to the shared platform.
Wed, 10 Feb 2016 16:07:00 +0000
Infrastructure servers are now showing as fully online. Work ongoing on the shared platform and the current ETA for restoration of service is 17:00
Wed, 10 Feb 2016 16:14:00 +0000
Hybrid VPS servers are now being powered on. A tentative ETA for restoration of hybrid servers is 17:30
Wed, 10 Feb 2016 16:38:00 +0000
VPS KVM host servers are now being powered on. Guests are still currently unavailable
Wed, 10 Feb 2016 17:08 +0000
Premium hosting front end webservers are now coming online. Databases are currently still powered down
Wed, 10 Feb 2016 17:37 +0000
Premium hosting database servers are now being powered on.
Planned System Maintenance
There is currently no planned maintenance.

Ginola · #4 10-02-2016, 07:55 PM

Sounds like a regular day in the office!

Steve_PPP · #5 10-02-2016, 10:24 PM

Yes, it sounds like our hosts (Heart Internet) had a fairly major balls-up this afternoon. Thanks to those who called/texted - and apologies to those I didn't reply to - but it was completely out of our hands.

Rdlangy1 · #6 11-02-2016, 09:27 AM

Nice to see we all survived

ha ha

iPond · #7 11-02-2016, 06:04 PM

Did anyone else have issues today? Was getting 503 errors most of the day and unable to access?

Steve_PPP · #8 11-02-2016, 07:00 PM

Quote:

Originally Posted by iPond

Did anyone else have issues today? Was getting 503 errors most of the day and unable to access?

Yes, the hosts have had more problems today. Their latest update:

Current Summary Of Issues (11/02/16 - 17:30)

Webmail - The webmail service appears to be back up and running again now, and you should be able to access your mailboxes through this correctly.

VPS & Hybrid Servers - We currently have two VPS 'kvmhosts' down - these are "kvmhost248" and "kvmhost249". Sadly it will likely take most of the morning to resolve this. All other kvmhosts for VPS and Hybrid Servers are online with all servers started. However if your VPS or Hybrid Server is definitely still down and a reboot does not fix this, please contact support.

Linux Shared Hosting - An issue still exists on the following web servers unfortunately; web75, web76, web77 and web78. It did appear that this was resolved, but further investigation has shown that there is another issue here. A separate issue has also been identified on web servers 175-183, and is under investigation now.

Premium Hosting - The Premium Hosting service appears to be back up and running again now, and the issue with MySQL database creation has also been addressed and resolved.

SES runs on server web183. I wouldn't be surprised if we see a few more hiccups over the next 24hrs before it settles down but hopefully not.

As an IT infrastructure guy, I have a lot of sympathy for them - it sounds like they had electrical engineers working in their datacenter on *supposedly* non-disruptive jobs but they screwed something up and dropped power to (or spiked) a bunch of racks, leaving their IT guys to pick up the mess. They posted an update earlier today:

Quote:

We are still working our very hardest to resolve the remaining kvmhost and webserver (web75, web76, web77 and web78) issues, many of our staff have been in the office since 8am yesterday and are still here now.

Looks like they were pulling an all-nighter to try and sort it out. Been there, done that. It's a right pain in the ass.

Quote:

Originally Posted by Rdlangy1

Nice to see we all survived

ha ha

Yeah, you spoke too soon mate

Steve_PPP · #9 17-02-2016, 04:10 PM

Had the following email from Heart Internet about the downtime we experienced last week. Its quite a read but i'll highlight some of the key bits in bold for you in case you're feeling lazy

I think they've had a bad week

Quote:

Dear customers,

As you may be aware, we recently suffered the worst single incident in our history due to a power outage at our Leeds data centre on Wednesday afternoon.

Emergency maintenance work was being carried out on the load transfer module, which feeds power from our external energy supplies to the data centre hall that holds the majority of our servers. The data centre has 2 dual feed uninterruptible supplies both backed by diesel generators in case of National Grid outages.

Unfortunately, a safety mechanism within the device triggered incorrectly, and resulted in a power outage of fewer than 9 minutes. Subsequently, this caused approximately 15,000 servers to be hard booted. Beyond a fire, this is the worst possible event that a hosting company can face. A full post mortem is currently being carried out to determine how power was lost on both supplies despite working with the external engineer from the hardware manufacturer.

What happens when servers hard reboot?

Web servers and virtual servers typically perform database transactions at a very high rate, meaning that the risk of database or file system corruption is quite high when a hard reboot occurs.

Following the restoration of power, our first priority was to get our primary infrastructure boxes back online, then our managed and unmanaged platforms. Our managed platforms are built to be resilient, so although we lost a number of servers in the reboot, the majority of our platforms came up cleanly. We faced some issues with our Premium Hosting load balancers, which needed repairing, so some customer sites were off for longer than we would have hoped. We are adding additional redundant load balancers and modifying the failover procedure over the next 7 days as an extra precaution for us and our customers.

On our shared hosting platform, a number of NAS drives, which sit behind the front-end web servers and hold customer website data, crashed and could not be recovered. However, they are set up in fully redundant pairs and the NAS drives themselves contain 8+ disk RAID 10 arrays. In every case but one, at least one server in each pair came back up cleanly, or in an easily repairable state, and customer websites were back online within 2-3 hours.

In a single case, the cluster containing web 75-79, representing just under 2% of our entire shared platform, both NAS drives failed to come back up. Following our disaster recovery procedure, we commenced attempts to restore the drives, whilst simultaneously building new NAS drives should they be required. Unfortunately, the servers gave a strong, but false, indication that they could be brought back into a functioning state, so we prioritised attempts to repair the file system.

Regrettably, following a ‘successful’ repair, performance was incredibly poor due to the damage to the file system, and we were forced to proceed to the next rung of our disaster recovery procedure. The further we step into the disaster recovery process, the greater the recovery time, and here we were looking at a total 4TB restore from on-site backups to new NAS drives. (For your information the steps following that are to restore from offsite backup and finally restore from tape backup although we did not need to enact these steps.) At this point, it became apparent that the issue would take days rather than hours to resolve, and the status page was updated with an ETA. We restored sites to the new NAS drives alphabetically in a read-only state and the restoration completed late Sunday afternoon.

A full shared cluster restore from backups to new NAS is a critical incident for us, and we routinely train our engineers on disaster recovery steps. Our disaster recovery process functioned correctly, but because the event did not occur in isolation, we were unable to offer the level of individual service that we really wanted to, and that you would expect from us (e.g. individual site migration during restoration).

Given the magnitude of this event, we are currently investigating plans to split our platform and infrastructure servers across two data centre halls, which would allow us to continue running in the event of complete power loss to one. This added reliability is an extra step that we feel is necessary to put in place to ensure that this never happens again for our customers.

VPS and Dedicated Servers

For our unmanaged platforms (VPS and Dedicated Servers), the damage was more severe, as by default these servers are not redundant or backed up. In particular, one type of VPS was more susceptible to data corruption in the event of a power loss due to the type of caching the host servers use. We have remedied this issue on all re-built VPS involved in the outage, and no active or newly built VPS now suffer from this issue.

We did lose two KVM hosts (the host servers that hold VPS, approximately 60-80 servers per VPS KVM host, 6-12 servers per Hybrid KVM host). The relatively good news was that the underlying VPS data was not damaged, although further to this, we also lost two KVM network switches which needed to be swapped out, which did result in intermittent network performance on other VPS during the incident.

To bring the VPS back online, the KVM hosts needed to have replacements built and VPS data copied from each before being brought back online. For every other VPS, the host servers were back up and running within 2 hours, but in many cases, the file systems or databases of the virtual machines on those servers were damaged by the power loss. For these VPS, by far the quickest course of action for customers to get back up and running immediately was a rebuild and restore from backups (either offsite or via our backup service).

However, we realised quickly that many of the affected VPS customers did not have any backups (irrespective of whether the backup was with us), and the only copy of the server’s data was held in a partially corrupted form on our KVM hosts so we took steps to attempt to get customers back online. For every affected VPS we ran an automated fsck (file system check) in an effort to bring the servers back online in an automated fashion. This would not, however, fix issues with MySQL, which would be the most common issues due to high transaction rate. Tables left open during a power loss are likely to result in corrupted data, so we provided a do-it-yourself guide to try and get MySQL into a working state.

We provided the option for us to attempt a repair, which typically takes 2-3 hours per server with an expected success rate of approximately 20%. We currently have a backlog of servers we have agreed to attempt to recover, but given the time per investigation, this is likely to take most of the week. This is roughly equivalent to the total loss of our NAS pair and is where disaster recovery steps (server rebuild and backup restoration) should be followed.

As these servers are unmanaged, there is no disaster recovery process in place by default. I know this isn’t the answer many of you want to hear, and most of all we want to ensure that this can never happen to you again. All VPS hosts are now set to be far more resilient in the event of a sudden power loss.

Support and Communications

During this incident, we have worked our hardest to ensure that our entire customer base was kept informed of our progress through our status page.

Given the scale of the issue, the load on our Customer Services team was far in excess of normal levels. On a standard day, we handle approximately 800 support tickets, which can rise to 1600 during a fairly major incident. At absolute capacity, we can handle approximately 2000 new tickets per day.

This event was unprecedented, so during and following the incident we received in excess of 5000 new support tickets every day (excluding old tickets that were re-opened), and the ticket complexity was far higher than usual. Our admin system was not set up to handle this number of requests (being poll heavy to give our team quick updates on our ticket queue). This heavily impacted the performance of our control panel and ticketing system until we made alterations to make it far less resource intensive.

After this, we took immediate steps to ameliorate the incredible support load via automated updates to affected customers, but most of the tickets required in-depth investigation and server repairs that require a high level of technical capability, so could only be addressed by our second line and sysadmin staff. It will take some time to clear our entire ticket backlog and restore normal ticket SLAs.

We had planned to go live with a brand new Heart Internet customer specific status page on the day of the outage, as it would allow us to provide greater detail for direct customers without the requirement that messages be white labelled and generic.

We did not push this live during the incident as we needed all hands on to fix the live issues, but we have just made it live at status.heartinternet.uk (it will later also be available at http://heartstatus.uk using external DNS). The service allows for subscription via email, SMS, and RSS, so you will be kept up-to-date during any major incident. Past events are also archived and remain fully visible. We will also use this page to inform you of any changes to the platform or scheduled work.

Most of all we’d like to apologise to you, and to your customers. We know as much as anyone how important staying online is to your business. The best thing we can do to regain your trust is to offer good, uninterrupted service long into the future, and that is now our utmost priority.

Rdlangy1 · #10 17-02-2016, 04:19 PM

Wow - at least they have done the right thing in offering a full explanation and seem to have done their best throughout what must have been a nightmare scenario!

Ginola · #11 17-02-2016, 08:09 PM

Got to feel sorry for them really, we had something similar with work a few years back when a data centre got flooded

Nicely written explanation of it all though ..

"Have you tried turning it off and turning it on again?"

Steve_PPP · #12 17-02-2016, 11:53 PM

Quote:

Originally Posted by Ginola

Got to feel sorry for them really

Me too mate, I've had a similar thing in my previous job on a much smaller scale (we're talking only 5 racks of kit). Was the first one in one morning and the lot was off. Power went off overnight, generator failed to start...UPS ran down to 10% and started graceful shutdown of most kit. Rest got dropped when it ran out. Still took me the best part of half a day just to get that lot online starting everything up in the right order.

Currently Active Users Viewing This Thread: 1 (0 members and 1 guests)