How January 2017 was the resilience in 30


The 30 of January 2017 confronted a fault that from its foundation never had faced. Here we told openly that it happened and that we made to come up returned to happen.

How January 2017 was the resilience in 30

In this article we are going to count the detail of happened the 30 of January 2017 in and that we made to come up it returns to happen.

A little history, in 2015 arose thanks to an opportunity to offer a solution of specific Hosting to lodge a software developed for the greatest distributor of DirecTV in Colombia (Communicate Ltda. to you), where we could provide a solution to a problem that did not solve Hostgator, Godaddy nor Amazon Web Services (AWS), the optimization. With this software happened to have multiple falls per week and absurdly slow times of answer, not to have falls and the speed that corporative surroundings need that operate hundreds of technicians, advisers, administrators with tens of thousands of clients.

From 30 of June 2015 that we left to the public we have had an accelerated growth of clients, where we were ourselves led to improve our infrastructure to do controlled it more by us, we happened to engage servers VPS to use servers dedicated with control of cabinet, routers, peers and now implementing our own VPS. This happened in August 2015. From our foundation we have worked hard in developing software solutions that allow to make the most efficient Hosting, causing than this he is faster with the same resources of hardware.

Now, 18 months later, hardware has improved, new processor of Intel Xeon, RAM DDR4 faster, faster Discs SSD, better controllers RAID, although our goal always has been to optimize with software cannot remain back with the hardware, that always is a good help, for that reason decided to invest in more robust servers and with modern hardware, we glided by a month to realise this migration. We took advantage from virtualization KVM and storage LVM to do everything easier to administer, to extend, to migrate. 

We had to migrate tens of VPS of clients, and VPS with services of Hosting/DNS of us, we migrated all the VPS without trouble, each VPS took 5 -10 minutes in migrating. He lacked one of most important, a VPS with more than 400 accounts of Hosting. This VPS especially was formed, one of those configurations was to have direct access to the processor, migrating this VPS that initiated 2:15 A.M. of the 30 of January, and finished 4:07 A.M. in the new servant, the CPU was different - last model, according to the documentation must not have problem, was a migration of Intel Xeon 64 bits to Intel Xeon 64 bits: But happened that when having direct access to the CPU entered a state of dementia in colloquial words, as if of a little while to another one they had changed the brain to him, is bug (a fault) that not always happens: in this state happened a corruption in the archives, had more of corrupt 30%, we recorded it: nevertheless we tried to repair the file system, to reinstate component affected (To reinstate cPanel, Yum, RPM, specific archives), by 5 hours we tried this, until 8:57 A.M. that we found that he was irreparable and we decided what commonly it is known as formatting.

During this period, we took care of its requests of chat by, called to fixed, the cellular one, post office, Skype, messages by Facebook Messenger as much in the Page (even in the personal profiles of our equipment); being clear with our users whom a fatal error in the service of Hosting had happened and we were in recovery process. Being 12:45 p.m. the installation was finished and configuration of the system, some of our customized super optimizations cPanel did not realise backup of the same, much work manual was required to do, for that reason it took almost 4 hours.

We realised Backups daily with retention to 30 days, this allows them to the users from cPanel to recover by file, folder, account of mail, anyone, data base of the 30 previous days. In order to realise this backup in efficient form, we realised it in incremental form, that is to say, the changes are only copied. But this backup was not most appropriate for a total losses of data, since copy in external servers and even though the network is of 1000 Megabyte/s (In Colombia the connection 3 average is of Megabyte/s for homes, and greater datacenter - in Bogota - offers up to 100 Megabyte/s), to copy almost 4 million archives is not something that is fast by a network, some accounts took up to 27 minutes in recovering itself, count with more than 200000 archives. It is by that the restoration that initiated 12:45 p.m. 30 January finished 4:48 p.m. of the 31 of January, backup older than it was recovered was of the 28 of January 12:53 A.M. - 25 hours 28 minutes before the failure. The information in that period of time could not be recovered, although there were accounts that backup was of only some minutes before the failure, which did not mean major to them loss of data.

In the last month we had worked in many improvements that we did of the service, automatizations, and also planning the migration. During day 29, 30, 31 not we rested not one minute, was work continuous in to solve as soon as outside possible (until the impossible thing), we even implemented a tail with priority, clients who communicated with we put them to us following in the restoration, in fact we have very lodged applications critics, for example clinical that depend on the system to take care of patients (cirug€¦), due to the exigency of availability of accounts as these priority occurred them.

Finally today 1 of February, in the hours in the morning after leaving everything in sequence we prepared all the necessary one to come up that these problems that we identified returned to happen:

  1. To use the virtualization in better form, without giving direct access to the CPU for one better compatibility between migrations, that happen each 1 - 2 years, but for the next one no longer we hoped to have a disc corruption as which we had this time.
  2. To implement backup additional for occasions as these, backup that makes copy concerning blocks of the disc (a LVM Snapshot), that can be used to recover the system as a whole in about 70 minutes (447 GB in a network 1Gb/s with effective speed of transference of 910 Megabyte/s - had to the checksum of TCP and the encryption of OpenSSL) and not 28 hours 3 minutes that took in recovering backup incremental.

In some simpler words, we implemented measures so that it does not return to happen a disc corruption, and to return to happen, we will be able to restore everything around one hour and more likely if it happened, will be in a day Sunday dawn, hours of little traffic.

It is our greater commitment than our service remains being fastest, with times of load of websites inferior to a second, surely and near the 100% of availability.

to twitter MyServletHosting
facebook MyServletHosting
MyServletHosting email