Exercises are good for health

tbowan & aryliin
(en français)
August 16th 2021

Spoiler: To sell backup systems, it is claimed that the world is divided into two, those who have lost data, and those who will lose it. It is also forgetting that there are those who take care of their health and others who expose themselves to problems... As with health in general, we have to move from the “it would be good to do it” stage to the “we do it” stage. And so to plan those workouts and exercises. Not only does it cost not much, but the benefits are real.

This is surely because we are always called after disasters, we hear the same complaint ~~very~~ too often:

If only I had checked that it worked!

A victim

You probably can't imagine the anger and frustration that we encounter when our customers have spent a certain budget to install a magic box that is supposed to protect them, and they finally find that an error somewhere has made this solution ineffective and that they ultimately lost everything...

So, for those who have not yet had the (bad) luck to go through this mourning, we offer you short stories to scare ourselves. And since we are not going to leave you in anxiety, we are also offering you a solution to regain confidence.

If we consider a network infrastructure as a set of "connected stuffs", it is a matter of maintenance or technical inspection. As for a cars where the manufacturers encourage the first (otherwise the warranty is lost) and where some states require the second (in, France, without this safety check, you’ll pay a 135€ fine and the car may be immobilized).

Personally, because of our little demiurge mindset, we prefer to see our infrastructure as a living being that we have created, that lives and that evolves. So we are talking about a lack of training and exercises.

Little stories to scare yourself

As always, these stories are written from our experiences and, as professional secrecy requires, we have anonymized and adapted them to respect the participants, the companies (and their reputation).

Ransomware pierces defenses

Sylvain is a system administrator and has been in charge of backup his company's data for several years when his management offered him a promotion to the new position of CISO (Security Management). Before officially taking up his duties, he is relieved of his current duties (the charge of backups has passed to a colleague) and he begins work-study training for a year.

While learning his new missions, his company is suddenly the victim of ransomware... All company data is encrypted and as long as the computer system is not rebuilt and the data restored, employees will have to manage and work as in old times. Sylvain must put his training on hiatus to save what can still be and rebuild what has been destroyed.

Unfortunately, when he finally finishes the installation of a new file server, he realizes that the backups on which he was counting have not been made since he passed to his colleague... His successor, already very busy with his tasks, had not considered it a priority and had "postponed" it. Six months of production went up in smoke.

After rebuilding the computer system, Sylvain was fired for gross misconduct. It won't bring back the data or pay off the eventual ransom, but it saves face: so it was the fault of the CISO, not the company, the unfortunate victim of the circumstances.

Cascading problems

Charlène is a renowned architect in the country, to the point of having set up her own office and hiring other architects to handle the many projects entrusted to her. As she does not consider herself competent in IT and does not have the budgets to hire a permanent administrator, she called on a specialized company to manage, among other things, her file server (with two disks in RAID1, mirroring each other) and its two backup boxes (including one at home).

For several years, everything went well: Charlène paid for the maintenance and the company set up all its machines, took care of a move to new offices and when one of the server disks failed, they replaced it quickly.

Until the new disk also fails and Charlene discovers that, despite her best efforts, she will not be able to recover her data...

The failure is material, the read heads hit the disks which are destroyed, even with a white room, it is irrecoverable,
When the disk was replaced earlier, the RAID1 was not reconfigured, the new disk was used alone, without mirroring on the second one which therefore contains nothing new since this intervention,
During the move to the new offices, the backup was not adapted to the new network parameters, it was never made and the box on the premises does not contain any more recent files than the move,
The box installed at her house has in fact never been configured and does not contain any files.

Of course, the technician who carried out these operations no longer works for the outsourcing company since that time and it is with as much surprise that his manager discovers the minefield that he had left behind and which finally exploded in the face of his client.

The case is now in the hands of lawyers and IT experts (with the participation of guest insurers). In a few years, a judge will be able to determine the responsibilities and the amount of damages reimbursed to one or the other. But in the meantime, it won't bring back the three years of lost files.

Do exercises

We could have told you more of the same kind. Anytime you would have realized that after setting up a backup system, valiant heroes tend to leave it unattended in its corner. Businesses then consider themselves safe from problems thanks to this foolproof system (after all, that's what white hats promised them).

Plan

In all of our examples, the damage could have been avoided if someone had taken the trouble to check that everything was working as expected. But as often this task, considered as subordinate, is postponed ad vitam aeternam...

And we can understand it. Caught in the uninterrupted flow of tasks to do, we don't see how to free up our time. And since we consider these exercises daunting, our brains find plenty of other things to do instead and eventually forget about them.

If you have difficulty in making this habit, in forcing yourself to do these exercises, the most effective is still to formally plan them. Whether through your calendar or your ticket manager, it is easy to create recurring tasks there (e.g. with kanboard (which we are using) but also with Nextcloud or thunderbird).

You can of course adapt the frequency to the density of activity in your infrastructure. The more things move, the more often you have to check that everything stills work.

Seen differently, for data backup, the last exercise corresponds to the most recent data recoverable in the event of a disaster scenario. So don't delay it too long.

Proceed

During an exercise, the goal is to simulate a problem to check that the protective mechanisms (automatic or manual) are effective. Here are some examples :

Log in to the backup system and restore a file to an arbitrary date,
Stop a primary server and check that the secondary takes control,
Disconnect the main internet connection and check that the backup connection is working,
From outside the network, connect a VPN client to your infrastructure, including during an outage of the main internet access.

And since we are mainly talking about redundancy, also check the human redundancy; if an administrator has implemented a solution, the exercise must be performed by someone else.

Even if the administrator is present to deal with any problems, the exercise should be conducted as if he were absent.

Hence the interest in writing formal procedures, updated during each exercise. If the administrator is unavailable, this document will help anyone resolve these issues. Everybody wins; employees in skills and the company in resilience.

To evolve

Ideally, each exercise goes smoothly; the mechanisms work as expected, the procedure is adapted and everything is going well.

In reality, these exercises almost always point to a problem somewhere. And that is all their interest. Once you have encountered any problem, you can apply these two management rules from GTD:

If the task takes less than 2 minutes, do it right away,
Otherwise, create a ticket in your task manager and schedule it.

Some corrections will be urgent ("the exercise broke everything") and therefore made immediately. Others less ("the protection is not as effective as expected") and planned for a later time.

In any case, at the end of each exercise, you gain a better vision of the resilience of your infrastructure and the opportunity to improve it even more.

Once it has become routine, this continuous improvement is like Zen.

At first, I found it painful. And then, every once in a while, I got a taste for it, now I'm a running addict.

A long-distance runner

At the arsouyes

To organize, synchronize and avoid forgetting important tasks, we use Kanboard and create tickets for whatever we need or want to do.

And in the midst of all these tasks, we have created a recurring "PRA Test" task, which we perform once a month and which includes, among others, the following subtasks:

During a IPv4 and IPv6 ping, disconnect the fiber then restart the main firewall,
If no use in the previous month, VPN connection from outside,
Restore a file, via the backup server.

And since time, these exercises have allowed us to detect and correct some problems...

While we were no longer connected to ADSL and transferred the emergency connection to our 4G phones, we realized that the WISP box and the procedure were no longer suitable. We made an article.
The backups are rather configured by Aryliin, doing the exercise by Tbowan forced us to write a procedure that is regularly updated. There are plans to involve the children next time.
After a failure of hard disks on the server (in RAID5, therefore without consequence), we replaced it, bought another additional disk and, above all, set up a CD-ROM allowing the restoration of the data saved from any PC. This CD-ROM is used in 1/4 restore exercises.
When we saved the data extracted from a seal to the NAS (~1TB), the backup machine did not have enough disk space and the backups were no longer taking place. We now exclude from the backup "DoNotSave" directories which are intended to contain this large data but do not need remote backup.
When we deleted a "DoNotSave" directory, it ended up in the recycle bin, which was not excluded from the backup... The VM did not have enough room again and you know the consequences. We now exclude the recycle bin from the backups.
When our TLS certificate on our AD expired, we were unable to connect to our services (these were checking the LDAPs connection) . To prevent this from happening again, we have since added verification of all our certificates to the monthly exercises.

To get an idea of the "cost" of these exercises, we time each of the subtasks (kanban takes care of this automatically when we check the box). Last month, it cost us 0.44 hours (or 26 minutes, including 15 to wait for duplicate to list remote files).

And now ?

Over a year, our monthly exercises cost us less than 2 days (of 7 hours), or 1% of a full time (of 218 days), I'll let you do the math in euros or dollars.

It’s not much. Especially when compared to the benefits in terms of experience gained and reduced consequences in the event of failure. It will take well over two days to rebuild what can be, and mourn the rest.

In exchange for these few hours, we are much more relaxed about our ability to resist and survive a big blackout.