2017-06-14 Data Extinction Event Investigation Report

Introduction

My laptop's home partition uses Btrfs alongside with Snapper. Snapper creates periodic volume snapshots, which I used for backup. It was very suitable for the prevention of accidental file deletion. However, I recently put some bulky files (8x ~2GB) on my home partition. Somehow this led to occasional 100% CPU usage for one of the Btrfs-related processes. I decided to clear my volume snapshots to avoid 100% CPU usage.

I decided to manually dump all the snapshots by running the following commands:

cd /home/.snapshots
for i in *; do btrfs subvolume delete $i/snapshot; done
rm -rf *

The problem was that the currently active default subvolume is mounted inside a folder within /home/.snaphshots too! My files are gone by the time I realised what happened.

Background

Well, it is clear that I know how to do backup, as evidenced by this ¹⁾ and this ²⁾. I also introduced my mum to Unison ³⁾. The problem was that I had been too lazy to back things up. I think it is important to analyse what led to the decision of not backing up my home partition properly, as this incident has a mild impact on the progress of my PhD.

I think to understand why I decided not to do backups, I need to look at my historical and current data handling practices and their consequences. So I will start by looking at the similar events that happened in the past. I will also include an interesting Bitcoin-related story that I heard from my time in York.

Historical extinction level events

Around 2007, my family's IBM Thinkpad R51 failed to perform a partition move/resize. This led to the loss of my family's photo archive. At that time, we only had one external hard drive. The photo archive wasn't on the external hard drive. The crisis was partially resolved using one of those file recovery tools, which worked by dredging whatever that was left on the hard disk and look for JPEG headers.
In 2014, when I was commissioning my Lenovo Thinkpad T440p, I encountered a memory corruption bug. Both of the two superblock copies for Btrfs root partition got wiped. At that time, I didn't have a separate home partition. This meant that I lost system image (“Cortana”) which I had been using for two years. No data was recovered, as I had just graduated from York.

Notable near-misses

At some point, I managed to delete all the LVM partitions on this server (gabriel.fangfufu.co.uk), when I was attempting to resize the partition. Luckily I managed to recover the LVM partition configuration from the backups from /etc/lvm/backup/. Had I rebooted the server before recovering the configuration files, all data on this server would have been lost.
There are also numerous occasions in which I managed to delete the whole partition, I ended up having to use TestDisk ⁴⁾ to scan the whole hard disk to recover the partition table.

Gary's Bitcoin story

When I was a third-year undergraduate at the University of York. Ian Gray told us that his friend (I believe the name is Gary?), managed to delete his Bitcoin wallet, which contained a significant amount of Bitcoin (in the range of £600 or $600). So he ended up writing a Python script to dredge his hard drive, to look for the signature of his Bitcoin wallet. This story was interesting, in the sense that I had learnt how to write a parser in my second year, and he could not have used existing software such as TestDisk to do that particular task, because none of the existing file recovery software had Bitcoin wallet's signature built in. I suppose the point of this story is that I am not the only one who live dangerously.

Dangerous cultural practices

It is clear that I understand the danger of losing data while performing risky operations. However, it seems that I always get away with it - in the sense that the mission critical files always have outdated backups somewhere.

I think I basically grew complacent. I believe I have learnt a lot of bad habits, rather than changing my bad habits, I managed to build myself layers of defences against those bad habits.

Rather than stop doing shift-delete. I decided to install volume snapshot, so I can liberally delete files. Rather than backing up data before changing partition layout, I rely on the fact that it is pretty easy to revert changes to LVM partitions by using the LVM configuration backup.

I seem to have been ignoring the danger of losing data, because the benefits of getting things down quickly have blinded me.

Backup solution that are being used

The following backup solutions are currently being used:

Mission-critical source code (for both software and documents) are checked into repositories on fangfufu.co.uk. The only problem is that sometimes I am too lazy to commit and push. I feel quite often I don't make enough progress to warrant a commit.
Btrfs volume snapshots are being used to avoid accidental file deletion.
Old archival files are synchronised to fangfufu.co.uk using Resilio Sync. The plan was to synchronise it with a Raspberry Pi in China, however, that Raspberry Pi went offline.
The same old archival files are also pushed to Google Drive using Google Drive CLI Client ⁵⁾
Ad-hoc copies of critical documents that are not often used are stored in Google Drive.

Backup solutions that can be considered

My local postdoc has two hard drives in his workstation, he uses rsync to copy files from main hard drive to backup hard drive on a daily basis. My T440p will have 3 hard drives, when it comes back from repair.
Rather than using rsync, I can just send the Btrfs volume snapshot diffs to my 1TB secondary hard drive. I am already doing Btrfs volume snapshot anyway.
btrbk seems to be a bit better than snapper, in the sense that it actually supports automatically sending the snapshots away.

Future action plan

~~Commit and push source files more frequently and more diligently.~~
Expand the coverage of Resilo Sync.
Push more data into Google Drive - after all, I have unlimited Google Drive storage space as a York Alumni.
~~Investigate btrbk.~~