On Prismo data loss

How i tricked myself into thinking everything is safe and well configured.

Sooooo - it happened. Nightmare of all the developers on this planet. Entire production database of prismo.news has been accidentally wiped out and it was fully my fault. In this article, i'm gonna try to analyze what happened, how i got here and what can i do to prevent such stuff from happening ever again. This post is gonna be totally technical so if you don’t care or don’t have time to read that, long story short is - make damn sure your backups are properly configured!

As a side note - i'm definitely not a writer nor a native speaker so please be polite. That's probably the longest thing i wrote since high school :)


Server setup

Back in the days (about a year ago) when Primo was just a dozen of days old, I decided to go for hosting it on DigitalOcean droplet. I was using DO for everything hosting-related so far so that was a natural choice.

I’ve set up a VPS, provisioned first Primo deployment, bought an account on wasabi for blob storage and, last but not least, configured DO’s automatic daily backups. Everything was working perfectly well for month or two, there was a shiny new backup waiting to be restored in cause of any failure every morning.

Then, out of nowhere, completely unnecessary change occurred - I just discovered a hosting company called UpCloud (btw. pretty popular in the fediverse afaik). Pricing tier and VPS params were pretty much the same but they had this super intriguing custom SSD solution for fast IO I wanted to test badly. And I did that - I decided to move primo.news to UpCloud. There was nothing really bad about that decision apart from the fact i fu*ked up the operation by forgetting to take the crucial step.

I moved the files, I cloned the database, I migrated the secrets but…I forgot to setup daily backups for my new VPS in UpCloud panel. How is that possible? How could i forgot such thing? That question will remain unanswered as I have absolutely no idea.

Git workflows

After several months of prismo.news working totally fine, another milestone has been reached. Till that day, primo development was based on a simplest workflow possible. There was a master branch with the most current version of codebase and merge requests with all the features being implemented.

Such workflow is fine when there is one developer working on a repo and if there is no much traction in the project + you don't need to think a lot about versioning and hotfixing.

After short discussion, we decided it would be a good idea to move our workflow to a thing called Git flow. I'm not gonna elaborate on that as there are a lot of better sources of knowledge about that on the internet but, in short, instead of having one single main (master) branch, repo has two main branches

  • master for current stable codebase
  • dev for semi-stable more up-to-date version of the codebase.

Additionally, merge requests are being merged into dev while merges from dev to master are reserved for version releases.

Git flow

When we switched to git flow, Prismo 0.4.0 has been already released and i was working on a 0.5.x feature called Design system. At this point, i already knew it will take me a lot of time to complete it so there was no sense in changing the branch from which prismo.news instance was being deployed from. 0.4.0 was on master, prismo.news was deployed from master and i was standing in front of a potentially month-or-so-long feature so i did not bother to switch production instance to dev.

Docker setup

Prismo's officially supported way of hosting and deploying is using a provided docker-compose.yml config file. For simplicity's sake.

Every technical dependency of Prismo is handled by that file - which is Redis and Postgresql at the time of writing this. In order to make the data of such dependency persistent between deploys and restarts, user needs to mount the data volume of such container into the host filesystem. That's an optional commented-out line in the docker-compose.yml file both for redis and postgresql services (it's commented out as each person would probably want to mount it in his own path of choice).

Sadly, the default value for Prismo docker-compose config file is ./postgres for the PostgreSQL (PostgreSQL holds entire data of everything posted on Prismo) which resolves to the directory this file is in - which is a root of the application.

As prismo.news is being deployed from the git master (instead of prebuilt docker images), entire application directory on production environment is controlled by git. Including the ./postgres directory. Well, at least that's how it was back when i was provisioning Prismo on a DigitalOcean for the very first time. Back then, i saw ./postgres directory is not listed in .gitignore file so git is listing it as untracked directory - i hotfixed it directly on the server, added the missing line and, it seems so, i forgot to commit that change to the main repo. So the .gitignore change was living only on production server.

0.5.0.rc1 💥

After a month of struggle with the new custom frontend framework feature - i finally merged it into dev just like the new git workflow told us to do. I marked this release as 0.5.0.rc1 and wanted to deploy it to production as fast as possible so users could test it and report any bugs.

I ssh-ed into the server, checked what's going on in there and found out that, just like i wrote earlier, prismo.news is still using master branch. Without any second thoughts (i had backups configured through the hosting company UI right? what can go wrong?), i triggered git pull and git checkout dev and...that's precisely when all hard work done by prismo.news users through a year has been wiped. In a blink of an eye.

I don't fully understand why that happened (will investigate it more, i promise) but, as i wrote, ./postgres directory containing entire data of prismo.news has been added to .gitignore locally on master branch so it was properly ignored since i did that fix. However, after running git checkout dev, codebase has been changed to the version pulled from the repo and there were no .gitignore entry about the ./postgres directory locally anymore. That's the part i don't fully get but it seems the checkout removed the directory for some reason and running docker-compose up after that recreated the directory from scratch. If there is anyone reading this who understands what happened or what might happened, i would be more than happy to read what you have to say regarding that incident.

I was sure this situation can be fixed by restoring a morning backup but...it turned out i forgot to migrate the backup settings from DigitalOcean to UpCloud months ago while being totally sure they are working fine.

Summary

What i screwed up:

  • I forgot to configure backups on UpCloud
  • I did not made a manual backup of database directory before the deploy/branch switch
  • I forgot to commit ./postgres to .gitignore
  • I kept the database directory mounted along all the git-controlled application files

Upcoming difficulties

There are two main difficulties on our way right now. First one is a problem of duplicated ids of prismo.news posts and comments. They were just a plain incremented integers propagated to entire fediverse. Starting from scratch means the ids are starting from 0 again which creates a duplication across the fediverse and might cause data desynchronization on the entire network. Simplest solution would be to bump the database indexes to not start from 0 but from the last known ID but that does not solve the core of this issue. Such incident can happen to anyone in the future, people are forgetting things, databases are being deleted by accident - Prismo needs to be prepared for that. That's why we're gonna move from integer IDs to GUIDs which are a unique sequence of 32 chars (read more here). This way, even after starting from scratch, no duplication could happen.

Second difficulty, much heavier than the first, is a loss of your trust and loss of valuable content posted prismo.news through this year. I deeply regret of what happened and i hope we can work together and rebuild this cool place without looking back. I'm so sorry this happened.

PS. Proper backup solution is gonna be the first thing i will configure on the new setup. PS2. prismo.news is not gonna be revived until this ID/GUID issue is resolved.