This is a really exciting post for me to write. For the first time in quite a while, I’m writing about something technical! At HubTran, we’ve found supportability to be an incredibly important part of writing software. We have a BackupIntegrationFile model that stores every file we receive and send as part of our third party integrations. This includes flat files, JSON documents and anything else that’s passed back and forth. We’ve found it to be a life saver when troubleshooting issues with our partners. The downside of this model is that it’s huge! We have hundreds of millions of records stored. We had a purge process that would clear out old records, but recently it just stopped working. Figuring out why was a lot of fun!
For a while, we just ignored the failure because it wasn’t that critical. We could still operate, we just had more data than we needed. Recently, we noticed that searching for specific files was getting slower which pushed us to finally investigate. In theory, this purge process should be quite simple. All it needs to do is to go through the table in batches and delete any purgeable data. We made all sorts of changes to the Ruby code trying to optimize it to no avail. Finally, I changed the batch size used to fetch data to an incredibly small number and noticed something that would send me in the right direction. Even deleting just 10 records was slow. How could this be?
It turned out that there was a foreign key on a model that referenced our backup_integration_files table. On every delete, the database had to go out and query this key to see if there was a record that references the row to be deleted This by itself isn’t that big a deal, but there was a problem. In the referenced table, the backup_integration_file_id column wasn’t indexed. That meant that every delete led to a full table scan of a table with almost 1,000,000 rows. When we first built this process, that table was small enough where the scans weren’t noticeable. Over time, however, the scans made the whole process slow to a crawl.
While this was a bit of a challenge to debug due to the hidden nature of the slowdown, the fix was simple. I added an index on the foreign key column and our purge performance returned to normal. It felt good to dig into a technical issue again.