server

Fixing Stability Issue In The Blog Server

The biggest fans of this blog (or just the people usually browsing between 6:00-7:00 UTC) may have noticed a frustrating issue where the site occasionally loads really slowly. Or in the worst-case scenario, refuses to load at all. Only an error page containing a message about a failing database connection gets returned.

Well, at least the message is short and to the point.

Investigating Issue

This issue started to occur sporadically in August and became consistent in October. And I started to consider fixing it in November. This kind of relaxed response time is common for hobby projects. The first obvious step to fix the issue was to check what was going on in the server when load times got longer. Once I noticed that the site was slowing down, I checked the monitoring stats. From the graphs, I saw that both CPU usage and disk reads were spiking. CPU was peaking at 80%, and disk reads were over 100MB/s for over 15 minutes. From the 7-day monitoring graph, it could be seen that this kind of spiking was happening almost daily.

Not every day though, and some spikes are taller than others.

Investigating the system log and comparing it with the time stamps of the peaks revealed the following cycle:

  1. One of the two daily apt package manager upgrade services gets started
  2. The CPU and disk activity starts ramping up
  3. The system starts heavy swapping and the website load times get longer
  4. About 15-20 minutes after the apt service starts the OOM (out-of-memory) killer kicks in and stops MySQL. Few other services may time out or get killed in this phase as well.
  5. MySQL restarts and the blog works again

I started investigating why the daily apt services seemed to constantly cause the server to run out of memory. The first of the two services downloaded the packages for upgrading, and the second one installed the downloaded upgrades. After trying out a few different things I realized that just installing or removing a package caused the server to randomly run out of memory if either of the apt services was started a few minutes earlier. It’s fun to do tests like this on a live server.

Fix Attempt 1: Installing System Upgrades

Some further investigation into the daily apt services revealed that the unattended upgrades had been failing for a long time. It seemed like the MySQL apt repository was missing keys, causing the apt update to fail. Also, it seemed like some upgrades required input from the user to configure packages. So I took a server backup and started installing the upgrades manually.

This isn’t foreshadowing. At least yet. Let’s see in three months.

Out of 141 packages, 127 wanted an update, which is “quite many” (to put it lightly). Fortunately, I have made no promises about the availability of this site, so I could liberally reboot the server as much as I needed for the upgrades. I was hoping that installing these pending upgrades would clear some cache that would reduce the RAM usage of the apt services. And in the worst-case scenario, it wouldn’t fix the issue but I would get an up-to-date server, so upgrading seemed like a win-win.

In addition to the upgrades I also installed an improved DigitalOcean monitoring service. This actually revealed something that should have been quite obvious from the beginning. The new monitoring service monitored RAM usage (the old one did not), and I could see that the server was using 90% of the RAM when it was idle. In hindsight, checking the RAM usage and monitoring how it gets consumed should have been the very first step when investigating an OOM issue.

Needless to say, 90% RAM usage is not good. I guess this happens because I’m running this blog on a low-end instance that doesn’t have much of RAM (I actually checked the minimum requirements of the OS, and the instance barely fills even that). However, before investigating the insufficient RAM, I wanted to first see if the upgrades would fix the original OOM issue. They did not.

So, the problem started to seem like a case of insufficient RAM. To fix this kind of issue, there are usually two options: scale the server up or scale the services down. In other words, throw money at the problem, or try to optimize the server. Being a cheapskate I chose the latter option. Also, I usually work with embedded things, so “just adding more RAM” feels like cheating. Also, considering the fact that on average I have about 20 daily visitors, beefing up the server seems like the wrong direction.

Fix Attempt 2: Optimizing RAM Usage

I used top to check the biggest memory consumers, and found two RAM gluttons: MySQL and Apache. Both are required for the well-being and existence of WordPress (that is the platform of this blog), but perhaps they could be optimized. At least they used to work on the server before, so perhaps they could be configured to work once again.

In the case of MySQL, there was a single mysqld daemon that was consuming plenty of RAM. Some googling revealed that disabling performance schema could help lower memory consumption. It seems to be a feature that measures the performance of the MySQL database server. Considering the fact that I’m using WordPress and I hope to write zero direct database queries to the database, that seemed nonmandatory. Perhaps when developing new software using MySQL such stats could be useful. Disabling performance schema lowered the mysqld RAM consumption from 39% to 19%.

In the case of Apache, there were ten worker threads, each consuming about 5%-8% of RAM. If my math is correct, in the last month I had about 0.00083 concurrent visitors on average. With that in mind, ten worker threads felt a bit excessive, and I scaled their amount down. I think it could be lowered even more, but I wanted to have enough workers in case there’s a sudden influx of readers.

Aaaany day now.

Conclusion

These actions took the idle RAM usage from 90% down to 60%. After this drop, I haven’t seen the OOM killer get activated in the past seven days, so I hope the issue is fixed. 60% is still a bit more than I’d like, but as long as the server stays stable and the performance doesn’t notably degrade I think that’s an acceptable percentage. Also, using the cheaper virtual machine saves me $6 a month!

The root cause for the increased RAM usage is still a bit of a mystery. I’m suspecting that installing WordPress plugins caused it because I was installing SEO plugins around the time the issue became more prevalent. If there’s one thing I’ve learnt from this, it’s that updates should be checked manually every now and then, and consumption of the system resources should be constantly monitored.