Hello, Jamie Spittal here, I'm the engineer in charge of our infrastructure and hosting and I'd like to go into some details about the operational problems that occurred on March 2nd. We'll provide a quick overview, go over the timeline of what happened, what the immediate fix was, and future solutions. Let's take a look:
On March 2nd around 8pm UTC our database CPU went up to 100% for an extended period of time which caused our services to be extremely slow for some users and to throw errors for others.
A bit of context: SpringboardVR uses MySQL databases managed by Google Cloud, this service is named CloudSQL. CloudSQL has a feature that allows for us to create what is known as a 'read replica'. Functionally this means we have one database that we write changes to and multiple databases that we read entries from. CloudSQL automatically keeps these databases in sync with each other allowing us to handle higher load.
We believe that the software that keeps these multiple databases in sync may have failed because a critical cron job was running at an incorrect time which also used a significant amount of database resources.
Timeline (in UTC)
Jordan, one of our founders, reports in our tech-support slack channel that an operator is having issues with the Monitor.
Early in SpringboardVR's history the development team experienced what we referred to as 'growing pains'. Those pains caused quite a bit of downtime at peak hours. To address those issues we took major steps and one of those steps was implementing various alerts and early warning systems.
Since we implemented these alerts it has become increasingly rare that an operator reports an issue before the dev team is actively mitigating risk. So right off the bat we knew this was an exceptional event...
Triage begins: Matthew our CTO performs the baseline "does SpringboardVR still work?" check.
The first gut check for any reported downtime is to load up the website. In Matthew's case, the website loaded and he was able to get access to the station monitor. From there he asked the tech support team for more details on what operators were experiencing.
Dylan from the support team provides a detailed account of the issues.
Using this new information Matthew ascertains what the issue is. It's the database, seems like the CPU has been hit harder than usual.
This is why our alerts weren't being triggered and safety mode wasn't activated on stations. The Database CPU was hitting a limit, which means the application was going through operations slowly, but it wasn't fully down. Some of our operators were seeing errors because if any operation takes more than 60 seconds, the application will throw error, this is called a 'timeout'. The 'timeout' errors are also why some operators were reporting downtime, and some weren't. We were also seeing issues where the read databases couldn't keep up with the write database so whenever a new Booking was created it would throw an error because the read database couldn't find it after it was created. At this point the database was trying it's hardest to get everything done, but it just simply didn't have the power.
Matthew attempts to clear long running operations, and investigates if there's any code that would cause extra stress on the database. After a few solutions are tried he resorts to restarting the database, which cuts service for all operators for around 1 minute.
Before we attempt any fix that would cause more service disruption we try to exhaust any options that will address the issue and keep all services alive. Sometimes this isn't possible so we resort to the tried-and-true method of turning the database off and turning it back on again.
Database comes back alive. Seems to be stable. Matt keeps a keen eye on the database.
Matt observes that the database is once again having CPU issues. Matt begins creating another 'read replica'.
A quick and easy way to give a MySQL database more power is to create a clone of it (called a 'read replica') and balance connections betweens those clones. In this case Matt was creating the database's third read replica.
The new read replica is not helping... Matt tries to restart the database once again. While this is going on he proceeds to create a brand new database with significantly more CPU power, as a failsafe.
Matt creates a brand new Google Cloud SQL database with more raw power then migrates over the existing database. He updates the applications to connect to the new database.
After many hours of stability, and the peak hours come to a close we are confident the issue is over. Jamie begins the audit to find out exactly what happened and why.
What Happened and Why?
Since our move to Google Cloud we have state of the art monitoring and logging. This gives me extremely detailed historical insight into everything that is going on with our servers. Our first step in any audit is to check out our server monitoring dashboard to see if anything really pops out as anomalous. From those anomalies we define a rough time frame for when we think the error started then dive deep into our server event logs to look for the exact moment when there was failure.
The dashboard and logging made things extremely clear: a resource intensive cron job running at an inopportune moment, causing the the CPU of our database to hit 100% usage.
Related to the CPU hitting 100% it caused the database replication between our "read" and "write" databases to fall out of sync. This caused most of our platform to appear to be working properly but anytime anything was created or edited (such as creating a new booking in the Monitor) an error was thrown.
How did this happen?
If you're reading this the question you might be asking yourself is "Why didn't you see this coming?". The answer is that we knew that our database was our bottleneck for speed and stability (more on that in a bit) in our setup. Knowing this we projected based on our growth that we would have to upgrade our database power sometime in the middle of this year.
So, now you might be wondering "Why didn't you upgrade the database preemptively?". We monitor the statistics on all of our services regularly but during the last few weekends - and the incredibly busy weekends over the Christmas holidays - our database was well within our expected range with it rarely going above 50% CPU or memory usage.
How we are solving this in the long term?
Technology companies in general have been trying to solve database bottleneck problems for a long time. These issues effect lots of large online companies - include Twitter, Reddit, etc
Almost all of our services will automatically scale based on traffic but for many, many reasons this is very hard to achieve with MySQL databases. It mostly has to with the fact that we need to maintain data integrity, and automatic scaling of databases can often lose data, which we certainly don't want.
What we're doing right now is actively looking for solutions that would allow for automatic scaling of our database. We've looked at solutions like using ProxySQL to horizontally scale our MySQL database which is more of a permanent solution than read replicas. We've looked at Vitess which is a self hosted solution for scalable MySQL, this product is used by massive companies like YouTube. We're even considering moving our database to Google Cloud Spanner which is a speedy, globally deployed database, the downside with this solution is that it isn't MySQL compliant, which would be a massive change for our product.
When it comes to the cron job running at an inopportune moment, we've already implemented more logging and alerting specifically for cron operations. This will give us more insight and will alert us to take action earlier if this ever happens again.
We are also looking into our status reporting and the activation of Safety Mode to make sure that we can catch these types of issues faster in the future.
Because the solution for this issue was creating a brand new database there was a need to migrate data from the old database to the new. Migrating data to and from databases is a slow and difficult process, as it means transferring gigabytes of data, this can often take several minutes to accomplish. The major caveat here is that the data being transferred over reflects the state of SpringboardVR at a single point of time. While this migration to the new database is in progress, 'new data' is still being created on the old database. This 'new data' was not part of the migration and will not be reflected in the new database. This is why some operators experienced data loss.
We were also having issues with our databases not being in sync with each other so we had to decide which database to use as a starting place for our data migration. Because of the issues each database had slightly different information but all of them had a LOT of bookings - shown below. With the Monitor being online but having issues everyone was constantly creating new bookings at a much higher quantity than usual.
We always strive to provide maximum performance, and reliability. We know that the database is in need of a more permanent solution and it's a high priority task for us. So, rest assured we are doing everything we can to make sure an event like this never happens again.