At 12:35PM on October 28th, we began encountering an increased error rate from our main application platform. These http request (status code 500) and Nginx errors (504 Timeout) were showing up on one of our application servers, which caused certain requests to fail unexpectedly.
Our initial investigation saw a memory spike on this particular server, from a steady baseline of 6-8% up to 99.99% almost instantly and at approximately the same time the increased error rate was noticed.
Upon further investigation we found that the disk space usage for the server was peaking at 97%, which historically had always hovered around 45%. At this point we figured the two issues (memory and disk space) were most likely related and began determining which files/directories were responsible for the disk usage. We identified two abnormally large files, one application log and one cache directory. We removed both files and confirmed that the disk space usage was back within normal range. Shortly thereafter, the memory usage for the server was back within normal range and no further errors were encountered.
Both the application log and the cache directory are optional configuration settings that are not needed in our production environment. The cache directory allows our deployments to complete faster and the application log is redundant, as logging is handled by a third party provider. Moving forward, all deployments will disable caching by default and no application logs will be written to the servers.