Actions

Work Header

Detail of downtime October 24-29 Downtime Report

Work Text:

Earlier in the week: Systems had been attempting to upgrade our original virtual machine server and this was taking significant time.There had been reports of intermittent 503s but this was put down to having only a single front end server. Our firewall was adjusted to send requests to the front end machine even if it wasn’t responding quickly enough.

October the 24th 11PM UTC: We notice strange load patterns on our firewall and reports of 503s start occurring. We contact our ISP to check that there are no external issues and that we are not seeing any significant traffic from any single host.

Our database load is varying in a similar way to our firewall traffic. Note: in these early graphs we can see that throughput is oscillating between 450,000 cpm and 750,000 cpm; a more normal load at this time is about 550,000 cpm. More importantly, our average response time is as high as 15ms; our system normally has an average response time of 0.6ms and anything over 1.75ms is likely to cause significant issues.

Fanlore and other internal services were also affected and this caused us to consider issues on the front end or database as those are the shared components.

Sunday 25th October:

After experimenting to find the current maximum capacity of our servers we increased the number of application processes running on each of the servers, which would mean there would be more resources if a page got delayed by the database taking longer than expected to return a result. Subsequently additional connections for supporting systems such as memcached and mysql needed to be increased. As the new processes need time to warm up it was hard to tell if this increased the stability of the system immediately.

We logged a support ticket with our MySQL support company to ask for help isolating a potential issue in the database.

We isolated a number of long running queries and our contractor of our own started to chase down some of their causes and try and find solutions.

We installed monyog which is included in our mysql support contract and examined our databases with that tool to see if anything was outstanding. A number of minor issues were picked up and we made those changes.

We decide to turn the tag wrangling interface off in peak times to reduce the strain on the database server.

We did a transparent failover of our firewall servers rebooting the systems in case there was any issue on the firewalls we hadn’t identified.

We brought the old firewall back into service; as front end IP addresses had changed, this took some time. Once the old firewall was reconfigured we put both the new and the old system back into service, load balancing traffic between the two systems in case it was a issue of load that the new firewall system was having issues with. The new firewall uses haproxy, which is an application proxy, while the old system uses reverse NAT, which means each system is likely to have different failure modes. However the problem still persisted, shifting the potential for issues away from the firewalls.

We adjusted the load between our different mysql servers and tried a number of different configurations. We wrote a script to collect diagnostic information for our mysql support company and sent the results to them. We tried harder to isolate the read and write workloads to our database servers.

We stopped serving pages to clients we classify as bots to see if that helps the issue and also to increase service to our corporeal users.

Our single front end machine was periodically hitting a gigabyte a second which is the capacity of its external interface ( and over 5 times our nominal maximum capacity ).

At 1 am I go to sleep after working for about 12 hours investigating the issue.

Monday 26th October:

Although our main systems person has a full time job this day was booked as a holiday as it is half term.

AM: The system seemed stable and therefore hard to debug. A number of minor issues were picked up by monyog and we made the changes suggested.

PM: Continued to discuss the issue with our database support.

Evening:  Our contractor of our own creates a pull request not to count a number of stats which are not used on the front page for logged in users. Repurposed one of the worker nodes as a front end server in case the issue was related to overloading the front end servers. We note that the issue appeared to occur pretty soon after a deploy and our contractor of our own checked the code in the deploy in case there were any issues that we missed in code review, none were found. The new front end machine appeared to improve the issue, however it was not solved; work finishes around 1am.

Tuesday 27th:

Also a holiday, however one of my Sons' birthday so actual day time not available.

AM ( around 6am ): A mysql server upgrade is available and the database servers were upgraded with no additional interruption to service. More work steering and balancing database load as recommended by mysql support. More work balancing load over our application servers and trying to shape the load as peaks arise.

Evening: The spiking traffic every 5 minutes appears to have returned. Continue to balance database and application traffic; the lack of true analytics ( we will not send our traffic data to google ) makes finding which pages are being viewed harder than for most sites, commercial analytics are unaffordable. Work on rate limit traffic to different parts of the archive at different rates, but at limits that would not affect most users. Our contractor of our own works on a pull request to cache bylines, which should help to reduce the load on the servers ( this has yet to be fully tested and deployed ) . Our contractor of our own codes an important security fix to reduce the likelihood of a successful phishing attempt.

Wednesday 28th:

A normal working day.

Evening: Check on rate limits and ensuring AJAX calls are not affected and adjusting the limits on particularly slow pages. We try each of the firewalls in turn in case there is an issue there we are missing. We deploy (  http://archiveofourown.org/admin_posts/4007 )  the fixes that have developed over the last few days. We have a discussion about whether the peaks we see in the database load are symptoms or causes of the issue. We try putting shaping from the new firewalls to the front end servers in order not to overwhelm them. We adjust kern.ipc.soacceptqueue on the new firewalls to ensure that requests are not dropped. Work finishes at 12:40am

Thursday 29th:

A normal working day.

AM ( 5:30am -> 7:00 am ) : Repurpose old servers ao3-app09,10 and 99 running the application rather than running using these machines for background jobs or serving pages for bots.

PM:  While accessing a virtual server it was noticed that the system seemed particularly slow. We checked the performance of the disk, as network access to this system was not via our normal firewalls. The access to the disk was was very slow. Once we noticed this it was clear that we had been working under a misapprehension. The system was performing badly intermittently because both the front end web servers and the back end application processes were blocked on accessing the shared disk. Given that tags are stored in the shared area this would explain the periodic slow down as various feed readers checked in. The slowdown was also variable as other virtual machines were accessing the same underlying disk. It is worth noting that we had purchased a second machine host virtual servers and we are in the process of trying to upgrade the cluster to a more recent version of the hypervisor. The second machine which was hosting the NAS virtual machine has more disk capacity and an improved I/O system than the original machine. Once we were sure of the issue we copied the skins and essential assets over to local disc on each machine. Additionally a banner was added to the site warning that feeds would be disabled and downloads could well be disabled as well. It was commented that http://media.ao3.org/ may need to be disabled for a while.

Evening: Work started at 5:07 PM, It took some time to disable feeds. At 5:45 PM  both  http://media.ao3.org/ and downloads were was disabled. The NAS had to be unmounted a number of times ( as our configuration management system kept remounting it ). By 8:45 PM the NAS was properly unmounted and most systems were working with reduced functionality such that we could see that we were fixing the right issue this time. By 9:10 PM http://media.ao3.org/ was available once more. By 10 PM tag feeds and downloads were working. After watching the system for an hour sleep is definitely needed.

Friday 30th: Normal work day.

Evening: Returned the borrowed servers ao3-app09,10 and 99 back to normal service running background jobs. Rebalanced the tasks running on each background task server. Logged ticket with virtual machine software vendor to investigate slow performance. Started writing this document.

Saturday 31st: Weekend.

We discovered that certain operations were failing around the area of skins so we populated the skins directory with the standard set of skins on all the applications servers. We also investigated collection summaries which were erroring and fixed the error however further investigation is needed to see if this area of code needs refactoring. The rest of the day was spent writing this document.

Sunday 1st: Weekend.

We note similar issues with less common skins on application servers and recache all the skins on the application servers. We received warnings that the memory being used by our application processes was running our application servers nearly out of memory. Given that we have greatly increased the number of processes this should be no surprise. We reduced the maximum amount of memory a process can use before being reaped so that we do not exceed the capacity of the servers. An issue with collection signups add an additional share from the front end system.