Actions

Work Header

The Technical Architecture of the Archive of Our Own

Work Text:

I am james, I am have been a member of the systems committee for 6 years and a member of t he group responsible for coding the Archive, which is called Accessibility, Design and Technology, or AD&T for short f or 4 years.

My professional experience is in high throughput computing which means I have some experience of systems at scale and that is useful, because the Archive has on average over 270 pageviews a second. However I am not a professional web developer and I have not been paid to ensure that a popular site stays up reliably so if anyone in the audience is then I would be interested in hearing what you have to say.

On that note please do ask any questions  at any time and I will do my best to answer.

If something goes wrong with the Archive, I am the most likely person to be to be  woken up in the middle of the night.

This is not heroic, each time is a failure. We try and design systems so that various components can fail without the users of the service noticing.

In this case the day before we had fixed an issue caused by the number of records which store all our users reading history. We had to change the index size from 4 bytes to 8 bytes, this made the database bigger by about three percent.

The next day we had an issue: We have a number of servers which generate pages for the Archive and three of them hit the same type of failure at once. The three machines ran out of memory ( all those extra 4 bytes add up ), we use spare memory on the servers as a cache and when those three machines failed we lost that cache and the processes generating the pages, which was a large enough part of our estate that the Archive stopped responding on a Sunday night, our busiest time.

That night I was out being a taxi driver for my eldest son, collecting him from a college trip where they had come second in the UK student robotics competition. So fixing things took a little bit longer than they normally do.

The organisation for transformative works, “the org” is one the core reason for the stability of the Archive, it is an American not for profit founded in 2007 with the goal of legitimising fan works.

The org and it’s projects are fan run and the vast majority of the fans involved are women.  The org was set up in a time when there was a worry about companies monetising fan content.

Because the hosting companies relied on advertising it was fairly easy to convince them to destroy content which some people believe was abhorrent. One of the main tenants is therefore to “own the servers” to ensure that the Archive is a safe place for content creators. While tags can help ensure that readers see the works they want while being able to avoid unwanted content. As the terms of service say “ You understand that using the Archive may expose you to material that is offensive, triggering, erroneous, sexually explicit, indecent, blasphemous, objectionable, grammatically incorrect, or badly spelled.” and if you are lucky all in the same work.

The first work on the Archive was posted on the 13th of September 2008. The Archive entered open beta on 14 November 2009. My account was created in September 2010.

There are a number of committees that provide the infrastructure of the org:

Communications: Creates the annual report, Creates some news posts, and runs some social media accounts. Think public relations.

Development and Membership: runs the two annual fundraisers, communicating with members and potential donors.

Elections: runs the elections process for the Board of Directors, in the past we have had too many years where we have not had enough people running for the board but that has not been an issue in the last 3 years.

Finance: responsible for accounting, budgeting, payments and tax filings.

Strategic Planning: works o n big-picture strategic questions and vision.

Systems: Systems is one of the two committees I am a member of. We are responsible for specifying and procuring the servers that the org has, and for installing both the operating systems and the software systems for example the databases that the Archive uses and the software that is behind transformative works and cultures ( the open journal system ). We run at least 3 different types of database, we run mail systems and web servers etc.

The Board of directors: Responsible for setting high level goals and org wide policy, financial well being,  following federal, state and our own laws.It providing support and guidance to other committees.

Translation: This is the committee that translate many of the news posts on the archive and the FAQ entries. They also supply translation for support tickets raised in languages other than English. In time we will get to translating the Archives interface.

Volunteers & Recruiting: Manage new starts and leavers and ensures the right people have access to the right tools. While there are many tasks that Vol Com does not do that a HR department does they are the nearest the org has to a HR department.

Webs: Responsible for the maintenance the org’s websites ( transformativeworks.org elections and open-doors )

Transformative Works and Cultures is a peer reviewed academic public access journal which publishes articles about transformative works, media studies; and the fan community.

Legal: provides guidance whenever needed for internal legal questions. The Legal committee is dedicated to helping fans who have questions about their own fanworks and to engaging in public advocacy on legal issues likely to impact fans and fanworks.

Fanlore: A wiki about fandom history and fan culture, written by fans.

Abuse:  field the complaints that come in about content uploaded to the Archive, whether that is works or comments.

Accessibility, Design & Technology: AD&T is the second committee I am a member of  it   coordinates design and development for the Archive. This means that they maintain and update the Archive code, are the first in the line of fire when emergencies happen, and generally make decisions regarding what new features are coming next, what needs prioritising and fixing on the Archive from the technical side.

Archive Documentation :  writes, edits and revises AO3 documentation, including FAQs and tutorials.

Open Doors:  is dedicated to preserving fanworks for the future. In practise this involves importing a large number of works from fragile web sites or database backups to new collections created on the Archive.

Support: helps to resolve technical problems experienced by users, passes on users' feedback about the Archive to coders, testers, tag wranglers.

Tag Wrangling: work within the Archive to ensure that the varied forms of user-generated tags are sorted and filterable, without changing any of the specific tags the users choose, so that the archive users have an easier time finding the type of content they are looking for for any given time.

There is a great difference in how you design a system to support a small number of users and how you design a system for a much larger number of users.

According to free comparision services the Archive is one of the top 400 most  popular web sites in the world.

Back in 2012 we doubled the number of pageviews over the year from 18 million to 38 million pages. However the percentage increase has been slowing and this year, last week we had only risen by 36 million pages compared to an increase the year before of 38 million. I would guess that the first month with over a billion pages a month will be either December 2018 or January 2019.

The following images come from the system we use to help us find slow parts of the code, and to help us find parts of the Archive that are causing issues.

This page shows us how many users we have and how much downtime we have had, the tool marks the site down both when it is down and also when it is unavailable from a number of probing points.

We can see what proportion of the time is taken up by each part of the process of rendering a page. Here we can see that as all the servers for the Archive are located in a single location, that the network time is significant.

Document object model processing is the time between the first bit of html being received by the web browser and the last bit of html.

By looking at the displays we can see where the issue is.

For about 5 minutes we can see there was a doubling of how long it took to delivery a page to the user.

Looking at the application view we can see most of the time is spent queuing, which means there were not enough processes available to generate the page, which was caused by each page taking  a lot longer to generate.

Now removing the queuing we can see that the delay is in web external, which for us means elastic search, which is used to power our tagging system.

Just to show that without web external everything looks normal, so we would look at the logs for elastic search to see what the issue was.

When doing maintenance I try not to think about how many people are affected by the work I am doing.

While most of users are on mobile, we don’t have the resources to dedicate to App development. While we do want to write a stable api for the Archive this is not something I can see happening in the immediate future.

It is worth noting that if an unofficial app requests your login and password you are trusting the app with both your reading history and the power to comment, kudos create and delete works.

Even though we just serve text we still use a considerable amount of bandwidth per year. We serve about half a PetaByte a year. When we do eventually start serving media, video and audio and pictures we will need significantly more bandwidth as our current capacity is only equivalent to 60 streams.

We currently have servers in three data centers

I am going to call sites Alpha, Omega and “Landing strip one”. The org has a policy of not naming suppliers generally. We get enough emails to our ISP either of the Cease and Desist or people trying to convince our ISP that we are evil.

 

 

 

We use routeros systems all three sites as vpn servers because they are cost effective, at landing strip one and omega routeros systems are also used as firewalls.

At our main site (alpha), we use the routeros systems as only a intersite vpn server and  a firewall so that the fanlore server is isolated from the other machines. While we use a pair of pfsense firewalls as our main ingress. And our brocade switches as the main layer 3 router.

We have started to migrate our out of band management to a separate vlan.

Currently our front end machines have an inbound and an outbound network, I would like to do this later for our galera cluster.

The network in the Omega is particularly simple, the VPN just connecting to Alpha

The network at landing strip one is slightly more complex as we host developer machines, access to which is given out relatively freely and we need to ensure that access to these systems does not give any additional privilege.

Omega is a relatively new datacenter for us we have had it for about a couple of years. It is mainly used as a DNS server and a mysql secondary server for the Archive and fanlore. It does use obsolete hardware that still has some use left in it. This year I hope to move some more recent equipment from Alpha to Omega.

Landing strip one has a large number of virtual machines, the hardware is over 4 years old and we plan on moving it over to a NAS role and buying new servers to take over the compute component.

For just over two years we have had a single rack in Alpha. This does make things more easy for us, when we were spread over multiple racks I was often worried that strange problems were caused by low bandwidth between ISP switches in each of the different racks on which our vlan was hosted.

 

This is the specifics of the hardware we are currently using, if and when we get to hosting media I am expecting to have to add additional 25B/s switches and update the firewalls, and front end machines.

This is an example of our current low power machine, We use the NVMe as the /var partition as while this system is only relaying email from the systems that generate the mail we do send over half a million emails a day and this does cause a lot of IO.

We have two servers running a large number of vm’s including machines for fanlore and the journal.

We also use the servers to provide machines for non time critical features of the Archive such as generating downloads and serving pages to web spiders such as google.

We have our own gitlab instance ( this is a git server with a web interface ). Git is a source control system which we use both to hold copies of the Archive source when we want to collaborate on security fixes that we don’t want to be public before they are deployed and for both Ansible and cfengine.

Cfengine is the it infrastructure framework that we have been using since 2012, before this all changes on the servers we had ( all three of them ) were done entirely by hand. I choose cfengine at the time as it was the system I used at work. It is not the most pleasant system and has some sharp corners but it has meant that deploying new servers is relatively pain free.

Ansible is a much more modern system and I am not the only member of systems who can use it. We are slowly migrating functionality out of cfengine and moving it to Ansible.

Fai is the system we use to install the operating system on our servers. It boots the system from the network and installs debian and cfengine. From there cfengine runs multiple times and the server should be useable ( assuming it is running a role we have already configuration for ).

We try and ensure that any system where failure would take the Archive offline is replicated and hopefully running Active/Active. After the firewalls comes the web front end whose task it is to do SSL termination, serving of static assets ( such as CSS and images ) , some full page caching.

More interestingly it sends different kind of requests to different sets of application servers, for example we use slower vm’s for spiders and downloads. We have a special pool of application servers put aside for writes such as work creation, comment creation and kudos giving so that these are prioritised.

 

Redis is an NOSQL database, we use redis to hold non relational data that we want to access quickly for example autocomplete. We also use redis to temporarily store information before it is written out to mysql an example of this is the reading history. Our mysql servers couldn’t support 15 million writes a day to a single table just to support the reading history so we temporarily store the information in redis and then write it to mysql in batches of 1,000  records every 15 minutes.

We also use resque which allows jobs to be run asynchronously that is a page that you visit might cause a process that could take several seconds or minutes to run which makes it impractical to have a webpage wait for you. In this case a job is created using resque and run later. Redis is used to provide the resque service.

As redis is a single threaded system we run multiple redis processes to make it less likely that we overwhelm a single process, we have redis instance for storing kudos,resque,rollout ( this would allow us to have features in test for a subset of users ) and a general redis database.

It is worth noting that redis is not replicated with failover as the jephsen report on sentinel was to  worrying. If we plan maintenance on redis then we take the Archive offline to do it.

We have recently moved redis from an old mysql server, which was only serving redis to a new mysql in our cluster. So we could redeploy the old server as an application server. If redis stops then the number of errors users see spike tremendously.

Elastic search is what we now use to power our tagging. While all the state of the Archive is canonically stored in mysql we use Elastic search often to display the data. Unpleasant things happen if the data becomes out of sync for example your work may not turn up in searches or may not turn up on a tag page.

The version of elastic search we use is ancient for a software project, it is coming up to it’s fourth birthday. The reason it hasn’t been upgraded is that the gem or library we use to access it is no longer supported  and changing the core code around this area is scary. We have got agreement for a contracting company that we have used, to refactor this code into shape. Using a modern gem and a much more modern version of elastic search.

For Jepsen this is pretty positive, however for us, we often get complaints when works are missing in a tag or fandom listing. We don’t use elastic search as a canonical data source. We don’t routinely rebuild indexes however documents do get reindex as a side effect of some processes. Things will get better when we upgrade to a more modern gem and a current version of elastic search.

We have been using percona xtradb cluster (galera cluster) since about October 2014.

This has significantly reduced the amount of downtime that we have had due to database issues because:

  • Three machines can handle a lot more load than a single machine ( not three times the load though).
  • If we need to take a server down for either a hardware fault or for a software update then we have two other machines which can keep the site running while the third machine is being worked on.

We have three servers in our mysql cluster and as these are the most expensive machines we buy I have a tendency to not buy all three at the same time. By buying two servers one year and a single one the next we can smooth the costs of the systems out.

Our old database servers generally get recycled into application servers running the Archive code itself. The oldest  database server was bought in 2015 and I expect to retire it from database work any month now.

It’s worth noting two things here these machines have plenty of RAM and very fast disk. The mysql database process sits at 465GB on these machines and the whole of the mysql data area takes 449 GB of disk space 407 of which is just for the Archive ( fanlore for example is 10GB ).

The NVMe for example runs at about 1.2GB/s which is fairly fast.

These are the machines we bought last year, they look very similar to the year before's however the RAM was increased by 50% and CPU is a family later.

With galera cluster you can apply writes to any node in the cluster. Using a rather complex system it can insure that the same data is written to all the nodes in a cluster. The idea is that if there is a conflict ( a certification failure ) then all nodes will agree on that conflict. If you ensure that all data is written to a single node then there is less opportunity for certification failures.

We run an instance of maxscale on all our MySQL clients set to steer all the queries that are writes to a single server ( generally the eldest ). This reduces the likelihood of certification failures and means that the system dealing with the important writes ( kudos, comments and works ) is not being bothered by the read workload.

Having a maxscale instance on each client is not completely usual. We do this to move complexity to a place where it is easily understood ( to me at least ). Traditionally you would have a floating ip address which migrates around the three servers controlled by pacemaker, this to me is a complex system. Given that we have to have a method of installing and configuring software packages on each machine ( cf-engine at present, ansible in the future ) I find it simpler to use that mechanism and have the mysql client on each machine talk to it’s local instance of maxscale.

We are currently using mysql 5.6, I am considering if we should migrate to mysql 5.7 or mariadb 10. The amount of data we are storing just keeps on increasing, since I started writing this talk in January the amount of disk space used has gone up by about 10%.

We already have three copies of the database on three separate machines in our datacenter, however we use mysql replication to ensure we have copies in omega and landing strip one. This means that if there was a zombie apocalypse that only effected alpha then we would still have a replica off site.It is worth noting that the system in omega is so old it can’t keep up with the io load of replicating the information and it slowly falls behind the other systems until I have to re initialise the replica by hand ( at this point it is about 48 days behind the main system ( which can be useful for abuse for example).

We use Percona xtrabackup to create online backups of our MySQL databases. We create weekly full backups. For the Archive these are created on one of our main servers and each of our replicas.

From that backup we could use the mysql binary logs to roll forward to any later time if there was an emergency.

We have a number of systems running as application servers bought over a few years, the first two machines were bought in 2012 and expect to move them to Omega in a couple of months so they can give a few more years service before they stop being useful.

The set of four machine is a supermicro fat twin, that is four servers in 2U, these were bought in 2014 and are also becoming a bit old.

The last machine was an old database server bought in 2014.

In the next year I expect us to purchase a new database server, allowing an Intel server with 56 threads to be used as an application server. I also plan on buying a new applications servers ( I suspect an AMD eypc 7551 , with 128 threads ).

The three machines marked as asynchronous workers do work such as send email to users, matching for exchanges and the inner workings of the tag wrangulator.

Here we are going to trace out the path of a request and how it is processed.

After the client has done all it’s DNS resolution and stuff, it will send a request to one of the ip addresses for our firewall pair ( 104.153.64.122 ).

Next the packet has to pass through the firewall rules.

Now the packet is passed to an instance of haproxy  running on the firewall itself. This instance of HAProxy is only used to load balance and  allow us to take front ends in and out of service.

We are now on to the front ends.

Nginx is used to as the front end web server and among its tasks are:

Access control, the first bit here is an example of blocking a client based on arbitrary facts about a client, while the second block is denying access to clients based on ip address in this case the ISP that advertised dubious hosting and was being used by a group who were proxying the archive and inserting adverts in to the pages.

These lists should be cleared periodically as ip address assignments do change over time.

Routes the request, here are some examples where the url determines how long the page can take to generate.

Note the tries files line, here we see if a filename matches the url then the file is sent, we check the existance of maintenance.html if that exists then we return that page. When we want to take the site down we move the file nomaintenance.html to maintenance.html and wait a minute for the filesystem cache to flush then all requests will get the sad archive page and we can do our work happy that no users can make requests ( there could be resque workers or cron jobs still around so if there needs to be no changes then these need fixing as well ).

Compression and SSL termination: These are pretty standard stanzas, we compress anything that is not an image, as these are precompressed. We get about two thirds compression, so that currently our maximum line rate is about 200MB/s it would otherwise be about 500MB/s.

We do support https for the archive there are a number of places where we will drop down to http although we do have a pull request into fix this. This is mostly a standard stanza however the SSL ciphers line does need to be updated periodically. I generally use ssl labs quality checker every so often to see if there needs to be any changes.

Caches content, while we are upgrading rails and changing our user authentication gem from authlogic to devise we do a lot less full page caching than we have had in the past. We used to use squid for full page caching however given that it is possible to do this in nginx we use nginx caching for the Archive, fanlore for example still uses squid as mediawiki can communicate with squid for cache invalidation.

At present we are only using nginx caching for autocomplete.

URL rewriting, we have three kind of examples here.

  • we have a number of domains that resolve to the archive and we ensure that the url is changed to archiveofourown.org
  • Here we support old url’s for our faq items which have changed over time.
  • The last example is part of the configuration that is used for clients we have decided are web spiders. Here we ensure that the url has the flag needed so that the spider is not asked if it wants to see an adult work.

Google pagespeed is a module available for both apache and nginx, it does a number of rather ( to me ) complex operations for example taking an image such as the logo in the top left hand corner of the Archives pages and changing it to a data statement in the html or taking css files and either inlining them or combining them together.

Nginx will talk to an backend instance of haproxy

The two obvious reasons for using haproxy on the back end of the free version nginx are the web display seen here, and it is easy enough to programatically take servers out of the load balance which is useful for when we are deploying new versions of the Archive.

 

We use god as a process manager, that is its task is to start processes and ensure they continue to run.

Unicorn is the web server we have that runs on each of the application servers. We have been using unicorn since at least 2010 so while other options are available we see no need to change.

The Archive is a Ruby on Rails application. Rails has a philosophy of convention over configuration which means that there is usually a natural way to do things in rails. The model holds the data and the behaviour. The views are the representation of the data that the users sees. And the controller takes the input from the user and updates the data using the model and prepares data for the view.

ActiveRecord is a component which is used to persist objects, data while allowing programmers to use natural ( to programmers ) forms, for example here I look up my account and set the number of failed login attempts to 0. And then lookup how many kudos and comments I have left, and how many works I have read logged in. I then look up how many guest kudos have been left from our ip address.

Here we see some code associated with storing readings.

Update_or_create checks that the user is logged in and has their history enabled and that they are not the author of the work, then creates an entry in redis that will later be stored in MySQL.

While mark_to_read_later creates an entry in mysql immediately

This is a part of the controller for readings before anything is called we check that the user is logged in, that they own the data they are trying to access, that they have their history enabled and then we load the data for the user.

For the index page we then load up the data that the view is going to use.

This is a part of the view, here we go through each of the readings and render a blurb for each reading.

This shows that each application server uses a local instance of haproxy for reliability and to have local queuing where that is needed.

This is a very quick overview of the development process.

We have an issue tracker which is used to log both faults and features. It is worth noting that both bugs and features can take a long time before anyone starts work on the issue, for example this issue was created over five years ago. I have done the work to migrate admin users for example but we still have to complete the work for standard users ( this is a complex issue, which involves changing the library we use for authentication and authorisation to a better supported and more standard library).

At some point a developer will do the work and add appropriate tests for the new code, and all the automated test will succeed. This is travis our primary Continuous integration system.

The developer now believes the their code is good and is fit to be merged into the Archive’s code. The code is now peer reviewed by the other coders in the group and a member of AD&T can approve or request changes to the pull request ( the changes the coder wishes to make ).

AD&T meet weekly and among the things discussed is which of the open pull requests which are ready to be merged will go into the next release. The decision is based on site integrity ( is it a security or an important performance fix ), requests from support, abuse, open doors or legal to help them in their work, Is it a feature that we believe our users would really appreciate.

After a release has been deployed and we are happy that no emergency fixes are likely to be required we will merge the next set of changes in to the master branch on github,

This causes our second continuous integration test system ( codeship ) to start and each change will be tested again, at the end of the tests codeship will deploy the new code to our staging servers automatically.

Our QA ( Quality and Assurance ) team now tests each issue by hand updating the issue tracker. If there are problems then a developer ( usually the original one ) will try and write a fix and make a new pull request, if no one can easily fix the issue then the new feature or bug fix will be reverted ( the change undone ).

Once all the issues have been verified then the new release is deployed to beta ( the Archive ).

Any changes to the database tables will be done in advance of the deployment, given the size of the database we have some plans to make migrations less disruptive.

Changes to the CSS are done by hand at the end of the deploy.

And finally.

There are a large number of systems that the archive relies on.

I need to thank everyone who has created a work for the site.

And we do get plenty of thanks sent to support which is collected by Anne-Li who records it and ensures that the coders know how much they are appreciated.

My colleagues in systems.

  • Amanda
  • Elz
  • Karen
  • Matthew
  • Puckling
  • Seamus
  • Tom

And the members of AD&T.

  • Ariana
  • Bingeling
  • Elz
  • Enigel
  • Lady Oscar
  • Mumble
  • Naomi
  • Sarken
  • Scott
  • And Katie