Docker at Lyst

by Alex Stapleton on Monday, 8 Dec 2014

Discuss this post on Hacker News

We don’t have a long history of using Docker at Lyst. On June 9th 2014 Docker 1.0 was released. On June 16th we started working on moving our entire platform to it, and by entire platform we mean everything. Development, test and production. We’re not quite finished rolling it out to all our services but we’ve learnt a lot of lessons and it’s had a big impact on how we work.

Many discussions of Docker talk about how to deploy micro-services and other simple applications. This isn’t one of those. This is how we deploy half a million lines of our own code written over almost 5 years. Many times a day. Using almost identical tooling from our laptops to our production servers. It’s also the story of how in the space of 2 months we went from deploying twice a month, to deploying twice an hour or more.

Lyst is not-small

Our stack is more or less your usual Django website on AWS. We use Python 2.7 running inside uwsgi, Django 1.5 and PostgreSQL 9.3. We also depend on over 100 3rd party libraries ranging from Boto to SciPy and communicate with at least 10 different data systems. On top of that we merge over 35 pull requests a day. With all this going on deployment was non-trivial and it was often hard to keep track of exactly what was going out. At the least we wanted a way to get rid of the Fortran compiler needed to build SciPy that was living on all our webservers.

Building

Avoiding build dependencies in your container image is a bit tricky with Python applications. The obvious idea is to use Wheels to install binaries of everything instead of using the source packages. Except PyPI doesn’t have Linux wheels yet so we have to compile and store all our binaries in our own pip compatible index server.

To do this we use devpi and a dedicated build container running devpi-builder to ensure we have the right wheels available. Our main application container then installs its dependencies direct from this package server. This also has the advantage of making the container build process very fast and more tolerant of PyPI outages.

One rough edge to this process is that getting any kind of configuration into the container build environment is a bit tricky. Dockerfiles don’t provide a lot of flexibility so we use a Makefile to manage the build context. This is not as nice as it could be by a long way. Particularly as Docker will currently use both the files content hash and modification time in cache keys. This means every git checkout or file copy invalidates our build cache. To work around this we use a script that scans git logs to fix the timestamp.

touch -t $(git log -n1 --pretty=format:%ci $FILENAME | sed -e 's/[- :]//g' -e 's/\+.*$//' -e 's/$[0-9][0-9]$$/.\1/') $TOUCH_FILENAME

It’s not pretty but it does the job. It looks like 1.4 will fix this which will make our build scripts much tidier.

As well as build configuration we also wanted to tidy up our application configuration. We adopted the pattern advocated by designs like 12factor for this. To move as much of our configuration as possible into environment variables. This meant making things smarter in places so that we reduced the total number of variables that needed setting. This was definitely worth it though. We now have one good default for development that is overridden for production and testing.

The other remaining wart in our build process is static assets. Our containers install quite a lot of software that is only used to compile static assets like optipng (via imagemin) and Uglify.js. The Django app doesn’t care about these details and it would be quite nice to make this step more independent. Unfortunately getting the output of our static build into the Django container is quite hard right now. The technique we use now is to build a Dockerfile that compiles our current assets and then docker cps the compiled data out of a temporary container. This gives us some of the benefits of the build cache but requires 4 separate steps to actually get the data into the application.

Development

We support both Linux and Mac OS X development environments. On OS X we do this using boot2docker. It mostly works but we are considering moving to a Vagrant based host using a commercial VM for better performance. To get local code editing functionality boot2docker mounts the entire /Users folder inside the host VM. We then use Docker volume mounts to expose the live source code to the container.

Getting the development containers up and running is currently done using a Makefile. We originally planned to use Fig for this but at the time there was a bug that caused it to have no output in Jenkins. You should probably just use Fig though. It’s much nicer and actually understands the dependency tree between your containers.

As part of standardising our application we also replaced the Django runserver with uwsgi. This meant we needed to figure out how to cope with code reloading and static files nicely. For static files we turned to the Whitenoise WSGI middleware and hooked it into the reload API in Django. This gave us a pretty close approximation of runserver but running on uwsgi instead of raw Python. This makes it easy to test changes to our WSGI configuration in and gives us lots of confidence that it will work in production.

Testing

Like almost everyone we use Jenkins to run our testing pipeline. With the aid of a custom web service that talks to GitHub we run tests on every pull request both before and after merging. Docker makes creating identical build environments for this easy to manage. Particularly as we now need to run large numbers of tests concurrently and also test against a complex array of services.

Whenever we get a clean build of our main branch we push updated images to our private registry tagged with the git commit id. This makes our deploys very transparent as it’s easy to tie specific Docker images back to the list of changes that went into them.

Production

In production we currently deploy the 2 main services that make up our website. The main Django application and the supporting Celery workers. We deploy our application to Docker using a simple Python script that finds Docker hosts using EC2 instance tags. It talks to our nginx servers to carry out a rolling restart of the application without interrupting users traffic.

Thanks to recent additions like the “exec” command troubleshooting has become a lot easier. There’s still lots of room to improve. e.g. The lack of good socket information inside /proc can make tools like lsof and netstat quite frustrating to use.

While there are still a lot of sharp edges we are happy with Docker and the direction it seems to be going in. It’s changing fast but more often than not that pace is solving our problems, not adding to them. It’s helped us get a large system under control and given us a much more flexible development process for the future.

P.S.

Now we’ve had some time to let Docker settle into our environment we’ve started thinking about how things could be better. We are looking forward to finding solutions to these issues we still have with Docker.

The private registry is not great. It’s slow and has taken more tuning and nginx configuration than it should. It also has no built in authentication which makes exposing it to our remote developers yet another thing to work around. We’ve heard rumours of some big changes coming here though.

Docker host daemons have no authentication so you have to do everything using either firewall rules or SSH. Hopefully 1.4 should start to solve this via the new libtrust library.

Dockerfiles are great for simple cases but need a lot of external help to use in more complex situations. Even simple variable expansion would be helpful.

The default Docker host logging setup works well but has some pretty glaring problems. For example it never rotates the log files and they grow forever. We’d much prefer to have Docker send our logs out over something like the syslog protocol. Logging drivers sound like they might solve this problem.

Devicemapper will do strange things in some configurations. For example on our Jenkins servers we have had containers become impossible to remove several times. This doesn’t make our ops team happy. Our work around for this is to just use AUFS. This has some performance issues when files become invalidated but it is at least reliable.