Microservices at Lyst
What we’ve built
About 2 years ago we started developing a more microservice based approach to systems at Lyst. Our main aim was simply to help us get more software shipped in less time. We’ve learnt a lot since then.
Our approach with microservices was fundamentally about standardising how we ship software. Removing certain decisions from the process of building things makes it easier for people to focus on the work that really delivers value to our users. We’re not done on this front, but our platform isn’t getting in the way anymore.
We now have over 20 individual microservices running within a single cluster. Each monitored, maintained and deployed by small teams using a standardised set of tools. We use Amazon ECS with Empire acting as a management interface to handle deployment. A combination of New Relic, Sentry and Grafana handle all of our monitoring requirements. And an assortment of other AWS services fill in the gaps for things like databases and message queues.
Overall I think there are 3 key things we’ve learnt
- You can always make it easier for things to talk to each other.
- Mean time to repair (MTTR) is usually more important than Mean time between failures (MTBF)
- Your monitoring is never good enough.
Microservices are all about APIs.
When we started out we went down the path of making our microservices as familiar and easy to write as possible. And because everyone was already familiar with the monolith we basically ended up making lots of tiny versions of our monolith.
Our Django based approach was great at getting our team moving with writing new services but it doesn’t really do a lot to make using the service a good experience. Django REST Framework solves some of our problems, but it still results in a lot of error prone manual labour around writing clients. It also doesn’t help very much with the problem of backwards and forwards compatibility between services.
REST microservices work, but without machine readable API descriptions you are going to find yourself making a lot of mistakes.
The interactions between services have (perhaps predictably) become one of our main problems. Each of our services has unit tests, usually with decent test coverage but we still hit integration bugs more often than we’d like. APIs not quite doing what people thought they did. Apparently minor API changes causing breaking behaviour for some use cases. Performance regressions in servers. Clients changing their behaviours and using up too many resources. Follow-on failures once one service starts to have availability problems.
MTTR is usually more important than MTBF
Fortunately most of the time these issues are detected and resolved very quickly. The occasional rollback or server reboot isn’t ideal, but we can do them so quickly it’s not too painful. On the other hand when we do miss a problem, it can lie undetected for weeks. RESTful Microservices don’t seem to have made us much better detecting bugs.
For us there are a few key tools that have helped keep our MTTR so low.
- NewRelic APM monitors every transaction to every microservice we run. This gives us rapid and accurate feedback whenever things start misbehaving.
- Our CircleCI build cluster makes building a new release reliable and easy.
- Empire + Amazon ECS makes deploying a fix trivial for any member of our team.
It’s the difference between things taking 15 minutes to fix instead of hours.
We could build an integration suite that enables us to test all kinds of different combinations and failures between services but our experience has been that this just isn’t necessary to meet the availability requirements of the business.
Monitor everything
That said, it’s not perfect. When a bug does slip through our monitoring it can take days to even notice it needs fixing. You need logs, but more than that you need tools that help you get useful information out them. This is the area we are probably weakest at right now. Getting actionable data about services out of logs and business intelligence platforms is much harder than it needs to be.
Where next?
Having collected data about how this approach has been working for the past 2 years, we’ve started investigating some ways to improve things. In particular we’ve been trying out gRPC.
gRPC is a multi-language API framework from Google. It uses protobuf to define messages and various types of API calls and code generation to get your server up and running quickly and clients in a wide range of languages.
It’s been pretty good. We get multi-language clients at the click of a button. We get type checking in our API layers for free. It has a well defined backwards compatibility story for API design. It’s simpler than deploying an entire Django app. And it’s even got features like streams, built in async support and protobufs that give us some interesting performance opportunities even in Python services.
The clear and well enforced API definition we get from Protobuf seems like it could be valuable in helping to reduce some of these inter-service bugs we experience.
However it’s also revealed some limitations in our platform. In particular HTTP/2 is currently required by gRPC and is not widely supported by AWS. ELB doesn’t support it at all, and strangely ALB can’t talk HTTP/2 to the backend servers. This makes actually deploying things a bit tricky, so far we’ve been getting by with TCP mode but it’s not ideal. Fortunately we’ve got some ideas we’ll be talking more about in the future that should help us work around this issue.
We’re also starting to explore how we can improve our testing using the tools gRPC gives us. The standardised API of the generated code we get from gRPC also makes building simpler tools for mocking external services possible. For example basing the mocks on the protobuf definitions can help a lot with detecting potentially backwards incompatible changes in our API definitions without needing integration testing. It also makes spinning up a dev environment which would otherwise need to talk to 20 different services much more tolerable.
Overall microservices have been a good experience for us. We’ve gained a lot of flexibility and can see clear paths forward to solve our remaining problems. Building a more flexible and error tolerant platform has opened up a lot of scope for future work that in the past would have just been too fragile and error prone to attempt.