Uploaded by beltan.tonuk

How to Build a Bank in a Year

advertisement
Starling: How to Build a Bank in a Year
blog.container-solutions.com/starling-how-to-build-a-bank-in-a-year
Last year we launched The Cloud Native Attitude a short book describing modern
infrastructure tools like Docker and Kubernetes and which included 3 case studies on real
life Cloud Native enterprises (The Financial Times, Skyscanner and ASOS). We're about
to re-release the book with 2 new chapters and 2 more case studies: the challenger bank
Starling and ITV. In the run-up to KubeCon, where we'll unveil the new book, we'll be
releasing the new chapters plus some bonus materials! First we are going to start with the
story of Starling Bank. How on earth did they build a bank in a year?!
Who Are Starling Bank and What Do They Do?
Starling Bank was founded in 2014. Based in London, it has been licensed and operating
since July 2016. The bank is a successful part of the British Fintech scene, which is a
spin-off from the UK’s strong financial services sector.
Starling are a mobile-only, challenger bank who describe themselves as a “tech business
with a banking licence”. They provide a full service current account solely accessed from
Android and iOS mobile devices.
They received $70m of investment in early 2016.
Services
Starling’s tech comprises a cloud-hosted backend system, talking to apps on users’
mobile phones, and third party services.
As well as a full current account, the Bank provides MasterCard debit cards (customers
spend money on their SB debit card and the authorisations and debits arrive at Starling
servers through third party systems). They also support direct debits, standing orders and
faster payments, which are again provided by backend integrations with other third party
systems.
Infrastructure
Starting in 2016, Starling created their core infrastructure on Amazon Web Services
(AWS) inside just 12 months. Their highly articulate CTO Greg Hawkins likes to say, “we
built a bank in a year”.
In common with everyone I’ve interviewed for this series of case studies, Starling use a
microservices architecture of independent services interacting via clearly defined APIs.
As of March 2018 they have ~20 Java microservices. That number will increase.
1/5
Many companies architect their services for division by team. This is known as a
“Conway’s Law” approach, where each team looks after one or more dedicated
microservices. However like the FT, Starling have chosen not to do this for now. Instead,
they have divided services by functional responsibility rather than team and every service
can be developed on by multiple teams. They operate this way because they can. As
Hawkins puts it, “we’re taking advantage of the flexibility we get from our small size - we
can reconfigure ourselves very quickly.” As they continue to grow, Greg recognises that
they will lose some of that flexibility, “it won’t last forever”, and will then adopt smaller
microservices and a more Conway-like model.
Deployment and Operations
Whilst services can be deployed individually, for convenience Starling usually use a
simultaneous deployment approach where all services in the backend are deployed at
once. This is a trade-off that has evolved between minimising the small amount of
overhead around releases and keeping release frequency up. They built a rudimentary
orchestrator themselves to drive rolling deploys based on version number changes (scale
up AWS, create new services on the new instances, expose those new services instead
of the old ones, turn off the old ones and scale down their AWS instances).
Starling generally re-deploy their whole estate 4-5 times per day to production. So, new
functionality reaches prod rapidly, and it’s business-as-usual to apply security patches
fast when necessary.
As always, API management is a tough challenge for frequent deployments. You could
argue (naively) that simultaneous deployment makes this easier because you are always
re-deploying both sides of your API at once, but this isn’t really true for several reasons:
Starling don't mandate simultaneous deployment and retain the ability to deploy
services individually. Simultaneous deploy is a convenience that will change as the
organisation grows.
During the minutes a deployment takes to roll across all the servers, services are
inevitably at different versions.
Any individual service may fail to deploy, leaving mismatching versions in
production.
The system must handle all of this safely, which means clients and services must
incorporate backwards API compatibility. To ensure this, part of their release process is
validation that there are no breaking API changes (this is straightforward to check using
the Swagger tool combined with the fact that their client-side calls are in isolated
"connector" libraries). As their system size has increased, they've also started introducing
Pact to help.
From the start, the bank used Docker containers as a packaging format, and EC2
instances as their “units of isolation” (i.e. to separate one running service from another).
They do not yet use containers as their primary form of isolation, although they do use
them to isolate some specific processes such as components of their monitoring
2/5
application Prometheus. They also don’t use an orchestrator. However, they are looking
closely at the popular open source orchestrator Kubernetes (K8s). Specifically, they are
interested in
the abstraction it provides to machines and applications, which helps with portability
(going cross-cloud)
the cost savings and improved performance they would get from using containers
as their units of application isolation and running on larger VM instances
the sophisticated additional deployment options that Kubernetes provides.
Starling have made a strategic decision not to take on the operational overhead of
managing Kubernetes themselves on AWS, but they are closely watching the progress of
AWS’s managed K8s service (EKS) and are likely to use that in the future if it reaches
their required level of functionality and stability. That’s not a crazy decision, there is
significant ops work involved in managing Kubernetes on AWS yourself.
Cloud-wise, Starling’s infrastructure is entirely hosted on Amazon and they are happy
there. However, regulatory requirements and commercial considerations mean they’ll
need to diversify into cross-cloud in the future. They’re therefore beginning to work with
the Google Cloud too, but that has a few interesting challenges. The Google Cloud is
more advanced than AWS in some areas but way behind in others.
Google’s managed Kubernetes service, GKE, is currently much better than EKS.
Starling have built a lot of custom advanced security features like temporary
privilege raising on top of AWS’ strong APIs that will need to be re-implemented for
Google.
Like any company choosing to use multiple cloud vendors, Starling will need to balance
the value of consistent operations against the desire to get the best out of both clouds.
Development
Stack-wise, Starling are a Java house.
They deploy their 20 Java services with an embedded web server inside Docker
containers.
They configure their estate using CloudFormation plus homegrown scripting.
They make heavy use of the Nginx load balancer and ELBs.
They use an ELK (Elasticsearch, Logstash, Kibana) stack for logging and Grafana
and Prometheus for monitoring.
On the client-side (mobile applications), they use Java for Android and Swift for iOS.
Architecture
Architecturally, Starling have an interesting set of needs:
security and data integrity are their highest priorities
they have some room for manoeuvre on performance.
3/5
Security
Security-wise, they sensibly make extensive use of service isolation at the network level
(aka “microsegmentation”) for which they use separate VPCs as well as subnets. They
encrypt all data in transit and at rest, and their inter-service communication is via
encrypted RESTful interfaces. They also have a strong focus on user-device security.
Specifically:
They guarantee that they are always talking to the correct, original device (achieved
through private keys).
They are careful to ensure the device has not been compromised. This is why, for
example, you cannot run their apps on jailbroken devices.
Performance
Starling offer user-facing services that are latency-sensitive. However, their own backend
is seldom the performance bottleneck for those services. Card transactions, for example,
pass through several layers of third party systems before reaching the bank’s servers.
Starling’s systems generally introduce <5% of the latency on these performance sensitive
operations.
So, Starling’s backend perf would have to be very severely impacted before it was
noticeable by end users. They can therefore afford to optimize their architecture for
robustness, simplicity, auditing, and data integrity rather than super-speed.
This high need for resilience and auditing and the slightly lower requirement for
operational performance influenced their decision to use asynchronous APIs between
their services. Each service has its own in-built asynchronous inbound command bus (a
kind of queue) backed by a database. This architecture provides reliable message
passing, rigorous decoupling, resilience, auditability and replayability as well as better
understandability for the system. Given their operational priorities, asynchronous APIs
were a sensible choice.
Testing
From a testing perspective, Starling Bank embraced Chaos from a very early stage.
Chaos engineering is an approach to testing critical systems that was pioneered by
Netflix. The idea is that testing to destruction happens in production. Sounds crazy?
Actually it makes good sense. It ensures that not only your functionality is tested but also
your production system can quickly identify and recover from issues. It’s like meta-testing.
It’s painful at first so one to start gradually!
Conclusion
Starling are very happy with the architectural choices they have made so far. Again, they
demonstrate that not everyone needs to or should make identical decisions.
4/5
They have chosen not to orchestrate in production yet. They judge that costs them money
in hosting but currently makes their operations simpler. Once AWS release a stable EKS,
Starling will use managed Kubernetes on AWS. That is a perfectly sensible approach that
works well for them.
They have chosen async over synchronous inter-service comms because they prioritise
auditability and reliability over hyper-performance. Again, they have weighed the tradeoffs and made a decision based on a good understanding of their own current situation
and needs.
Their original motivation for hosting in the Cloud was Hawkins anticipated it would help
Starling move faster. He felt that Infrastructure as a service, by supporting Devops and an
iterative approach, would help him create an innovation culture in his tech teams. That
very much appears to have paid off.
Overall, Starling Bank seems to be an excellent example of the need to consider context
when making architectural choices. They seem a sensible bunch, I’m very tempted to
move my current account there!
Read more about Starling and other companies, like the FT and ASOS, in part three of
the Cloud Native Attitude.
.
5/5
Download