Coding is no longer the constraint: Scaling devex to te… — Transcript

Spotify shares how AI tools like Claude and Honk revolutionize developer productivity and automate code maintenance at scale.

Key Takeaways

AI coding tools dramatically boost developer productivity and deployment frequency at Spotify.
Automated fleet management and AI-driven code migrations reduce manual maintenance work significantly.
Large-scale codebase management benefits from combining AI agents with robust verification and orchestration systems.
Iterative improvements in LLM capabilities enable tackling complex code changes previously difficult to automate.
Spotify’s approach demonstrates scalable AI integration in software engineering workflows for large organizations.

Summary

Spotify has nearly 3,000 engineers and performs about 4,500 daily deployments across a mix of large monorepos and thousands of smaller polyrepos.
The adoption of AI coding tools, especially Claude Code, surged after the Opus 4.5 release, with over 99% of engineers using AI tools weekly.
94% of engineers report increased productivity due to AI tools, reflected in a 76% increase in pull request frequency.
Most pull requests are now co-authored by AI agents and developers, accelerating code shipping.
Spotify faced challenges with codebase growth outpacing engineer growth, leading to increased maintenance burden.
To address this, Spotify developed Fleet Shift, an automated fleet management system that merges millions of maintenance PRs without human intervention.
Simple code changes are easily automated, but complex migrations require advanced solutions due to code variability and Hiram's Law.
Spotify experimented with LLMs early on to automate code modifications, culminating in the tool Honk, which uses Claude under the hood.
Honk runs in Kubernetes pods, integrates with trusted verification tools, and performs CI builds across multiple OSes to ensure safe automated changes.
Honk is integrated with Fleet Shift to orchestrate large-scale code migrations across thousands of repositories efficiently.

Full Transcript — Download SRT & Markdown

Speaker A

Hey everyone, yes, I'm Niklas. I was very surprised to see my face on screen earlier because I had completely forgotten that Boris was going to mention Spotify as part of the keynote. So I'm here to give you a bit of a rundown on

Speaker A

how we're approaching the AI transition at Spotify. So let me start with a little bit of an introduction to Spotify. Anyone in here who's a Spotify user?

Speaker A

Oh, lots of hands. Good. So we're a fairly sizable engineering org at this point, close to 3,000 engineers. We've spent many years trying to optimize our developer experience and how we build products. We try to make sure that it's as easy as possible to deploy and ship changes

Speaker A

to our users. One way to illustrate that is that we do around 4,500 deployments every day to our production environment. We run on a mix of repositories. I'll come back to this later. Some are very large monorepos. Our back

Speaker A

end is in a 40 million lines of code monorepo. And then we have lots and lots of smaller polyrepos, thousands of them.

Speaker A

The AI transition for us has been a journey of very rapid adoption curves. We roll out tools internally all the time to make our developers more productive. But we have never seen the rate of adoption that we've seen rolling out AI coding tools.

Speaker A

And you can see in particular how Claude Code, orange in this diagram, completely exploded. It's a little bit hard to see due to the holiday break, but it really happened around the Opus 4.5 release in November last year. And since then,

Speaker A

growth and usage of Claude in particular, but AI tools in general has gone completely bananas. And today, more than 99% of our engineers use AI coding tools every week. And we do a recurring engineering survey to all our engineers, and in the latest one who just came in last

Speaker A

week, 94% of our engineers reports that using AI tooling has helped them become more productive. And that's with a record high self-assessed productivity. We can also look at productivity in other ways. One way is to look at PR frequency as a proxy for how fast and

Speaker A

how much we're able to ship. We're seeing today an increase of 76% in PR frequency. As I was working on these slides over the last two weeks, I had to change this number because it keeps growing all the time. And

Speaker A

by now, by far, most of the PRs that we ship are authored by an AI agent together with the developer. One thing you can see in this curve, if you look, you can see it's actually hard to see here, but the number of

Speaker A

PRs has been very slowly growing over a longer period of time. But you can see that jump again happening around the Opus 4.5 release. That was when this took up for us. took off for us. So this of course then means that we

Speaker A

also have an explosive growth in our code. Luckily, that's something that we came prepared for. We've seen this for a long time, also prior to AI. In fact, a few years ago, we noticed that our code base, our production code

Speaker A

base was growing seven times faster than the number of engineers. So that meant that engineers would spend more and more of their time maintaining our existing codebase compared to being able to build new features and value for our users.

Speaker A

So we realized that we needed to fix this. So we started an effort to automate as much of that maintenance as possible. A lot of that maintenance comes down to pretty dull things that we just need to do, you know, migrate from this

Speaker A

version to this version, deprecate this API, fix this security vulnerability, those types of things.

Speaker A

But that took a lot of time for our developers, and the way we typically did those migrations back then was to send out some migration path to hundreds of teams saying, hey, you need to upgrade from this Java version to this

Speaker A

Java version. for your components. The teams would go ahead and do that, and this would typically take us months to complete one of those upgrades across many thousands of components. That was not fun for anyone. In that same engineering

Speaker A

survey back then, migrations was the top thing that users, or sorry, our developers were frustrated about. So, we imagined, instead of doing this like component per component and fairly manually, can we imagine a way where we do this as

Speaker A

a way to mutate our entire fleet of components, figure out a way to do that. And we built this out, built out the infrastructure for this, something we call fleet management, and the underlying system that we use is called Fleet Shift.

Speaker A

And up until today, we've now merged two and a half million of those automated maintenance PRs, work that our developers did not have to do. The vast majority of those, the green part of this graph, have been auto-merged, so there's no

Speaker A

human in the loop. It's automation creating the PR to begin with, automation validates that PR is safe to merge, and then go ahead and merge it without any developer needing to care about that change. This happens every day. We ship thousands

Speaker A

of these every day. So, that was all pre-AI. And one thing that we noted pretty quickly was that This works really well for simple changes. That might be changes to configuration, it might be bumping some dependency in your build file, those types of things. Works great. But once

Speaker A

you get into a little bit more complex changes, like replacing API calls, those types of things, the scripts that we used to run these shifts across our fleet became incredibly complicated. Code, as it turns out, has a very, very wide API surface. There are many, many ways to

Speaker A

achieve the same thing if it's just calling method. And when you write that script and you run that across millions of lines of code and thousands of components, you are going to find every corner case, and you need to deal with that in

Speaker A

your migration script. There's even a term for this, it's called Hiram's Law, coming from an engineer at Google that discovered this many years before we then ran into it.

Speaker A

So, pretty early on as LLMs came about, we figured that, hey, instead of writing these deterministic scripts to do these code modifications, can we use an LLM for this?

Speaker A

So very early we started iterating on trying to do this, prior to Claude and similar tools. And we noticed it was challenging initially. The models were just too stupid. The way we were trying to do it was just too stupid. But over

Speaker A

time, on many iterations, we started figuring out the patterns for it, and the models got better. Out of this came a tool that we now call Honk. Boris mentioned this this morning. It has a silly name and a silly icon, but it's

Speaker A

a very useful tool, as it turns out. And Honk is really the result of all of these iterations, of us trying different ways of solving this problem of automating these still relatively simple code changes, but again, applied over

Speaker A

many, many variants of code. It started out very differently, but today Honk has Claude under the hood using the Agent SDK. And it wraps up the Agent SDK inside our own harness, inside a Kubernetes pod so we can schedule many of these running in our cloud

Speaker A

environment. And we give it access to a set of trusted tools. The chart here just says the verification tools, but there's actually more tools that it has available to it. And for verification, it's able to run builds of the code running in our CI environment. So one thing that is important

Speaker A

to us is that we can run our builds across multiple operating systems, for example, because our clients run on many different operating systems. So Honk has available tools that it can use to verify that its changes are correct. And

Speaker A

again, we run many of these every day. And then we integrate this into that fleet management tooling that I mentioned before. We use Fleet Shift, our tool that I showed the graph before, to schedule and orchestrate these changes across our

Speaker A

thousands of repositories, and Honk sits in the middle doing the actual code changes. And it might look something like this. In this case, this is a fairly small migration targeting 39 repositories, but for a team that owns this, they can go in

Speaker A

and see what's the status of this particular shift today, How many PRs has been created, how many has been merged, how many failed in CI, so I need to take a look at them, those types of things. And as Boris mentioned this

Speaker A

morning, we're seeing pretty significant time savings from this. What used to be what I described before, hundreds of teams doing migrations for their components, taking weeks and weeks or months, now can be done by a single engineer in

Speaker A

a few days. The latest Java migration that we did, we run our backend mostly on Java on the JVM. The latest Java migration we did took three days using these tools. And we're making this now available. So, we have a commercial

Speaker A

offering for other companies through our Backstage developer portal, and we're making this available as a product in that packaging. So, if this is something that is relevant for your company, you can take a look there. But as it turns out,

Speaker A

developers are very resourceful and innovative, so pretty quickly folks figured out that, hey, this Honk thing that we run for all of these migrations, how about I figure out how I can call that over Slack and have it do things for me

Speaker A

that way? So, similar to how you might invoke Claude or other tools over Slack, you can do that with Honk at Spotify as well. And it's become a very common way that people will have a Slack conversation for something, then just at mentioned

Speaker A

honk, honk goes off and works on that and comes back with a PR. So we're seeing more and more of these patterns evolve around honk. And in fact, yesterday we released honk v2. The v2 versioning is a little bit off

Speaker A

because I think it's actually like the eighth version of honk, so I don't know what we did with the versioning, but it doesn't matter too much. So this week we have Hack Week at Spotify and we released the alpha of Honk V2, which

Speaker A

is a pretty significant addition of features for Honk. And it really now builds towards this world where developers are using it more interactively. So we've integrated it with our agent orchestration tool that we call Chirp. This is similar to

Speaker A

what you can do with Cloud Agents or with Agent Deck or similar tools, but it's a little bit more features and it's integrated into our infrastructure. This is the way that you can run many, many agent sessions at the same time and coordinate

Speaker A

those types of things. And Honk is built into that, so you can use Chirp to schedule Honk jobs, for example. You can also collaborate with other developers on shared sessions. So instead of it's being you in front of your agent, you

Speaker A

are now sharing that agent session with more people, and you can collaborate on that and give feedback and ideas and whatnot on that. So basically, imagine Google Docs or something similar, but for Claude. And that then also groups up into larger efforts. Imagine you're working on a completely new feature or product,

Speaker A

you're working with a team on that, you can have a project that you're sharing, And in that you can have many sessions with Honk where you collaborate over working towards whatever that goal is. This is also available on any device and whatnot,

Speaker A

so users can use them from wherever they are. And lots and lots more features that we're rolling out going forward. We're very excited about Honk V2, and in particular, I'm gonna say personally, I'm very excited about these like multiplayer features of imagining how

Speaker A

agents actually collaborate in with multiple developers and teams. All right. Let me switch gears a little bit. So I want to also talk about how we try to optimize our code base to make agents as effective as possible in our code. So we've had for many, many

Speaker A

years, more than I've been at Spotify for 15 years, This happened prior to I arrived, so I don't actually know exactly how old it is, but we've had this belief on the fewer technologies that we use, the faster we will be able to

Speaker A

go. And this basically comes down to a few different aspects. One, if we have a set of technologies that we know really, really well, we're really deep experts on them, we will be able to build better things on top of those. We can

Speaker A

also eliminate a lot of small decisions for teams. Instead of having to pick the technology for everything you're building, there's a ready set of technologies available to you that hopefully solves your problem. It also means that it's much easier to collaborate. If you're working with some other team on their

Speaker A

components, and their components look roughly the same as yours, it's going to be easier for you to contribute to those. And similarly, if we need to move components around or developers move to a different team, things look roughly the same over there. So

Speaker A

if you look at our typical backend service at Spotify, they will all look very similar. Same technology stack, roughly the same design patterns and so on.

Speaker A

And we think this makes sense for agents as well. So for many years now, we've been driving towards a more and more standardized stack.

Speaker A

Less unnecessary variance, at least. We want some level of variance. We want to experiment with new technologies, evaluate new things that could be good for us, but we don't want to do that willy-nilly. We want to be intentional about

Speaker A

it. And we see that this leads to more effective teams at Spotify. And we believe that it also leads to more effective agents.

Speaker A

Simply, if Claude has a lot of other code to look at, and that code looks roughly consistent, Claude will do a better job. That's what we're seeing. And we actually have code bases that are more fragmented, and we can actually see Claude

Speaker A

perform worse in those code bases. And the starting point for this is, I mentioned Backstage before, Backstage is our developer portal. It used to be that it provided a single pane of glass for us developers. Prior to Backstage within Spotify, we had, I think, roughly like a

Speaker A

hundred different tools that you as a developer would go to. It was one tool to check your deployments, one to look at CI, want to look at A-B tests and whatnot. And it was very, very confusing. All of those tools were kind

Speaker A

of shit as well, like they weren't particularly good. So we thought there was an opportunity to consolidate this and provide a better experience for our developers. And it really started with this notion of a catalog of all our software. I mentioned before that

Speaker A

we have thousands of components in production, and Backstage came about just as a way to know who owns one of those components. Let's say we have an incident and I need to be able to page someone on the owner of that team. Before,

Speaker A

backstage, I couldn't even figure out who that owner was. So, it started as a way to just having a catalog for that. Over the years, it's then grown into having lots and lots of tools around those components as well. So, today, as a

Speaker A

human developer, everything I do when I need to take an action on some of our software components, I'm going to do that in Backstage. And as it turns out, that's equally useful for agents. So we expose all of these as

Speaker A

MCPs or command line tools for our agents, and Claude can go look up who's an owner for something, and it can go ping that team on Slack if it needs to ask questions about it, for example. This has turned out to be incredibly

Speaker A

useful for us, and in particular, as we've scaled up, it allows us to keep track of everything we have going on. It is also a way for us to drive our standardization. So I mentioned this before, we have strong recommendations for which technologies to use for a particular problem.

Speaker A

And we describe this in a few different ways. We have a technology radar, as many companies do, that just lists all the technologies that are available and what state they're in. Like this one we recommend using, this one we don't recommend using, and

Speaker A

so on. We also have what we call Golden State. So this is essentially for a particular type of component. If you're this type of back-end service or you're this type of iOS view, these are the technologies and practices that we recommend that you use. And

Speaker A

we have a UI in Backstage that we call Soundcheck, where you as a team can go in and self-assess this. This is an example of such a view. You can see here some component and it has a requirement to

Speaker A

define a valid owner. That was what I was talking about before. This allows us to then make our code base much, much more consistent and has been something that we've been driving over several years. It's been very, very powerful and set us

Speaker A

up well for where we are now with AI. And we then also combine that with static analysis and linting. So these things are then implemented in our codebases as checks, so that when Claude works in our codebase, it will get immediate

Speaker A

feedback on if it's using the right set of technologies and the right set of design patterns. So if Claude comes up with something that... a way to call gRPC in a way that we know is not optimal for our infrastructure,

Speaker A

Claude will get feedback from our Lint system to correct that. And we think this is super useful both for our developers and for our agents. And we see this all the time. When I work with Claude in our codebase,

Speaker A

I will see Claude run into these Lint checks all the time and correct itself.

Speaker A

It's an awesome way to drive this typo standardization. All right. I'll try to sum this up. First, hopefully this came through, but the need for strong engineering practices has not gone away with agents. It remains as important as it was before.

Speaker A

Boris mentioned verification this morning. We fully agree with that. The ability to have your code being well tested and having your agents being able to invoke those tests, either Claude running locally, or Honk with the verification tools that I showed before, that is the way to make your agents be

Speaker A

much more autonomous and come up with better solutions in your code. Similarly, what I just talked about in terms of making sure that your codebase is consistent and it's well-defined what developers and agents are supposed to do, turns out to make agents work much, much better, at least in our case. We're

Speaker A

also very careful about trying to measure everything, measure every aspect of our developer experience. So we instrument all our infrastructure, we instrument all our PRs and so on, and we can collect that and measure how we're doing. So some of the numbers that I've been showing here today comes

Speaker A

from that instrumentation, and we have tons and tons of metrics that we're tracking. We believe that human judgment matters just as much as it did before, or even more now that we're able to move faster. We need to

Speaker A

figure out where to apply that human judgment, though. So, I mentioned the increase in PR frequency. The flip side of that is that we now have 76% more PRs to review. Developers, One of our most frequent feedbacks at the moment is there's just too many freaking PRs to review.

Speaker A

So we need to figure out where we apply humans to review those PRs where it matters the most. So that won't be all PRs. We're already auto-approving some PRs that we think are safe enough to merge without human review, and then

Speaker A

we try to focus the human review where it really matters. And I think this will be recurring. We'll figure out over time where we need the human judgment to be applied. And that's going to be both, I think, prior to invoking the agent

Speaker A

and post invoking the agent. And lastly, as we're moving faster, we're seeing that coding is much less of a bottleneck now. It used to be that if you looked at the way that we build our products, our product development lifecycle, we were mostly waiting

Speaker A

on developers building out features, implementing them. And that might have been early in the phase where we need to validate something, or it might be building that out for production. Both of those cases, that was one of the main bottlenecks

Speaker A

that we had as a company. And that is now starting to loosen up. I won't say that it's completely eliminated, but it's starting to be reduced. So, for example, for that early validation, Spotify is a company that has

Speaker A

too many ideas, way too many ideas about what we could do to our users than we've ever been able to build, that we had the capacity to build. And having that many ideas about what we can do means that we need to validate

Speaker A

which of those ideas make sense. And one way we can do that is to prototype. Prototype used to be a fairly expensive thing for us to do. You had to convince a bunch of developers to build something for you so you can then

Speaker A

show that to other people. One thing that Claude and Agents allows us to do is to allow anyone to prototype in our actual production codebase. So now at Spotify, you can open up Claude in our client monorepo And through a set of skills and some infrastructure

Speaker A

that we've built, you can prompt Claude to build out any feature that you want to try out and imagine. Claude will build that for you. You will get an app back that you can install and test on your device and share with

Speaker A

other people within Spotify to actually get a sense of what it feels to use that idea you had. And this has brought prototyping for something that could take days or weeks to literally taking minutes now. So anyone, including as it turns out one of our CEOs, are now building these prototypes for

Speaker A

the ideas they have. So that's for prototyping, and then the same is true for building things out in production. But what we're seeing is that this is moving the constraints around. So where coding used to be the bottleneck, We're now seeing more and more of those constraints and bottlenecks turning

Speaker A

into other aspects of how we build products. And in particular, where we have human decision-making in the loop. So again, things like deciding what we're going to ship to our users, or which ideas we want to explore. Those things used to be...

Speaker A

We didn't have to make that many of those decisions, because again, we were constrained on how fast we could build things. But as that constraint lifts, we need to figure out better and more effective ways of making those decisions. And we're seeing this

Speaker A

now and we're trying to shift around how we plan the work we do and how we decide on, or how we make those decisions at the moment. It is still very much an ongoing learning at the moment and the set of experiments that

Speaker A

we're running, but I think in six months or so, I think we'll have a very, very different way of building products compared to what it had looked like previously.

Speaker A

That was it. Again, if you want to try out Fleet Shift and Honk, that's where you can take a look at that. And thanks for having me.

Topics:SpotifyAI coding toolsdeveloper productivityClaude CodeHonkFleet Shiftcode automationlarge-scale software engineeringLLMsoftware maintenance

Frequently Asked Questions

How has AI adoption impacted developer productivity at Spotify?

AI coding tools are used weekly by over 99% of Spotify engineers, with 94% reporting increased productivity. This is evidenced by a 76% increase in pull request frequency, showing faster and more frequent code shipping.

What is Fleet Shift and how does it help Spotify manage code maintenance?

Fleet Shift is Spotify’s automated fleet management system that schedules and orchestrates code changes across thousands of repositories. It has merged over 2.5 million automated maintenance PRs, most of which are auto-merged without human intervention.

What role does the tool Honk play in Spotify’s AI transition?

Honk is an AI-driven tool built on Claude and the Agent SDK that automates complex code migrations. It runs in Kubernetes pods, uses verification tools to ensure safe changes, and integrates with Fleet Shift to manage large-scale code updates efficiently.

Get More with the Söz AI App

Transcribe recordings, audio files, and YouTube videos — with AI summaries, speaker detection, and unlimited transcriptions.

App Store Google Play

Or transcribe another YouTube video here →

Free tools: TXT to SRT · SRT Validator · Merge SRT · Subtitle to Text · All tools