From one person to 80: Scaling a hypergrowth engineerin… — Transcript

Learn how B44 scaled from 1 to 80 engineers using Claude Code to maintain velocity and simplify onboarding, code review, and QA.

Key Takeaways

Keep scaling processes simple to maintain velocity in hypergrowth engineering teams.
Use AI tools like Claude to automate onboarding, code review, and QA instead of building complex manual processes.
Real-time, prompt-driven documentation and organizational mapping help new engineers ramp up quickly.
Leveraging live user interaction data and AI classification can effectively monitor product quality at scale.
Scaling engineering teams requires balancing rapid growth with maintaining product-market fit and quality.

Summary

B44 scaled from a solo founder engineer to 80 engineers while maintaining product velocity using Claude Code.
The talk is divided into two phases: scaling from 1 to 15 engineers and from 50 to 80 engineers.
B44 is a low-code platform started in late 2024, profitable by April 2025, with a strong AI-focused user base.
Post-acquisition by Wix, B44 expanded rapidly and faced challenges in onboarding, code review, and product testing.
To scale onboarding, B44 used simple Claude prompts to generate real-time organizational knowledge and component diagrams.
Claude was used to analyze past PR comments to automate and amplify code review without complex processes.
A new engineer successfully completed a complex WhatsApp integration in a few days using these methods.
For quality assurance, B44 leveraged live user traffic and Claude to classify user frustration in conversations as a signal of product issues.
This frustration metric was used to evaluate new agent versions by exposing a subset of users to them.
The key philosophy was to keep processes simple and leverage AI tools to scale engineering and maintain velocity.

Full Transcript — Download SRT & Markdown

Speaker A

Hello everyone. My name is Yav. I lead product at B44. And going to join me on stage later on is Gabrielle, who leads our AI. And we're going to talk about how B44 scaled from a solo founder engineer all the way up to 80 engineers and how Cloud Code helped us facilitate that growth while maintaining our velocity.

Speaker A

We split this talk into a short intro and then two phases: going from one engineer to 15 engineers and then going from 50 engineers to 80 engineers.

Speaker A

So let's talk a little bit about the first phase, which is mostly an intro to B44 and our solo founder. So B44 is a low-code platform, but this was a new term a year ago. It was Mar thinking, "I want to build a platform that will let anyone build software, non-technical user, technical user, let's build up the speed."

Speaker A

He started the platform at the end of 2024, and by 2025, you already had a working product, started building in public on LinkedIn and Twitter, gained a lot, a lot of traction, and by April 2025, the product was already profitable. That's the moment I joined because money was starting to flow in and getting a lot of traction. And because this was a profitable product, AI-focused user base, and a crazy founder, it started getting the focus of a lot of companies and acquisition opportunities, which leads us to the next phase, which is our post-acquisition.

Speaker A

So Wix has a very similar user base as B44, and so they saw B44 as a big bet, and they wanted to maintain the velocity of B44 but expand it dramatically. So we basically went from a two-member team into a 15-engineer team, and we needed to scale, and we needed to scale as fast as possible, and we had four major challenges.

Speaker A

One is onboarding doesn't scale. We can't have Mo onboard each engineer to the team. Code review doesn't scale. Mo was really, really cautious about what goes inside the backend of B44, so he wanted to review each PR on his own.

Speaker A

We can't have each engineer sit with our beta tester to understand whether the product is working as expected. So we need to find a way to automate that as well. And an interesting part about the fact that you have very immediate product-market fit is there's a lot of product surface you need to cover. Whether it's integration, whether it's the agentic flow, whether it's the visual editor, there's so many areas, and you need the engineer to ramp up really, really quickly.

Speaker A

So let's jump in. How do we solve each one of the challenges? And the key takeaway I want everyone to get out of here, especially for those with small teams, is the fact that you need to keep everything very, very simple.

Speaker A

Okay. The meetings when we tried to tackle those challenges would start with, "Hey, let's build this process where we review everything and then build an onboarding doc and we'll do like a nightly update." We're thinking, actually, no, let's keep it very simple. Every new engineer that comes into the company will get a task to basically use two prompts before he starts working on his task. One, go over all the commits and tell me what everyone cares about. So after we were like three, four engineers and people started building their knowledge in each area, like the fifth and sixth engineer came, wrote this prompt, and they already get like this map of the organization, and you don't need to kind of think about how do I keep these onboarding docs updated as new engineers come up. No, a simple prompt gives you in real time the entire map of the organization. The second thing is before you dive into each area is basically ask Claude, "Hey, can you give me a mermaid chart of how this component works?" And again, this works in real time because everything keeps evolving. You don't want to kind of try, "Hey, I need to keep this document up to date. I need to keep this document up to date." No, Claude keeps it for you. Very, very simple. One prompt gives you everything an engineer needs to know in order to start working inside of B44. The second thing is, as I mentioned, Mo was very, very cautious about what code goes inside our agent and what code goes inside the backend of B44. So we needed a way to amplify Mo's PR abilities. So after about one or two weeks, we already have a big pool of PR comments Mo added inside our repo. So again, instead of sitting down and brainstorming, "Okay, what's the instruction that we need?" let's have Claude review the past PRs, say, "What's the most important things and what's the most crucial things we need to keep in mind while engineers are writing their new code?" and we put it in instructions, run it every couple of days, and have more PR reviewers inside of B44 without having us build a sophisticated and complicated process. The cool thing about it is when we really started to see velocity picking up.

Speaker A

Okay, so one of the PRs that we kind of remember and keep referring to is we wanted to do a WhatsApp integration inside B44 to communicate with the agent using WhatsApp, and we handed it over to a new engineer. We assumed a new engineer working on this kind of feature. It requires an integration. It requires working on the agentic flow. It requires a new meta API. We assumed it's going to be a one to two weeks endeavor. And it was really, really awesome to see that we gave that Thursday night, Sunday morning, everything was ready. He onboarded on Thursday using those simple prompts. He sent it over to PR. The PR model review had two or three small comments, and we were ready to move on to production.

Speaker A

Okay, so we managed to resolve most of the issues. Now we have the issue of how do we make sure that what goes into production, especially our agent, works really, really well for our customers. Previously, when we were a tiny team, we would just sit with customers and hear how they interact with B44. But now we need to find a way to scale. And like almost every naive AI company out there, we would say, "Hey, let's build an eval suite. We'll make sure that everything that comes out, we'll run it through our evals. It will work perfectly, and we'll understand what's going on." And I don't know if you tried to build an eval mechanism before, but usually a 15-person team is not ready for it. It's a much bigger endeavor. So we sat down and said, "Okay, we already have a tremendous amount of traffic in production. How do we use that traffic in order to understand whether the model is working for our customers or not?" We have conversion rate, which is nice, but we want to understand whether the agent itself, especially for paying customers, is working as expected. So we started looking at the conversation, and a very simple pattern emerged, and that if you look at the conversation when everything is working well, the user doesn't say anything. It just goes to the next feature, to the next feature, to the next feature. But when things start to break, that's when users get really, really loud inside the chat and say, "Hey, why is this broken? I can't believe it's not working." It's really, really easy to see and manifest the fact that things are broken. So we said, "Okay, we have a very strong signal when things aren't working. Why don't we use that and leverage that and ask Claude, using a simple model, using an IQ model, to classify each message on whether the frustration level of the user is high or low?" Once we have that, then every single version of the agent that we want to release, we basically put a small percentage of the customers on that version, and we can track the frustration level. And this works whether we're changing the infrastructure, we're changing the prompt, or we're changing the model. And we can understand whether this works as well as expected after the change for our users. And the key takeaway again is just keeping everything super simple without building a sophisticated process aro...

Speaker A

immediate product market fit is there's a lot of product surface you need to cover. Whether it's integration, whether it's the identic flow, whether it's the visual editor, there's so many areas and you need the engineer to ramp up really

Speaker A

really quickly. So let's jump in. How do we solve each one of the challenge? And the key takeaway I want everyone to get come out of here especially for those with small teams is the fact that you need to keep everything very very

Speaker A

simple. Okay. The meetings when we try to tackle those challenges would start with hey let's build this process where we review everything and then build an onboarding dock and we'll do like a nightly that that uh update that. We're

Speaker A

thinking actually no let's keep it very simple. Every new engineer that comes into the company will give him a task to basically use two prompts before he starts working on his task. One go over all the commits and tell me what

Speaker A

everyone is what everyone cares about. So after we were like three four engineers and people started like building their knowledge in each area like the fifth and sixth engineer came wrote this prompt and they already get like this map of the organization and

Speaker A

you don't need to kind of like think about how do I keep like these onboarding docs updated as new engineer come up no a simple prompt gimps you in real time the entire map organization the second thing is before you dive into

Speaker A

each area is basically ask claude hey can you give me a mermaid chart of how this component works. And again, this works in real time because because everything keeps evolving. You don't want to kind of like try, hey, I need to

Speaker A

keep this document up to date. I need to keep this document up to date. No, claude keeps it for you. Very, very simple. One prompt gives you everything an engineer needs to know in order to start working inside of B 44. The second

Speaker A

thing is as I mentioned Maul was very very cautious about what code goes inside our agent and what code goes inside the back end of base 44. So we needed a way to amplify MA's PR abilities. So after about one or two

Speaker A

weeks we already have a big pool of PR comments M add inside our repo. So again, instead of kind of like sitting down and thinking of brainstorming, okay, what's the instruction that we need, let's have Claude review the Pas

Speaker A

say, what's the most important things and what's the most crucial things we need to keep in mind while engineers are writing their uh new code and we put it in destruction, run it every couple of days and have more PR reviewer inside of

Speaker A

B 44 without having us to build a sophisticated and complicated process. The cool thing about it is when we really started to see kind of like velocity picking up. Okay, so one of the uh PR that we kind of like remember and

Speaker A

we keep referring to is we wanted to do a WhatsApp integration inside B 44 to kind of like communicate with the aging using WhatsApp and we handed it over to a new engineer. We assumed a new engineer working on this kind of

Speaker A

feature. It requires an integration. It requires working on the agentic flow. It requires a new meta API. We assumed it's going to be going to be a one to two weeks uh endeavor. And it was really really awesome to see that we gave that

Speaker A

Thursday night, Sunday morning, everything was ready. He onboarded on Thursday with uh using those simple prompts. He sent it over to PR. The PR model review had kind of like two three small comments and we were ready to move

Speaker A

on to production. Okay so we we managed to resolve most of the issues. Now we have the issue of how do we make sure that what goes into production especially our agent works really really good for our customers.

Speaker A

Previously when we were a tiny team, we would just sit with customers and hear like how they interact with base 44. But now we need to find a way to scale. And like almost every naive AI company out

Speaker A

there, we will say, "Hey, let's build an evil suite. We'll make sure that everything that comes out, we'll run it through our evils. It will work perfectly and we'll understand what's going on." And I don't know if you tried

Speaker A

to build evil um mechanism before but usually 15 people team is not ready for it. It's a much bigger endeavor. So we sat down and we said okay we already have a tremendous amount of traffic in production. How do we use that

Speaker A

traffic in order to understand whether the model is working for our customers or not? We have conversion rate which is nice but we want to understand whether the agent itself especially for paying customer is working as expected. So we

Speaker A

started looking at the conversation and a very simple pattern emerged and that if you look at the conversation when everything working well well the user doesn't say anything. It just goes to the next feature to the next feature

Speaker A

to the next features. But when things start to break, that's when users get really really loud inside the chat and say, "Hey, why is this broken? I can't believe it's not working." It's really really easy to see and manifest the fact

Speaker A

that things are broken. So we said, "Okay, we have a very strong signal signal when things aren't working. Why don't we use that and leverage that and ask Claude using a simple model using an IQ model to classify each message on

Speaker A

whether it's the frustration level of the user is high or low. Once we have that then every single version of the agent that we want to that we want to release. We basically put a small percentage of the customers on that uh

Speaker A

version and we can track the the frustration level. And this works whether we're changing the infrastructure, we're changing the prompt or we're changing the model. And we can understand whether this works as well as expected for after the change

Speaker A

for our users. And the key takeaway again is just keeping everything super simple without building a sophisticated process around it. uh like we hear a lot about like let's build an agent for this and and agent orchestration but when you're a

Speaker A

small team you have very simple way of getting the almost the same amount of value while keeping processes really really really lean but when you scale from 15 to 18 it becomes a little bit of a different challenge and that's when

Speaker A

Gabrielle is going to walk you through thank you very Hello everyone. My name is Gabriel and I lead the app builder agent for base 44.

Speaker A

I had a lot of time to watch you have behind the scenes so I got a little bit nervous. So, so you I've just told you about the first two phases of our growth and last couple of months we reached a new

Speaker A

point of growth like we started hiring more externally. We had more internal movers moving from weeks to base 44. And then we even merged a different product working on vibe coding and in one single night we doubled our ad count from 40

Speaker A

people to almost 80. And that brought a new set of challenges that we had to solve. So we had many new challenges.

Speaker A

I'd like to focus on the three most interesting ones like the first one is how do we do experimentation at scale.

Speaker A

Now you have just shared how we did the the frustration metric and how we AB tested in in production but you can't expect any new hire to understand exactly which KPIs to test how long do we want to test things whether you can

Speaker A

just be brave and and ship it and like not everything needs an experiment right so we knew we wanted to shift left product management decisions in AB testing so we also uh needed better evolves now Again back to what Yav just said, we had

Speaker A

we we were before in a point where we knew that evils is not the best uh ROI for us but now it became something we really need to focus on. And the last thing is how do you do QA QA properly in

Speaker A

a company that's very consumer oriented without growing your uh testers in a linear way with the other headcount.

Speaker A

Let's start with experimentations. Okay. So we had we started with a general shell of what we wanted to have like we knew we wanted a process that runs when a pull request is ready. We knew that eventually we want like a a bot

Speaker A

commenting on GitHub saying like for a developer whether she could or not just ship it. If she needs an AB test, how long should it run? Which KPIs does the experiment need to monitor? And we also wanted it to post to open the

Speaker A

experiment on postto that was like the shell was the easy part but we also needed the guidelines the actual logic of how do how do we work like how do we how do we operate we never sat and and

Speaker A

articulated that we didn't have a guideline committee we just like had really good product sense and intuitions so we had one option like get a multistakeover committee and like enter a lot of meetings but we really hate meetings so We figured out that our past

Speaker A

actions they could convey our guidelines in the best way possible. So we thought like wait we can just take like the 100 last experiments we had on posto the matching pool request and distill our guidelines from that. So we spawn cloud

Speaker A

code hooked it up to the posto mcp. Posto is an AB testing experimentation. Pretty great product by the way. Uh and and had Claude u um suggest the first iteration of the guidelines and it was it did a great job. It wasn't perfect,

Speaker A

very rough on the edges, but we had like a working document. we can just iterate and a couple of hours later we had like something working like uh uh each pull request opened has like a clear verdict whether you can just ship it gradual

Speaker A

roll it a gradual uh roll out it or do an AB test and how long some features deserve seven days of of testing for our scale some need to have a full a full month because you might uh you might

Speaker A

affect uh uh conversion rate and premium rates in very little percentages. And to wrap all of that up, we needed a central place that everyone could just see what's going on. So it was a great opportunity for us to dog food our own

Speaker A

product base 44 connect it to BigQuery our data warehouse to posttog to GitHub to everything and have a central place where you can everyone could see which experiments are running uh how they're how are they u uh how they're moving the

Speaker A

needle if something's causing more AI cost if something's reducing like rate of published apps like all the things we cared about and this for now kind solved us the problem and allows us to open up a new paradigm in how we uh scale our

Speaker A

experiments. Okay, so the next part is evolves like this could be like a easily a full onehour talk and maybe next year depending team here will even do that but our challenge was very short term like we needed something to give us real

Speaker A

value. We didn't want it to be like a three months project and we didn't we couldn't afford uh um taking their top AI engineers. We need them to work on features and improve the product. So we asked ourselves what do we really

Speaker A

need to be build? Do we want to just evaluate the output of the model or do we want to check correctness of the apps that our users are building? And eventually we had to build a user simulator. Now for base 44 when a user

Speaker A

types in like a request they want like an app and some small part of it won't work that doesn't mean that the evil uh fails. That was a great epiphany moment for us. It means that our evil suite

Speaker A

needs to pipe back the rejection and and ask the our our agent to to to fix the the the missing parts. And then we ended up looking at uh latency, how many turns things took, how much every uh uh how

Speaker A

much it cost to us, how many credits we took for our users, and we got into like a a working CI/CD pipeline where any change in our AI code spins up real a real base 44 app instance and we use

Speaker A

stage hand to simulate us real user actions like if like there's like a automated QA engineer spun up in a small box. That's how we look at it.

Speaker A

And this is how the internal app we built to support that looks like. Again, a great opportunity for us to dog food our own platform. You can see here the example of like the the most canonical evil we have is like the hello world app

Speaker A

like uh it it doesn't mean that the app is doing like that B 44 is is performing the way we want. It's like a smoke test.

Speaker A

It's just making sure we didn't break anything. So the way we'll do that, we'll ask B 44 to build us a simple hello world app. Assert that the right text is visible and there's like it looks good and it's very subject

Speaker A

subjective but we trust AI on failing. If not, then we ask for a very small change, uh, text change, and then we ask for a small feature. And as you can see, most of them just pass. And fun fact,

Speaker A

these eval will pass on the smallest model you can think of, which is really cool. And of course, we have many more complex evals. For example, we have scenarios where we start with an existing app and do many changes. we

Speaker A

have scenarios where we get to to check our compaction mechanism which is very complex and requires a lot of user messages.

Speaker A

So this is kind of brought us to a new paradigm in in in evil. It's not perfect yet. We're constantly working on it but it was like the right time the right moment uh to to build such a system.

Speaker A

And the third thing I want to share is how did we uh streamlined QA? So we do believe in shifting left quality of course like all of us like unit testing end to end tests it's obvious everyone uh working at base 44 needs to

Speaker A

have complete ownership of what they build but most of the times you're working on really deep features that have a lot of edge cases. For example, imagine testing a feature that only affects users at a specific sub subscription tier when they reach a

Speaker A

specific point of their credit limits. Like, and imagine your feature has a lot of permutations that affect that. Like, it will be very tedious for everyone to test it manually. And so, that's a classic case where we would hand off to

Speaker A

a QA engineer, but then we'd have longer feedback loops and you have to wait for someone to be available. And and and and that was wasn't ideal. We knew that cloud code need what could operate a browser right playright MCP browser use

Speaker A

like there's ton of tools out there but it was missing critical pieces of how to do it well for example each time it had it had to relearn the platform the selectors the flows um each time uh it

Speaker A

had to uh um understand which events to look for in in in our database and then mix panel so we started wrapping um are are common uh flows in skills. For example, we have one skill that taught uh cloud code combined with

Speaker A

the browser how to uh go over all the major user flows that most of features will touch and of course for new features like cloud code can just understand what the feature does and how to get along. So we don't need to cover

Speaker A

100% in the skill. You just need to maximize the 80% so you have enough context. It's like a a thin trade-off between like the right abstraction level and what do you tr just trust close code. The second challenge we had is

Speaker A

like how do you do a proper setup testing like for for your test. Let's take the let's take the the example from before like when you want to test a very specific edge case. Now you could just click and and do it very manually like

Speaker A

like like a QA a manual person a human person like could just do the clicks but that would be very very slow right so what uh a good engineer will do a QA engineer will do is just go to the

Speaker A

database and override the the the the setup so that they can just test that that case we needed cloud to be able to do the same thing so we created CLI tools that abstract our APIs and database cases specifically for the use

Speaker A

case of setting up tests. And we uh built skills that taught uh Claude how to use that those properly.

Speaker A

And eventually we combined all of these uh efforts and skills into one like meta skill of how to do proper QA and we got into a flow where a PR pull request opens the agent triggers it creates a

Speaker A

test plan also great opportunity to dog for their own product sends it to an base 44 app starts testing and reports back and this is like how it looks like for a single test like you get screenshots you can know what what it tests, what it

Speaker A

didn't test. And I sometimes I will get cases where like I know it it it's I'm stretching the boundary of what it can do and then it was just like right like I couldn't test that and like surface

Speaker A

the the missing capabilities but that works for 80% of the time and it allowed us to shift left deep and edge case quality assurance and move faster. Okay, so that's all for the challenges and I'd love to share a little bit about the the

Speaker A

the the common thread around all of all of our like challenges and solutions. Like just as you have said before, we really value simplicity. like we really think about like we we we try to think about like bold and and and simple and sometimes

Speaker A

like we we'll take like we we'll work very hard not to to to build complex things when they're not when it's not the right time. Evils is a great example. We hold it off until it was the right moment to build it and then we

Speaker A

went all in. The second thing is that taste is a big word, right? like recently like everyone's talking about taste and like it's the last mode of us humans against machines. So I'm I I believe in that too. But I do think that

Speaker A

you can encode a big chunk of your team's company's taste by looking at your past actions like just and and that kind of pipes back to to the the memory talk from the last session where like you can just look at what you actually

Speaker A

did like in the last uh week or so and understand what your guidelines are like for code reviews for for AB testing you name it.

Speaker A

The third thing is like if you're lucky enough to work on a product that can that you yourself can use uh that's also like a huge win like I think the uh the team at Entropic constantly speaks about

Speaker A

how magical it is to be working on cloud codework and all the product suite and how how you get the feedback and insight loop going like in a magical way. So if you can do it like sometimes you have to

Speaker A

stretch a little bit if you're working like on like I don't know an finance app but find ways to do it. it will be of value. And the last thing is that the bottleneck will keep moving. Like for

Speaker A

example, for now, our current challenge is like first of all, how to continue and scale all of the processes I just shown, but also how do we do post validation correctly? Like once a pull request reaches uh production, how do

Speaker A

you make sure it's moving the right needle? For example, is a bug really reducing a support tickets? You don't want a human to keep it on his head like is a feature really being used by users?

Speaker A

Is it of course you want it to raise business metrics but not everything will will show that fast. So we we want to automate that. And that's it for today.

Speaker A

Uh we really appreciate you coming and I really hope you found at least one thing you can take back to your company organization. Thank you.

Topics:B44Claude Codeengineering scalinghypergrowthonboarding automationcode review automationAI in software developmentlow-code platformquality assuranceuser frustration analysis

Frequently Asked Questions

How did B44 handle onboarding new engineers during rapid scaling?

B44 used simple Claude prompts that allowed new engineers to get a real-time map of the organization and component diagrams, eliminating the need for manual onboarding documentation.

What approach did B44 take to scale code reviews?

They leveraged Claude to analyze past PR comments and generate instructions for code reviews, enabling multiple reviewers without complex processes and amplifying the lead engineer's review capacity.

How did B44 ensure product quality as the engineering team grew?

B44 used live user traffic and Claude to classify user messages by frustration level, using this as a signal to monitor agent performance and evaluate new versions with a subset of customers.

Get More with the Söz AI App

Transcribe recordings, audio files, and YouTube videos — with AI summaries, speaker detection, and unlimited transcriptions.

App Store Google Play

Or transcribe another YouTube video here →

Free tools: TXT to SRT · SRT Validator · Merge SRT · Subtitle to Text · All tools