bureaucracy is startup success
Three properties of software systems at scale:
- We need more complex systems to handle the scale
- We need more complex observability solutions
- It is possible for one or few replicas to slack off and let work be carried by other replicas
Let’s talk about software systems. We have a nice simple monolithic software system with a Postgres database and a Redis cache. This can take us pretty far as far as handling scale goes. Our observability needs for this setup will be equally as simple. With the bare minimum of logs and standard metrics (CPU and RAM usage, RPS), it’s likely we’ll be able to operate the system fairly well. When an incident occurs, your intuition and the simple observability setup is able to take you far – there’s just not that many variables to consider. The complexity of the system is low, you can just keep it in your head.
The company grows and our software system has to mature to handle the demand. To help teams move faster and “independent” of each other, we move to microservices running on k8s. That choice comes bundled with increased operational complexity. We’re now concerned with network faults, inconsistent data, idempotency, retry logic, circuit breakers, service discovery, improving security posture, SLOs. We’ve added Kafka to the mix. We’ve got high availability setups for our database and microservices. The volume of logs and metrics increases exponentially. Our system complexity has exploded. When an incident occurs, we need to first figure out where the problem is coming from and only then, what the actual problem is. As we all know, this is non-trivial in a large scale system, there’s just too many variables to consider. Our observability solution has to evolve accordingly in order to debug system issues. For example, we’d need distributed tracing to understand the flow of requests throughout our system.
At this scale, we’re exposed to other fun problems. As the number of instances we need to monitor increases, it becomes unwieldy and expensive to monitor them at an individual level. Instead, we care about the health of our system as a whole. We aggregate to get a global understanding of our system. The downside is that if a single or few instances start having issues, it can go unidentified for long stretches of time. If 1 out of 20 replicas starts responding slowly, it’s possible for it to have limited impact on the system. It can slack off almost indefinitely with the bulk of the work being carried by the other 19.
Where am I going with this? There’s parallels between large scale software systems and large scale organisations. At large scale, the requirements imposed on your system are much more complex and you have to build in capabilities to deal with that complexity.
In startup-land, one of the most admired traits of startups is their lack of bureaucracy. Decisions can be made quickly, there’s no need for tons of meetings, there’s not that much red tape. This isn’t really an inherent trait of startups, it’s a byproduct of being small scale. If you’re a team of 3, there’s less communication overhead, there’s less or no people to manage, you’re probably operating in one country and don’t have to deal with tax compliance across countries, you don’t have to comply with labour law in different countries and so on. As you scale up your organisation, the complexity scales with it. Managing the desires and goals of 3 people is much simpler than 3000 people. The communication overhead increases substantially. You start having more meetings to start making everyone is aligned on what to do. There’s more legal overhead. To manage that scale, you have to sacrifice a high-level of detail and start generalising. These become the cursed word of startups: processes. They’re extremely coarse grained due to scale. But you need them, in the same way that you need to a more complex system and better observability to handle a large scale software system.
Scale is also the reason why people don’t need to work as hard. It is impossible for a company the size of Google to attribute work at the individual level. In such a system, you’ll inherently have people that do less work, just don’t have the desire anymore or just have different goals adjacent to the success of the company. The scale of the organisation makes it possible for those people to exist indefinitely. At an (early-stage) startup, you’re a) small, b) committed to the success of the company (in part because if it fails, we all lose) and c) easy to “monitor” – if you don’t work, your team will feel it. At a large scale organisation, these are less true. It’s not obvious to anyone, including myself, that if I work 4 hours a day the company will be worse off. How much of my effort is represented in the revenue numbers? Who knows! (Of course, this is not true if you’re working on a (new) product that is directly responsible for new revenue e.g a new cloud service). The company also has no real way of attributing revenue to individuals and in some cases, down to teams. Again, these things are difficult at scale, it’s not because they don’t care. They will look at things in aggregate: how much money is Gmail bringing in, has that number gone down/up with an increase/decrease in employees. Scale makes everything more difficult.
I say all of this because startups often look at these big companies and criticise them for moving slowly. They have too many meetings, processes and workers that slack. These are all bad things and indeed it is the role of management to fight these tides. But is important to keep top of mind that these companies are not lacking insight nor determination. Much of theri challenge is precisely a function of scale and the complexity it brings along with it.
The irony of it all is that many startups want to become long-lasting, large scale companies. They want to be the Google’s and Apple’s of tomorrow. With that comes the processes and the meetings and the difficulty of management. It comes with bureaucracy. You’ve made it to that level when your company’s bureaucracy begins to annoy you and you yearn for the nimbleness of a startup. Bureaucracy is startup success.