Le Carnet
Pour la gloire de Dieu et le salut du monde.

Growing a Great Engineering Culture With Kafka

Published on .
Last updated on .
by Hubert Behaghel
Table of Contents

Abstract

Through a review of some key moments in Sky Technology history, we analyse the factors that contribute to a strong Software Engineering culture. We identify three behaviours that seem to be the pillars of such culture: sense of ownership, spirit of collaboration, relentless focus on data. We also establish an approach for such behaviours to permeate the entire organisation. It relies on tooling, more precisely on tooling as a unified ecosystem. We do it through the prism of Sky’s usage of the open source technology Apache Kafka. On one hand, the focus on Kafka is because this work was initiated to deliver Kafka Summit London 2019 Keynote. On the other hand, as we discover it, Kafka concentrates the essence of those four pillars and as such, beyond its technical ability, reveals a cultural “power” that can positively influence the organisation that uses it.

The Hardest Problem in Engineering

Imagine you are an ambitious entrepreneur on a quest for global success. By the way, this story also works if you’re an average middle manager in IT. Now, the reason I write is to escape reality with my wildest dreams so please. Imagine you are an ambitious entrepreneur on a quest for global success. Mind you, a decade ago, you’d have targeted the financial industry. But not today. You know what you need and it’s going to be a bright new tech idea. But in Tech, having an idea isn’t where the challenge lies. It’s all about executing. You need to attract a bunch of brilliant engineers who want to work with you on that idea. Maybe they should even find it for you. And you need an environment where they can be productive and which fosters a spirit of excellence. In other words, in today’s world, to succeed you need a strong engineering culture.

So let’s build one, shall we? Easier said than done. I actually believe it to be the hardest problem in Engineering. Let’s dig into why.

Aristotle, one of the first hackers that History bothered to remember, liked to say

We are what we repeatedly do.

Our culture, it’s who we are but also how we do things. Now he said repeatedly, and here come our first problem. It’s twofold. Firstly, here is this principle of ours: Don’t Repeat Yourself aka the DRY principle. The nature of our job is to not repeat ourselves. Secondly, software engineers almost by definition work stuff they have never done it before, at least not exactly. The more experts they are, the truer it is, the less likely they will be employed to work on problems we already know how to solve.

Now there is still something repetitive that needs doing. It’s just we don’t do it. We build tools to do it instead. Let’s postulate Aristotle’s principle still works in that case. By reduction, “we are what we repeatedly do” becomes

We are our tools.

Somehow, if we are to believe in logical reduction — and by virtue of the Curry-Howard isomorphism, that could well be the only thing we believe in — that strong engineering culture we are desperate for has to do with the tools we use, those we build and what they do for us. Repeatedly.

In our example, we are talking of a tech start-up. What about a well established big corporate company? Can it have a great engineering culture in its IT department? Well as it happens yes, many have done it. The most successful ones, you know them for being the companies behind “the Cloud”. And this proves our point nicely. The reason why we all take seriously the engineering culture of a big online retailer, it’s because it was so serious about its tools, it realised it could monetise them. Therefore the size of the company is no excuse. But I’ll give you that, most if not all of those companies were tech companies from the get go. What about a company that predates the prominence of the web and the tech scene? Like Sky?

Melvin Conway, another hacker from more recent past, gave us this law:

Conway’s Law Organizations which design systems are constrained to produce designs which are copies of the communication structures of these organizations.

Let’s give it an aristotelian twist:

Our tools are a reflection of how we collaborate.

Let’s put it side-by-side:

We are our tools and they are how we collaborate.

Do you see the chicken-egg problem here? How cool is that? We had a twofold problem and not it’s recursive. So what’s the deal? Well, it means we are also how we collaborate. And since the way our tools collaborate is the by-product of our collaboration, the question is: can we reverse the process and program our culture, at least partly, by implementing in our tools the collaboration spirit we want in our culture? In other words, can we hack the Conway’s Law and influence our culture?

As it happens yes. I have seen it in action many times. And you shouldn’t be surprised if Apache Kafka has played a role here more than once. Why is that?

The last piece of information our ambitious entrepreneur needs to grasp is that data is the new currency for success in this century. There we are: to succeed you need tools that collaborate well between them and understand the value of data.

Now there is another chicken-egg problem: for your culture to attract talents, it needs to be popular! I told you: the hardest problem in Engineering. But at least, now you understand why I am now talking to you wisely about engineering culture.

I will now open Sky Engineering Family Photo Album and recall 3 key moments where something happened with a lasting impact on our culture. Interestingly enough, each time Kafka was there somewhere in the background.

3 photos from the family album at Sky Technology

Data Ownership

This picture shows the Sky.com homepage team, on a beautiful day of 2015.

Taking Control of Shape and Flow

In this picture, you find two engineers, the devops and the coder working together on the homepage of sky.com, our website. It’s very slow. When you zoom, you can see this is because there are a 100+ “tags” that collect the same data over and over again for 3rd-party integrations. Unconsciously, this compensates for the poor state in which we find clickstream data and analytics. It is based on a vendor solution that forces its own cryptic data structure onto our product. The team that owns the relationship with this vendor is the perfect description the man-in-the-middle: it doesn’t use the tool itself but specify requirements to the tech teams who implement it which therefore never talk to the teams that use it. Data quality is not even a concept in this setup. I let you imagine the result. Add to the mix considerations like AdBlockers that wipe out about 40% of the data and you’ll have an idea of what data mess sky.com was into 4 years ago.

But then here is what happened:

Technical Cause, Cultural Effects

Ensued an explosion. Some from then on would be constantly looking at this data with curiosity in search for patterns and disruptions. Others would build tools to democratise this data further. I recall in particular two nifty innovations. First, there was this chrome extension to overlay the website with analytical visualisations. Second, the visitor timeline tool was about surfacing the entire history of events attached to one anonymised visitor ID. I wouldn’t say we have a proof but what we observe here is that data ownership fosters collaboration and data-savvy tools to support it. Great culture is under construction.

Even though introducing Kafka for clickstream data is quite a cliché, the technical boredom here is truly a good thing. The same way a technology like Retina Display™ aims at making the pixel invisible in favour of the picture on screen, Kafka here aims at making the infrastructure invisible so that your data can really emerge like a glowing treasure chest, unearthed from your technical stack. And it has to start with clickstream data as this is the poor-man event-driven domain model for your product. The analogy with display technology is all the more relevant here that one can easily consider those first pipelines to be the equivalent of the optical nerve of Sky platform. Soon enough, we would grow a full nervous system for Sky platform.

Now let’s face it: those two topics, as successful as they are in effecting mentalities, they wouldn’t have been enough in shifting our culture. That was a successful exploration. Real cultural success comes from full industrialisation, when it is so embedded in what you do that it becomes part of who you are. I could tell you the story that got us from one Kafka cluster with two topics to dozens of clusters and hundreds of topics, removing quite a few vendors from our landscape in the process. Instead I should like to move onto our second moment. I think you’ll understand.

The Ecosystem Effect

This picture shows the Content Metadata Assembly Pipeline team, some time in 2018.

How To Design For Collaboration?

Let’s have another look at our paradigmatic statement:

To succeed in this technical world, you need tools that collaborate well between them and that understand the value of data.

“Tools that collaborate well between them”: we have a name for that, one of the most beautiful nouns in our industry I think. It’s ecosystem. The word collaboration when applied to tools becomes integration. The difference between a bunch of well integrated tools and an ecosystem is the coherence in the integration patterns. In an ideal ecosystem, there is only one universal integration mechanism.

I don’t know about you but if you tell me “software ecosystem”, the first thing that comes to mind is UNIX. Who hasn’t been struck by the sheer beauty where simplicity yields so much power? It’s predicated upon 3 concepts: data — the file, or data stream —, tools that do one and only one thing and a universal combinator, the pipe.

As it happens, that’s exactly the approach taken by Kafka: it’s an ecosystem. It is centered on data — the topic —, with a suite of well integrated tools you can combine for great power. More profoundly, Kafka is your universal combinator, the integration layer for the system under design. Last but not least, most of it is open source, therefore you can easily improve it and expand it. That is choosing Kafka doesn’t reduce in any way your ownership, quite the contrary. This is a clear sign of a tool you should consider as it will enable a strong engineering culture.

The MAP Team and the Territory Ecosystem

Therefore, choosing Kafka could set you on track to build your opinionated ecosystem which in turn could give you the culture you need. That’s what the Discovery Metadata Assembly Pipeline team did at Sky. They chose the Kafka ecosystem as the foundation for Sky Content Discovery Metadata estate. They didn’t set out to deliver a platform, or containers, or some APIs. No, they were to deliver whatever it takes to fix metadata in Content Discovery. In other words, they didn’t take ownership of a specific slice of Sky platform, with all the prejudice that comes with it in terms of solution. They took ownership of a whole problem space. Those are engineers that didn’t fool themselves by falling in love with the solution but rightly indeed with the problem.

To the virtuous principle about ownership you build it, you run it, they added a new principle you could call the ecosystem principle: you face it, you fix it. In everything they do, they acknowledge they evolve in a wider ecosystem where things can’t be perfect in every way and each time they discover something suboptimal or a gap, they fix it at the ecosystem level. That is they fix it for everyone, once and for all.

As a result, here are few of the things they contributed to our ecosystem in the past two years:

Naturally, they contributed to Kafka Open-Source ecosystem:

I say naturally because that is the unavoidable consequence of this mindset.

And of course they use all of it to build the Content Metadata Assembly pipeline.

Mindset Programming

What you should take away from this picture is that to scale from one instance of a good idea to a full adoption that’s integral part of your culture, you should count on the ecosystem effect. That’s your best vehicle for good ideas to spread across the organisation.

But how are we to make our ecosystem live up to this ecosystem effect? And that’s the beauty of it, it is intrinsic to the concept of ecosystem. If you decide say that your continuous delivery tooling is part of the same ecosystem as the company LDAP, then you’ll easily be able to calculate the average number of deployment per engineer per week. Otherwise probably not, as the initial effort in collaboration and integration will make it too hard and too ad hoc to make it worth it. Allegedly, it’s all in the mind. But don’t get fooled in thinking that therefore it doesn’t really work. The reason we are trying to program our culture via our tools is because in turn our culture is programming our minds. And minds have a tremendous influence over the material world. With one mindset, the LDAP team will engage with you and ensure it takes part into the ecosystem. You’ll live in a reality where it’s possible. With another, it will look at you suspiciously and do everything it can to ignore you. You’ll be in a different reality where it’s impossible. The things you can do with the mind really…

Now your culture isn’t always about repeating a pattern over and over so that it comes as natural, or introducing new tools. Repetition can be the enemy of progress, tools can get in the way of collaboration. Like any evolutive system, your ecosystem needs a fitness mechanism, a self-simplifying process. This leads me nicely to our third and final moment taken from Sky Technology Family Album.

Ownership and Interdependence

This one is a group picture with the entire Data Platform team and the Video Platform team in 2019.

When OVP meets DAP

Because your strategy relies on your empowering, strong engineering culture, you cannot just coordinate everything in a top-down fashion. It would kill it. This means you’ll have organic pockets of your organisation evolving slightly in parallel, on their branch as it were. It’s all fine. And then every once in a while, branches get merged. They are usually joyful moments with deep resonance in your culture.

That could happen when a tool from your ecosystem penetrates a new area for use. Or it could be when an emerging pattern is identified and someone decides it’s time to capture it into your ecosystem. All of a sudden more of your team do things the same way. They share more: practice, tooling, vocabulary, mindset, goals, etc.

That’s what happened to Sky Video Platform and Sky Data Platform. Both had adopted Kafka in their core architecture. One kafka implementation had a strong focus on security, the other on data quality. At the end of last year, we realised that a majority of topics from the Video Platform where mirrored into the Data Platform. That is why we are in the process of taking the best of both into one cluster.

This move hits all the right notes. Data is more accessible and its handling more coherent. The ecosystem gets simplified. Collaboration improves as well drastically, mostly because the two areas have now a lot more in common.

Conway’s Law Pwned!

Ownership may feel less straightforward though. Who gets to build this new cluster? One of those two teams? Or a new team? It can’t be a new team because you build it, you run it is a 2-way street: you run it, you build it. We want skin in the game. The Data Platform team would have to operate a kafka cluster for other use cases anyway. The Video Platform on the other hand, would become properly relieved of an entire concern. Hence the shared decision was made to have the Data Platform SRE team as owners of the new kafka cluster.

And that’s another interesting aspect of a technology like Kafka: it forces a model of interdependence within the organisation. Take a technology like Cassandra, you’re rarely forced to share a cassandra cluster between several systems. When you do, you trade operational efficiency for reliability. While if you don’t share but segregate in different cassandra clusters, reliability goes up while operational efficiency goes down. In Kafka, it’s the opposite: trying to isolate Kafka clusters from different systems implies introducing mirroring activities which are essentially moving parts and deteriorate your reliability. Therefore Kafka proves to be a catalyst in our ambition to hack the Conway’s Law and bake into the ecosystem the collaboration we want to see in the organisation.

Because look, it’s happening to us now: we are designing the organisation around the platform under construction and not the other way around! I suggest to leave it at that, on this resounding success.

Where To Go From There

Dear reader, now that you have surrendered to this masterly proof, you are now equipped with a brand new outlook to drive your engineering culture towards awesomeness. The most difficult problem in the field has been broken down into four concerns. All you have to do now is cultivate your spirit of collaboration, continuously build your sense of ownership, relentlessly focus on your data and treat all your tools as parts of a single ecosystem. And as it happens, you may well give yourself a great start simply by using Kafka and expanding on its principles.

In subsequent posts, we will explore ways to structure your actions and manage those concerns with 100% success. Or you get your money back. Stay tuned!

Ciao!