Blog

People in Data (my favorite for Q1-2021) : Taylor Brownlow (Head of data @ Count)

This is my second article on “Why do you find Data so interesting after all these years ?” and my anwser is always “it is not about the subject, it is about the people”.

A distinctive and instantly-recognizable style

I was reading this article “Is the Tableau Era Coming to an End?” with no author and long before the conclusion I was telling to myself “looks like an article from Taylor Brownlow”

It is clearly not easy with so many authors on the Data topic to have a distinctive and instantly recognizable style but Taylor Brownlow has really different ideas and she has clearly an outside-the-box mindset. IMHO, she is far ahead of current practices and I am sure that some of her ideas will become standards in the future. So let’s discover her work and her convictions.

Dashboards are dead

In this article (episode 1), she is not the first person to say that dashboards are dead. The interesting part is that it is based on her experience where she has observed :

  • The time and ressources spend to build these dashboards
  • The flexibility is based on filters and it is and endless task
  • The “not my dashboard” syndrome where you have different versions for every truth

But the solution for her is not to build better Dashboards or get a magic BI tool that will solve your problems but “going to portrait’s mode” with data notebooks. Using this definition from the Jupyter website, a “notebook is an open-source web application that allows you to create and share documents that contain live code, equations, visualizations and narrative text. Uses include: data cleaning and transformation, numerical simulation, statistical modeling, data visualization, machine learning, and much more”. From a 10 000 feet view, it looks like a word document but with a web interface.

  • Switching to “document mode” : we are seing this trend more and more (with the Async way of working) but a document is the best way to collaborate. A document can be annotated and you really can have a group of people working together. A data science notebook (as today) will not be enough but when you can see the way people are using Google Docs, it could inspire a total different approach for what I am calling the last mile delivery in Data. You can really have in one place what (with the results), how (the process) and why (comments)
  • Using Code to explain the process : the good part with data science and notebooks is the strong emphasis on reproducibility. It helps of course for the traceability and the explainability too. It is not going to be for everyone and it will depend on the level of maturity of the users. But you can see more and more data analyts being able to read and create SQL query. Even if you still have a strong pushback on the business side too : “we are not developers”. This is a question of time and again it is not for everyone.
  • Extension : another interesting part is the ability to “extend” the document. You can add new chapters in the document because the business context has changed and you need to focus on some analysis. In a way it is like writing a book where you can add chapters because you have new findings.

It is not just a question of capabilities, it is a radical change from the current BI apps by switching from >>> to.

  • App >>> Document
  • Parameters >>> Code
  • Limited modifications >>> Unlimited extension

and the set of skills needed is very different (structure of written communication, coding, obsessed with auditability). The next question is of course why should you invest in that shift ?

The Analyst’s workflow is broken

In her second article, you are going to discover her sense of humor. She has this schema about the life of an analyst trapped in an infinite loop  😆 

She has no less than 8 archetypes showing that how difficult is the life of an Analyst. Every company is investing a lot in collecting the data but the real life is that the last mile is often a set of manual steps to deliver the final product. The problem has many root causes :

  • No clear place to collaborate or interact between the different actors (using the email today)
  • No single place between the query and the final result
  • No single technology (SQL, python, others)

In this article, she is giving the historical perspective (SQL IDE, Excel, Dataviz) and for her it is now the “notebook” era. My feedback on that part is that we have removed none of these tools : we still have all of them today. It is highly probable that the notebook way will become a possibility and you will still keep the old stuff. My hope is that it will replace the SQL IDE and the dataviz part (my hope) – I will be dead long before the end of Excel. But this “notebook era” can bring much more.

Numbers Are No Longer Enough

I love the subtitle of her 3rd article “To make better decisions, we need data tools that communicate more than just numbers”. And she has this wonderful graphic to explain it !

She is brillantly right : context is king. And it is about to be focus on the structure of the written communication versus having a collection of graphics. I see a strong link with the 6 pager Amazon : a solid part of their culture. But I also see a link with the Async move mentioned before (https://www.weareasync.com/) : you have the example of Github where you have more and more collaboration around a structured document.

At the end of the article you have this motto “data + words = impact”. After these 3 articles, hope you are convinced about the benefit of using a data notebook.

Your Organization’s First Notebooks

The next question is how are you going to implement your first notebook ?

Her answer is to focus on “data request the most important process no one likes” (so true). Using a Notebook helps having everything in one place to

  • manage it from the data needed to the results or visualizations
  • explain the context and the result
  • to manage iterations (agile way of working)

In fact, there is an ocean of opportunities in this data notebooks world if the actors can execute. My only concern is that you have to reconciliate the open source world (today with Jupyter Notebooks for example) and the BI world (licensing and souscription). The benefit of the open source world is not just because it is free, you have also the benefits coming from the community. On the other side, in the BI world, you are going to pay and have a roadmap but it will be a bottleneck too because ressources of the software company are not unlimited. Beyond the tool, the future of the business model is also a question. Her article on Tableau is asking the right questions (it is the end of something).

Conclusion

I strongly recommend to spend time to read her 4 articles (it is like a Netflix mini serie). I just tried to highlight what I like most but I am sure you will find many other findings. If you are in a situation where you need to “re-think” the way you are doing data analysis for the business, her work will help you to understand what’s wrong (this article is brillant too about the relationships between the requester and how to anwser to any business question).

I also recommend this video where she is doing a demo (thanks youtube) : she just not has a lot of skills, this is the combination of her skills that is unique. Among these skills, you have writing interesting articles on Data.

Toward a Data Mesh (part 2) : Architecture & Technologies

Just an illustration – not the truth and you certainly can do it with other technologies.

TL;DR

After setting up and organizing the teams, we are describing 4 topics to make data mesh a reality.

  • the selfserve platform based on a serverless philisophy (life is too short to do provisioning)
  • the building of data products (as code) : we are building data workflows not data pipelines
  • the promotion of data domains where the metadata on the data life cycle is as important as your data
  • The old data governance should become a data governance-as-code (DataGovops) to scale

In the previous episode

It was very focus on the organisation and data teams. We set the stage with 3 team topologies :

  • We have independant stream aligned teams organized by data domain teams building data products
  • We have a platform team building “data services as a platform” to ease the life of the stream aligned teams
  • We have the enabling team to handle / facilitate subjects like Data Governance

The next step is how from a technology point of view :

  • How about the self serve platform ?
  • How do we build data products ?
  • How can we interoperate between the data domains ?
  • How do we govern all these data products and domains ?

It will be illustrated with our technical choices and the services we are using in the Google Cloud Platform. There are certainly many other ways to do the same with another technologies. More than the technology chosen, it is more interesting to talk about the spirit behind these choices.

1. The self-serve data platform or the “serverless way of life”

In this article “How to move beyond a Monolithic Data Lake to a distributed Data Mesh” Zhamak Dehghani describes in the introduction the 3 data platform generations (the data warehouse, the data lake and the data lake but in the cloud “with a modern twist”). With this 3rd platform generation, you have more real time data analytics and a cost reduction because it is easier to manage this infrastructure in the cloud thanks to managed services.

To illustrate that, let’s take Cloud SQL from the Google Cloud Platform that is a “Fully managed relational database service for MySQL, PostgreSQL, and SQL Server”. It looks like this when you want to create an instance. You can choose your parameters like the region, the version or the number of CPUs.

It looks nice but it is really not ! This is really the dark side of the data. As soon you have started to think about the server configuration, you are starting to lock yourself. And in a Data Mesh organisation, you will have “mini” platforms for each independant team. The next problem will be the diversity of these mini data platforms (because of the configuration) and you even go deeper in problems with managing different technologies or version. You are starting to be an operation or technology centric data team.

To get out of this, you have to move to another stage : the serverless stage.

In this stage, you will never think about the configuration. If you look at the BigQuery service (the cloud data warehouse in the Google Cloud Platform), you can start to write, transform and query your data without any provisioning. This is really for us the definition of a self serve platform.

It is not only a technical debate on what is more powerful or not, it is a choice in your priorities. The “serverless way of life” is where you do not need to manage or think about the capacity of the insfrastructure even if it is managed for you. We want everything Serverless (and we can be extreme with that by eliminating many technologies) because :

  • We want every teams to be as independant as possible versus we have to do some provisioning for the compute.
  • We want interoperability for any data stored versus we have to think how to store the data in a specific node to optimize the processing.
  • We are Data Teams versus we have to patch the server with the latest version and do the tests.

2.How to build Data Products or never call me Data Pipeline any more

You have this interesting schema in her second article on Data Mesh by Zhamak Dehghani : “Data mesh introduces the concept of data product as its architectural quantum. Architectural quantum, as defined by Evolutionary Architecture, is the smallest unit of architecture that can be independently deployed with high functional cohesion, and includes all the structural elements required for its function.”

  • Code : all the code necessary to build a data product (data pipelines, API, policies). Data As Code is a very strong choice : we do not want any UI because it is an heritage of the ETL period. We want to have our hands free and be totally devoted to devops principles.
  • Data & Metadata : the data of the data product in many possible storages if needed but also the metadata (data on data)
  • Infrastructure : you will need compute & storage but with the Serverless philisophy, we want to make it totally transparent and stay focus on the first two dimensions.

As you can see, this is in the code part where you are building your data pipelines, a misnomer because this is an over simplification.

Data Pipeline – Linear and without any stopper

The proper term should have been “Data Workflow” because a workflow is not linear : you have conditions and many different steops. When I first saw this graphic presenting Google Cloud Workflows (their latest service to manage workflows), it was obvious “This is what a data engineer should do !”. He/She is managing triggers, he/she needs to check conditions (event type ? is safe ?) and he/she has different actions to execute (reading, calling a vision API, transform, create metadata, store them, etc…). What you have to code is this workflow !

The beauty of using this serverless service is that you can really “call” any block you need (all serverless too – no infrastructure provisioning) like a Lego game. The other benefit is you can also use parameters and build a generic workflows to be re-used. You will be able to make some improvements and it will be applied to every data workflows using this generic / template workflow.

And of course, because everything is code, you have all the devops tools. This is the only way to best fast and still deliver quality. It is a huge shift in skills needed (will talk about that in part 3) but this is the only way to fully “accelerate” (like the title of this book where devops is a key part). Below, this is the list of all the most serverless services used for building data products.

3.Data Domains interoperability or Data Domains development of usage ?

In my previous article, I got this comment about the fact that data domains could become data silos managed by functional departement focusing only on their needs. And more precisely, the concern was about data like “products” or “customers” where you need to have a transversal approach. The way you have designed your data domains could help to handle the problem but at the end we should be very focus on the level of usage and the way these data domains interact together.

The core idea of Data Mesh is how you can develop the data usages and remove the centralized and monolitich data warehouse where you have very less access.

In her first article, Zhamak Dehghani has this very powerful graphic

You have 6 “rules” to be compliant with if you are building data domain and data as a product. If you look at it closely, all these 6 attributes are one common goal : develop usages

  • Discoverable : it helps to know what exists and what can use (aka Data Catalog)
  • Addressable : the adress and location of every data asset
  • Trustworthy : the data asset is monitored to know if we can use it
  • Self-Describing : well documented with examples
  • Inter operable : meaning you can cross different datasets together
  • Secure : the access management is defined so you can have users

We don’t need just the data but all the metadata to be able to use this data. In this excellent article, Prupalka explains that “We’re fast approaching a world where metadata itself will be big data”.

To be precise, we don’t need every metadata but we need all the metadata throughout the data life cycle and how you have the right exposition to develop interactions and usages.

  • At the center, you have a portal for the discovery of data domains : this is the place to find everything about your data (a solution bought that we will personalize and fuel with the relevant metadata)
  • Data Asset Inventory : a more suitable term than just data catalog, this is all the data related to the creation of a data object. It gives an anwser to what do we have , where is the data (its address) how many objects do we have ?
  • Data Security & Access management : who has access to this data ? who are our active users ? who are our non active users ? Who is doing what ?
  • Data Observability : How this data was generated (lineage) ? When was it generated ? Why is it broken ?
  • Data Privacy : to illustrate that you have to archive your data when needed (right to oblivion for example)

The portal (Data Domain Discovery) is the accelerator to develop transparency and the promotion of each data domain and at the end the usage of each data domain. This is also where you can calculate KPI about the usages and check the level of interoperability and usage of each data domain. I see a good orientation if we can really measure the success of these data domains based on the level of usage. It will re-inforce the data as a product approach.

The data domain Discovery portal with all the metadata on the data life cycle

4.Federated Computational Governance or Automated Governance-As-code (DataGovOps)

In her second article, Zhamak Dehghani defines the governance like this (one of my favorite part ) ” a data mesh implementation requires a governance model that embraces decentralization and domain self-sovereignty, interoperability through global standardization, a dynamic topology and most importantly automated execution of decisions by the platform“.

Automatize or have your Data Governance policies by default is the only solution to cope with 3 factors :

  • the increase of data volume and more kinds of data (like photos, sounds, videos)
  • more users if you want to develop the usage
  • new regulations every day

if not the volume of work is too important and you will have to prioritize subjects like regulation. The data security can be chaotic and the visibility and control is drowned in a (data) lake.

The number of subjects to automatize is not short.

  1. Creating the project or space for each data domain to collect or transform the data : each data domain is always coming with the right services and the same security heritage.
  2. Data Cataloging for each new data using templates to collect all the needed metadata : each dataset has the same template with the same metadata created by data product as code.
  3. Classification of this data to be able to apply the right policies (security, protection, retention, etc..) : especially for sensitive data to detect them
  4. Users creation and Access management : ideally, the rights are defined based on attributes coming from the referential of your users
  5. Data Protection (encryption, key management) : all data store and in transit are encrypted with a key rotation
  6. Data profiling : How to assess the profil of each column for each table to be able to detect if something has changed
  7. Data quality : how to apply data quality check automatically based on business rules ?
  8. Data Lineage : every data produced can be analyzed (by column and row) to understand their origin.
  9. Data retention and deletion : How to be compliant with regulations without doing any manual tasks ?
  10. and the most important one Data Monitoring (including alerting and automatic actions) : if there is a suspect export of data, the task is terminated based on rules before ending.

We have 3 kinds of automation :

  • By default : services like data encryption is a good example, you have a default mode where data is automatically encrypted.
  • Configuration with a cloud service : Google Cloud DLP can detect 120 infotypes but you have to configure it and define what will be the actions
  • Fully custom made with code : data lineage can be a full custom made based on steps included in the code to trace the different steps and transformations.

This is here you need a platform team (for automation) and the enabling team (to make the link with data governance) to manage these topics : these teams will work on automatizing every step of the data life cycle to enforce the data governance policies. For each subject, this team will have more and more anwsers on how to automatize it. you will find a list of services we are using in the Google Cloud Platform (I am sure you will find the same services in any cloud provider more or less)

  1. Ressource Manager to centrally manage all your projects and heritage rules
  2. Data Catalog : more a central to manage your medata and your different templates
  3. Cloud Data Loss Prevention : not yet activated but will be to detect sensitive infotype for regulations
  4. Identity and Access Management : used together with BigQuery and the data Catalog, the hardest part will how to automatize the policy attached to each data
  5. Encryption at rest and in transit : part of the the security in the GCP
  6. Data profiling with Google Cloud Data Prep is today our main option to study to automatize this task
  7. Data quality : will be a mix between our data discovery portal (to collect the business rules) and transform them to data quality rules to be executed using Cloud Data Prep for example (a possible choice to study)
  8. Data Lineage will be certainly a full custom made but based on decoding the code used in Google Cloud Workflows
  9. Data retention & deletion : should be guided by the data discovery portal and the data domain owner, we will have to define automatic jobs to apply the policy. We could define data workflows to do this task
  10. Monitoring : we will have our own tools but we are also looking Security Command center center to automatize remediation

We have still a lot of work to do but you have an idea on how do we consider automatizing the data governance.

Automatize your data governance

Conclusion

At first, when you are reading the two articles on Data Mesh by Zhamak Dehghani, you have your attention on the idea of distributed data domains vs the monolithic Data Lake and data as a product. But in this “revolution”, you have another one with the self serve platform for each data domain, the architectural quantum of a data product and the computational policies embedded in the mesh. These 3 components are the “How” you need to move toward a data mesh. It is a huge shift and I will talk about the change management part and the skills needed in part 3.

Towards a Data Mesh (part 1) : Data Domains and Teams Topologies.

Just an illustration – not the truth and we will pivot if it does not work.

I discovered Zhamak Dehghani’s first article about Data Mesh in August 2020. Thanks to Youtube, you have the live illustration in this video with even more context and explanations. And then, you have this second video that is an introduction to her second article (december 2020). It is really a journey with a brillant person who has really a discovery mindset. She keeps making this concept better and clearer. Most of the time, I am seeing her in this small window but even like this you can feel her very high level of assertiveness.

Her ideas are known by many now and you even have a community here.

The next question after reading her two articles : how do we implement it ? In any organisation, you never have everything in control and you have to consider the context and the starting point. It is always a work in progress too.

Just to know, I never have had in mind that “this is what we should implement”. But I recognise I fully agree with the analysis and the principles proposed to fix the problem.

I will try to answer to this question “Could you illustrate your journey towards a Data Mesh ?” with 3 articles : this one about Data domains and Team Topologies, a second one devoted to the architecture and the technology and the last one about change management and the needed skills.

1. Solving the right problem

Let’s say that your starting point is having multiple data teams and multiple data platforms. It is a very common pattern for data (I call it the “it’s my data syndrome”) because like this everybody is happy with the autonomy but your data is naturally in silos with possibly multiple technologies. The solution could be tempting to have only one team working on one data platform. In fact, it is a problem too.

This is what has been described by Zhamak Dehghani as “centralized and monolithic” illustrated with this “big data” platform schema.

Creating a central platform and a central team will not match two very important points mentioned in her first article : data is ubiquitious and the need to innovate with data is urgent. With this kind of organisation, we are creating a point with a very high level of pressure on it. This team and the platform is not just a bottleneck, it will concentrate all the frustrations. You can trust me because I have been so many times in this situation : it will never be enough if you have multiple business units or many countries. And because you can’t stop the sea with your arms, you will have silos again. Even without any technology, at the end, people will use Excel or any other solution to fulfill their needs.

The right problem is to solve is this one ; to have different teams working on different subjects (data domains) and still be able to share and cooperate together.

In this article, Juan Sequada gives maybe one of the best definition of Data Mesh ” It is paradigm shift towards a distributed architecture that attempts to find an ideal balance between centralization and decentralization of metadata and data management.”. I would have added data teams to metadata and data management.

2. From Data Domains to stream aligned teams

Everything starts with the data domains : you will have to divide all the data of your organisation into data domains. I will not develop how you can define these domains (to stay focus on people and teams) but I strongly recommend this article by Ramesh Hariharan about Domain Driven Design (DDD). This methodology is totally adapted to data because it is about the business problem space to modelize, the language and the context.

Now you have these data domains, you will need to personify it with a Data Domain Owner. You will find many definitions of what a Data Domain Owner should do. Some are with a strong Data Governance flavour very focus on glossaries, data quality and data stewardship. But you can see it too as a Data Strategist who will translate business problems to analytical solutions and start to think data as a product. This is really about building the vision of what should be the data domain : scope of data, use cases and value.

When you have this vision, how to make it become a reality : the only answer to that is building a team (probably more than one) and to be more specific like a stream aligned team as defined by Matthew Skelton and Manual Pais in this book. “A stream aligned team is a team aligned to a single valuable stream of work… Further, the team is empowered to build and deliver customer or user value as quickly, safely, and independently as possible, without requiring hands-off to other teams to perform parts of the work”. The hand-off part is very important and you can illustrate that with the « you build it,you own it, you run it » principle.

Because, we are using scrum as a framework, a stream aligned team on a data domain will include :

  • a data product owner : a very important role because he/she will convert the vision of a product to a deliverable or increment. Not just a classic PO but he/she has strong data and analytics skills too.
  • a scrum master : because the team animation is crucial
  • the developpers (including a tech lead) : people who will “code” the solution.

This is one team who is focus on value and data as a product. Everyone in this team has a good underestanding of the business challenge and the data they are manipulating. We want to avoid this schema defined by Zhamak Dehgahni (the left one who are building data pipelines without understanding what they are doing) and be on the right side (building a data product including data pipelines but not just that).

You can create as many teams as needed (if you have the ressources) because all these teams are independent. It can be a team for a data domain (full scope) but it can be also a part of this data domain because it is for a country or for a business unit. From a data governance point of view, all these teams will share the same categorization by data domain and sub-domains.

The most important ritual is going to be the « sprint review » of all these teams one by one. It can last all day but this is the place where each Data Domain Team can see the work of the other teams (demonstrate their product) to ask questions and imagine collaborations and partnerships. It is a ritual to develop transparency and to see the progress of each data domains. With the virtual meetings, this ritual is very different from the physical one we use to have, it is not rare to have more than 80 people attending.

The agile methodology is a very important framework to animate the teams (explained here in a previous article) and it is very important to develop the mindset if you do not want to do « fake agile ». The change management is huge and underestimated because you will never find someone against being agile. But in fact, every organisation is developing many anti-agile patterns because of the culture or the history.

3. The Platform Team

After having creating these autonomous team, you will need to support them and to ensure consistency on the way they are going to deliver their data products. As defined in Team Topologies, « the purpose of a platform team is to enable stream-aligned teams to deliver work with substantial autonomy… The platform team provides internal services to reduce the cognitive load that would be required from stream-aligned teams to develop these underlying services ».

You have the same organisation with a product owner, a scrum master and the developers. The « product » is (most of the time) an API Service that will not just ease the work of the developers in the stream aligned team but also standardise their work.

You can have a VERY long list of services but the priorities should be

  • Data objects creation (from ingestion, transformation and exposition)
  • Data observability (as defined by Barr Moses here)
  • Orchestration (from real time to complex workflows)
  • Security & Access management

This is where you have the link with the DataOps move. In this previous article, you have all the basics to have for a « Data Platform As A service ».

It is totally aligned with this schema from Zhamak Dehghani in her first article

This team is just the gap between « we are like any other organisation » and the efficient one described by Nicole Forsgren in her book Accelerate. They are also involved in the sprint review ritual described for the stream aligned teams. The intent is the same : develop transparency and get the feedback from the other teams. The success of this team is based on their ability to develop two opposite things :

  • Empathy for the developers in the stream aligned team (by building feedbacks)
  • Great services with a 99.9996% quality (by staying focus and concentrate on the quality of the development)

There will be frictions between the « data infra engineers » and the data engineers in the stream aligned teams. The best image to use is the chicken and the pig fable. The developers in the stream aligned team are committed, the platform team is involved. The idea is to have only pigs and no chicken. Every team should be committed. Working on this relationship will be a key success factor even if they have the same leader.

The choice of technology will play an important role (will be covered in the second article) and you decision on which service to implement should be based on how to improve the velocity of the team and not a supposed « state of the art ».

The Enabling Team

It is not finished yet. There is still one missing part, a team (as defined in Team Topologies) « composed of specialists in a given technical (or product) domain. In the DATA context, you have many transversal subjects with a very high level of expertise needed like :

  • Your Cloud Service Provider (mastering the cloud services, the connection to your source systems, the costs with Finops)
  • Data Security & Access management and Data Privacy
  • Data Management like data modelisation, catalogs or data quality

Because Data is an asset, you must have a governance and to make the loop, because you have created these data domains, you need to govern them and have all the same rules. These topics are transversal for the stream aligned teams and the platform team as well.

In her second article from Zhamak Dehghani, the closest to this Enabling team is what she is calling a federated computational governance.

The mistake would be to let each stream aligned teams and the platform team to handle these topics by themselves because it is a lot of efforts (and meetings…). That is why you need this Enabling team made up of individuals with a very high level of expertise and with a “strong collaborative nature”. They are the link (or the proxy) between these two team topologies (stream aligned and platform team) with the other services in your very large organization. The perfect wrong thing to do would be having all your stream aligned teams attending a meeting on data security by the security departement and figuring out how to do that in their context.

In the real life, the enablers can turn very rapidly into a big blocking point because of these reasons :

  • Lost in the processes of the organisation : they were here to help to navigate but they are the first to need help.
  • Lost their credit : they do not have enough expertise or empathy and every team leader will try to avoid this team or individuals as much as possible.
  • Lost in their expertise : I am in my “ivory tower” and I strongly believe that no one can understand the subject. They are seen not like doers.

The key about building this team is not selecting the right expertise but the right mindset that will fit with the culture of all the other teams. Let face the reality, Data Governance needs a paradigm shift too and I strongly recommend the work of Barr Moses and the blue/red pill analogy.

You have now the 3 team topologies for a data mesh organisation, it looks like this and you the all the key elements :

  • Data domains oriented with the stream aligned teams building data as a product
  • Data services as a platform with the platform team
  • Global Data Governance with the enabling team

Conclusion : the link between Team Topologies & Data Mesh

I just wanted to map the data mesh principles (described by Zhamak Dehghani in her two articles) with the the 4 Team Topologies (yes there are 4 and not 3, the 4th one could be all the teams managing the IT systems you need as a source) . You have so much more in the book with subjects like :

  • Sofware sizing and the cognitive load (very high when it is about data)
  • Heuristics for Conway’s Law (the link between the architecture chosen and your teams)
  • Patterns for team interactions (the success of your teams is there)
  • Triggers for change and evolution (because nothing is static)

I like the subtitle of the book too : “organizing business and technology teams for fast flow”. You will not have a better summary when we are talking about the data challenge in any organisation.

Do not be mislead too, the heart of the Data Mesh is about the architecture (I will come back to that in the second article) but everything starts with Data Teams and how they are organized. It is very far from the one central team that will save the world because they have a big data platform. And because of the Conway Law, we all know about the link between teams and the design of your system.

If you want to start your journey to a data mesh, it will start for sure with the way you have organized your data teams and how each team can contribute and share to build the data asset of your organisation.

My next article will be how to do it from an architecture and technology point of view.

Build your data pipelines like the Toyota Way

If there is one only book to read about lean manufacturing, this is the one. This is the kind of book you can read again and again and still learn something about your current context.

It is also a book you can read whatever your industry, you will always find situations covered by this book.

Today, we are going to apply these principles to the data pipelines.

“The right process will deliver the right results” – Totoya way (section II)

In the 14 Toyota way principles, you have 7 of them devoted to the process. This is where you have all the main tools to improve manufacturing processes. The idea is to transpose these 7 principles to data pipeline knowing that

  • Data pipelines are 100% flexible : if you have the skills, you build the pipeline you want.
  • Data pipelines are virtual : you do not have any people involved, it is a computer workload
  • Data pipelines are “unlimited” : if you have the correct system, you can have thousands data pipelines

So it looks like a quiet easy activity where a good design with the right system should be enough. In fact it is not, everyday you have in the world tons of broken data pipelines. A tech team can spend 40% of their time just fixing them. It can be horrible expensive too with software licences cost and a very large infrastructure. And of course, it is often late or down (the famous data downtine promoted by Barr Moses).

How does a bad data pipeline process look like ?

Before going deeper with the 7 principles, I just want to give real live examples to illustrate a very bad (or not very optimized) data pipeline looks like.

  1. We have only batch processes : we are receiving all the files at the same time and we have to process them in a very short window of time. We are mainly focus on these files and less on the end result. We do not have time to check dashboards or reports produced with these data.
  2. The other applications are sending us files but we never know if the shipment is complete. If it is late, we can’t wait, we will start our “batch”.
  3. We have upgraded many time our main servers but it is often too short : especially when we have to do catchups.
  4. We have 2 teams : one is building the pipelines and the other to maintain them. If it is broken, the team who is building the pipelines is too busy on their projects and they do not want to fix it.
  5. Because we have different squads (you should look my previous article why it is not a good idea), we have many ways of building the same data pipelines !
  6. We have tried the last open source project because it was more easy but we did not check the support.
  7. We do not really know what it happening (see again data observability by Barr Moses)

Let’s do it like the Toyota Way

#1 “Create continuous process flow to bring problems to the surface”

Capturing the data continuously is creating a flow and it is much more harder because you will to anticipate all the problems. And if there is an unexpected problem, you will have to be quick to fix it ! It will be much more customer oriented because he does need to wait every day to have his report, he can have a refresh all the time. In the book, it is very focus about reducing waste (muda) and it can be transposed to data pipelines : no need to store files before being processed and no need to have a powerful server to manage all the files to be ingested.

#2 “Use “Pull” Systems to avoid Overproduction

I guess that if you ask this question to any BI manager, he will tell you “it’s better if you can push me the data” because he has nothing to do and the responsability is on the source side. He is right but if you want to control your destiny, pull is a better philosophy. You know when you are ready to do it, you can put checks on the sequence because you know where you are and at the end, you are always responsible if you did not deliver.

#3 “Level out the workload”

It the consquence of #1. If you have a continous flow, you can also agregate the data continuously and calculate your KPI. This is a very common functionallity for real time processing. Doing that, you will avoid any overload and be sure that everything is going smoothly.

#4 “Build a culture of stopping to fix problems, to get quality right the first time”

It could have been a devops principle too ‘you build it, you run it”. But clearly, you should stop any developer if there is a bug. Organisations based on projects on one side and run on the other side are doomed. You will never get the quality.Stop everything, fix the bug and I am sure that devs will be better.

#6 “Standardize tasks are the foundation for continous improvement and employee empowerment”

Same, could be a devops principle too ! It is not very far from the continuous integration / continous deployement. How can you standardize the way that each pipeline will have the same structure, same metadata generated, same observability. I agree too about the employee empowerment. Spend more time on the business logic than developping these data pipelines.

#7 “Use visual control so no problems are hidden”

It is maybe the most important point about your data pipelines : what kind of visibilty do you have on what’s happening. When I hear, it is about reporting, I think it is a very small way to look at this topic. We can do much more better than red or green light. Process mining is a source of inspiration for visualizing your data pipelines.

Conclusion

Still amazed that these principles dedicated to manufactoring can work very well with data pipelines/You can also see the similarities with devops principles. The zoom was on “processes” but you 3 other interesting sections too :

  • Having a long term philosophy,
  • Value to organisation by developing people
  • And solving root problems drives organizational’s learning.

This book is a gem because all the principles in this book can fit with your data teams. You should not see data pipelines as a technical subject, you should see it bigger as a system to provide tools for people to continually improve their work !

The rise and fall of the Agile Spotify Model

If you are working in the tech field, I think you have already heard of Squads, Tribes, Chapters or Guild. It comes from Spotify, a swedish audio streaming company.If you are organizing #datateams, it could be tempting to copy/paste. You should really not !

The Spotify Model and Engineering Culture

If you want to go back to the original article, it his here. It has become a standard and you have many articles explaining again and again how great it is for Agile at Scale (just Googling it). It looks beautiful isn’t it no ? And the “aha” moment was to explain every concept like it was the solution to all of our problems.

But you have even better than that, I was fascinated by these 2 materials about the Engineering Culture.

You have videos (part 1, part 2) explaining how does it work. It was like a revolution in a world still dominated by the V cycle and it looks so simple (thanks to these funny graphics) and so obvious. I am still using some of these concepts especially this one : “Trust > Control”. I do love it.

2020… The end of the Utopia

First, you have the feedback from inside with this presentation “There is no Spotify Model”. Marcin Floryan (Director of Engineering @ Spotify) can’t be clearer and his argument was that you should have read the disclaimer.

A very important disclaimer indeed, the meaning is about that organisations should not be “frozen” in a model. They have to evolve and must be adapted.

But you have even more with this article “Spotify’s Failed #SquadGoals – Spotify doesn’t use “the Spotify model”
and neither should you.”
. And Jeremiah Lee (a former Spotify member) message was quiet clear.

  • Matrix management or the weakness of the chapter lead position : in this organisation, Chapter leads are servant-leaders who help you grow as an individual.”. It is important but not enough if you want to reach technical excellence. Your engineers are just like the hamster on a wheel. You are not building anything solid without a strong tech leadership.
  • Team autonomy or you are not living alone : giving autonomy to a team does not mean they will not have any relationships with the rest of your organisation. Managing alignment, accountability and relationships with others is even more important than autonomy.
  • Collaboration or putting people together is not enough : like you need technical excellence, you need collaboration excellence too. You need tech Leads, you need “collaboration” Leads too.
  • Mythology or the gap between with the reality : squads, tribes, chapters and guilds look cool and different. It was not. It was just a renaming.

This book could help

This book is focus on organizing business and technology teams. It is a much more complex than just having cross functional teams. It is about having different team topologies and how they should collaborate.

The best part is about the relationship between teams : you can define the most effective way to collaborate.

You have this video doing the link between the spotify model and this book here.

And the most important point, it is also about triggers for change and evolution.

Conclusion

There is no magic key ! You have to find inspiration in good practices (true one) but you have to adapt it to your context and your subject. And mostly, you have to do it every day !