Blog

The agile manifesto : 20 years later

Or Robert C Martin, this uncle you should pay a visit more often.

Where was I 20 years ago at that time when these 17 brillant folks were in a ski station for the Agile Manifesto ? I was part of a small team with great individuals and in fact we were an alternative to IT unable to deliver what we wanted. So we are going to do it ourselves. Without knowing it, we were totally in that agile mindset :  valuing interactions, working software, our collaborations with the users and be able to change because we were in a very fast growing company.

20 years later, it is totally different. Agile is everywhere now and we have all theses practices coming mainly from Scrum and I can see the journey and the level of professionnalisation.

Back to basics (relative and hope killer)

I have discovered this video and his author quiet recently : it is an hour long video and I really had fun time watching it (highly recommended). Robert C Martin (oncle Bob) is one of the 17 pilgrim fathers of Agile and his main message (also in his book clean agile) is “Agile get muddled”. For him, the message was lost because too many people has tried to add more things to Agile. In this video, he wanted to come back to the basics.

The video is separated in two parts. The first part  is really the story in a world dominated by waterfall and how 17 people has changed it with this manifesto. It is clearly a good demonstration of what can do a collective intelligence and how something can become viral. The message was not “Waterfall is bad” but they value interactions, working software, customer collaboration and responding to change more. The relative comparaison is the main message of the manifesto. It does not mean no processes, no documentation, no contract and no plan. It was this relative comparaison they all agree on and their “Aha” moment.

In the second part of the video, Oncle Bob gives an overview of agile with these 2 punchlines : “Hope is the project killer, the purpose of agile is to destroy hope and replace it with data”. And “you don’t do agile so you can go fast, you do agile to know how fast you are going”. You have these two ugly graphics to illustrate that.

Yes this is the heart of agile

Agile is more than that but his intent is to kill the supposed magic with Agile : a scrum master is not Harry Potter and suddenly, you will not go faster and have a baby in one month. Another way to say it is that the level of expectation is often too high with Agile. There is no magic, just good project management.

How about the next 20 years ?

The funny part is that in this video, Oncle Bob mentioned the level of indoctrination caused by Waterfall. It was impossible to think another way of doing software. I think we are living the same period with Agile with tons of pratices, coaches and tranings.

If you look at the 12 principles, you have many of them totally not mastered 20 years later :

  • Technical excellence is not associated with Agile, quick and dirty is
  • Self organizing teams about architecture is still a problem because of the thickness of the technical stack
  • Simplicity, maximizing the amount of work not done is not accepted, we want it all
  • Sustainability : many agile teams look like an hamster in a spinning wheel

And most importantly, the business is still seen as a customer and the technical team as a vendor even when they are working for the same company. Focusing on the true customer and delivering value is the exception not the rule.

I also see some very strong trends like asynchronous communication or companies going 100% remote team that will change a lot the way we are working now. And of course the software craftsmanship is clearly a reaction to the agile :

  • Not only working software, but also well-crafted software
  • Not only responding to change, but also steadily adding value
  • Not only individuals and interactions, but also a community of professionals
  • Not only customer collaboration, but also productive partnerships

Maybe the true meaning of being agile in the next 20 years is like this photo

Why do you need Agile Software Development for your Data Use Case ?

The agile part of DataOps

In this previous article, we have defined dataops as a”combination of tools and methods inspired by Agile, Devops and Lean Manufacturing » (thanks to DataKitchen for this definition).

Let’s focus on the agile part and why it is so relevant for your data use cases.

  1. “Data is like a box of chocolates, you never know what you’re gonna get.”

It is about the nature of data : you cannot guess what will be the content and the quality of your data sources before seing it. The Exploratory Data Analysis part is the starting point and you will have to adjust based on what you have found. Metrics or dimensions that you have specified will not be possible because of the data sources. Sometimes you will have to build referentials because the data is too granular.

This is where Agile comes in. Because you do not have stable requirements, you must be able to always adapt and change. Starting a data project with fixed dates and fixed requirements does not exist in real life. This is also the difference between an A team and a normal team “A late change in requirements is a competitive advantage.” – Mary Poppendieck.

2. Resheduling is a competency, not a fatality

Agile is about mastering your schedule. You are not guessing anymore. You have your backlog, your user stories, your point, your velocity and you can start to calculate when it will be done and what you will have. In this brillant video and to quote Robert C Martin “You do agile not to go fast, You do agile to know how fast you are”.

The heart of agile is here : produce data to know where we are and where we will be in 2 or 3 sprints. This is also where as a team, we can discuss our priorites. It does dot mean we have to remove scope, it is about what we should do first.

3. Technical Excellence or the fastest route is not always a straight line

The forgotten part of agile software development : develop a technical excellence. Even in 2020, Agile means “quick and dirty”. It is in fact always slow and dirty because of the technical debt created. This is where Agile goes along with Devops. You must have the technical stack to be able to deliver. If you do not have it, you have to build it. At the end, you have better chances to go faster and you will not postpone in your run phase all the quality problems.

Dataops is all about this technical excellence. Data is a journey, you will have tons of pipelines to build. Don’t be Sisyphus where your data is the boulder

4. The human side of Data

If you think data is a cold and scientific subject, it is not. At the end, you have users who will use data and there is nothing more relative than “good data” or a “good dashboard”. Proximity with the business and daily communications are just the main key success factors for you data use case.

This is the opportunity to catch all these rules about data quality and convert them into automatic data testing.

The very strange way of doing Data Quality at Airbnb

or why you should have a look at Data Observability !

This article is the second part on how Airbnb is managing data quality : “Part 2 — A New Gold Standard”. The first part can be found here and it was just good principles about roles and responsabilities.

The second part is really how they do it and all the steps to have a “certification”. They name it Midas, the famous king who can turn everything into gold (with a not so good ending). Welcome to this strange way of managing data quality !

  1. Strange data quality dimensions

The list given in the article were :

  • Accuracy: Is the data correct?
  • Consistency: Is everybody looking at the same data?
  • Usability: Is data easy to access?
  • Timeliness: Is data refreshed on time, and on the right cadence?
  • Cost Efficiency: Are we spending on data efficiently?
  • Availability: Do we have all the data we need?

It looks like a game where you have to find the errors…

  • “Correct ?”… Could be very different from one person to another !
  • “Consistency” has nothing to do that about everybody looking at the same data, it is more about the consistency of your data compare to the past.
  • “Usability” is interesting but it won’t help about data quality.
  • Timeliness : the only correct one 🙂
  • “Cost efficiency” could be years of discussions with your finance departement about the Data ROI.
  • “Availability” : it is not all the data we need, it is just if the data is… available.

2. Strange certification : turning all your data into gold

If a data has all these data quality dimensions checked and OK, you will have this approval.

Humm it could be difficult to reach and maintain this level (with the data quality dimensions defined earlier)… and your final user could look like this trying to find gold for years.

Normally, you should clasiffy your data with a gold, silver and bronze tag based on the importance or crititicy of the data. And then focus first your attention on “gold” data to reach the right data quality level. Solving a data quality problem is not just about the data itself, it could be due to processes or what is happening in the source. Focusing your effort on what it is important will leverage your “data quality management service” team.

3. Data quality Validation, review, lots of manual tasks… or the Everest north face of Data Quality

If you take each step, it looks logic and solid. Of course you must document everything and have many validations. But why to have it apart from your data development process. The data quality validation should not be done on top of everything, it is totally integrated in your data pipepline process. The other point is about scaling. You will need an army to be sure that the process is applied because everything is manual.

The “other” approach developed by concept like #DataOps or #DataObservability is about these 3 fucntions.

Data Observability must be “build-in” in your data platform : any data you have will be observed. Every data pipeline has automatic data testing by default. All the alerts has a clear process with automation too (self recovery actions).

Conclusion

The articles ends with “ requires substantial staffing from data and analytics engineering experts (we’re hiring!)” and “quality takes time“…

It does take time but how do you use this this time in is the most important question and hiring is one thing, keeping them is another topic espescially when you spend time on borring tasks.

In Google now, you have a list of questions when you are looking about a subject. I just pick this one.

The last (but not least)”ops” you need for your data : DataGovops

To finish the trilogy (Dataops, MLops), let’s talk about DataGovOps or how you can support your Data Governance initiative.

  1. The origin of the term : Datakitchen

We must give credit to Chris Bergh and his team DataKictchen. You should visit their website, you will find incredible good stuff there. This article was published in October 2020 with this title : “Data Governance as Code”. The idea behind that is you should “actively promotes the safe use of data with automation that improves governance while freeing data analysts and scientists from manual tasks”. The article is illustrated with many examples. It is well illustrated and it is really the fouding article.

Let’s try to give another illustration of this idea.

2. The Dataops Heritage

In my previous article, I have described the loop around the devops part of dataops.

  • Parametization (every new data object is a parameter) to Data Catalog (every new data object will generate by itself all the needed metadata)
  • Monitor everything (every step is monitored and tested) to Data Quality/Observability (every step comme with data quality check)
  • Orchestrate on event (each workflow is on event) to Data Lineage (every step of the workflow is recorded every time).

As mentioned in the DataKitchen article, it is deployed automatically with code. It is not an extra work by reading the database schema or based on your ETL. In every step,we do not just read, transform and write data, we are also doing that with the metadata.

Last part, it was added the data security and privacy part. Every data governance policy about this topic must be read by a code to act in your data platform (access management, masking, etc..)

3. How does it help the Data Governance

You will find a good definition here : ‘True data governance puts the rules in place and aligns the organization so data is not a potential liability”. To do that, Data Governance will need a lot of information about the data itself (inventory, users, measurement of data quality, access management, usage, etc…) to be able to align People, Process and Policies. Data Governance will also need to be able to implement the policies directly in the data platform or be able to enrich the data generated. This is the role of datagovops to reduce the cycle time between policies and their deployment.

4. DataGovOps or the 5W2H approach

5W2H is for Why, Who, What, Where and When + How and How many/much. This is a common way to ensure that you are covering every aspects of a subject. It is a useful tool when it is about your data because you have many questions :

  • Why the data pipeline is broken ?
  • Who has an access to this Data ? Who used this table ?
  • What data do we have ? What is the data quality check done on this table ?
  • Where is this table ? Coming from which source ?
  • When this table was updated ?
  • How this metric or table has been generated ?
  • How many tables do we have ? How many users ? etc…

Starting a DataGovOps aims to anwser to all these questions easily and always without any additional manual tasks. Knowing what’s happening in your data is maybe the first real sign that Data Governance is in place.

AI/ML : Powerpoint vs Real life

(and why you need MLops and how)

Sources of inspiration : 1 book, 1 video and 1 article

This book from Mark Travel and the Dataiku team is a good point to start (like very their risk matrix). The video you should see is from Kaz Sato : I hope you will enjoy his sense of humor but more than that, it is a very good coverage with illustrations of ML best practices. And the article you should read has no author and can be found in the middle of Google Cloud documentation. It illustrates the different stages of MLops. I am not doing advertising for Google, just mentioning that this work is valuable.

You will find in these 3 sources all the concepts or way of doing in the right column. Rome was not build in a day… It is same for MLops. The challenge is really now to define stages and how you can avoid as much as possible the left column.

More than reducing the cycle time and increase quality : scale this subject !

MLops will help to reduce cycle time and increasing the model quality. But more than that, it is the key success factor for scaling this subject. It is easy to maintain 10 ML models, it is a totally different story with hundreds or thousands. The difference will not be about the talent war for data scientists, it is going about how to scale and become a “scalist”. Tools, organisation and people (as a team) is the core essence of MLops.