Sources of inspiration : 1 book, 1 video and 1 article
This book from Mark Travel and the Dataiku team is a good point to start (like very their risk matrix). The video you should see is from Kaz Sato : I hope you will enjoy his sense of humor but more than that, it is a very good coverage with illustrations of ML best practices. And the article you should read has no author and can be found in the middle of Google Cloud documentation. It illustrates the different stages of MLops. I am not doing advertising for Google, just mentioning that this work is valuable.
You will find in these 3 sources all the concepts or way of doing in the right column. Rome was not build in a day… It is same for MLops. The challenge is really now to define stages and how you can avoid as much as possible the left column.
More than reducing the cycle time and increase quality : scale this subject !
MLops will help to reduce cycle time and increasing the model quality. But more than that, it is the key success factor for scaling this subject. It is easy to maintain 10 ML models, it is a totally different story with hundreds or thousands. The difference will not be about the talent war for data scientists, it is going about how to scale and become a “scalist”. Tools, organisation and people (as a team) is the core essence of MLops.
“Dataops is not just devops for data” but there is a specific flavor when we are are talking about the devops part.
“Data as a platform”
Data is not a project or a list of projects. At the end, what you are building is a platform or at least a framework because more and more data will come. You can of course manage it as a single project but you will start to build silos even if it is the same team. “Data as a platform” means that you are building a destination where it will be easy to ingest, transform, monitor and expose data at scale and at speed. From day one, you have to think about that and find the right balance between project delivery and building this platform. You are going to develop and code a list of features as services delivered by the platform.
2. “Parametization” is king
“Analytics is code” but you have also to “make it reproducible”. There is nothing more similar to a data pipeline than another data pipeline. There is no reason to reinvent the wheel every time. The parameters will probably include SQL code (SQL is back) that could vary to very simple to very complex but you are sure that every data pipeline will be the same and you will have no surprise with the deployment.
3. Monitor everything
Another benefit of having a platform and working with parametization is that you can decide how to log every event in a consistent manner throughout the data workflow. It is going to be totally under your control and you can improve it step by step with more metadata or events. You can really customize it according to your needs (debugging, data governance, data monitoring).
4. Orchestration “on event”
If you have not heard “the file was missing and we have to relaunch the ETL jobs”, you are lucky person or you are not old enough and you have skipped the ETL era (yes, ETL are dead). Because you are monitoring everything, you know when something is happening and what you have to do (in fact the platform does). Eveything must start and finish with an event.
“Reduce heroism” by building a collective data as a platform
“Reflect” by being able to track what happened and why (monitor everything)
“Analytics is code” is even stronger in that context because it means that you have to code these features
“Orchestrate” from end to end based on events
“Make it reproducible” by using parametization
“Disposable environments” by managing this inside your platform (can’t do without Cloud IMHO)
And also a link with this book “Devops Handbook” : the flow (the loop between the 3 parts), the feedback (with monitor everything) and the continual learning and experimentation (the platform you are going to improve it with new features or new technologies).
The journey will start with building your data engineering team with the right mindset and it will take time. But at the end, this is going to be a great asset for your organization.
…. or a totally different approach to a very old problem.
Within my own circle, I have often this question : “why do you find Data so interesting after all these years ?” and my anwser is always “it is not about the subject, it is about the people”. This is the chance of our time (blog, youtube) to have access to so many inspiring people wihout ever meeting them physically. Thanks to them, I have had so many “aha moments” when everything becomes crystal clear and you are telling yourself “I got it”.
Barr Moses is definitely one of them. I guess when you start you CV with an experience as “Commander in Israel Air Force” with “Commanded and trained soldier platoons in elite Air Force unit” and “Analyzed intelligence data to support time and detail sensitive decisions for operative units”, it makes you someone different ! It is easy to recognize her, it looks like she never leaves her cap.
She has introduced “Data Downtime” as a major problem for data driven organization. At first, for me it was calling a data quality problem with another words. I did not get the essence of calling it that way. And then, she was talking about “Data Observability” : how to use metrics, logs and traces to tell what and why about your data pipelines. Another way to say it you cannot manage the data quality problem like before with the rise of dataops and a totally different way to build data pipelines (cloud, code, platform as a service). More over, automation is the only way to cope with the volume of data. This is a key moment where you have choose your “data quality architecture”.
After the summer 2020 holidays, she was unstoppable : data catalogs are dead, metadata is useless and data governance is broken. My favorite punchline from her is “Data is ubiquitous; data governance is not”. The move to cloud dynamic for data did not just change the way we are storing and transforming our data, it is going to change drastically the way we are managing our data.
It is hard to believe because you are still seeing all the market (software, consultants) talking about how to improve your data quality or how to build your data governance like nothing has changed. It will take years before understanding the dataops move and its impact.
She is like Morpheus in Matrix and you have now to choose between the blue or the red pill.
Jesse Anderson is widely regarded as an expert in the field and for his novel teaching practices.
« To do big data right, you need three different teams. Each team does something very specific in creating value from data.”
“If the data engineering team is missing, there isn’t anyone there to give an engineer’s viewpoint about data product creation.”
“If the data science team is missing, the organization’s ability to create analytics is severely limited.”
“If the operations team is missing, no one can depend on the data products that are created. One day the infrastructure and data products are working, and the next day they’re not.”
About how do these teams are working together : “The solution is to create high-bandwidth connections between the teams. A high-bandwidth connection means that the members of the data teams know each other. They build camaraderie and respect by seeing that each other is competent in their respective fields. »
After having these 3 teams working together, everything is then about how to interact with the business (chapter 8), how to manage projects (chapter 9), how to start the teams (chapter 10), how to measure the success (chapter 11). There will be also so strong change management with « the old guard » (chapter 12) and how you are going to fix every day all the problems encountered.
You have then some case studies and interview from different sectors to illustrate all the principles.
Conclusion : if you want to start a data team and you need some definition and illustration, this is a book to read. These 3 roles are a must and I would have added two major roles. One about data governance/management and the other one is the interface role between business and the data teams called « business translators » or « data strategist ».