The devops part of Dataops

Where to start ?

“Dataops is not just devops for data” but there is a specific flavor when we are are talking about the devops part.

  1. “Data as a platform”

Data is not a project or a list of projects. At the end, what you are building is a platform or at least a framework because more and more data will come. You can of course manage it as a single project but you will start to build silos even if it is the same team. “Data as a platform” means that you are building a destination where it will be easy to ingest, transform, monitor and expose data at scale and at speed. From day one, you have to think about that and find the right balance between project delivery and building this platform. You are going to develop and code a list of features as services delivered by the platform.

2. “Parametization” is king

“Analytics is code” but you have also to “make it reproducible”. There is nothing more similar to a data pipeline than another data pipeline. There is no reason to reinvent the wheel every time. The parameters will probably include SQL code (SQL is back) that could vary to very simple to very complex but you are sure that every data pipeline will be the same and you will have no surprise with the deployment.

3. Monitor everything

Another benefit of having a platform and working with parametization is that you can decide how to log every event in a consistent manner throughout the data workflow. It is going to be totally under your control and you can improve it step by step with more metadata or events. You can really customize it according to your needs (debugging, data governance, data monitoring).

4. Orchestration “on event”

If you have not heard “the file was missing and we have to relaunch the ETL jobs”, you are lucky person or you are not old enough and you have skipped the ETL era (yes, ETL are dead). Because you are monitoring everything, you know when something is happening and what you have to do (in fact the platform does). Eveything must start and finish with an event.

And then, you can close the loop.

my very personal data as a platform definition

There is a link of course with the dataops manifesto by DataKitchen :

  • “Reduce heroism” by building a collective data as a platform
  • “Reflect” by being able to track what happened and why (monitor everything)
  • “Analytics is code” is even stronger in that context because it means that you have to code these features
  • “Orchestrate” from end to end based on events
  • “Make it reproducible” by using parametization
  • “Disposable environments” by managing this inside your platform (can’t do without Cloud IMHO)

And also a link with this book “Devops Handbook” : the flow (the loop between the 3 parts), the feedback (with monitor everything) and the continual learning and experimentation (the platform you are going to improve it with new features or new technologies).

The journey will start with building your data engineering team with the right mindset and it will take time. But at the end, this is going to be a great asset for your organization.

3 thoughts on “The devops part of Dataops

  1. Great article !

    DataOps build tools and services, like DevOps they build strong and solid CI/CD pipeline and design test automation for their own frameworks

    Orchestration on event:
    pipelines are code, when data is ready never wait, DAG design is the key.

    Parametization is king : configuration-as-code is everywhere
    Generate your SparkSQL transformations from your parametization.
    Don’t buid statics DAG, automate dynamic DAG creation from your parametization.
    Choose a strong naming conventions, it’s huge help to configure as code.

    Monitor everything :
    Up to date datalineage and process metadata worth gold, defined and measured SLAs, design an alerting frameworks triggered on anomaly detection.

    Don’t forget data quality check, ensure data doesn’t get corrupted before entering into your pipelines and use data validation between steps.
    Warn and never wait business to complain about completeness, consistency, timeliness, integrity

    Automate everything, it’s a mindset, automate data profiling, auto-generated data expectations over time.
    Tooling for a DataOps should be an habit or unconscious reflex

    Like

Leave a comment