Join the Kedro community

Registering pipeline metadata in a database

A quick question , i would like to register the pipelines by making an entry in a database with some keys like pipeline_id, name , description whenever the pipeline is executed for the first time in production. This simply requires creating a db connection and running an insert query. I would like to better understand the below things -

Where can I store pipeline specific metadata in the kedro project . Let's say we have 3 pipelines defined in a project data_extraction , data_processing , model_training .
How can we read all these metadata followed by creating a db connection and finally executing the insert operation.
Last, What is the best place to achieve such tasks in kedro project. Is it Hooks ? Like We can run this logic after the context is created .

45 comments

LLaurens Vijnck

Hi Vishal, yes a hook could work. There is a after_context_created hook in which you could implement the logic.

Question is, should you? It seems like you'd like tracking across pipeline runs? Have you considered the MLFlow plugin? Seems this could provide a good starting point.

https://github.com/Galileo-Galilei/kedro-mlflow

VVishal Pandey

yeaah do you think , Tracking pipeline and their respective runs should go into MLFlow.

Till now I was thinking to create some tables in db like

pipeline_details

pipeline_id - PK
pipeline_name - 
description
created_date
created_by

run_details

run_id - PK
triggered_at
time_taken
triggered_by

VVishal Pandey

Idea is to push logs for each run into the ELK ecosystem as well track the above discussed metrics

LLaurens Vijnck

Context: Reason I'm bringing up MLFlow is because we're currently integrating it, and our experience with it so far is positive. I'm afraid your DB approach will hit limits quickly, e.g., imagine you want to start logging parameters as part of the run, or model artifacts, or charts with model convergence etc. All of this is already provided by MLFlow (maybe there are other alternatives though, but we've gone for this one)

LLaurens Vijnck

I think logging whether the pipeline fail/success is also important, and then question becomes how would you track that?

LLaurens Vijnck

As the pipeline grows, you might want to execute nodes in parallel, so then you can't determine pipeline fail/success on the single kedro run invocation

VVishal Pandey

Hey Thanks for responding and sharing your experience.

Currently, We are building Data Pipelines(ETL) , Which will simply extract data from data sources like sql dbs and perform some transformation . This transformed data is stored in flat files like csv and will be Versioned while storing on S3 kind of service. This Versioned data is then Loaded into Some warehouse as part of the Load component.

Why do we want to store pipeline and runs related information is because this ETL pipeline will be scheduled to run for let's say every day or every week. So we would like to know when exactly the last pipeline was triggered so that the new run can know beforehand what all data needs to be pulled from the last run date.

VVishal Pandey

Can you also share some way of Registering the pipeline whenever it is executed for the first time. Like where exactly we can maintain some basic information regarding the pipeline , like some identifier , description etc in our kedro project and then may be the hooks can read all these information and check such pipeline is already registered with MLFlow , if not it can register it and within that the pipeline can start logging the metrics as various runs.

LLaurens Vijnck

IF you need that, what about considering Airflow? Pipeline scheduling and tracking remains a difficult topic.

LLaurens Vijnck

If you still want to implement it yourself, you will have to add your own way of identifying pipeline runs, Kedro does not maintain state, so it does not know what it means for it to "execute a first time".

VVishal Pandey

I am thinking of maintaining a config file which can store Pipeline related metadata and the hook implementation can read this config . Hooks can query using MLflow API to find out if an Experiment Exists corresponding to a pipeline . In case it doesn't exists , it can simply create an Experiment.

But we would like to give some unique id to each run so that we can identify the logs being generated in specific run in the ELK stack as I mentioned earlier. Is it possible for us to create a run with a custom name ?

LLaurens Vijnck

haha you're touching some interesting territory, and there are some issues here

LLaurens Vijnck

MLFlow does not allow creating runs with a custom name. The runs have a unique run id that you should use, but MLFlow is the one generating it.

I've therefore implemented functionality to create a run with a specific name, which essentially does a search in the runs to see if the run with the name exists. If it does not exist, it creates it, otherwise it returns the id.

We use Argo Workflows to orchestrate our pipelines, Argo does generate a unique run for each pipeline so we use that run to identify Kedro in MLFlow. We use argo workflows as we run everything on k8s, though Airflow might be a better option if you're so interested in tracking runtimes etc

VVishal Pandey

In our case we would like to run our kedro pipelines on kubeflow.

LLaurens Vijnck

Then kubeflow could generate the identifier, and you could set a env variable and read that with Kedro I think (I've not used Kubeflow before)

VVishal Pandey

I think i will need more clarity here. So we are the ones who is uploading the kedro pipeline on kubeflow using the kedro-kubeflow plugin. They have exposed couple of commands mentioned below -

1. Kedro-kubeflow init -> This creates a config file as mentioned in this link - https://kedro-kubeflow.readthedocs.io/en/0.7.4/source/02_installation/02_configuration.html
2. kedro-kubeflow upload_pipeline uses the above generated config and converts the kedro DAG into Kubeflow Compatible DAG and publishes the pipeline on kubeflow.

If you carefully see, there are run related configs present in the config file being generated. Any heads up from here as you already know our use case.

VVishal Pandey

I mean the use case is to simply track the run level metrics and dump all the logs to ELK stack, this should not be a problem if we can produce a unique run_id or name. But I see some challenges here -

Once a pipeline is scheduled to be running using a command like kedro-kubeflow run , I am not sure how can we get unique run_ids .

LLaurens Vijnck

I'm not familiar with the kubeflow plugin unfortunately

LLaurens Vijnck

Based on a quick searxh

LLaurens Vijnck

each pipeline in kubeflow gets a KFP_RUN_ID which is set in the env variable

LLaurens Vijnck

that should help you no? then you can plug into the id kubeflow uses

VVishal Pandey

But how do i fetch this run id in kedro ??

VVishal Pandey

You mean they must be available in env variables

VVishal Pandey

Simple os.environ.get(“KFP_RUN_ID”) should do the job ?

LLaurens Vijnck

ahhhh haah

LLaurens Vijnck

oke

LLaurens Vijnck

you can use omegaconf

LLaurens Vijnck

https://docs.kedro.org/en/stable/configuration/advanced_configuration.html#how-to-load-credentials-through-environment-variables

LLaurens Vijnck

${oc.env:KFP_RUN_ID}

LLaurens Vijnck

which you can use in the catalog

VVishal Pandey

Will get back to you on this 🫡

LLaurens Vijnck

oops

LLaurens Vijnck

gave you wrong link

LLaurens Vijnck

https://docs.kedro.org/en/stable/configuration/advanced_configuration.html#how-to-use-resolvers-in-the-omegaconfigloader

LLaurens Vijnck

use this one

LLaurens Vijnck

give it a shot!

LLaurens Vijnck

this way you can literally turn it into a pipeline param or global and have it picked up

LLaurens Vijnck

(or add it directly to the kubeflow config)

VVishal Pandey

in that case MLflow still looks a viable option to track metrics 😊

LLaurens Vijnck

yep, though I've implemented https://github.com/Galileo-Galilei/kedro-mlflow/issues/579 to ensure I have a stable run name for my runs

LLaurens Vijnck

cause the plugin always generates a new name, so if you would re-run a workflow it would list as a new run in MLFlow

LLaurens Vijnck

the above hook ensures a stable name is used, so the reruns keep working (and write to the same MLFlow run)

VVishal Pandey

Need to go through this as well

VVishal Pandey

any thoughts on such use case

VVishal Pandey

Opening this discussion again.

just to re-iterate you mentioned that there is a unique id being generated for each kubeflow run and this is luckily stored in env variables .
And we can use a a resolver like oc.env to access this run_id in the kedro project , and probably set in the globals.yml and then we can re use it anywhere needed like in parameters.yml or catalog.yml or we can directly use oc.env wherever it is permitted across the files in conf/

Add a reply

Join on Slack