A quick question , i would like to register the pipelines by making an entry in a database with some keys like pipeline_id, name , description
whenever the pipeline is executed for the first time in production. This simply requires creating a db connection and running an insert query. I would like to better understand the below things -
Hi Vishal, yes a hook could work. There is a after_context_created
hook in which you could implement the logic.
Question is, should you? It seems like you'd like tracking across pipeline runs? Have you considered the MLFlow plugin? Seems this could provide a good starting point.
https://github.com/Galileo-Galilei/kedro-mlflow
yeaah do you think , Tracking pipeline and their respective runs should go into MLFlow.
Till now I was thinking to create some tables in db like
pipeline_details
pipeline_id - PK pipeline_name - description created_date created_byrun_details
run_id - PK triggered_at time_taken triggered_by
Idea is to push logs for each run into the ELK ecosystem as well track the above discussed metrics
Context: Reason I'm bringing up MLFlow is because we're currently integrating it, and our experience with it so far is positive. I'm afraid your DB approach will hit limits quickly, e.g., imagine you want to start logging parameters as part of the run, or model artifacts, or charts with model convergence etc. All of this is already provided by MLFlow (maybe there are other alternatives though, but we've gone for this one)
I think logging whether the pipeline fail/success is also important, and then question becomes how would you track that?
As the pipeline grows, you might want to execute nodes in parallel, so then you can't determine pipeline fail/success on the single kedro run
invocation
Hey Thanks for responding and sharing your experience.
Currently, We are building Data Pipelines(ETL) , Which will simply extract data from data sources like sql dbs and perform some transformation . This transformed data is stored in flat files like csv and will be Versioned while storing on S3 kind of service. This Versioned data is then Loaded into Some warehouse as part of the Load component.
Why do we want to store pipeline and runs related information is because this ETL pipeline will be scheduled to run for let's say every day or every week. So we would like to know when exactly the last pipeline was triggered so that the new run can know beforehand what all data needs to be pulled from the last run date.
Can you also share some way of Registering the pipeline whenever it is executed for the first time. Like where exactly we can maintain some basic information regarding the pipeline , like some identifier , description etc in our kedro project and then may be the hooks can read all these information and check such pipeline is already registered with MLFlow , if not it can register it and within that the pipeline can start logging the metrics as various runs.
IF you need that, what about considering Airflow? Pipeline scheduling and tracking remains a difficult topic.
If you still want to implement it yourself, you will have to add your own way of identifying pipeline runs, Kedro does not maintain state, so it does not know what it means for it to "execute a first time".
I am thinking of maintaining a config file which can store Pipeline related metadata and the hook implementation can read this config . Hooks can query using MLflow API to find out if an Experiment Exists corresponding to a pipeline . In case it doesn't exists , it can simply create an Experiment.
But we would like to give some unique id to each run so that we can identify the logs being generated in specific run in the ELK stack as I mentioned earlier. Is it possible for us to create a run with a custom name ?
Then kubeflow could generate the identifier, and you could set a env variable and read that with Kedro I think (I've not used Kubeflow before)
I think i will need more clarity here. So we are the ones who is uploading the kedro pipeline on kubeflow using the kedro-kubeflow plugin. They have exposed couple of commands mentioned below -1. Kedro-kubeflow init
-> This creates a config file as mentioned in this link - https://kedro-kubeflow.readthedocs.io/en/0.7.4/source/02_installation/02_configuration.html
2. kedro-kubeflow upload_pipeline
uses the above generated config and converts the kedro DAG into Kubeflow Compatible DAG and publishes the pipeline on kubeflow.
If you carefully see, there are run related configs present in the config file being generated. Any heads up from here as you already know our use case.
I mean the use case is to simply track the run level metrics and dump all the logs to ELK stack, this should not be a problem if we can produce a unique run_id or name. But I see some challenges here -
kedro-kubeflow run
, I am not sure how can we get unique run_ids .this way you can literally turn it into a pipeline param or global and have it picked up
yep, though I've implemented https://github.com/Galileo-Galilei/kedro-mlflow/issues/579 to ensure I have a stable run name for my runs
cause the plugin always generates a <i>new</i> name, so if you would re-run a workflow it would list as a <i>new</i> run in MLFlow
the above hook ensures a stable name is used, so the reruns keep working (and write to the same MLFlow run)
Opening this discussion again.
just to re-iterate you mentioned that there is a unique id being generated for each kubeflow run and this is luckily stored in env variables .
And we can use a a resolver like oc.env
to access this run_id in the kedro project , and probably set in the globals.yml and then we can re use it anywhere needed like in parameters.yml or catalog.yml or we can directly use oc.env wherever it is permitted across the files in conf/