Linking Training and Inference Pipelines

At a glance

Hello everyone!

I'm trying to link training and inference pipelines. And faced an interesting problem that it looks like you are not allowed to use "parameters" dictionary for the ml pipeline like in the node below:

node(
                func=train_model,
                inputs=["df_train", "y_train", "df_val", "y_val", "parameters"],
                outputs="trained_recommender",
                name="train_model_node",
                tags=["training"],
            ),

Below is the link to the source code which accepts only dataset_name.startswith("params:")
https://github.com/Galileo-Galilei/kedro-mlflow/blob/master/kedro_mlflow/mlflow/kedro_pipeline_model.py#L122
Do I understand correctly that I have to define all the parameters I supposed to use manually?

Sounds surprising to see the error about kedro's default parameters 🙂

KedroPipelineModelError: 
                                The datasets of the training pipeline must be persisted locally
                                to be used by the inference pipeline. You must enforce them as
                                non 'MemoryDataset' in the 'catalog.yml'.
                                Dataset 'parameters' is not persisted currently.

4 comments

YYolan Honoré-Rougé

TBH I purposely didn't support parameters entry as I consider it a very dangerous practice and opposed to what kedro-mlflow tries to do (logging unused parameters is extremely confusing), but it is very weird I am not compatible with kedro. I should rather try to deprecate parameters in the core framework than adding inconsistency in the plugin.

I can accept PR for that, but I honestly strongly discouraged it. If you have a lot of parameters used in your nodes, you can just pass them as a dictionary:

# Parameters.yml
model_config:
    Param1: value1
    Param2: value2
    Subdict1: 
        Subparam:value3

And in your pipeline.py:

node(
                func=train_model,
                inputs=["df_train", "y_train", "df_val", "y_val", "params:model_config"], outputs="trained_recommender",
                name="train_model_node",
                tags=["training"],
            ),

This should work and is more readable and reproducible.

To follow another discussion, this is one of the thing I'd like to clarify and eventually break in 0.20/ 1.0 😅

OOleg Litvinov

Dear , thank you very much for the answer!
But registering some of the parameters is essential, right? We want to see in the mlflow registry what exact params were used for training and if they are used for the inference (I know MLflow is not very friendly with their change but regardless) without any changes.

As for the example, also was thinking about adding them in the keyword base. Thank you!

YYolan Honoré-Rougé

Yes you must need to register the exact params you need for inference (for training, it's just a matter of reproducibility : you want to remember the parameters you used but it's not strictly speaking mandatory - kedro-mlflow will log any input anyway)

YYolan Honoré-Rougé

And it will become even more prevalent after https://github.com/Galileo-Galilei/kedro-mlflow/pull/612 is merged (I'd bet before Christmas). You will need the exact parameters for inference to be registered in mlflow signature to be able to modify them at predict time

Add a reply

Join the Kedro community

Linking Training and Inference Pipelines