Join the Kedro community

Home
Members
Oleg Litvinov
O
Oleg Litvinov
Offline, last seen 2 months ago
Joined November 25, 2024

[Question: Repo separation btw ETL and ML apps]

3 comments
O
R

Hello everyone!

I'm trying to link training and inference pipelines. And faced an interesting problem that it looks like you are not allowed to use "parameters" dictionary for the ml pipeline like in the node below:

node(
                func=train_model,
                inputs=["df_train", "y_train", "df_val", "y_val", "parameters"],
                outputs="trained_recommender",
                name="train_model_node",
                tags=["training"],
            ),

Below is the link to the source code which accepts only dataset_name.startswith("params:")
https://github.com/Galileo-Galilei/kedro-mlflow/blob/master/kedro_mlflow/mlflow/kedro_pipeline_model.py#L122
Do I understand correctly that I have to define all the parameters I supposed to use manually?

Sounds surprising to see the error about kedro's default parameters 🙂
KedroPipelineModelError: 
                                The datasets of the training pipeline must be persisted locally
                                to be used by the inference pipeline. You must enforce them as
                                non 'MemoryDataset' in the 'catalog.yml'.
                                Dataset 'parameters' is not persisted currently.

4 comments
Y
O

Hi all!

I have a couple of questions regarding the best practices of Kedro usage.

Frequently, ML models incorporate some preprocessing logic right in the model classes. And there may be some quite complex class inheritance structure to use some abstractions and, for example, try models with similar interface like BaseRegressor and lots of ancestors like LGBMRegressor LinearRegressor. And all these wrappers do not only use sklearn.model.predict or lgbm.model.predict but also incorporate quite long list of data preparations.
So my first question is: how this paradigm of "advanced and abstract ML development" is compatible with Kedro which is mostly (to the best of my understanding) for pipelines? In the basic examples I see that there may be any number of preprocessing steps like load, filter, enrich, fillna, etc, etc, and then just train step. This is compatible with the pipeline logic perfectly. But, probably, doesn't work well if you keep some methods in the model class and also use some internal states and so on. Maybe you know some good practices or have any ideas?

The second question is similar to the first one but covers mostly the inference part. Please, correct me if I'm wrong, but I mostly see Kedro as a framework for the preprocessing and training ML routines. What is recommended to do if I want to reuse some of my logic (already defined as data_processing nodes) for model inference?

Thank you very much!

15 comments
R
O
Y