Join the Kedro community

Updated 4 months ago

Best Practices for Kedro Usage

At a glance

The post discusses the compatibility of Kedro, a framework for building data pipelines, with the common practice of incorporating preprocessing logic directly into ML model classes. The community member has two main questions:

1. How can the "advanced and abstract ML development" paradigm, where models have complex inheritance structures and preprocessing logic, be reconciled with Kedro's pipeline-centric approach?

2. What is the recommended way to reuse preprocessing logic (defined as Kedro nodes) during model inference?

The comments suggest using Kedro's namespaces and modular pipelines as a good practice for complex projects. The community members also discuss the kedro-mlflow library and its tutorial, which addresses these issues by providing a "scikit-learn-like pipeline" for Kedro. The discussion covers the use of tags to separate "fit" and "transform" steps, and the fact that the kedro-mlflow tutorial example is not up-to-date but still considered a good approach. Additionally, the community members mention the kedro-boot library as an option for deep control over pipeline serving.

There is no explicitly marked answer, but the discussion provides several suggestions and resources for addressing the original questions.

Useful resources

Hi all!

I have a couple of questions regarding the best practices of Kedro usage.

Frequently, ML models incorporate some preprocessing logic right in the model classes. And there may be some quite complex class inheritance structure to use some abstractions and, for example, try models with similar interface like BaseRegressor and lots of ancestors like LGBMRegressor LinearRegressor. And all these wrappers do not only use sklearn.model.predict or lgbm.model.predict but also incorporate quite long list of data preparations.
So my first question is: how this paradigm of "advanced and abstract ML development" is compatible with Kedro which is mostly (to the best of my understanding) for pipelines? In the basic examples I see that there may be any number of preprocessing steps like load, filter, enrich, fillna, etc, etc, and then just train step. This is compatible with the pipeline logic perfectly. But, probably, doesn't work well if you keep some methods in the model class and also use some internal states and so on. Maybe you know some good practices or have any ideas?

The second question is similar to the first one but covers mostly the inference part. Please, correct me if I'm wrong, but I mostly see Kedro as a framework for the preprocessing and training ML routines. What is recommended to do if I want to reuse some of my logic (already defined as data_processing nodes) for model inference?

Thank you very much!

R
O
Y
15 comments

Hi Олег Литвинов, If you haven't seen this already, for complex projects kedro recommends you to use namespaces and modular pipelines as a good practice. Check modular pipelines and namespace docs for further information.

I will check this out, thank you! Would be happy to hear various options and experiences 🙂

An here is kedro-mlflow and its tutorial which is specifically design to address these issues: https://github.com/Galileo-Galilei/kedro-mlflow-tutorial

Think about it as "scikit learn like pipeline but for any arbitrary kedro pipeline"

It requires mlflow though

Thank you very much, colleagues! I appreciate your ideas! Please, let me know if there are any other options to consider

Dear , thank you for sharing the docs. I see how this helps to establish good separation of preprocessing and modelling as well as training of two different models (via namespaces). However I still don't have any good idea of how to reuse, for example, the preprocessing logic\nodes\parameters during the model inference. Do you, probably, have some examples of such?

Dear , thank you for the tutorial. This makes a lot of sense and addresses my original question. Looks like the core idea here is to use tags, right? I see this example haven't been updated for a while. Is there any particular reason for that? Is this approach still considered as the best practice?

Hi, some answers:

  • A namespace is a way to tag all the nodes of a pipeline, so both suggestions are very correlated
  • Yes the key idea is to use tags because (using sklearn vocab) some steps are "fit" (e.g. create something from data to reuse at inference time - in mlflow vocab this is called an "artifact") and other steps are "transform" (e.g. apply a fitted object on data). You never want to do "fit_transform" because you need to separate the steps: you "fit" only at training time and you "transform" both at training time and at inference time
  • Unfortunately the example has not been updated because the starter changed between 0.18 and 0.19 and I have to update all examples and screenshots and I never took the time, but it perfectly works in 0.19. There were no breaking change on pipeline and nodes between the two major versions.
  • I don't know if it's "best practices" but given the number of related issues in this channel, in the kedro-mlflow repo issues and discussions (look for "pipeline_ml_factory" keyword to see them), and the numerous projects I've seen in production using it, I am quite confident this is considered a good approach (likely the best available at the time)

Thank you very much for a follow up! This sounds great.

In the meantime, I found a very similar issue mentioned here: https://github.com/kedro-org/kedro/issues/464. It looks like this issue\thread started somewhere there. This is a very useful discussion helping to frame some understanding. According to it, I see that model serving was outside of the interests of Kedro. But this was 4 years ago. So look like now it's pretty well covered and addresses the main inference goals. Thank you again!

Yes, the original poster contributed directly to the kedro mlflow code base back then

If you want deep control over pipeline serving, check out kedro-boot and the fastapi mapping

After a couple days of investigating realised how similar tags and namespaces are. Is there any preferred way of using one or the other?

Hi ,

Both tags and namespaces help you group nodes in your kedro project and structure a complex project.

Inclusive Grouping : By this I mean nodes are part of more than one group, For example a kedro node can have multiple tags

  • Tagging is inclusive
  • Tagging cannot provide modularity
  • Not suitable for deployment because of non-exclusivity
  • Tagging is good when you want nodes belonging to more than one group

Exclusive Grouping: By this I mean a node is part of only one group. For example having a namespace to a pipeline (set of nodes)
  • Namespaces provide hierarchy. Hierarchy is exclusive
  • Namespace provide modularity
  • Most suitable for visualization and deployment
  • Namespace is good when you want nodes belonging to one group

Hope this helps. Thank you

Add a reply
Sign up and join the conversation on Slack