Hi all!
I have a couple of questions regarding the best practices of Kedro usage.
Frequently, ML models incorporate some preprocessing logic right in the model classes. And there may be some quite complex class inheritance structure to use some abstractions and, for example, try models with similar interface like BaseRegressor
and lots of ancestors like LGBMRegressor
LinearRegressor
. And all these wrappers do not only use sklearn.model.predict or lgbm.model.predict but also incorporate quite long list of data preparations.
So my first question is: how this paradigm of "advanced and abstract ML development" is compatible with Kedro which is mostly (to the best of my understanding) for pipelines? In the basic examples I see that there may be any number of preprocessing steps like load, filter, enrich, fillna, etc, etc, and then just train step. This is compatible with the pipeline logic perfectly. But, probably, doesn't work well if you keep some methods in the model class and also use some internal states and so on. Maybe you know some good practices or have any ideas?
The second question is similar to the first one but covers mostly the inference part. Please, correct me if I'm wrong, but I mostly see Kedro as a framework for the preprocessing and training ML routines. What is recommended to do if I want to reuse some of my logic (already defined as data_processing nodes) for model inference?
Thank you very much!
Hi Олег Литвинов, If you haven't seen this already, for complex projects kedro recommends you to use namespaces and modular pipelines as a good practice. Check modular pipelines and namespace docs for further information.
I will check this out, thank you! Would be happy to hear various options and experiences 🙂
An here is kedro-mlflow and its tutorial which is specifically design to address these issues: https://github.com/Galileo-Galilei/kedro-mlflow-tutorial
Think about it as "scikit learn like pipeline but for any arbitrary kedro pipeline"
Also see this section in the documentation : https://kedro-mlflow.readthedocs.io/en/stable/source/05_pipeline_serving/index.html
Thank you very much, colleagues! I appreciate your ideas! Please, let me know if there are any other options to consider
Dear , thank you for sharing the docs. I see how this helps to establish good separation of preprocessing and modelling as well as training of two different models (via namespaces). However I still don't have any good idea of how to reuse, for example, the preprocessing logic\nodes\parameters during the model inference. Do you, probably, have some examples of such?
Dear , thank you for the tutorial. This makes a lot of sense and addresses my original question. Looks like the core idea here is to use tags, right? I see this example haven't been updated for a while. Is there any particular reason for that? Is this approach still considered as the best practice?
Hi, some answers:
Thank you very much for a follow up! This sounds great.
In the meantime, I found a very similar issue mentioned here: https://github.com/kedro-org/kedro/issues/464. It looks like this issue\thread started somewhere there. This is a very useful discussion helping to frame some understanding. According to it, I see that model serving was outside of the interests of Kedro. But this was 4 years ago. So look like now it's pretty well covered and addresses the main inference goals. Thank you again!
Yes, the original poster contributed directly to the kedro mlflow code base back then
If you want deep control over pipeline serving, check out kedro-boot and the fastapi mapping
After a couple days of investigating realised how similar tags and namespaces are. Is there any preferred way of using one or the other?
Hi ,
Both tags and namespaces help you group nodes in your kedro project and structure a complex project.
Inclusive Grouping : By this I mean nodes are part of more than one group, For example a kedro node can have multiple tags