Hi team, I am looking for a real life kedro repo and its package dependencies. I've heard from different power users that they follow the convention of writing pipelines into packages for testing, versioning, etc?
I think it could be a really good thing to attract more users to see how Kedro is used in "real life"
On a related note, is the convention to have an ML pipelines be composed of other pipelines (like in the docs: preprocessing
, ds
, etc) or to have each step of the pipeline be a node?
Maybe this is pushing the current dataset factories too far but is it possible to parametrise a SQL Catalog entry where the SQL is read from a file?
Like:
mytable: type: pandas.SQLQueryDataset credentials: postgres_dwh filepath: sql/mytable.sql
Would kedro users be opposed defining nodes with decorators? I have written a simple implementation but as I've only recently started using kedro I wonder if I'm missing anything:
The syntax would be:
from kedro.pipeline import Pipeline, node, pipeline @node(inputs=1, outputs="first_sum") def step1(number): return number + 1 @node(inputs="first_sum", outputs="second_sum") def step2(number): return number + 1 @node(inputs="second_sum", outputs="final_result") def step3(number): return number + 2 pipeline = pipeline( [ step1, step2, step3, ] )
Can a kedro pipeline call other pipelines?
I have a situation where I need to run the same ML pipeline for similar kind of data across groups. I want to keep my ML pipeline modular and have it act it on just one cut of data (filter -> preprocess -> feature engineering -> training -> save model), and have it be parametrised, my question is how do I run the pipelines for all my groups? do I run a pipeline of pipelines? do I run it in a python script?
Hi kedro community!! I have encountered an issue when working with kedro within a marimo notebook (I think the issue would be just the same in a jupyter notebook). Basically, I initially was working on my notebook by calling it from the command line from the kedro project root folder, something like: marimo edit notebooks/nb.py
where my folder structure is something like:
├── README.md ├── conf │ ├── base │ ├── local ├── data ... ├── notebooks │ ├── nb.py ├── pyproject.toml ├── requirements.txt ├── src ... └── tests ...Within
nb.py
I have a cell that runs:from kedro.io import DataCatalog from kedro.config import OmegaConfigLoader from kedro.framework.project import settings from pathlib import Path conf_loader = OmegaConfigLoader( conf_source=Path(__file__).parent /settings.CONF_SOURCE, default_run_env = "base" ) catalog = DataCatalog.from_config(conf_loader["catalog"], credentials=conf_loader["credentials"])
weekly_sales = pl.from_pandas( catalog.load("mytable") )
catalog
all the filepaths are absolute and assume that wherever the catalog is being used from is using the Kedro project root level. the conf_source
argument in the OmegaConfigLoader
instance is an absolute path (e.g. conf/base/sql/somequery.sql
or data/mydataset.csv
so if I run my notebook from the root of my kedro project, all is fine but I were to run: cd notebooks; marimo edit nb.py
then catalog.load
will attempt to load the query or dataset from notebooks/conf/base/sql/somequery.sql
How do you avoid over DRY ("Don't Repeat Yourself") using Kedro? I find given the fairly opinionated syntax and project structure that is proprosed it's easy to DRY bits of code that would be best not DRY (e.g. preprocessing code). I wonder if anyone else has had similar thoughts
Hey, how do people use kedro at scale? I've read a few tutorials on how to use kedro for single projects but none on how to use it at scale. To me there would be an inherit benefit in creating modules with the pipeline step logics (so like shared nodes.py) and for common tasks using those rather than writing them in the pipeline specific nodes.py, does anybody do this?
I am keen to learn how people make the most out of kedro