Luis Chaves Rodriguez

Exploring orchestration options: Kedro, Dagster, and beyond.

does kedro integrate well with dagster? What flavours of orchestration do you guys enjoy? (Airflow, dagster, prefect, others)

We currently have very rudimentary orchestration with Azure Data Factory and I am hoping to push my team onto a nice orchestrator as our needs grow

7 comments

LLuis Chaves Rodriguez

Exploring real-life kedro repositories and package dependencies

Hi team, I am looking for a real life kedro repo and its package dependencies. I've heard from different power users that they follow the convention of writing pipelines into packages for testing, versioning, etc?

I think it could be a really good thing to attract more users to see how Kedro is used in "real life"

On a related note, is the convention to have an ML pipelines be composed of other pipelines (like in the docs: preprocessing, ds, etc) or to have each step of the pipeline be a node?

2 comments

LLuis Chaves Rodriguez

Parametrizing SQL Catalog Entries With File-based Queries

Maybe this is pushing the current dataset factories too far but is it possible to parametrise a SQL Catalog entry where the SQL is read from a file?

Like:

mytable:
  type: pandas.SQLQueryDataset
  credentials: postgres_dwh
  filepath: sql/mytable.sql

basically, I'd like to pass parameters to the SQL query

7 comments

LLuis Chaves Rodriguez

Defining Nodes with Decorators in Kedro

Would kedro users be opposed defining nodes with decorators? I have written a simple implementation but as I've only recently started using kedro I wonder if I'm missing anything:

The syntax would be:

from kedro.pipeline import Pipeline, node, pipeline

@node(inputs=1, outputs="first_sum")
def step1(number):
   return number + 1

@node(inputs="first_sum", outputs="second_sum")
def step2(number):
   return number + 1 

@node(inputs="second_sum", outputs="final_result")
def step3(number):
   return number + 2

pipeline = pipeline(
   [
       step1,
       step2,
       step3,
   ]
)

the node name could be inferred from the function name

37 comments

LLuis Chaves Rodriguez

Can a Kedro pipeline call other pipelines?

Can a kedro pipeline call other pipelines?

I have a situation where I need to run the same ML pipeline for similar kind of data across groups. I want to keep my ML pipeline modular and have it act it on just one cut of data (filter -> preprocess -> feature engineering -> training -> save model), and have it be parametrised, my question is how do I run the pipelines for all my groups? do I run a pipeline of pipelines? do I run it in a python script?

2 comments

LLuis Chaves Rodriguez

Accessing Kedro Configuration Within Marimo Notebook

Hi kedro community!! I have encountered an issue when working with kedro within a marimo notebook (I think the issue would be just the same in a jupyter notebook). Basically, I initially was working on my notebook by calling it from the command line from the kedro project root folder, something like: marimo edit notebooks/nb.py where my folder structure is something like:

├── README.md
├── conf
│   ├── base
│   ├── local
├── data ...
├── notebooks
│   ├── nb.py
├── pyproject.toml
├── requirements.txt
├── src ... 
└── tests ...

Within nb.py I have a cell that runs:

from kedro.io import DataCatalog
from kedro.config import OmegaConfigLoader
from kedro.framework.project import settings
from pathlib import Path
conf_loader = OmegaConfigLoader(
    conf_source=Path(__file__).parent /settings.CONF_SOURCE,
    default_run_env = "base"
)

catalog = DataCatalog.from_config(conf_loader["catalog"], credentials=conf_loader["credentials"])

and later...

weekly_sales = pl.from_pandas(
    catalog.load("mytable")
)

The issue is, within the catalog all the filepaths are absolute and assume that wherever the catalog is being used from is using the Kedro project root level. the conf_source argument in the OmegaConfigLoader instance is an absolute path (e.g. conf/base/sql/somequery.sql or data/mydataset.csv so if I run my notebook from the root of my kedro project, all is fine but I were to run: cd notebooks; marimo edit nb.py then catalog.load will attempt to load the query or dataset from notebooks/conf/base/sql/somequery.sql

Is it clear?

PD: please don't ask me why there is SQL code within the conf folder 😅, it's moving soon

13 comments

LLuis Chaves Rodriguez

Avoiding Over-drying with Kedro: Balancing Abstraction and Flexibility

How do you avoid over DRY ("Don't Repeat Yourself") using Kedro? I find given the fairly opinionated syntax and project structure that is proprosed it's easy to DRY bits of code that would be best not DRY (e.g. preprocessing code). I wonder if anyone else has had similar thoughts

13 comments

LLuis Chaves Rodriguez

How Do People Use Kedro at Scale?

Hey, how do people use kedro at scale? I've read a few tutorials on how to use kedro for single projects but none on how to use it at scale. To me there would be an inherit benefit in creating modules with the pipeline step logics (so like shared nodes.py) and for common tasks using those rather than writing them in the pipeline specific nodes.py, does anybody do this?

I am keen to learn how people make the most out of kedro

4 comments

Join the Kedro community

Exploring orchestration options: Kedro, Dagster, and beyond.

Exploring real-life kedro repositories and package dependencies

Parametrizing SQL Catalog Entries With File-based Queries

Defining Nodes with Decorators in Kedro

Can a Kedro pipeline call other pipelines?

Accessing Kedro Configuration Within Marimo Notebook

Avoiding Over-drying with Kedro: Balancing Abstraction and Flexibility

How Do People Use Kedro at Scale?