Avoiding Over-drying with Kedro: Balancing Abstraction ...

At a glance

How do you avoid over DRY ("Don't Repeat Yourself") using Kedro? I find given the fairly opinionated syntax and project structure that is proprosed it's easy to DRY bits of code that would be best not DRY (e.g. preprocessing code). I wonder if anyone else has had similar thoughts

13 comments

ddatajoely

So in the Kedro tutorial we keep everything in one project, longer term I move all business logic into independently tested packages.

This also means your Kedro projects are really lightweight representations of flow and data catalog. Data catalog utilizing dataset factories also massively improves DRY

DDeepyaman Datta

What's an example of over-DRY?

ddatajoely

I think it’s possible to make things so dry it’s hard to follow, but in general it’s not a problem

LLuis Chaves Rodriguez

@Deepyaman Datta to me, writing functions that are a thin wrapper around some pandas/polars operations, much more straightforward to just read the plain dataframe operations in their native language

LLuis Chaves Rodriguez

@datajoely I am still fairly new to kedro, what do you mean by dataset factories? I can't see a mention of it in the docs

LLuis Chaves Rodriguez

Also do you have an example of the kedro projects built on top of independently tested packages?An advanced tutorial for it would be a fantastic addition to the docs

ddatajoely

https://docs.kedro.org/en/stable/data/kedro_dataset_factories.html

LLuis Chaves Rodriguez

ok, apologies for that, I need to change search engine! that was Google's top result

LLuis Chaves Rodriguez

shouldn't the example be:

boats:
  type: pandas.CSVDataset
  filepath: data/01_raw/boats.csv

cars:
  type: pandas.CSVDataset
  filepath: data/01_raw/cars.csv

planes:
  type: pandas.CSVDataset
  filepath: data/01_raw/planes.csv

ddatajoely

yes but with the factory approach all 3 of those can be collapsed into one DRY pattern matching entity called a dataset factory

Dataset factories is similar to regular expression and you can think of it as reversed f-string. In this case, the name of the input dataset factory_data matches the pattern {name}_data with the _data suffix, so it resolves name to factory. Similarly, it resolves name to process for the output dataset process_data.
This allows you to use one dataset factory pattern to replace multiple datasets entries. It keeps your catalog concise and you can generalise datasets using similar names, type or namespaces.

LLuis Chaves Rodriguez

That's very cool on the catalog front! I'd love to see how people avoid DRY-ing in pipelines and nodes, especially how people build packages for this!

DDeepyaman Datta

@Deepyaman Datta to me, writing functions that are a thin wrapper around some pandas/polars operations, much more straightforward to just read the plain dataframe operations in their native language

This is possible! For example, say you want to use https://docs.pola.rs/api/python/stable/reference/dataframe/api/polars.DataFrame.drop_nulls.html. Instead of defining a node, you can do from operator import methodcaller and use methodcaller("drop_nulls") as your node function.

LLuis Chaves Rodriguez

I see, that's nice, thanks! I'll have a look at the methodcaller method!

Add a reply

Join the Kedro community

Avoiding Over-drying with Kedro: Balancing Abstraction and Flexibility