Join the Kedro community

Updated 2 days ago

Avoiding Over-drying with Kedro: Balancing Abstraction and Flexibility

How do you avoid over DRY ("Don't Repeat Yourself") using Kedro? I find given the fairly opinionated syntax and project structure that is proprosed it's easy to DRY bits of code that would be best not DRY (e.g. preprocessing code). I wonder if anyone else has had similar thoughts

d
D
L
13 comments

So in the Kedro tutorial we keep everything in one project, longer term I move all business logic into independently tested packages.

This also means your Kedro projects are really lightweight representations of flow and data catalog. Data catalog utilizing dataset factories also massively improves DRY

What's an example of over-DRY?

I think it’s possible to make things so dry it’s hard to follow, but in general it’s not a problem

@Deepyaman Datta to me, writing functions that are a thin wrapper around some pandas/polars operations, much more straightforward to just read the plain dataframe operations in their native language

@datajoely I am still fairly new to kedro, what do you mean by dataset factories? I can't see a mention of it in the docs

Also do you have an example of the kedro projects built on top of independently tested packages?An advanced tutorial for it would be a fantastic addition to the docs

ok, apologies for that, I need to change search engine! that was Google's top result

shouldn't the example be:

boats:
  type: pandas.CSVDataset
  filepath: data/01_raw/boats.csv

cars:
  type: pandas.CSVDataset
  filepath: data/01_raw/cars.csv

planes:
  type: pandas.CSVDataset
  filepath: data/01_raw/planes.csv
??

yes but with the factory approach all 3 of those can be collapsed into one DRY pattern matching entity called a dataset factory

Dataset factories is similar to regular expression and you can think of it as reversed f-string. In this case, the name of the input dataset factory_data matches the pattern {name}_data with the _data suffix, so it resolves name to factory. Similarly, it resolves name to process for the output dataset process_data.
This allows you to use one dataset factory pattern to replace multiple datasets entries. It keeps your catalog concise and you can generalise datasets using similar names, type or namespaces.

That's very cool on the catalog front! I'd love to see how people avoid DRY-ing in pipelines and nodes, especially how people build packages for this!

@Deepyaman Datta to me, writing functions that are a thin wrapper around some pandas/polars operations, much more straightforward to just read the plain dataframe operations in their native language
This is possible! For example, say you want to use https://docs.pola.rs/api/python/stable/reference/dataframe/api/polars.DataFrame.drop_nulls.html. Instead of defining a node, you can do from operator import methodcaller and use methodcaller("drop_nulls") as your node function.

I see, that's nice, thanks! I'll have a look at the methodcaller method!

Add a reply
Sign up and join the conversation on Slack