How do you avoid over DRY ("Don't Repeat Yourself") using Kedro? I find given the fairly opinionated syntax and project structure that is proprosed it's easy to DRY bits of code that would be best not DRY (e.g. preprocessing code). I wonder if anyone else has had similar thoughts
So in the Kedro tutorial we keep everything in one project, longer term I move all business logic into independently tested packages.
This also means your Kedro projects are really lightweight representations of flow and data catalog. Data catalog utilizing dataset factories also massively improves DRY
I think it’s possible to make things so dry it’s hard to follow, but in general it’s not a problem
@Deepyaman Datta to me, writing functions that are a thin wrapper around some pandas/polars operations, much more straightforward to just read the plain dataframe operations in their native language
@datajoely I am still fairly new to kedro, what do you mean by dataset factories? I can't see a mention of it in the docs
Also do you have an example of the kedro projects built on top of independently tested packages?An advanced tutorial for it would be a fantastic addition to the docs
ok, apologies for that, I need to change search engine! that was Google's top result
shouldn't the example be:
boats: type: pandas.CSVDataset filepath: data/01_raw/boats.csv cars: type: pandas.CSVDataset filepath: data/01_raw/cars.csv planes: type: pandas.CSVDataset filepath: data/01_raw/planes.csv??
yes but with the factory approach all 3 of those can be collapsed into one DRY pattern matching entity called a dataset factory
Dataset factories is similar to regular expression and you can think of it as reversed f-string
. In this case, the name of the input dataset factory_data
matches the pattern {name}_data
with the _data
suffix, so it resolves name
to factory
. Similarly, it resolves name
to process
for the output dataset process_data
.
This allows you to use one dataset factory pattern to replace multiple datasets entries. It keeps your catalog concise and you can generalise datasets using similar names, type or namespaces.
That's very cool on the catalog front! I'd love to see how people avoid DRY-ing in pipelines and nodes, especially how people build packages for this!
@Deepyaman Datta to me, writing functions that are a thin wrapper around some pandas/polars operations, much more straightforward to just read the plain dataframe operations in their native languageThis is possible! For example, say you want to use https://docs.pola.rs/api/python/stable/reference/dataframe/api/polars.DataFrame.drop_nulls.html. Instead of defining a node, you can do
from operator import methodcaller
and use methodcaller("drop_nulls")
as your node function.