Join the Kedro community

Updated 4 hours ago

Defining Nodes with Decorators in Kedro

Would kedro users be opposed defining nodes with decorators? I have written a simple implementation but as I've only recently started using kedro I wonder if I'm missing anything:

The syntax would be:

from kedro.pipeline import Pipeline, node, pipeline

@node(inputs=1, outputs="first_sum")
def step1(number):
   return number + 1

@node(inputs="first_sum", outputs="second_sum")
def step2(number):
   return number + 1 

@node(inputs="second_sum", outputs="final_result")
def step3(number):
   return number + 2

pipeline = pipeline(
   [
       step1,
       step2,
       step3,
   ]
)

the node name could be inferred from the function name

2
N
L
d
31 comments

the functions could even be decorated in the project nodes.py and then pipeline definition would just become:

from kedro.pipeline import Pipeline, node, pipeline
from .nodes import step1,step2,step3

pipeline = pipeline(
   [
       step1,
       step2,
       step3,
   ]
)

Personally I don't strongly against it, but to me it's mostly a syntax sugar. It will make simple things simpler but the complex cases more difficult. That's the main tradeoff

It's pure syntactic sugar yes

What complex cases would this not cover though?

What complex cases would this not cover though?
You get into funny situations where different decorators would conflict, for example combining this with Pandera might be painful

We also try to have only one way of doing things, whilst this rule is broken in some places, it can cause headaches

say you want to reuse the function -> now you can't because it's a node with pipeline specific details

similar case when you want to unit test, you will most likely want to test the node function rather than the node

It's easy to extract the node function from node for test, but arguably add more complexity

I don't think it's an issue with reusability of functions, I replied to this concern in the github thread: https://github.com/kedro-org/kedro/issues/2471#issuecomment-2598338855

But I hear the concern around having multiples ways of doing one thing and that confusing users

And I hear also the concern about clashes when stacking decorators @datajoely I'm not sure how easy it is to circumvent that

yeah which is why I think it's in my opinion not worth the trouble

I agree with this:

It will make simple things simpler but the complex cases more difficult. That's the main tradeoff

I personally like to move all of my functions outside of my kedro project into their own well tests python package which is available on a private PyPI like repo. This is also means my flow logic isn't coupled to my business logic and say I needed to swap Kedro for something else it wont be a pain

Fair enough! Just wanted to see what the community thought of it, thanks for the insights @Nok @datajoely

so in summary I think our current pattern enables high cohesion but low coupling between Kedro's framework and your business logic

The approach you have taken it's slightly different, so you keep the function but have a separate thin node wrapper for what you call "step"

@node(inputs=["a", "b"], outputs="sum")
def pipeline_step(a, b): return reusable_fn(a, b)

Is this simpler than node(reuable_fn, "a", "b", outputs="sum" ? It's a few more keystroke, though maybe slightly clearer since the arguments are highlight at the top

Curious, where do you store your private pypi repos? At my old work we had artifactory but I'm not sure what folks used out there

So my new place we use the one GCP provides

I wasn't able to get the GitHub one working

but technically a PyPI index can be a flat file on s3 on something

^don't be discourage if you find this works for you, fundamentally there is nothing wrong IMO. We aim to serve the broad audience so we try to keep this simple

JForg private PyPi index

Is this simpler than node(reuable_fn, "a", "b", outputs="sum" ? It's a few more keystroke, though maybe slightly clearer since the arguments are highlight at the top
@Nok yeah I'm not really sure which one is simpler at that point hence agreeing with your point that it complicates the "advanced" use cases

oh yes, we used JFrog too!

Bit late to the party, but I would suggest not binding input and output names to the node. Instead, I think you can get something halfway like:

@node
def step1(number):
    return number + 1

@node
def step2(number):
    return number + 1

@node
def step3(number):
    return number + 2

@pipeline
def my_pipe(my_input):
    first_sum = step1(something)
    second_sum = step2(first_sum)
    final_result = step3(second_sum)
    return final_result

my_pipe(1)

This is definitely not 100% what it should look like, but I think the benefit is that you are still constructing the DAG--even though it's function calls--at pipeline definition time.

Also, on a separate note, I think the desire for this alternative syntax does come up, and it would be interesting to see some community-driven package that enables this syntax, + realistic examples + understanding the caveats. πŸ™‚ It's hard to evaluate how well something like this could work without actually doing it, but it's a pretty big risk to take in core Kedro. πŸ™‚

I implemented something like this at work a while ago, but never really used it. I think a better solution to the issue I was trying to solve would be a kedro vscode plugin that can better show all the pipelines, nodes, catalog entries

Add a reply
Sign up and join the conversation on Slack