Join the Kedro community

Updated 6 days ago

How to think about unit testing complex nodes in kedro

Hey team. Looking into some advice or insights on how to think about unit testing complex nodes in kedro (or rather nodes taking in complex data with a lot of edge cases). In these cases I usually follow the approach of integrating a lot functionality into a single node, composed of several smaller private functions.
My question: How to best test the node's actual output (standard stuff like column a shouldn't have any nulls, column b should never be lower than 10))?

  • I feel like it would be impossible to create dummy data to account for all edge cases in the test function itself
  • Reading from the production input table, on the other hand, defeats the purpose of unit testing.
  • Does it make sense to generate synthetic or sample data from the input tables to the node and store it somewhere to be read at testing time?

N
Y
P
5 comments

Hi Pedro, I think it would be beneficial to separate them into two different cases. Unit testing and data validations are quite different in terms of tooling.

Synthetic data is an option, there are libraries like hypothesis that take this to the next level, though I am not sure if that's what you are looking for.

How does the complexity of node affect this test? I guess what you want to do is testing the input/output, but not necessary every intermediate output.

Hi! Check out pandera. That library allows you to define all conditions you want to enforce for your dataset, and validate any dataset, in your example I guess the output of that function, against them. Pandera can also generate synthetic data automatically, based on schema definition.

This is literally what it’s designed for:

column a shouldn't have any nulls, column b should never be lower than 10))

Thanks guys!

  • i would say it's the complexity of the inputs (several columns, billions of rows which hide several edge cases, several input tables in some cases), rather than the complexity of the node affecting the way tests can be thought of here. The problem applies to the smaller private functions in the node, because inputs/outputs are also complex. I'm struggling to declare sample inputs and outputs in the test that are not ridiculously extensive and complex
  • good point, I have been using pandera for a while too, but that's intended for runtime checks I believe - i.e., if a given contract is not met, you can either stop execution or handle it gracefully. But here I am interested in testing this at CI/CD time

Though probably I can include pandera in an end-to-end test that runs the entire pipeline with actual prod data - but then again that becomes quite expensive

Add a reply
Sign up and join the conversation on Slack