Hey team. Looking into some advice or insights on how to think about unit testing complex nodes in kedro (or rather nodes taking in complex data with a lot of edge cases). In these cases I usually follow the approach of integrating a lot functionality into a single node, composed of several smaller private functions.
My question: How to best test the node's actual output (standard stuff like column a shouldn't have any nulls
, column b should never be lower than 10
))?
Hi Pedro, I think it would be beneficial to separate them into two different cases. Unit testing and data validations are quite different in terms of tooling.
Synthetic data is an option, there are libraries like hypothesis
that take this to the next level, though I am not sure if that's what you are looking for.
How does the complexity of node affect this test? I guess what you want to do is testing the input/output, but not necessary every intermediate output.
Hi! Check out pandera
. That library allows you to define all conditions you want to enforce for your dataset, and validate any dataset, in your example I guess the output of that function, against them. Pandera can also generate synthetic data automatically, based on schema definition.
This is literally what it’s designed for:
column a shouldn't have any nulls, column b should never be lower than 10))
Thanks guys!
Though probably I can include pandera in an end-to-end test that runs the entire pipeline with actual prod data - but then again that becomes quite expensive