Testing pipeline with spark object output and memory da...

AAlexis Drakopoulos

I am writing my first Kedro pipeline tests and I am a little confused.

I am testing a pipeline with two nodes, however the first node outputs a spark object which needs to have copy mode assign as a memory dataset. How can I specify that in python rather than yaml?

catalog = DataCatalog( )
caplog.set_level(logging.DEBUG, logger="kedro")
successful_run_msg = "Pipeline execution completed successfully."
SequentialRunner().run(pipeline, catalog)
assert successful_run_msg in caplog.text

do I do that using add_feed_dict? how?

19 comments

ddatajoely

So you can use Kedro this way, but it's not actually the way we recommend unless you have a specific reason to do so.

I would really recommend that you follow the Spaceflights tutorial since it covers the key concepts and abstracts some of this complexity

ddatajoely

we also have a full training course on YouTube :youtube:
https://www.youtube.com/playlist?list=PL-JJgymPjK5LddZXbIzp9LWurkLGgB-nY

AAlexis Drakopoulos

But this is for integration tests for pipelines

ddatajoely

ah gotcha!

ddatajoely

that falls into a good reason

AAlexis Drakopoulos

This is the recommended way in the Kedro documentation to write pipeline tests: https://docs.kedro.org/en/stable/tutorial/test_a_project.html

ddatajoely

give me sec

AAlexis Drakopoulos

No worries take your time I'd appreciate any help I can get

MMerel Theisen

You might be able to get some inspiration from the Kedro code base tests!

MMerel Theisen

Let me see if I can find a good example

AAlexis Drakopoulos

Oh that's a great idea actually, I'm having Friday brain!

ddatajoely

This is a little old - but relevant
https://github.com/kedro-org/kedro/discussions/1068

ddatajoely

I've been desperate to get a kedro-test micro-framework off the ground but it's been hard to prioritise

AAlexis Drakopoulos

If we end up using Kedro we might be interested in doing some OSS contributions with it so could maybe help

MMerel Theisen

Maybe this one helps? https://github.com/kedro-org/kedro/blob/main/tests/pipeline/test_pipeline_integration.py

AAlexis Drakopoulos

hmmmm mine is similar but I'm having the issue that I don't know how to specify that the output of 1 pipeline should use copy_mode "assign"

MMerel Theisen

I guess something like:

dataset = MemoryDataset({"data": 42}, copy_mode="assign")
DataCatalog().add_feed_dict({"dataset":dataset})

AAlexis Drakopoulos

Yes indeed, it's:

catalog = DataCatalog(
datasets={
"data_utility": MemoryDataset(copy_mode="assign"),
"extract_model_features": MemoryDataset(copy_mode="assign"),
},
)

AAlexis Drakopoulos

I got it working now thanks!

Add a reply

Join on Slack

Join the Kedro community

Testing pipeline with spark object output and memory dataset configuration