Second runner run fails to save output after pipeline upgrade

Could you share an example? This sounds suspicious, which runner are you using? I only recall a minor change with ThreadRunner

trying to reproduce with spaceflights

I'll make a github ticket

https://github.com/kedro-org/kedro/issues/4235

<@U04MH7Q419V>

KKacper Leśniara

This unfortunately results in making the packaged model servers with kedro-mlflow work only once, then they need a reboot. FYI

when will we start making things that actually can last? one use cutlery, one use batteries, now we get a one use servers >...<

MMerel Theisen

Is this caused by the catalog work?

it seems like

I was able to reproduce it, looking into it

Here is the explanation and fix: https://github.com/kedro-org/kedro/pull/4236

I leave a comment there. It's unclear to me why it breaks (?) I haven't been able to reproduce the error yet. I got a and b both {} when I run this on GitPod on 0.19.8 and 0.19.9

Is this how your test look like?

def test_data_science_pipeline(caplog, dummy_data, dummy_parameters):

    pipeline = (
        create_ds_pipeline()
        .from_nodes("split_data_node")
        .to_nodes("evaluate_model_node")
    )
    catalog = DataCatalog()
    catalog.add_feed_dict(
        {
            "model_input_table" : dummy_data,
            "params:model_options": dummy_parameters["model_options"],
        }
    )
    
    a = SequentialRunner().run(pipeline, catalog)
    b = SequentialRunner().run(pipeline, catalog)
    assert a == b

change the test and you’ll reproduce

    pipeline = (
        create_ds_pipeline()
        .from_nodes("split_data_node")
        .to_nodes("train_model_node")
    )

evaluate_model_node does not return anything

and there are no free_outputs

ya ok, as the issue describe using the test we have in the starter and I cannot reproduce it.

let me try

I updated the comment there with the new test, I still think there is an issue with the memory dataset definition

pipeline.outputs()={'y_test', 'X_test', 'regressor'}
registered_ds=['params:model_options', 'model_input_table']
memory_datasets={'model_input_table', 'params:model_options'}
free_outputs={'y_test', 'X_test', 'regressor'}


pipeline.outputs()={'y_test', 'X_test', 'regressor'}
registered_ds=['X_test', 'params:model_options', 'model_input_table', 'X_train', 'regressor', 'y_test', 'y_train']
memory_datasets={'model_input_table', 'params:model_options'}
free_outputs=set()

I can see now the 2nd run we return nothing for free_outputs, but I expect y_test', 'X_test', 'regressor' in the memory_dataset, but it's not. That is why the free_output is missing them at the end.

I think the issue is with the shallow copy instead. Those free_outputs are initialised before the copy was made, and thus making incorrect reference.

I don't understand the need of the shallow copy - but by shifting all those free_outputs declaration after the shallow copy, I get the expected output correctly.