Paul Mora

Parallelrunner and dataset error

Hey team I am facing an issue that when using ParallelRunner. Basically I am trying to have model-training for many different use-cases (tabular data) parallelized through ParallelRunner, though the problem that I am facing is

DatasetError: Data for MemoryDataset has not been saved yet.

Is that something that you have seen before?

8 comments

PPaul Mora

Another Question From My Side

Another question from my side.

I have a node which outputs a dictionary called train_test_dts which I am saving as a pickle with the backend joblib.
When I then try to run my pipeline with the parallel-runner like this:

kedro run --pipeline feature_engineering --params env=dev,inference_dt=2025-01-05 --runner ParallelRunner

Then I am getting the following error:

AttributeError: The following datasets cannot be used with multiprocessing: ['train_test_dts']
In order to utilize multiprocessing you need to make sure all datasets are serialisable, i.e. datasets should not make use of lambda functions, nested functions, closures etc.
If you are using custom decorators ensure they are correctly decorated using functools.wraps().

Any idea why that happens and what I could do to fix that?

1 comment

PPaul Mora

Stupid Question: The Kedro Vscode Plugin Does Not Work For Me

Stupid question, the kedro vscode plugin does not work for me. After installing this and the dependencies I still cannot click on the catalog items. Any standard solution for this?

15 comments

PPaul Mora

Make Environment Specific to --env in Primary Data Paths

Hey team, hope you are doing well. I have the following question (I already tried to see whether any previous question answers it, with no luck)

I have primary data paths such as:

"prm_customer_base":
  table_name: primary_${_environment}.prm_customer_base
  <<: *_conn

Now I would like the make the environment specific to the --env that I am running. So, I created a file under conf/dev/catalog_dev.yml which contains

_environment: dev

Though the interpolation does not work, and I get the error that the interpolation key is not found.

One workaround that I found was the creation of conf/dev/globals.yml (+ removing the underscore then ofc). That seems to work, though I am not sure how to feel having a globals.yml file for each environment, since I was thinking having multiple globals is defeating the point.

Any comment on that?

10 comments

Join the Kedro community

Parallelrunner and dataset error

Another Question From My Side

Stupid Question: The Kedro Vscode Plugin Does Not Work For Me

Make Environment Specific to --env in Primary Data Paths