Join the Kedro community

Updated 4 days ago

Kedro Mlflow NetCDF Dataset Path Issue

Hey Kedro community,
I'm currently working on a project trying to use kedro_mlfow to store kedro_datasets_experimental.netcdf as artifacts. Unfortunatly I can't make it work.

The problem seems to be path related:

kedro.io.core.DatasetError: 
Failed while saving data to dataset MlflowNetCDFDataset(filepath=S:/
/data/07_model_output/D2-24-25/idata.nc, load_args={'decode_times': False}, protocol=file, save_args={'mode': w}).
'str' object has no attribute 'as_posix'
I tried to investigate it to the best of my abilities and it seems to have to do with the initialization of NetCDFDataset. Most Datasets inherit from AbstractVersionedDataset and will call __init__ with its _filepath as str.
NetCDFDataset is missing it and so the PurePosixPath is not created. If this should be the problem in the end I don’t know but it is the point where other datasets have its path set. In the meantime I thought it might be because mlflow isn't capable of tracking Datasets which don't inherit from AbstractVersionedDataset but in kedro-mlfow documentation it says MlflowArtifactDataset is a wrapper for all AbstractDatasets.

I tried to set the self._filepath = PurePosixPath(filepath) myself in the sitepackage but getting a Permission error on saving and that’s were my journey has to end. Would have been too good if this oneline would have made it^^
Thank you guys for your help

here some reduced code for what I'm trying to achive.

catalog.yml
"{dataset}.idata":
  type: kedro_mlflow.io.artifacts.MlflowArtifactDataset
  dataset:
    type: kedro_datasets_experimental.netcdf.NetCDFDataset
    filepath: data/07_model_output/{dataset}/idata.nc
    save_args:
      mode: a
    load_args:
      decode_times: False
node.py
def predict(model, x_data):

    idata = model.predict(x_data)

    return az.convert_to_dataset(idata)
pipeline.py
pipeline_inference = pipeline(
            [
                node(
                    func=predict,
                    inputs={
                        "model": f"{dataset}.model",
                        "x_data": f"{dataset}.x_data",
                    },
                    outputs=f"{dataset}.idata",
                    name=f"{dataset}.predict_node",
                    tags=["training"],
                ),
            ]
        )

1
J
Y
R
4 comments

hi @Philipp Dahlke, sorry you had a bumpy experience!

I have a couple of questions:

  1. is MlflowNetCDFDataset a custom dataset you created? (from the first error you reported)
  2. when you used NetCDFDataset inside MlflowArtifactDataset (second code snippet), what error did you get? Could you share the full traceback?

Hi @Philipp Dahlke, sorry for the bad experience.

Unfortunately kedro abstract (unversioned) dataset don't necesarily have the _filepath attribute, and neither its format or access is standardized. See this issue (and maybe report your bug here to help prioritizing -> this is something tangentially related to our Datacatalog refactoring but for datasets) : https://github.com/kedro-org/kedro/discussions/3753. kedro-mlflow has focused a lot on AbstractVersionedDatasets and it may have some flaws for such unusual datasets.

I think your fix attempt is the right one. In its init, the NetCDFDataset should convert the filepath to a Path. Can you :

  1. try to use pathlib.Path instead of pathlib.PurePosixPath and see if it works?
  2. In case it does not, can you share a minimal reproductible sample of data in the correct format you can load and save with NetCDFDataset so that I can try on my own ?

@juanlu MlflowNetCDFDataset is created under the hood by the MlflowArtifactDataset

i am also tagging @Riley Brady; the author of the NetCDFDataset Riley, is there a reason we don't set the self._filepath = PurePosixPath(filepath) . If not, can we make the change to the dataset to handle it.

Thanks for your help.

I discovered that missing folders in my workspaces, which are declared for the dataset in the catalog.yml, raise the permission error. See first post.
After I created those manually I can save NetCDFDataset with the change made to self._filepath in NetCDFDataset. __ init __ .


@Yolan Honoré-Rougé
Both versions seem to work. pathlib.Path and pathlib.PurePosixPath I declared them either like in other classes after self.metadata or at the end after self._ismultifile. I didnt want to disturb the is_multifile logic by creating it before hand but it seems like PurePosixPath can handle getting its own type passed.

A minimal sample:

import numpy as np
import arviz as az

def test_netCDF():
    size = 100
    dataset = az.convert_to_inference_data(np.random.randn(size))

    return az.convert_to_dataset(dataset)

@juanlu
  1. As mentioned by Yolan this class is created by mlfow and is not implemented by me
  2. see below for both traces

Traceback for missing missing _filepath as instance of Path:
Traceback (most recent call last):
  File "H:\Programs\Anaconda\envs\.conda_ba_env\Lib\site-packages\kedro\io\core.py", line 271, in save
    save_func(self, data)
  File "H:\Programs\Anaconda\envs\.conda_ba_env\Lib\site-packages\kedro_mlflow\io\artifacts\mlflow_artifact_dataset.py", line 63, in _save
    local_path = local_path.as_posix()
                 ^^^^^^^^^^^^^^^^^^^
AttributeError: 'str' object has no attribute 'as_posix'

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "<frozen runpy>", line 198, in _run_module_as_main
  File "<frozen runpy>", line 88, in _run_code
  File "H:\Programs\Anaconda\envs\.conda_ba_env\Scripts\kedro.exe\main.py", line 7, in <module>
  File "H:\Programs\Anaconda\envs\.conda_ba_env\Lib\site-packages\kedro\framework\cli\cli.py", line 263, in main
    cli_collection()
  File "H:\Programs\Anaconda\envs\.conda_ba_env\Lib\site-packages\click\core.py", line 1157, in call
    return self.main(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "H:\Programs\Anaconda\envs\.conda_ba_env\Lib\site-packages\kedro\framework\cli\cli.py", line 163, in main
    super().main(
  File "H:\Programs\Anaconda\envs\.conda_ba_env\Lib\site-packages\click\core.py", line 1078, in main
    rv = self.invoke(ctx)
         ^^^^^^^^^^^^^^^^
  File "H:\Programs\Anaconda\envs\.conda_ba_env\Lib\site-packages\click\core.py", line 1688, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
                           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "H:\Programs\Anaconda\envs\.conda_ba_env\Lib\site-packages\click\core.py", line 1434, in invoke
    return ctx.invoke(self.callback, **ctx.params)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "H:\Programs\Anaconda\envs\.conda_ba_env\Lib\site-packages\click\core.py", line 783, in invoke
    return __callback(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "H:\Programs\Anaconda\envs\.conda_ba_env\Lib\site-packages\kedro\framework\cli\project.py", line 228, in run
    return session.run(
           ^^^^^^^^^^^^
  File "H:\Programs\Anaconda\envs\.conda_ba_env\Lib\site-packages\kedro\framework\session\session.py", line 399, in run
    run_result = runner.run(
                 ^^^^^^^^^^^
  File "H:\Programs\Anaconda\envs\.conda_ba_env\Lib\site-packages\kedro\runner\runner.py", line 113, in run
    self._run(pipeline, catalog, hook_or_null_manager, session_id)  # type: ignore[arg-type]
    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "H:\Programs\Anaconda\envs\.conda_ba_env\Lib\site-packages\kedro\runner\sequential_runner.py", line 85, in _run
    ).execute()
      ^^^^^^^^^
  File "H:\Programs\Anaconda\envs\.conda_ba_env\Lib\site-packages\kedro\runner\task.py", line 88, in execute
    node = self._run_node_sequential(
           ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "H:\Programs\Anaconda\envs\.conda_ba_env\Lib\site-packages\kedro\runner\task.py", line 186, in _run_node_sequential
    catalog.save(name, data)
  File "H:\Programs\Anaconda\envs\.conda_ba_env\Lib\site-packages\kedro\io\data_catalog.py", line 438, in save
    dataset.save(data)
  File "H:\Programs\Anaconda\envs\.conda_ba_env\Lib\site-packages\kedro\io\core.py", line 276, in save
    raise DatasetError(message) from exc
kedro.io.core.DatasetError: Failed while saving data to dataset MlflowNetCDFDataset(filepath=S:/___Studium/Bachelor_Arbeit/ba_env/bundesliga/data/07_model_output/D1-24-25/pymc/idata_fit.nc, load_args={'decode_times': False}, protocol=file, save_args={'mode': a}).
'str' object has no attribute 'as_posix'

Traceback for _filepath set to Path but missing folders:
Traceback (most recent call last):
  File "H:\Programs\Anaconda\envs\.conda_ba_env\Lib\site-packages\xarray\backends\file_manager.py", line 211, in _acquire_with_cache_info
    file = self._cache[self._key]
           ~~~~~~~~~~~^^^^^^^^^^^
  File "H:\Programs\Anaconda\envs\.conda_ba_env\Lib\site-packages\xarray\backends\lru_cache.py", line 56, in __getitem__
    value = self._cache[key]
            ~~~~~~~~~~~^^^^^
KeyError: [<class 'netCDF4._netCDF4.Dataset'>, ('S:\\___Studium\\Bachelor_Arbeit\\ba_env\\bundesliga\\data\\07_model_output\\D1-24-25\\pymc\\idata_fit.nc',), 'a', (('clobber', True), ('diskless', False), ('format', 'NETCDF4'), ('persist', False)), '8aa8dfaa-e6a7-47e2-8b44-b700e528ffb8']

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "H:\Programs\Anaconda\envs\.conda_ba_env\Lib\site-packages\kedro\io\core.py", line 271, in save
    save_func(self, data)
  File "H:\Programs\Anaconda\envs\.conda_ba_env\Lib\site-packages\kedro_mlflow\io\artifacts\mlflow_artifact_dataset.py", line 66, in _save
    super().save.__wrapped__(self, data)
  File "H:\Programs\Anaconda\envs\.conda_ba_env\Lib\site-packages\kedro_datasets_experimental\netcdf\netcdf_dataset.py", line 172, in save
    data.to_netcdf(path=self._filepath, **self._save_args)
  File "H:\Programs\Anaconda\envs\.conda_ba_env\Lib\site-packages\xarray\core\dataset.py", line 2372, in to_netcdf
    return to_netcdf(  # type: ignore[return-value]  # mypy cannot resolve the overloads:(
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "H:\Programs\Anaconda\envs\.conda_ba_env\Lib\site-packages\xarray\backends\api.py", line 1856, in to_netcdf
    store = store_open(target, mode, format, group, **kwargs)
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "H:\Programs\Anaconda\envs\.conda_ba_env\Lib\site-packages\xarray\backends\netCDF4_.py", line 452, in open
    return cls(manager, group=group, mode=mode, lock=lock, autoclose=autoclose)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "H:\Programs\Anaconda\envs\.conda_ba_env\Lib\site-packages\xarray\backends\netCDF4_.py", line 393, in __init__
    self.format = self.ds.data_model
                  ^^^^^^^
  File "H:\Programs\Anaconda\envs\.conda_ba_env\Lib\site-packages\xarray\backends\netCDF4_.py", line 461, in ds
    return self._acquire()
           ^^^^^^^^^^^^^^^
  File "H:\Programs\Anaconda\envs\.conda_ba_env\Lib\site-packages\xarray\backends\netCDF4_.py", line 455, in _acquire
    with self._manager.acquire_context(needs_lock) as root:
         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "H:\Programs\Anaconda\envs\.conda_ba_env\Lib\contextlib.py", line 137, in __enter__
    return next(self.gen)
           ^^^^^^^^^^^^^^
  File "H:\Programs\Anaconda\envs\.conda_ba_env\Lib\site-packages\xarray\backends\file_manager.py", line 199, in acquire_context
    file, cached = self._acquire_with_cache_info(needs_lock)
                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "H:\Programs\Anaconda\envs\.conda_ba_env\Lib\site-packages\xarray\backends\file_manager.py", line 217, in _acquire_with_cache_info
    file = self._opener(*self._args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "src\\netCDF4\\_netCDF4.pyx", line 2521, in netCDF4._netCDF4.Dataset.__init__
  File "src\\netCDF4\\_netCDF4.pyx", line 2158, in netCDF4._netCDF4._ensure_nc_success
PermissionError: [Errno 13] Permission denied: 'S:\\___Studium\\Bachelor_Arbeit\\ba_env\\bundesliga\\data\\07_model_output\\D1-24-25\\pymc\\idata_fit.nc'  

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "<frozen runpy>", line 198, in _run_module_as_main
  File "<frozen runpy>", line 88, in _run_code
  File "H:\Programs\Anaconda\envs\.conda_ba_env\Scripts\kedro.exe\__main__.py", line 7, in <module>
  File "H:\Programs\Anaconda\envs\.conda_ba_env\Lib\site-packages\kedro\framework\cli\cli.py", line 263, in main
    cli_collection()
  File "H:\Programs\Anaconda\envs\.conda_ba_env\Lib\site-packages\click\core.py", line 1157, in __call__
    return self.main(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "H:\Programs\Anaconda\envs\.conda_ba_env\Lib\site-packages\kedro\framework\cli\cli.py", line 163, in main
    super().main(
  File "H:\Programs\Anaconda\envs\.conda_ba_env\Lib\site-packages\click\core.py", line 1078, in main
    rv = self.invoke(ctx)
         ^^^^^^^^^^^^^^^^
  File "H:\Programs\Anaconda\envs\.conda_ba_env\Lib\site-packages\click\core.py", line 1688, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
                           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "H:\Programs\Anaconda\envs\.conda_ba_env\Lib\site-packages\click\core.py", line 1434, in invoke
    return ctx.invoke(self.callback, **ctx.params)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "H:\Programs\Anaconda\envs\.conda_ba_env\Lib\site-packages\click\core.py", line 783, in invoke
    return __callback(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "H:\Programs\Anaconda\envs\.conda_ba_env\Lib\site-packages\kedro\framework\cli\project.py", line 228, in run
    return session.run(
           ^^^^^^^^^^^^
  File "H:\Programs\Anaconda\envs\.conda_ba_env\Lib\site-packages\kedro\framework\session\session.py", line 399, in run
    run_result = runner.run(
                 ^^^^^^^^^^^
  File "H:\Programs\Anaconda\envs\.conda_ba_env\Lib\site-packages\kedro\runner\runner.py", line 113, in run
    self._run(pipeline, catalog, hook_or_null_manager, session_id)  # type: ignore[arg-type]
    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "H:\Programs\Anaconda\envs\.conda_ba_env\Lib\site-packages\kedro\runner\sequential_runner.py", line 85, in _run
    ).execute()
      ^^^^^^^^^
  File "H:\Programs\Anaconda\envs\.conda_ba_env\Lib\site-packages\kedro\runner\task.py", line 88, in execute
    node = self._run_node_sequential(
           ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "H:\Programs\Anaconda\envs\.conda_ba_env\Lib\site-packages\kedro\runner\task.py", line 186, in _run_node_sequential
    catalog.save(name, data)
  File "H:\Programs\Anaconda\envs\.conda_ba_env\Lib\site-packages\kedro\io\data_catalog.py", line 438, in save
    dataset.save(data)
  File "H:\Programs\Anaconda\envs\.conda_ba_env\Lib\site-packages\kedro\io\core.py", line 276, in save
    raise DatasetError(message) from exc
kedro.io.core.DatasetError: Failed while saving data to dataset MlflowNetCDFDataset(filepath=S:/___Studium/Bachelor_Arbeit/ba_env/bundesliga/data/07_model_output/D1-24-25/pymc/idata_fit.nc, load_args={'decode_times': False}, protocol=file, save_args={'mode': a}).
[Errno 13] Permission denied: 'S:\\___Studium\\Bachelor_Arbeit\\ba_env\\bundesliga\\data\\07_model_output\\D1-24-25\\pymc\\idata_fit.nc' 

Add a reply
Sign up and join the conversation on Slack