Hey Kedro community,
I'm currently working on a project trying to use kedro_mlfow
to store kedro_datasets_experimental.netcdf
as artifacts. Unfortunatly I can't make it work.
The problem seems to be path related:
kedro.io.core.DatasetError: Failed while saving data to dataset MlflowNetCDFDataset(filepath=S:/âŠ/data/07_model_output/D2-24-25/idata.nc, load_args={'decode_times': False}, protocol=file, save_args={'mode': w}). 'str' object has no attribute 'as_posix'I tried to investigate it to the best of my abilities and it seems to have to do with the initialization of
NetCDFDataset
. Most Datasets inherit from AbstractVersionedDataset
and will call __init__
with its _filepath as str.NetCDFDataset
is missing it and so the PurePosixPath
is not created. If this should be the problem in the end I donât know but it is the point where other datasets have its path set. In the meantime I thought it might be because mlflow isn't capable of tracking Datasets which don't inherit from AbstractVersionedDataset
but in kedro-mlfow documentation it says MlflowArtifactDataset
is a wrapper for all AbstractDatasets
.self._filepath = PurePosixPath(filepath)
myself in the sitepackage but getting a Permission error on saving and thatâs were my journey has to end. Would have been too good if this oneline would have made it^^"{dataset}.idata": type: kedro_mlflow.io.artifacts.MlflowArtifactDataset dataset: type: kedro_datasets_experimental.netcdf.NetCDFDataset filepath: data/07_model_output/{dataset}/idata.nc save_args: mode: a load_args: decode_times: Falsenode.py
def predict(model, x_data): idata = model.predict(x_data) return az.convert_to_dataset(idata)pipeline.py
pipeline_inference = pipeline( [ node( func=predict, inputs={ "model": f"{dataset}.model", "x_data": f"{dataset}.x_data", }, outputs=f"{dataset}.idata", name=f"{dataset}.predict_node", tags=["training"], ), ] )
hi @Philipp Dahlke, sorry you had a bumpy experience!
I have a couple of questions:
MlflowNetCDFDataset
a custom dataset you created? (from the first error you reported)NetCDFDataset
inside MlflowArtifactDataset
(second code snippet), what error did you get? Could you share the full traceback?Hi @Philipp Dahlke, sorry for the bad experience.
Unfortunately kedro abstract (unversioned) dataset don't necesarily have the _filepath attribute, and neither its format or access is standardized. See this issue (and maybe report your bug here to help prioritizing -> this is something tangentially related to our Datacatalog refactoring but for datasets) : https://github.com/kedro-org/kedro/discussions/3753. kedro-mlflow has focused a lot on AbstractVersionedDataset
s and it may have some flaws for such unusual datasets.
I think your fix attempt is the right one. In its init, the NetCDFDataset
should convert the filepath to a Path. Can you :
pathlib.Path
instead of pathlib.PurePosixPath
and see if it works? NetCDFDataset
so that I can try on my own ?MlflowNetCDFDataset
is created under the hood by the MlflowArtifactDataset
i am also tagging @Riley Brady; the author of the NetCDFDataset
Riley, is there a reason we don't set the self._filepath = PurePosixPath(filepath)
. If not, can we make the change to the dataset to handle it.
Thanks for your help.
I discovered that missing folders in my workspaces, which are declared for the dataset in the catalog.yml, raise the permission error. See first post.
After I created those manually I can save NetCDFDataset
with the change made to self._filepath
in NetCDFDataset. __ init __
.
@Yolan Honoré-Rougé
Both versions seem to work. pathlib.Path
and pathlib.PurePosixPath
I declared them either like in other classes after self.metadata
or at the end after self._ismultifile
. I didnt want to disturb the is_multifile logic by creating it before hand but it seems like PurePosixPath
can handle getting its own type passed.
A minimal sample:
import numpy as np import arviz as az def test_netCDF(): size = 100 dataset = az.convert_to_inference_data(np.random.randn(size)) return az.convert_to_dataset(dataset)
Traceback (most recent call last): File "H:\Programs\Anaconda\envs\.conda_ba_env\Lib\site-packages\kedro\io\core.py", line 271, in save save_func(self, data) File "H:\Programs\Anaconda\envs\.conda_ba_env\Lib\site-packages\kedro_mlflow\io\artifacts\mlflow_artifact_dataset.py", line 63, in _save local_path = local_path.as_posix() ^^^^^^^^^^^^^^^^^^^ AttributeError: 'str' object has no attribute 'as_posix' The above exception was the direct cause of the following exception: Traceback (most recent call last): File "<frozen runpy>", line 198, in _run_module_as_main File "<frozen runpy>", line 88, in _run_code File "H:\Programs\Anaconda\envs\.conda_ba_env\Scripts\kedro.exe\main.py", line 7, in <module> File "H:\Programs\Anaconda\envs\.conda_ba_env\Lib\site-packages\kedro\framework\cli\cli.py", line 263, in main cli_collection() File "H:\Programs\Anaconda\envs\.conda_ba_env\Lib\site-packages\click\core.py", line 1157, in call return self.main(*args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^ File "H:\Programs\Anaconda\envs\.conda_ba_env\Lib\site-packages\kedro\framework\cli\cli.py", line 163, in main super().main( File "H:\Programs\Anaconda\envs\.conda_ba_env\Lib\site-packages\click\core.py", line 1078, in main rv = self.invoke(ctx) ^^^^^^^^^^^^^^^^ File "H:\Programs\Anaconda\envs\.conda_ba_env\Lib\site-packages\click\core.py", line 1688, in invoke return _process_result(sub_ctx.command.invoke(sub_ctx)) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "H:\Programs\Anaconda\envs\.conda_ba_env\Lib\site-packages\click\core.py", line 1434, in invoke return ctx.invoke(self.callback, **ctx.params) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "H:\Programs\Anaconda\envs\.conda_ba_env\Lib\site-packages\click\core.py", line 783, in invoke return __callback(*args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "H:\Programs\Anaconda\envs\.conda_ba_env\Lib\site-packages\kedro\framework\cli\project.py", line 228, in run return session.run( ^^^^^^^^^^^^ File "H:\Programs\Anaconda\envs\.conda_ba_env\Lib\site-packages\kedro\framework\session\session.py", line 399, in run run_result = runner.run( ^^^^^^^^^^^ File "H:\Programs\Anaconda\envs\.conda_ba_env\Lib\site-packages\kedro\runner\runner.py", line 113, in run self._run(pipeline, catalog, hook_or_null_manager, session_id) # type: ignore[arg-type] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "H:\Programs\Anaconda\envs\.conda_ba_env\Lib\site-packages\kedro\runner\sequential_runner.py", line 85, in _run ).execute() ^^^^^^^^^ File "H:\Programs\Anaconda\envs\.conda_ba_env\Lib\site-packages\kedro\runner\task.py", line 88, in execute node = self._run_node_sequential( ^^^^^^^^^^^^^^^^^^^^^^^^^^ File "H:\Programs\Anaconda\envs\.conda_ba_env\Lib\site-packages\kedro\runner\task.py", line 186, in _run_node_sequential catalog.save(name, data) File "H:\Programs\Anaconda\envs\.conda_ba_env\Lib\site-packages\kedro\io\data_catalog.py", line 438, in save dataset.save(data) File "H:\Programs\Anaconda\envs\.conda_ba_env\Lib\site-packages\kedro\io\core.py", line 276, in save raise DatasetError(message) from exc kedro.io.core.DatasetError: Failed while saving data to dataset MlflowNetCDFDataset(filepath=S:/___Studium/Bachelor_Arbeit/ba_env/bundesliga/data/07_model_output/D1-24-25/pymc/idata_fit.nc, load_args={'decode_times': False}, protocol=file, save_args={'mode': a}). 'str' object has no attribute 'as_posix'
Traceback (most recent call last): File "H:\Programs\Anaconda\envs\.conda_ba_env\Lib\site-packages\xarray\backends\file_manager.py", line 211, in _acquire_with_cache_info file = self._cache[self._key] ~~~~~~~~~~~^^^^^^^^^^^ File "H:\Programs\Anaconda\envs\.conda_ba_env\Lib\site-packages\xarray\backends\lru_cache.py", line 56, in __getitem__ value = self._cache[key] ~~~~~~~~~~~^^^^^ KeyError: [<class 'netCDF4._netCDF4.Dataset'>, ('S:\\___Studium\\Bachelor_Arbeit\\ba_env\\bundesliga\\data\\07_model_output\\D1-24-25\\pymc\\idata_fit.nc',), 'a', (('clobber', True), ('diskless', False), ('format', 'NETCDF4'), ('persist', False)), '8aa8dfaa-e6a7-47e2-8b44-b700e528ffb8'] During handling of the above exception, another exception occurred: Traceback (most recent call last): File "H:\Programs\Anaconda\envs\.conda_ba_env\Lib\site-packages\kedro\io\core.py", line 271, in save save_func(self, data) File "H:\Programs\Anaconda\envs\.conda_ba_env\Lib\site-packages\kedro_mlflow\io\artifacts\mlflow_artifact_dataset.py", line 66, in _save super().save.__wrapped__(self, data) File "H:\Programs\Anaconda\envs\.conda_ba_env\Lib\site-packages\kedro_datasets_experimental\netcdf\netcdf_dataset.py", line 172, in save data.to_netcdf(path=self._filepath, **self._save_args) File "H:\Programs\Anaconda\envs\.conda_ba_env\Lib\site-packages\xarray\core\dataset.py", line 2372, in to_netcdf return to_netcdf( # type: ignore[return-value] # mypy cannot resolve the overloads:( ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "H:\Programs\Anaconda\envs\.conda_ba_env\Lib\site-packages\xarray\backends\api.py", line 1856, in to_netcdf store = store_open(target, mode, format, group, **kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "H:\Programs\Anaconda\envs\.conda_ba_env\Lib\site-packages\xarray\backends\netCDF4_.py", line 452, in open return cls(manager, group=group, mode=mode, lock=lock, autoclose=autoclose) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "H:\Programs\Anaconda\envs\.conda_ba_env\Lib\site-packages\xarray\backends\netCDF4_.py", line 393, in __init__ self.format = self.ds.data_model ^^^^^^^ File "H:\Programs\Anaconda\envs\.conda_ba_env\Lib\site-packages\xarray\backends\netCDF4_.py", line 461, in ds return self._acquire() ^^^^^^^^^^^^^^^ File "H:\Programs\Anaconda\envs\.conda_ba_env\Lib\site-packages\xarray\backends\netCDF4_.py", line 455, in _acquire with self._manager.acquire_context(needs_lock) as root: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "H:\Programs\Anaconda\envs\.conda_ba_env\Lib\contextlib.py", line 137, in __enter__ return next(self.gen) ^^^^^^^^^^^^^^ File "H:\Programs\Anaconda\envs\.conda_ba_env\Lib\site-packages\xarray\backends\file_manager.py", line 199, in acquire_context file, cached = self._acquire_with_cache_info(needs_lock) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "H:\Programs\Anaconda\envs\.conda_ba_env\Lib\site-packages\xarray\backends\file_manager.py", line 217, in _acquire_with_cache_info file = self._opener(*self._args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "src\\netCDF4\\_netCDF4.pyx", line 2521, in netCDF4._netCDF4.Dataset.__init__ File "src\\netCDF4\\_netCDF4.pyx", line 2158, in netCDF4._netCDF4._ensure_nc_success PermissionError: [Errno 13] Permission denied: 'S:\\___Studium\\Bachelor_Arbeit\\ba_env\\bundesliga\\data\\07_model_output\\D1-24-25\\pymc\\idata_fit.nc' The above exception was the direct cause of the following exception: Traceback (most recent call last): File "<frozen runpy>", line 198, in _run_module_as_main File "<frozen runpy>", line 88, in _run_code File "H:\Programs\Anaconda\envs\.conda_ba_env\Scripts\kedro.exe\__main__.py", line 7, in <module> File "H:\Programs\Anaconda\envs\.conda_ba_env\Lib\site-packages\kedro\framework\cli\cli.py", line 263, in main cli_collection() File "H:\Programs\Anaconda\envs\.conda_ba_env\Lib\site-packages\click\core.py", line 1157, in __call__ return self.main(*args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^ File "H:\Programs\Anaconda\envs\.conda_ba_env\Lib\site-packages\kedro\framework\cli\cli.py", line 163, in main super().main( File "H:\Programs\Anaconda\envs\.conda_ba_env\Lib\site-packages\click\core.py", line 1078, in main rv = self.invoke(ctx) ^^^^^^^^^^^^^^^^ File "H:\Programs\Anaconda\envs\.conda_ba_env\Lib\site-packages\click\core.py", line 1688, in invoke return _process_result(sub_ctx.command.invoke(sub_ctx)) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "H:\Programs\Anaconda\envs\.conda_ba_env\Lib\site-packages\click\core.py", line 1434, in invoke return ctx.invoke(self.callback, **ctx.params) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "H:\Programs\Anaconda\envs\.conda_ba_env\Lib\site-packages\click\core.py", line 783, in invoke return __callback(*args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "H:\Programs\Anaconda\envs\.conda_ba_env\Lib\site-packages\kedro\framework\cli\project.py", line 228, in run return session.run( ^^^^^^^^^^^^ File "H:\Programs\Anaconda\envs\.conda_ba_env\Lib\site-packages\kedro\framework\session\session.py", line 399, in run run_result = runner.run( ^^^^^^^^^^^ File "H:\Programs\Anaconda\envs\.conda_ba_env\Lib\site-packages\kedro\runner\runner.py", line 113, in run self._run(pipeline, catalog, hook_or_null_manager, session_id) # type: ignore[arg-type] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "H:\Programs\Anaconda\envs\.conda_ba_env\Lib\site-packages\kedro\runner\sequential_runner.py", line 85, in _run ).execute() ^^^^^^^^^ File "H:\Programs\Anaconda\envs\.conda_ba_env\Lib\site-packages\kedro\runner\task.py", line 88, in execute node = self._run_node_sequential( ^^^^^^^^^^^^^^^^^^^^^^^^^^ File "H:\Programs\Anaconda\envs\.conda_ba_env\Lib\site-packages\kedro\runner\task.py", line 186, in _run_node_sequential catalog.save(name, data) File "H:\Programs\Anaconda\envs\.conda_ba_env\Lib\site-packages\kedro\io\data_catalog.py", line 438, in save dataset.save(data) File "H:\Programs\Anaconda\envs\.conda_ba_env\Lib\site-packages\kedro\io\core.py", line 276, in save raise DatasetError(message) from exc kedro.io.core.DatasetError: Failed while saving data to dataset MlflowNetCDFDataset(filepath=S:/___Studium/Bachelor_Arbeit/ba_env/bundesliga/data/07_model_output/D1-24-25/pymc/idata_fit.nc, load_args={'decode_times': False}, protocol=file, save_args={'mode': a}). [Errno 13] Permission denied: 'S:\\___Studium\\Bachelor_Arbeit\\ba_env\\bundesliga\\data\\07_model_output\\D1-24-25\\pymc\\idata_fit.nc'