Hello, is there a way to get Kedro to create folders in paths if they do not exist? For instance if my data structure is
data: outputs:And I have the catalog entry
data@pandas: type: pandas.CSVDataset filepath: data/outputs/csv_files/data.csvIt would be nice for Kedro to automatically create
csv_files
inside data and store data.csv
afterwards.{'non_existent_subfolder/file_name': data}
for being saved in the catalog entrydata@PartitionedDataset: type: partitions.PartitionedDataset path: data/outputs dataset: type: pandas.ExcelDataset save_args: sheet_name: Sheet1 load_args: sheet_name: Sheet1 filename_suffix: ".xlsx"It would be nice for Kedro to create '
non_existen_subfolder
' automatically inside data/outputs
. I already tryed it and Kedro does not creates folders when they don't exist. Is there a way of changing this default behaviour?Hello, I'm facing a memory issue in my Kedro project and I would like to know if there is a kedro-oriented solution.
I am developing a pipeline for processing large datasets of audio recordings. This involves processing several audio files (large numpy arrays) in a single node and storing them again. I was rellaying on partitionedDatasets for doing so but I'm having memory issues because building the dictionary of numpy arrays is quite heavy and always ends up consuming all of my tiny memory.
Is there a way of storing each processed image as soon as it is done instead of storing them in RAM untill the last one is done? Off course this is possible in many ways but my question is regarding Kedro, is it possible saving in the body of the function using Kedro and partitioned datasets? Has any ou you expereienced something like this before?
Bests
Nicolas
Hi everyone!
I'm trying to run the following node in Kedro:def test(a):
print(a)
return 2+2
node(
func=test,
inputs=[ 'params:parameter'],
outputs="not_in_catalog",
name="test_node",
),
test()
is in nodes.py and the node in pipeline.py. When I run kedro run --nodes test_node
I get the following log:
(pamflow_kedro_env) s0nabio@hub:~/kedroPamflow$ kedro run --nodes test_node [10/10/24 14:49:06] INFO Using '/home/s0nabio/miniconda3/envs/pamflow_kedro_env/lib/python3.10/site-packages/kedro/framework/project/rich_logging.yml' as logging configuration. __init__.py:249 [10/10/24 14:49:07] INFO Kedro project kedroPamflow session.py:327 Illegal instruction (core dumped)I already ran Kedro in the active environment (Python 3.10.14) in a Windows machine and it Worked. Now I'm trying to run it in a Linux VM and is when I get the error. The only libraries I have installed are
birdnetlib==0.17.2 contextily==1.6.2 fsspec==2024.9.0 geopandas==1.0.1 kedro==0.19.8 kedro_datasets==4.1.0 librosa==0.10.2 matplotlib==3.6.2 numpy==1.23.5 pandas==2.2.3 pytest==8.3.3 PyYAML==6.0.2 scikit-maad==1.4.1 seaborn==0.13.2 statsmodels==0.14.4 tensorflow==2.17.0If I run
test()
using python directley on the terminal instead of through Kedro I don't get the error. That's why I'm here beacause without any warnings and just when I try to run the simplest kedro node, I get the error.Hi all! I have worked with kedro many times in different operating systems and I have never had issues with catalog path entries. It has always been fine to define catalog entries such like
catalog_entry: type: AnyDataset filepath: data/01_raw/file.extensionwhether on windows or mac. Now I'm having an issue with it for the first time. It turns out that the following catalog entry
problematic_catalog_entry type: MyCustomDataSet mainfolderpath: data/01_raw/file.extensionrises a
winerror 3 the system cannot find the path specified
when loaded from a Kedro Jupyter Notebook butproblematic_catalog_entry_2 type: MyCustomDataSet mainfolderpath: C:\same\path\but\absolute\data\01_raw\file.extensiondoesn't.
AbstractDataset
but I don't have this problem with other custom AbstractDataset
. I will attach my _load
method because the problem might be theredef _load(self): subfolder_names=[ subfolder_name for subfolder_name in os.listdir(self._mainfolderpath) if os.path.isdir(os.path.join(self._mainfolderpath, subfolder_name)) ] wav_paths_dict={} for subfolder_name in subfolder_names: subfolder_path=os.path.join(self._mainfolderpath, subfolder_name) wav_files=[] for root, dirs, files in os.walk(subfolder_path): for file in files: if file.lower().endswith('.wav'): wav_file_path=os.path.join(root, file) wav_file_name=os.path.split(wav_file_path)[-1].replace('.wav','').replace('.WAV','') wav_files.append((wav_file_name,wav_file_path)) wav_paths_dict[subfolder_name]=dict(wav_files) partitioned_dataset_dict={} for subfolder_name, sub_dict in wav_paths_dict.items(): partitioned_dataset=[(wav_file_name,SoundDataset(wav_file_path).load()) for wav_file_name,wav_file_path in sub_dict.items()] partitioned_dataset_dict[subfolder_name]=dict(partitioned_dataset) return partitioned_dataset_dictOn
__init__
I'm initializing self._mainfolderpath
this way: self._mainfolderpath = PurePosixPath(mainfolderpath)
. Thank you very much for yor help againHi everyone,
I'm working in a kedro project where I want to automatically label thousands of audio files, apply transformations to them and then store them in a folder of folders, each subfolder corresponding to one label. I want that folder of folders to be a catalog entry on my yml file
I followed this Kedro tutorial and created my own custom dataset for saving/loading .wav files in kedro catalog. I also am able to create PartitionedDataset
catalog entries in catalog.yml
such as
audio_folder: type: partitions.PartitionedDataset dataset: my_kedro_project.datasets.audio_dataset.SoundDataset path: data/output/audios/ filename_suffix: ".WAV"The next level of abstraction I would require is to be able to create a catalog entry corresponding to a folder containig folders such as the
audio_folder
above. Here is my try to do so but I'm having an issue with the _save
methodclass AudioFolderDataset(PartitionedDataset): def __init__(self, main_folder_path: str): """Creates a new instance of SoundDataset to load / save audio data for given filepath. Args: filepath: The location of the audio file to load / save data. """ protocol, mainfolderpath = get_protocol_and_path(main_folder_path) self._protocol = protocol self._mainfolderpath = PurePosixPath(mainfolderpath) self._fs = fsspec.filesystem(self._protocol) def _load(self,subfolders_dictionary): # loading code . def _save(self, subfolders_dictionary): os.path.normpath(self._mainfolderpath) for subfolder_name in subfolders_dictionary.keys(): subfolder_path=os.path.join(self._mainfolderpath, subfolder_name) partitioned_dataset = PartitionedDataset( path=subfolder_path, dataset=SoundDataset, filename_suffix=".WAV", ) partitioned_dataset.save(subfolders_dictionary[subfolder_name]) partitioned_dataset.save(subfolders_dictionary[subfolder_name]) def _describe(self): # describe codeThe problem is I'm working on windows but it seems that
PartitionedDataset
assumes that my system separator is /
instead of \
. When I print the path in _save
method in SoundDataset
class I get folder\\subfolder/file.WAV
which off course os leading to an error.