Hi everyone,
I'm working in a kedro project where I want to automatically label thousands of audio files, apply transformations to them and then store them in a folder of folders, each subfolder corresponding to one label. I want that folder of folders to be a catalog entry on my yml file
I followed this Kedro tutorial and created my own custom dataset for saving/loading .wav files in kedro catalog. I also am able to create PartitionedDataset
catalog entries in catalog.yml
such as
audio_folder: type: partitions.PartitionedDataset dataset: my_kedro_project.datasets.audio_dataset.SoundDataset path: data/output/audios/ filename_suffix: ".WAV"The next level of abstraction I would require is to be able to create a catalog entry corresponding to a folder containig folders such as the
audio_folder
above. Here is my try to do so but I'm having an issue with the _save
methodclass AudioFolderDataset(PartitionedDataset): def __init__(self, main_folder_path: str): """Creates a new instance of SoundDataset to load / save audio data for given filepath. Args: filepath: The location of the audio file to load / save data. """ protocol, mainfolderpath = get_protocol_and_path(main_folder_path) self._protocol = protocol self._mainfolderpath = PurePosixPath(mainfolderpath) self._fs = fsspec.filesystem(self._protocol) def _load(self,subfolders_dictionary): # loading code . def _save(self, subfolders_dictionary): os.path.normpath(self._mainfolderpath) for subfolder_name in subfolders_dictionary.keys(): subfolder_path=os.path.join(self._mainfolderpath, subfolder_name) partitioned_dataset = PartitionedDataset( path=subfolder_path, dataset=SoundDataset, filename_suffix=".WAV", ) partitioned_dataset.save(subfolders_dictionary[subfolder_name]) partitioned_dataset.save(subfolders_dictionary[subfolder_name]) def _describe(self): # describe codeThe problem is I'm working on windows but it seems that
PartitionedDataset
assumes that my system separator is /
instead of \
. When I print the path in _save
method in SoundDataset
class I get folder\\subfolder/file.WAV
which off course os leading to an error.I think it's mostly due to your implementation of doing os.path.join etc, if you use pathlib.Path
you should be handle these path properly regardless of your OS.
On the other hand - I see you have a PartitionedDataset
inside your implementation, this feels a bit weird since you are inheriting PartitionedDataset
at the same time.
I would approach this differently, since you mentioned a folder of files is consider as a single "Dataset".
Hey, yes I'm the author. Thank you for your answer. After trying different approaches, extending AbstractDataset
to load and save folders of foldes using dictionaries of dictionaries did what I needed. Thank you for your help
Perfect! If you don't mind share your solution, it would be great if you can self answer it in that thread