Join the Kedro community

Updated 4 months ago

Automatically labeling and transforming audio files in Kedro project

At a glance

Hi everyone,

I'm working in a kedro project where I want to automatically label thousands of audio files, apply transformations to them and then store them in a folder of folders, each subfolder corresponding to one label. I want that folder of folders to be a catalog entry on my yml file

I followed this Kedro tutorial and created my own custom dataset for saving/loading .wav files in kedro catalog. I also am able to create PartitionedDataset catalog entries in catalog.yml such as

audio_folder:
  type: partitions.PartitionedDataset
  dataset: my_kedro_project.datasets.audio_dataset.SoundDataset
  path: data/output/audios/
  filename_suffix: ".WAV"
The next level of abstraction I would require is to be able to create a catalog entry corresponding to a folder containig folders such as the audio_folder above. Here is my try to do so but I'm having an issue with the _save method

class AudioFolderDataset(PartitionedDataset):
    

    def __init__(self, main_folder_path: str):
        """Creates a new instance of SoundDataset to load / save audio data for given filepath.

        Args:
            filepath: The location of the audio file to load / save data.
        """
        protocol, mainfolderpath = get_protocol_and_path(main_folder_path)
        self._protocol = protocol
        self._mainfolderpath = PurePosixPath(mainfolderpath)
        self._fs = fsspec.filesystem(self._protocol)

    def _load(self,subfolders_dictionary):
        # loading code 
        .
    def _save(self, subfolders_dictionary):
        os.path.normpath(self._mainfolderpath)
        for subfolder_name in subfolders_dictionary.keys():
            subfolder_path=os.path.join(self._mainfolderpath, subfolder_name) 
            
            partitioned_dataset = PartitionedDataset(
            path=subfolder_path,
            dataset=SoundDataset,
            filename_suffix=".WAV",
            )
            
            partitioned_dataset.save(subfolders_dictionary[subfolder_name])
    
    
    partitioned_dataset.save(subfolders_dictionary[subfolder_name])
    
    def _describe(self):
        # describe code
The problem is I'm working on windows but it seems that PartitionedDataset assumes that my system separator is / instead of \ . When I print the path in _save method in SoundDataset class I get folder\\subfolder/file.WAV which off course os leading to an error.
Is there a way in which I can change this default behaviour?

N
N
6 comments

Hey , I guess you are the author of the same Stackoverflow thread?

I think it's mostly due to your implementation of doing os.path.join etc, if you use pathlib.Path you should be handle these path properly regardless of your OS.

On the other hand - I see you have a PartitionedDatasetinside your implementation, this feels a bit weird since you are inheriting PartitionedDataset at the same time.

I would approach this differently, since you mentioned a folder of files is consider as a single "Dataset".

  1. Keep PartitionedDataset if it's flexible enough for you, other wise extend it to iterate folders however you need
  2. Implement your own AudioDataset, that load a single folder as a data.

Hey, yes I'm the author. Thank you for your answer. After trying different approaches, extending AbstractDataset to load and save folders of foldes using dictionaries of dictionaries did what I needed. Thank you for your help

Perfect! If you don't mind share your solution, it would be great if you can self answer it in that thread

I will immediately because it finally worked. Thanks!

Add a reply
Sign up and join the conversation on Slack