Nicolas Betancourt Cardona

Automatically Creating Folders in Kedro Paths

Hello, is there a way to get Kedro to create folders in paths if they do not exist? For instance if my data structure is

data:
  outputs:

And I have the catalog entry

data@pandas:
  type: pandas.CSVDataset
  filepath: data/outputs/csv_files/data.csv

It would be nice for Kedro to automatically create csv_files inside data and store data.csv afterwards.
Or if i'm working with a partitioned dataset, and my nodes returns a dictionary with the structure {'non_existent_subfolder/file_name': data} for being saved in the catalog entry

data@PartitionedDataset:
  type: partitions.PartitionedDataset
  path: data/outputs
  dataset: 
    type: pandas.ExcelDataset
    save_args:
      sheet_name: Sheet1
    load_args:
      sheet_name: Sheet1
  filename_suffix: ".xlsx"

It would be nice for Kedro to create 'non_existen_subfolder' automatically inside data/outputs . I already tryed it and Kedro does not creates folders when they don't exist. Is there a way of changing this default behaviour?

Thank you all in advance :)

4 comments

NNicolas Betancourt Cardona

Handling Memory Issues in Kedro Projects for Processing Large Datasets

Hello, I'm facing a memory issue in my Kedro project and I would like to know if there is a kedro-oriented solution.

I am developing a pipeline for processing large datasets of audio recordings. This involves processing several audio files (large numpy arrays) in a single node and storing them again. I was rellaying on partitionedDatasets for doing so but I'm having memory issues because building the dictionary of numpy arrays is quite heavy and always ends up consuming all of my tiny memory.

Is there a way of storing each processed image as soon as it is done instead of storing them in RAM untill the last one is done? Off course this is possible in many ways but my question is regarding Kedro, is it possible saving in the body of the function using Kedro and partitioned datasets? Has any ou you expereienced something like this before?

Bests
Nicolas

6 comments

NNicolas Betancourt Cardona

Running node test_node with custom function and inputs

Hi everyone!
I'm trying to run the following node in Kedro:
def test(a):
print(a)
return 2+2
node(
func=test,
inputs=[ 'params:parameter'],
outputs="not_in_catalog",
name="test_node",
),
test() is in nodes.py and the node in pipeline.py. When I run kedro run --nodes test_node I get the following log:

(pamflow_kedro_env) s0nabio@hub:~/kedroPamflow$ kedro run --nodes test_node
[10/10/24 14:49:06] INFO     Using '/home/s0nabio/miniconda3/envs/pamflow_kedro_env/lib/python3.10/site-packages/kedro/framework/project/rich_logging.yml' as logging configuration.                                                                                                          __init__.py:249
[10/10/24 14:49:07] INFO     Kedro project kedroPamflow                                                                                                                                                                                                                                        session.py:327
Illegal instruction (core dumped)

I already ran Kedro in the active environment (Python 3.10.14) in a Windows machine and it Worked. Now I'm trying to run it in a Linux VM and is when I get the error. The only libraries I have installed are

birdnetlib==0.17.2
contextily==1.6.2
fsspec==2024.9.0
geopandas==1.0.1
kedro==0.19.8
kedro_datasets==4.1.0
librosa==0.10.2
matplotlib==3.6.2
numpy==1.23.5
pandas==2.2.3
pytest==8.3.3
PyYAML==6.0.2
scikit-maad==1.4.1
seaborn==0.13.2
statsmodels==0.14.4
tensorflow==2.17.0

If I run test() using python directley on the terminal instead of through Kedro I don't get the error. That's why I'm here beacause without any warnings and just when I try to run the simplest kedro node, I get the error.

2 comments

NNicolas Betancourt Cardona

Kedro catalog path entries not working on new operating system

Hi all! I have worked with kedro many times in different operating systems and I have never had issues with catalog path entries. It has always been fine to define catalog entries such like

catalog_entry:
  type: AnyDataset
  filepath: data/01_raw/file.extension

whether on windows or mac. Now I'm having an issue with it for the first time. It turns out that the following catalog entry

problematic_catalog_entry
  type: MyCustomDataSet
  mainfolderpath: data/01_raw/file.extension

rises a winerror 3 the system cannot find the path specified when loaded from a Kedro Jupyter Notebook but

problematic_catalog_entry_2
  type: MyCustomDataSet
  mainfolderpath: C:\same\path\but\absolute\data\01_raw\file.extension

doesn't.

This is absolutely my fault because the data set type I'm using is a custom AbstractDataset but I don't have this problem with other custom AbstractDataset . I will attach my _load method because the problem might be there

def _load(self):
        subfolder_names=[ subfolder_name 
                         for subfolder_name in os.listdir(self._mainfolderpath) 
                         if os.path.isdir(os.path.join(self._mainfolderpath, subfolder_name)) 
                        ]
        
        
        wav_paths_dict={}
        for subfolder_name in subfolder_names:
            subfolder_path=os.path.join(self._mainfolderpath, subfolder_name)
            wav_files=[]
            for root, dirs, files in os.walk(subfolder_path):
                for file in files:
                    if file.lower().endswith('.wav'):
                        wav_file_path=os.path.join(root, file)
                        wav_file_name=os.path.split(wav_file_path)[-1].replace('.wav','').replace('.WAV','')
                        wav_files.append((wav_file_name,wav_file_path))
                wav_paths_dict[subfolder_name]=dict(wav_files)

        
        partitioned_dataset_dict={}
        for subfolder_name, sub_dict in wav_paths_dict.items():
            partitioned_dataset=[(wav_file_name,SoundDataset(wav_file_path).load()) for wav_file_name,wav_file_path in sub_dict.items()]
            partitioned_dataset_dict[subfolder_name]=dict(partitioned_dataset)
        
        return partitioned_dataset_dict

On __init__ I'm initializing self._mainfolderpath this way: self._mainfolderpath = PurePosixPath(mainfolderpath) . Thank you very much for yor help again

12 comments

NNicolas Betancourt Cardona

Automatically labeling and transforming audio files in Kedro project

Hi everyone,

I'm working in a kedro project where I want to automatically label thousands of audio files, apply transformations to them and then store them in a folder of folders, each subfolder corresponding to one label. I want that folder of folders to be a catalog entry on my yml file

I followed this Kedro tutorial and created my own custom dataset for saving/loading .wav files in kedro catalog. I also am able to create PartitionedDataset catalog entries in catalog.yml such as

audio_folder:
  type: partitions.PartitionedDataset
  dataset: my_kedro_project.datasets.audio_dataset.SoundDataset
  path: data/output/audios/
  filename_suffix: ".WAV"

The next level of abstraction I would require is to be able to create a catalog entry corresponding to a folder containig folders such as the audio_folder above. Here is my try to do so but I'm having an issue with the _save method

class AudioFolderDataset(PartitionedDataset):
    

    def __init__(self, main_folder_path: str):
        """Creates a new instance of SoundDataset to load / save audio data for given filepath.

        Args:
            filepath: The location of the audio file to load / save data.
        """
        protocol, mainfolderpath = get_protocol_and_path(main_folder_path)
        self._protocol = protocol
        self._mainfolderpath = PurePosixPath(mainfolderpath)
        self._fs = fsspec.filesystem(self._protocol)

    def _load(self,subfolders_dictionary):
        # loading code 
        .
    def _save(self, subfolders_dictionary):
        os.path.normpath(self._mainfolderpath)
        for subfolder_name in subfolders_dictionary.keys():
            subfolder_path=os.path.join(self._mainfolderpath, subfolder_name) 
            
            partitioned_dataset = PartitionedDataset(
            path=subfolder_path,
            dataset=SoundDataset,
            filename_suffix=".WAV",
            )
            
            partitioned_dataset.save(subfolders_dictionary[subfolder_name])
    
    
    partitioned_dataset.save(subfolders_dictionary[subfolder_name])
    
    def _describe(self):
        # describe code

The problem is I'm working on windows but it seems that PartitionedDataset assumes that my system separator is / instead of \ . When I print the path in _save method in SoundDataset class I get folder\\subfolder/file.WAV which off course os leading to an error.
Is there a way in which I can change this default behaviour?

6 comments

Join the Kedro community

Automatically Creating Folders in Kedro Paths

Handling Memory Issues in Kedro Projects for Processing Large Datasets

Running node test_node with custom function and inputs

Kedro catalog path entries not working on new operating system

Automatically labeling and transforming audio files in Kedro project