Join the Kedro community

M
M
M
D
M
Members
Nicolas Betancourt Cardona
N
Nicolas Betancourt Cardona
Offline, last seen 6 days ago
Joined October 2, 2024

Hi everyone!
I'm trying to run the following node in Kedro:
def test(a):
print(a)
return 2+2
node(
func=test,
inputs=[ 'params:parameter'],
outputs="not_in_catalog",
name="test_node",
),
test() is in nodes.py and the node in pipeline.py. When I run kedro run --nodes test_node I get the following log:

(pamflow_kedro_env) s0nabio@hub:~/kedroPamflow$ kedro run --nodes test_node
[10/10/24 14:49:06] INFO     Using '/home/s0nabio/miniconda3/envs/pamflow_kedro_env/lib/python3.10/site-packages/kedro/framework/project/rich_logging.yml' as logging configuration.                                                                                                          __init__.py:249
[10/10/24 14:49:07] INFO     Kedro project kedroPamflow                                                                                                                                                                                                                                        session.py:327
Illegal instruction (core dumped)
I already ran Kedro in the active environment (Python 3.10.14) in a Windows machine and it Worked. Now I'm trying to run it in a Linux VM and is when I get the error. The only libraries I have installed are

birdnetlib==0.17.2
contextily==1.6.2
fsspec==2024.9.0
geopandas==1.0.1
kedro==0.19.8
kedro_datasets==4.1.0
librosa==0.10.2
matplotlib==3.6.2
numpy==1.23.5
pandas==2.2.3
pytest==8.3.3
PyYAML==6.0.2
scikit-maad==1.4.1
seaborn==0.13.2
statsmodels==0.14.4
tensorflow==2.17.0
If I run test() using python directley on the terminal instead of through Kedro I don't get the error. That's why I'm here beacause without any warnings and just when I try to run the simplest kedro node, I get the error.

2 comments
R
N

Hi all! I have worked with kedro many times in different operating systems and I have never had issues with catalog path entries. It has always been fine to define catalog entries such like

catalog_entry:
  type: AnyDataset
  filepath: data/01_raw/file.extension
whether on windows or mac. Now I'm having an issue with it for the first time. It turns out that the following catalog entry
problematic_catalog_entry
  type: MyCustomDataSet
  mainfolderpath: data/01_raw/file.extension
rises a winerror 3 the system cannot find the path specified when loaded from a Kedro Jupyter Notebook but
problematic_catalog_entry_2
  type: MyCustomDataSet
  mainfolderpath: C:\same\path\but\absolute\data\01_raw\file.extension
doesn't.

This is absolutely my fault because the data set type I'm using is a custom AbstractDataset but I don't have this problem with other custom AbstractDataset . I will attach my _load method because the problem might be there

def _load(self):
        subfolder_names=[ subfolder_name 
                         for subfolder_name in os.listdir(self._mainfolderpath) 
                         if os.path.isdir(os.path.join(self._mainfolderpath, subfolder_name)) 
                        ]
        
        
        wav_paths_dict={}
        for subfolder_name in subfolder_names:
            subfolder_path=os.path.join(self._mainfolderpath, subfolder_name)
            wav_files=[]
            for root, dirs, files in os.walk(subfolder_path):
                for file in files:
                    if file.lower().endswith('.wav'):
                        wav_file_path=os.path.join(root, file)
                        wav_file_name=os.path.split(wav_file_path)[-1].replace('.wav','').replace('.WAV','')
                        wav_files.append((wav_file_name,wav_file_path))
                wav_paths_dict[subfolder_name]=dict(wav_files)

        
        partitioned_dataset_dict={}
        for subfolder_name, sub_dict in wav_paths_dict.items():
            partitioned_dataset=[(wav_file_name,SoundDataset(wav_file_path).load()) for wav_file_name,wav_file_path in sub_dict.items()]
            partitioned_dataset_dict[subfolder_name]=dict(partitioned_dataset)
        
        return partitioned_dataset_dict
On __init__ I'm initializing self._mainfolderpath this way: self._mainfolderpath = PurePosixPath(mainfolderpath) . Thank you very much for yor help again

12 comments
N
N

Hi everyone,

I'm working in a kedro project where I want to automatically label thousands of audio files, apply transformations to them and then store them in a folder of folders, each subfolder corresponding to one label. I want that folder of folders to be a catalog entry on my yml file

I followed this Kedro tutorial and created my own custom dataset for saving/loading .wav files in kedro catalog. I also am able to create PartitionedDataset catalog entries in catalog.yml such as

audio_folder:
  type: partitions.PartitionedDataset
  dataset: my_kedro_project.datasets.audio_dataset.SoundDataset
  path: data/output/audios/
  filename_suffix: ".WAV"
The next level of abstraction I would require is to be able to create a catalog entry corresponding to a folder containig folders such as the audio_folder above. Here is my try to do so but I'm having an issue with the _save method

class AudioFolderDataset(PartitionedDataset):
    

    def __init__(self, main_folder_path: str):
        """Creates a new instance of SoundDataset to load / save audio data for given filepath.

        Args:
            filepath: The location of the audio file to load / save data.
        """
        protocol, mainfolderpath = get_protocol_and_path(main_folder_path)
        self._protocol = protocol
        self._mainfolderpath = PurePosixPath(mainfolderpath)
        self._fs = fsspec.filesystem(self._protocol)

    def _load(self,subfolders_dictionary):
        # loading code 
        .
    def _save(self, subfolders_dictionary):
        os.path.normpath(self._mainfolderpath)
        for subfolder_name in subfolders_dictionary.keys():
            subfolder_path=os.path.join(self._mainfolderpath, subfolder_name) 
            
            partitioned_dataset = PartitionedDataset(
            path=subfolder_path,
            dataset=SoundDataset,
            filename_suffix=".WAV",
            )
            
            partitioned_dataset.save(subfolders_dictionary[subfolder_name])
    
    
    partitioned_dataset.save(subfolders_dictionary[subfolder_name])
    
    def _describe(self):
        # describe code
The problem is I'm working on windows but it seems that PartitionedDataset assumes that my system separator is / instead of \ . When I print the path in _save method in SoundDataset class I get folder\\subfolder/file.WAV which off course os leading to an error.
Is there a way in which I can change this default behaviour?

6 comments
N
N