Partitioning issues with PartitionedDataset

At a glance

hey guys I'm having some issues when applying partitions.PartitionedDataset, I manage to create multiple files but when accessing them on a .ipynb to check each partition, thats my problem, and I would like to make sure they are Ok in order to open one by one by iterating over them on the next pipeline, can someone help me with that?

my_partitioned_dataset:
  type: partitions.PartitionedDataset
  path: data/02_intermediate  # path to the location of partitions
  dataset: pandas.CSVDataset

14 comments

RRavi Kumar Pilla

Hi @U04D5SM9LSW, Is it possible for you to share the actual issue ? Thank you

TThiago José Moser Poletto

I mean, when I load it using catalog.load(), I did tried to access it like any Dict, but it doesn't work. so whatwould it be the correct way to access each partition

RRavi Kumar Pilla

Are you facing this issue only in notebook ? Did you try loading the partition in local dev env in an IDE ?

RRavi Kumar Pilla

I hope you already went through the docs, if not can you have a look at the Python API example mentioned here

TThiago José Moser Poletto

I did, it's just a bit confusing, I'm trying to use the same way to iterate over the catalog entry after loaded, but that is not working

TThiago José Moser Poletto

No I'm using vertex ai workbench to code, and I do load to try it out in a jupyter notebook .ipynb

%load_ext kedro.ipython
%reload_kedro ../

catalog.list()
[
    'companies',
    'historical_product_demand',
    'my_partitioned_dataset',
    'reviews',
    'shuttles_excel',
    'shuttles@csv',
    'shuttles@spark',
    'preprocessed_companies',
    'preprocessed_shuttles',
    'preprocessed_reviews',
    'model_input_table@spark',
    'model_input_table@pandas',
    'regressor',
    'metrics',
    'companies_columns',
    'shuttle_passenger_capacity_plot_exp',
    'shuttle_passenger_capacity_plot_go',
    'dummy_confusion_matrix',
    'parameters',
    'params:model_options',
    'params:model_options.test_size',
    'params:model_options.random_state',
    'params:model_options.features'
]

my_partitioned_dataset = catalog.load('my_partitioned_dataset')

RRavi Kumar Pilla

Thanks for the information. The problem might be due to some missing partitions or access permission issues. I will check with my team for some more help. Thanks for your patience

AAnkita Katiyar

Once you’ve loaded the partitioned dataset with catalog.load() it’ll be a Dict with the partition name and it’s corresponding load function. You can iterate over it to load the individual partitions -

my_partitioned_dataset = catalog.load('my_partitioned_dataset')

for file, func in my_partitioned_dataset.items():
  data = func()

TThiago José Moser Poletto

I did that and it didn't work, but it was due to something that it was created and I don't know why it happen, it was a partition gitkeep.

'.gitkeep': <bound method CSVDataset._load of kedro_datasets.pandas.csv_dataset.CSVDataset(filepath=PurePosixPath('/home/jupyter/demand-forecast-gcp-kedro/pdi-demand-forecast/data/02_intermediate/.gitkeep'), protocol='file', load_args={}, save_args={'index': False})>,

TThiago José Moser Poletto

If I skip that it works

AAnkita Katiyar

Oh it’s reading the gitkeep file as one of the data partitions as well, you can just delete that file

TThiago José Moser Poletto

yeah, I just didn't understand how that happen, like there's any way to avoid that, because every time that node runs it will do the same, I know that with a simple "if" I can avoid it, but, I would like to understand how that was created.

AAnkita Katiyar

The Kedro template comes with the .gitkeep file in the data folders so they can be uploaded to GitHub, as Github doesn’t read empty folders. You can delete these files when the folders actually contain something. I’d also recommend creating a folder within 02_intermediate for the actual data

TThiago José Moser Poletto

ohhh I see,I thought was something genarated when the node was executed to create partitions, I got it now, thanks

Add a reply

Join the Kedro community

Partitioning issues with PartitionedDataset