Join the Kedro community

Updated 2 months ago

Partitioning issues with PartitionedDataset

At a glance

hey guys I'm having some issues when applying partitions.PartitionedDataset, I manage to create multiple files but when accessing them on a .ipynb to check each partition, thats my problem, and I would like to make sure they are Ok in order to open one by one by iterating over them on the next pipeline, can someone help me with that?

my_partitioned_dataset:
  type: partitions.PartitionedDataset
  path: data/02_intermediate  # path to the location of partitions
  dataset: pandas.CSVDataset

R
T
A
14 comments

Hi @U04D5SM9LSW, Is it possible for you to share the actual issue ? Thank you

I mean, when I load it using catalog.load(), I did tried to access it like any Dict, but it doesn't work. so whatwould it be the correct way to access each partition

Are you facing this issue only in notebook ? Did you try loading the partition in local dev env in an IDE ?

I hope you already went through the docs, if not can you have a look at the Python API example mentioned here

I did, it's just a bit confusing, I'm trying to use the same way to iterate over the catalog entry after loaded, but that is not working

No I'm using vertex ai workbench to code, and I do load to try it out in a jupyter notebook .ipynb

%load_ext kedro.ipython
%reload_kedro ../

catalog.list()
[
    'companies',
    'historical_product_demand',
    'my_partitioned_dataset',
    'reviews',
    'shuttles_excel',
    'shuttles@csv',
    'shuttles@spark',
    'preprocessed_companies',
    'preprocessed_shuttles',
    'preprocessed_reviews',
    'model_input_table@spark',
    'model_input_table@pandas',
    'regressor',
    'metrics',
    'companies_columns',
    'shuttle_passenger_capacity_plot_exp',
    'shuttle_passenger_capacity_plot_go',
    'dummy_confusion_matrix',
    'parameters',
    'params:model_options',
    'params:model_options.test_size',
    'params:model_options.random_state',
    'params:model_options.features'
]

my_partitioned_dataset = catalog.load('my_partitioned_dataset')

Thanks for the information. The problem might be due to some missing partitions or access permission issues. I will check with my team for some more help. Thanks for your patience

Once you’ve loaded the partitioned dataset with catalog.load() it’ll be a Dict with the partition name and it’s corresponding load function. You can iterate over it to load the individual partitions -

my_partitioned_dataset = catalog.load('my_partitioned_dataset')

for file, func in my_partitioned_dataset.items():
  data = func()

I did that and it didn't work, but it was due to something that it was created and I don't know why it happen, it was a partition gitkeep.

'.gitkeep': <bound method CSVDataset._load of kedro_datasets.pandas.csv_dataset.CSVDataset(filepath=PurePosixPath('/home/jupyter/demand-forecast-gcp-kedro/pdi-demand-forecast/data/02_intermediate/.gitkeep'), protocol='file', load_args={}, save_args={'index': False})>,

Oh it’s reading the gitkeep file as one of the data partitions as well, you can just delete that file

yeah, I just didn't understand how that happen, like there's any way to avoid that, because every time that node runs it will do the same, I know that with a simple "if" I can avoid it, but, I would like to understand how that was created.

The Kedro template comes with the .gitkeep file in the data folders so they can be uploaded to GitHub, as Github doesn’t read empty folders. You can delete these files when the folders actually contain something. I’d also recommend creating a folder within 02_intermediate for the actual data

ohhh I see,I thought was something genarated when the node was executed to create partitions, I got it now, thanks

Add a reply
Sign up and join the conversation on Slack