Join the Kedro community

Updated 2 months ago

Loading a flat file from s3 based on conditions

Hello everyone

Is there a way to load a flat file from S3 based on some conditions like pulling the latest file from the mentioned bucket.

L
V
19 comments

Split it into two problems:

  • Listing files in the bucket based
  • Listing metadata can help you filter the one you'd like to load
  • Load correct file from list

But how is that done, the below example expects the exact name of the file

motorbikes:
type: pandas.CSVDataset
filepath: s3://your_bucket/data/02_intermediate/company/motorbikes.csv
credentials: dev_s3
load_args:
sep: ','
skiprows: 5
skipfooter: 1
na_values: ['#NA', NA]

I just know the name of the bucket, we need to fetch the files based on some conditions right ?

in that case you'll have to implement a custom dataset

you can expand the behaviour of the PandasDataset

and override the load method

this is an example that adds additional functionality for sheets

how challenging it is going to be?

but same manner you can add a dataset, have filtering args in constructor and use those args in the load method

cannot open the above link

my bad, updated link

not hard, datasets are just classes with load and save methods really

I just found that PartionedDatasets , provides a way of iterating over each file present in a bucket/folder

https://docs.kedro.org/en/stable/data/partitioned_and_incremental_datasets.html#partitioned-dataset-load

ah yes, that one exists , though then you'll be implementing the conditions in the node

I usually prefer to have the dataset internals handling the selection

you might even be able to extend the ParitionedDataset and overload the load method there to call super and thereafter do the filtering

Add a reply
Sign up and join the conversation on Slack