Kedro

Log inLog into community

Join the Kedro community

Updated 5 months ago

Loading a flat file from s3 based on conditions

Loading a flat file from s3 based on conditions

At a glance

·

Hello everyone

Is there a way to load a flat file from S3 based on some conditions like pulling the latest file from the mentioned bucket.

L

V

19 comments

LLaurens Vijnck

Split it into two problems:

Listing files in the bucket based

Listing metadata can help you filter the one you'd like to load

Load correct file from list

But how is that done, the below example expects the exact name of the file

motorbikes:
type: pandas.CSVDataset
filepath: s3://your_bucket/data/02_intermediate/company/motorbikes.csv
credentials: dev_s3
load_args:
sep: ','
skiprows: 5
skipfooter: 1
na_values: ['#NA', NA]

I just know the name of the bucket, we need to fetch the files based on some conditions right ?

LLaurens Vijnck

in that case you'll have to implement a custom dataset

I see

LLaurens Vijnck

you can expand the behaviour of the PandasDataset

LLaurens Vijnck

and override the load method

understood

LLaurens Vijnck

https://github.com/kedro-org/kedro-plugins/pull/810/files

LLaurens Vijnck

this is an example that adds additional functionality for sheets

how challenging it is going to be?

LLaurens Vijnck

but same manner you can add a dataset, have filtering args in constructor and use those args in the load method

cannot open the above link

LLaurens Vijnck

my bad, updated link

LLaurens Vijnck

not hard, datasets are just classes with load and save methods really

I just found that PartionedDatasets , provides a way of iterating over each file present in a bucket/folder

https://docs.kedro.org/en/stable/data/partitioned_and_incremental_datasets.html#partitioned-dataset-load

LLaurens Vijnck

ah yes, that one exists , though then you'll be implementing the conditions in the node

LLaurens Vijnck

I usually prefer to have the dataset internals handling the selection

LLaurens Vijnck

you might even be able to extend the ParitionedDataset and overload the load method there to call super and thereafter do the filtering

Add a reply

Sign up and join the conversation on Slack