Join the Kedro community

Updated last month

Risk of loading full dataset instead of incremental updates

Question, if we have a dataset in the catalog that updates incrementally with an upsert/append, and then is an input to another node, is there a risk that the full dataset will get loaded from the catalog and input to the downstream node rather than the increment of data?

N
R
J
12 comments

This depends on what dataset you use. The important thing here is the consistent behavior of load and save, it should be either incremental or full data. Takes Kedro out of the equation you can think of it as two function calls:

<some_output> = function_a(...)
<some_other_output> = function_b(<some_output>)
If you only want to process the incremental chunk, then the question is how do you load or save only the incremental part. You will most likely need some checkpoint or using the IncrementalDataset if it fits your usecase.

Okay, loading the increment would be hard (delta table upsert) so I might resolve this by passing the upsert df through an identity function before upserting

I was struggling some time ago with Delta upserts, PartitionedDataset, and checkpoints https://github.com/kedro-org/kedro-plugins/issues/471 would love to know how you are solving this

Thanks, , I will read through this. The way we handle increments/batches is pretty painful - we obtain a last processed timestamp from our output table and generate query time ranges based on this. I would be interested to explore this more as I'll be dealing with a similar problem in my next project.

What's the painful part here after you obtain the timestamp?

We have multiple grouping variables/data sources for this timeseries data with different arrival times...so basically we need to get the last processed timestamp for each group, then do a range filter. We have to have read/write versions of the dataset for the dag to resolve.

Hmm interesting, will have some thoughts about it

How do you get the new data and is the update go to the same table? How do you keep track of what has been updated or not, do you have a status table or is it possible to keep a column for this status?

We have one that looks at a write timestamp. Another one just assumes there are no data gaps and goes off the last processed time in the output table

is it possible to keep a column of updated/not, or having some staging table that need to be consumed as downstream?

not sure what you mean by the first one, but staging table would be possible

Say your goal is upsert some rows and processed those rows.

You can upsert those rows meanwhile set those rows, i.e. is_processed=0, the next node simply fetch all row is_processed=0

Add a reply
Sign up and join the conversation on Slack