Risk of loading full dataset instead of incremental upd...

At a glance

The post asks whether there is a risk that the full dataset will be loaded from the catalog and input to the downstream node, rather than just the incremental data, when the dataset is updated incrementally with an upsert or append. Community members discuss various approaches to handling incremental data, such as using checkpoints, the IncrementalDataset, and maintaining a status column or staging table to track which rows have been processed. There is no explicitly marked answer, but the discussion suggests that the key is to ensure consistent behavior in how the data is loaded and saved, either fully or incrementally.

Useful resources

RRichard Purvis

Question, if we have a dataset in the catalog that updates incrementally with an upsert/append, and then is an input to another node, is there a risk that the full dataset will get loaded from the catalog and input to the downstream node rather than the increment of data?

12 comments

NNok Lam Chan

This depends on what dataset you use. The important thing here is the consistent behavior of load and save, it should be either incremental or full data. Takes Kedro out of the equation you can think of it as two function calls:

<some_output> = function_a(...)
<some_other_output> = function_b(<some_output>)

If you only want to process the incremental chunk, then the question is how do you load or save only the incremental part. You will most likely need some checkpoint or using the IncrementalDataset if it fits your usecase.

RRichard Purvis

Okay, loading the increment would be hard (delta table upsert) so I might resolve this by passing the upsert df through an identity function before upserting

JJuan Luis Cano Rodríguez

I was struggling some time ago with Delta upserts, PartitionedDataset, and checkpoints https://github.com/kedro-org/kedro-plugins/issues/471 would love to know how you are solving this

RRichard Purvis

Thanks, , I will read through this. The way we handle increments/batches is pretty painful - we obtain a last processed timestamp from our output table and generate query time ranges based on this. I would be interested to explore this more as I'll be dealing with a similar problem in my next project.

NNok Lam Chan

What's the painful part here after you obtain the timestamp?

RRichard Purvis

We have multiple grouping variables/data sources for this timeseries data with different arrival times...so basically we need to get the last processed timestamp for each group, then do a range filter. We have to have read/write versions of the dataset for the dag to resolve.

NNok Lam Chan

Hmm interesting, will have some thoughts about it

NNok Lam Chan

How do you get the new data and is the update go to the same table? How do you keep track of what has been updated or not, do you have a status table or is it possible to keep a column for this status?

RRichard Purvis

We have one that looks at a write timestamp. Another one just assumes there are no data gaps and goes off the last processed time in the output table

NNok Lam Chan

is it possible to keep a column of updated/not, or having some staging table that need to be consumed as downstream?

RRichard Purvis

not sure what you mean by the first one, but staging table would be possible

NNok Lam Chan

Say your goal is upsert some rows and processed those rows.

You can upsert those rows meanwhile set those rows, i.e. is_processed=0, the next node simply fetch all row is_processed=0

Add a reply

Join the Kedro community

Risk of loading full dataset instead of incremental updates