Guys, are there any built-in solution to handle large databases, so that the nodes run them partially, like, lets say, a 100k rows will be running in batches of 10k each. Instead of doing by hand with for loop or something like it...
Nice to know, I'll love to read it...
yeah I mean, I saw a little bit about PartitionedDataset, is just that to me was not that clear if its usable in all scenarios, like in order to avoid problems with lack of vm resources, to allow me to run even with lower count of CPUs and so on....
I do want to learn more about the "Kedro way" of things, to understand on its fully potential you know.
It's not really a question of the "Kedro way", but if you want to process large volumes of data from a database, the best way is to do the compute on the database.
For example, Ibis is a Python dataframe library that lets you lazily execute code on the backend. Ibis can be fairly easily integrated with Kedro (there are built-in datasets and examples).
Would this help, or am I misunderstanding your question?
https://docs.kedro.org/en/stable/nodes_and_pipelines/nodes.html#how-to-use-generator-functions-in-a-node Kedro supports generator functions, you only need to have a dataset which will load a part of the data and yield
it and then in the node you need to iterate through the generated data chunks, process them and then yield
them back, which Kedro will call save
on a dataset which supports append. You can check in the docs an example how that would work.