Hi Kedro Community!
I've came up with a solution that I believe is not quite optimal to my problem, but I'm quite lost with the provided possibilities and I'd like to check with you a better solution.
Context:
I have 3 tables stored in a DB that I want to query them daily and store the raw data then proceed to process it in my pipelines.
The tables are quite large, so I have to query it in chunks.
Current Solution:
I'm using PartionedDataset with underling pandas.CSVDataset (is there a better data format?)
catalog.yml
_SQLQueryDataset: &SQLquery type: pandas.SQLQueryDataset credentials: db_espelho load_args: chunksize: 5 # testing only "DB.table1": <<: *SQLquery sql: ... "01_raw.{dataset}": type: partitions.PartitionedDataset path: data/01_raw/{dataset} dataset: type: pandas.CSVDataset save_args: index: False filename_suffix: ".csv"
def create_partitions(data_chunks: Iterator[DataFrame]) -> dict[str, Any]: return {f"part-{i:02d}": data for i, data in enumerate(data_chunks)}
create_partions
function all chunks are loaded into memory. Is there a way to avoid that so I can save each chunk at a time?In general, loading data from a database into pandas and processing it is an anti-pattern, and you should seek to do as much work as possible in the databaseāat the very least until it's of a manageable size! Hacking around loading using partitions is unnecessary, especially when databases allow you to efficiently query large volumes of data (e.g. transactions data).
You can use the ibis.TableDataset
to work with most databases. There's some information on https://kedro.org/blog/building-scalable-data-pipelines-with-kedro-and-ibis (this is outdated; you can use the dataset that is available via Kedro-Datasets now), and other posts.
One of the tables is a transactions table, so I just need to query the previous day entries. Is it possible to do with kedro only?