Hey team, is there a way to use pandas bigquer with parallel runners or is the answer to use ibis again?
Took me a bit, but I'm pretty sure the problem is with the pandas.GBQTableDataset
implementation. the bigquery.Client
is constructed in the __init__()
method of the dataset, and that is probably not serializable (something along the lines of https://cloud.google.com/python/docs/reference/dataproc/latest/multiprocessing).
It should be possible to solve this by delaying connection until first use, e.g. in load()
or save()
. The pandas SQL and Ibis datasets all do this.
You can always do something like this yourself by defining a custom dataset. may be able to confirm if this can be squeezed into the imminent 6.0.0 release; I could probably do this tonight or tomorrow.
Just to be clear, do you means pandas.GBQTableDataset
or BigFrames or something else?
I get an error that the specified tables are not sutiable for parallelrunner. Their commonality is that they all are bigquery tables.
Sorry, I'm not following; can you share the exact error?
I don't see anything at a glance that should prohibit using pandas.GBQTableDataset.
Of course,
I get: AttributeError: The following tables cannot be used with multiprocessing: [TABLE_NAMES]
Took me a bit, but I'm pretty sure the problem is with the pandas.GBQTableDataset
implementation. the bigquery.Client
is constructed in the __init__()
method of the dataset, and that is probably not serializable (something along the lines of https://cloud.google.com/python/docs/reference/dataproc/latest/multiprocessing).
It should be possible to solve this by delaying connection until first use, e.g. in load()
or save()
. The pandas SQL and Ibis datasets all do this.
You can always do something like this yourself by defining a custom dataset. may be able to confirm if this can be squeezed into the imminent 6.0.0 release; I could probably do this tonight or tomorrow.
https://github.com/kedro-org/kedro-plugins/pull/961 added a draft PR; I need to look at the remaining unit tests tomorrow
OK, tests are fixed. I think this should work, but you probably want to validate it yourself. π
This just got merged and will be available in Kedro-Datasets 6.0.0 soon! Hopefully it fixes the issue for you