Join the Kedro community

Updated 3 weeks ago

Using Pandas with Bigquery and Parallel Runners

At a glance

Hey team, is there a way to use pandas bigquer with parallel runners or is the answer to use ibis again?

Marked as solution

Took me a bit, but I'm pretty sure the problem is with the pandas.GBQTableDataset implementation. the bigquery.Client is constructed in the __init__() method of the dataset, and that is probably not serializable (something along the lines of https://cloud.google.com/python/docs/reference/dataproc/latest/multiprocessing).

It should be possible to solve this by delaying connection until first use, e.g. in load() or save(). The pandas SQL and Ibis datasets all do this.

You can always do something like this yourself by defining a custom dataset. may be able to confirm if this can be squeezed into the imminent 6.0.0 release; I could probably do this tonight or tomorrow.

View full solution
D
J
10 comments

Just to be clear, do you means pandas.GBQTableDataset or BigFrames or something else?

Yes, GBQTableDataset exactly

Also, what is the issue you're running into using ParallelRunner?

I get an error that the specified tables are not sutiable for parallelrunner. Their commonality is that they all are bigquery tables.

Sorry, I'm not following; can you share the exact error?

I don't see anything at a glance that should prohibit using pandas.GBQTableDataset.

Of course,

I get: AttributeError: The following tables cannot be used with multiprocessing: [TABLE_NAMES]

Attachment
image.png

Took me a bit, but I'm pretty sure the problem is with the pandas.GBQTableDataset implementation. the bigquery.Client is constructed in the __init__() method of the dataset, and that is probably not serializable (something along the lines of https://cloud.google.com/python/docs/reference/dataproc/latest/multiprocessing).

It should be possible to solve this by delaying connection until first use, e.g. in load() or save(). The pandas SQL and Ibis datasets all do this.

You can always do something like this yourself by defining a custom dataset. may be able to confirm if this can be squeezed into the imminent 6.0.0 release; I could probably do this tonight or tomorrow.

https://github.com/kedro-org/kedro-plugins/pull/961 added a draft PR; I need to look at the remaining unit tests tomorrow

OK, tests are fixed. I think this should work, but you probably want to validate it yourself. πŸ™‚

This just got merged and will be available in Kedro-Datasets 6.0.0 soon! Hopefully it fixes the issue for you

Add a reply
Sign up and join the conversation on Slack