Using Pandas with Bigquery and Parallel Runners

Question

Hey team, is there a way to use pandas bigquer with parallel runners or is the answer to use ibis again?

Deepyaman Datta · Accepted Answer

Took me a bit, but I'm pretty sure the problem is with the pandas.GBQTableDataset implementation. the bigquery.Client is constructed in the __init__() method of the dataset, and that is probably not serializable (something along the lines of https://cloud.google.com/python/docs/reference/dataproc/latest/multiprocessing).

It should be possible to solve this by delaying connection until first use, e.g. in load() or save(). The pandas SQL and Ibis datasets all do this.

You can always do something like this yourself by defining a custom dataset. may be able to confirm if this can be squeezed into the imminent 6.0.0 release; I could probably do this tonight or tomorrow.

Deepyaman Datta · Answer

Just to be clear, do you means  pandas.GBQTableDataset  or BigFrames or something else?

Jannik Wiedenhaupt · Answer

Yes, GBQTableDataset exactly

Deepyaman Datta · Answer

Also, what is the issue you're running into using  ParallelRunner ?

Jannik Wiedenhaupt · Answer

I get an error that the specified tables are not sutiable for parallelrunner. Their commonality is that they all are bigquery tables.

Deepyaman Datta · Answer

Sorry, I'm not following; can you share the exact error? I don't see anything at a glance that should prohibit using pandas.GBQTableDataset.

Jannik Wiedenhaupt · Answer

Of course, I get:  AttributeError: The following tables cannot be used with multiprocessing: [TABLE_NAMES]

Deepyaman Datta · Answer

https://github.com/kedro-org/kedro-plugins/pull/961  added a draft PR; I need to look at the remaining unit tests tomorrow

Deepyaman Datta · Answer

OK, tests are fixed.  I think this should work, but you probably want to validate it yourself. 🙂

Deepyaman Datta · Answer

This just got merged and will be available in Kedro-Datasets 6.0.0 soon! Hopefully it fixes the issue for you

Join the Kedro community

Using Pandas with Bigquery and Parallel Runners