Validating External Dataset Accessibility in BigQuery

Question

Hey team, what is a good way of checking whether all the input tables for the nodes that I want to run, are accessible. I am having issues with permissions in BigQuery and testing is cumbersome. Is there a way to run a validation of all external datasets in the catalog?

I was thinking of adding a hook and a metadata tag that identifies the datasets as external.

My main concerns are

how do I handle different dataset types
how do I only ping each table (or load just the first row) instead of loading it in full for speed reasons

Sajid Alam · Accepted Answer

For different dataset types, you would need to adjust the check accordingly maybe use  load_args  to limit rows or using a custom dataset that queries just one row?

Deepyaman Datta · Answer

Just wondering—what are you using to work with BigQuery?

Ibis has good BigQuery support (the BigFrames team at Google helped build and maintain the Ibis backend, and they use it under the hood of BigFrames). Especially on the second point, Ibis has a lazy/deferred execution model, so it ensures tables exist and let you examine schema without loading any data.

Kedro's Ibis integration is on the newer side, but improving quickly.

Jannik Wiedenhaupt · Answer

I am using Ibis, yes. My issue is that I pull in external tables into different nodes and want to make sure I know whenever the SA running the code does not have access to any dataset.

Deepyaman Datta · Answer

I think an  after_catalog_loaded  or  before_pipeline_run  hook could make sense. Not sure I really understand this question. Shouldn't be an issue if you're using Ibis.

Jannik Wiedenhaupt · Answer

Got it, thank you!

Join the Kedro community

Validating External Dataset Accessibility in BigQuery