Hey Guys, quick question:
Is there a way to enforce on the catalog the schema data type?
Like:
cars: type: pandas.CSVDataset filepath: data/01_raw/company/cars.csv load_args: sep: ',' save_args: index: False date_format: '%Y-%m-%d %H:%M' decimal: . schema:
When reading, you can use the dtype
arg of pandas’ read_csv
method in the load_args
field. Is that what you are looking for? So:
cars: type: pandas.CSVDataset filepath: data/01_raw/company/cars.csv load_args: sep: ',' dtype: {} save_args: index: False date_format: '%Y-%m-%d %H:%M' decimal: .
Unfortunately no… pandas does not have such am option as it infers type from df.dtypes
when writing to csv. You can always cast to the correct dtype from within the node
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.to_gbq.html
table_schema list of dicts, optional
List of BigQuery table fields to which according DataFrame columns conform to, e.g. [{'name': 'col1', 'type': 'STRING'},...]
. If schema is not provided, it will be generated according to dtypes of DataFrame columns. See BigQuery API documentation on available names of a field.
New in version 0.3.1 of pandas-gbq.
to be clear, it's not a pandas issue. CSV does not have type because it is plain text and thus always known as a bad format for any data processing pipeline.
The reason why you can do that to BiqQuery because it is a typed system that stores your dataframe as a table.
If you want to preserve type, use something like Parquet
so theres no way to feed a BQ table with specific data type for the columns using type: pandas.GBQTableDataSet
You can definitely do that with bigquery, I said you cannot save CSV with types as that was the original example that you provided.
oh yeah, my bad, I'm using
type: pandas.GBQTableDataSet
but for some reason I got
kedro.io.core.DatasetError: Failed while saving data to data set GBQTableDataset Could not convert DataFrame to Parquet.