Things to know about BigQuery

Updated: 26/07/2023

·

7 min read

BigQuery is Google "managed" data warehouse for analytics that allows you to store, query, and extract data. It's a separate tool from Google Analytics and Google Tag Manager, and it is part of the Google Cloud Platform.

Image source: cloud.google.com

BigQuery is a "managed data warehouse". It allows you make a cheese sandwich (i.e. data warehouse) and you don't have to milk the cow, operate the mill or harvest the wheat or drive the Tractor. You have access to the computing and storage resources of Google Cloud Platform and are not held back by your hardware limitation.

Benefits of Google BigQuery

As BigQuery is one of many Google Cloud Platform products, here are some great benefits to organisations:
1. Serverless. No need to download data to your machine to work with it - the dataset will remain on the cloud.
2. Ease of use. Run ad-hoc SQL queries on your dataset without having to prepare the data, like indexes, beforehand. This is invaluable for data exploration.
3. Scale. Carry out data exploration on extremely large datasets interactively. You don't need to sample the data to work with it promptly.
4. Shareability. You will be able to run queries on data from different datasets without any issues. BigQuery is a convenient way to share datasets. Of course, you can also keep your data private, or share them only with specific persons -- not all data need to be public.

Getting started in BigQuery can be overwhelming. You need to have experience with sql query, there are platform-specific nuances that might trip you up. You also need to be comfortable working with web analytics data outside the Google Analytics interface.

Data Ingestion (Loading Data)

To ingest data into BigQuery, you can use an

  • EL pipeline (used for periodic loads of log files),

  • ETL pipeline (used when data needs to be enriched or quality controlled),

  • ELT pipeline (used for exploratory work)

What I learn from ingesting new external data sources into BigQuery from CSV, Google Cloud Storage, and Google Drive.

Loading Data into Google Cloud Storage and Google Drive

Linking external tables to BigQuery (e.g. Google Spreadsheets or directly from Google Cloud Storage) has several limitations. 2 of the most significant are:

  • Data consistency is not guaranteed if the data values in the source are changed while querying.

  • Data sources stored outside of BigQuery lose the performance benefits of having BigQuery manage your data storage (including but not limited to auto-optimization of your query execution path, certain wildcard functions are disabled, etc.).

Data from external tables do not get the advantage of metadata stored and cached in BigQuery. BigQuery needs to query the external source each time you need to read data even if you are running the exact same query -- since caching is not used for external sources. link

Data Processing Tasks

Piping Cloud Pub/Sub messages to BigQuery or bulk-compressing files in Cloud Storage are common data processing tasks perform in GCP.

As pipeline templates are simply stored in Cloud Storage buckets, they can be made publicly available for use across organizations and teams.

Google provides ready-made templates for several of these common processing tasks, your teams can accomplish these common tasks without ever developing a pipeline.

Costs of BigQuery

Bigquery charges for data storage, streaming insert and querying data. For loading and exporting data, it is free of charge.

BigQuery has a free tier plan which allows you to query 1 TB (Terabyte) per month and store 10 gigabytes. With Google Analytics 360, BigQuery is free, up to $500/month of usage.

For a website with over one-billion hits per month, that in the region of $150 per month for storage and $100 per month for querying. Google Analytics 360 customers benefit from native integration with Google BigQuery.

However, the use of BigQuery is not restricted to Google Analytics 360 customers alone. It is just a little more involved to integrate the free Google Analytics with BigQuery. If you need help on this, DM me.

Why exports Google Analytics to BigQuery?

  • No more data sampling! This is a step up for those sites having heavy traffic daily

  • An opportunity to build ANY reports with ANY segments using ANY configuration you like

  • To retain data beyond the default retention period

Sql with BigQuery

This is what I learned so far after completing the course From Data to Insights with Google Cloud Platform

  • Partitioned tables in BigQuery allows a single table to be split up. BigQuery offers date-partitioned tables. For example, January will be in one partition whiles February will be in another partition.

  • Do not use SELECT * FROM report.all.* on BigQuery. Such a query would quickly blow out the 1TB free quota! Drill down to the small section of the data table offered by Partitioned tables. This is much cheaper to query in terms of quota usage and execution time.

  • The most common error using BigQuery Cannot access field type on a value with type ARRAY<STRUCT<type STRING, amount INT64> The correct way is add in the appropriate UNNEST( ) and WHERE Clause filter

  • GA360 exports to BigQuery is specific to the view. This means that anything you filter out in the view you’re exporting does not appear in the BigQuery data.

Use Case

Connect your Firebase project to BigQuery and export your predictions, join your Crashlitics data and target users who had a difficult time using your app and keep them engaged.

BigQuery sits on the edge between data storage and data processing. The usual reason to store data in BigQuery is to use its big data analysis and interactive query capabilities. You would not want to use BigQuery for example as the backings store for an online application.

Interacting with BigQuery

You access BigQuery through the

  • Google Cloud Platform Console (via a browser),

  • Command-line tool, or

  • Making calls to the BigQuery REST API using client libraries such as Java, .NET, or Python.

There are also a variety of third-party tools that you can use to interact with BigQuery, such as visualizing the data or loading the data.

Using the BigQuery web UI in the GCP Console as a visual interface to complete tasks like running queries, loading data, and exporting data is the most common way to use BigQuery

External Table in BigQuery

External Tables are used to query data that you don’t want to import into BigQuery and remain in an external source such as Google Cloud Storage, Google Drive, BigTable.

Since External tables in BigQuery reference data from an external source, an explicit BigQuery schema cannot be specified for an external table. The format of the external source defines the schema for the table.

https://cloud.google.com/bigquery/external-data-sources

DataLab

Google Datalab which is based on the open-source Jupyter Notebook is one way to access BigQuery. With Pandas, you can have documentation, code, results, and visualizations all in one place making collaborating with project teams so much easier.

Colab Notebook (BigQuery in GDrive)

Explore and visualize BigQuery in a browser. Unlike DataLab which resides on GCP, Colab Notebook is Jupyter Notebook in Google Drive.

  • Initial set-up with Data Table Display extension, Google BigQuery API wrapper and Application Default Credentials

  • Reference SQL syntax from the original job using jobs.query method to return the SQL syntax from the job.

  • Result set loaded from BigQuery job as a DataFrame using to_dataframe method using the BigQuery Storage API.

  • Show descriptive statistics using describe() method to generate descriptive statistics. Descriptive statistics include : the central tendency, dispersion and shape of a dataset’s distribution, excluding NaN values

Message me if you want free access to the Colab notebook

Query dataset with BigQuery Python client library

You’ve got your data in a BigQuery dataset and now you need to programmatically access it, lets say from a Python application.

You can use the google-cloud-bigquery Python library to access a BigQuery dataset. Here are some of the BigQuery's Python Cloud Client Libraries to access Google Cloud programmatically

Source: https://cloud.google.com/python/docs/reference