BigQuery Reference

Free reference guide: BigQuery Reference

16 results

About BigQuery Reference

The BigQuery Reference is a focused, searchable SQL cheat sheet for Google BigQuery — Google Cloud's fully managed, serverless data warehouse. Unlike standard SQL, BigQuery has unique features that data analysts, data engineers, and ML practitioners need to master: nested and repeated data types (STRUCT for structured records and ARRAY for repeated values), the UNNEST function for flattening arrays into rows, and BigQuery-specific functions that handle edge cases elegantly (SAFE_DIVIDE returns NULL instead of throwing a division-by-zero error). This reference covers these BigQuery-specific constructs across five categories: Types, Functions, ML, DDL, and Optimization.

The Functions category covers the datetime utilities that power most analytical queries: FORMAT_TIMESTAMP for formatting TIMESTAMP columns into human-readable strings, DATE_TRUNC for grouping time-series data by day/week/month/quarter/year, and GENERATE_DATE_ARRAY for creating calendar sequences used in date spine joins. The QUALIFY clause — one of BigQuery's most powerful but under-known features — filters window function results directly in the SELECT statement without needing a subquery, enabling patterns like "get the latest row per user" in a single query. The ML category covers the complete BigQuery ML workflow: CREATE MODEL for training (linear regression, logistic regression, boosted tree, etc.), ML.EVALUATE for model assessment, and ML.PREDICT for scoring new data.

The DDL and Optimization sections are critical for production-grade BigQuery usage. EXPORT DATA allows exporting query results directly to Cloud Storage in CSV, JSON, Avro, or Parquet format — useful for data pipelines. The MERGE statement enables full UPSERT operations (insert-or-update) on BigQuery tables, which is essential for incremental data loading patterns. For performance and cost optimization, PARTITION BY creates date-partitioned or integer-range-partitioned tables that enable partition pruning (only scanning relevant partitions), CLUSTER BY physically organizes data within partitions for faster filter queries, and CREATE FUNCTION (UDF) allows defining reusable SQL or JavaScript functions for complex business logic that cannot be expressed in standard SQL.

Key Features

Types: STRUCT (named field records), ARRAY (repeated values), UNNEST (flatten arrays to rows with cross join)
Functions: SAFE_DIVIDE (NULL on div/0), FORMAT_TIMESTAMP, DATE_TRUNC (day/week/month/year), GENERATE_DATE_ARRAY
QUALIFY clause: filter window function results inline (e.g., ROW_NUMBER() OVER PARTITION = 1 for deduplication)
BigQuery ML: CREATE MODEL (linear/logistic regression), ML.EVALUATE (model metrics), ML.PREDICT (scoring)
DDL: EXPORT DATA to GCS (CSV/JSON/Avro/Parquet), MERGE for UPSERT (matched/not matched conditions)
Optimization: PARTITION BY (date/integer range partitioning), CLUSTER BY (sorted block organization)
CREATE FUNCTION (UDF): SQL and JavaScript user-defined functions with typed parameters and return types
All examples use realistic table and column names with actual BigQuery SQL syntax

Frequently Asked Questions

What is the difference between STRUCT and ARRAY in BigQuery?

STRUCT is a container for named, typed fields — like a row-within-a-row. For example, STRUCT("John" AS name, 30 AS age) creates a single nested record. ARRAY is an ordered list of values of the same type. These two types are often combined: an ARRAY of STRUCTs represents repeated nested records (like a one-to-many relationship stored in a single row). BigQuery is designed for this denormalized, nested structure for performance — avoiding expensive JOINs on large datasets.

How does UNNEST work in BigQuery and when should I use it?

UNNEST takes an ARRAY and converts each element into a separate row. Use it in the FROM clause with a cross join (implicit or explicit comma): SELECT item FROM my_table, UNNEST(tags) AS item returns one row per tag per original row. UNNEST is essential for querying nested/repeated fields in BigQuery schemas (common with Firestore exports, GA4 event data, and any schema designed for storage efficiency). Combine UNNEST with STRUCT fields using SELECT tag.name, tag.value FROM my_table, UNNEST(tag_array) AS tag.

What is SAFE_DIVIDE and why is it better than the division operator?

The standard division operator (/) throws a "division by zero" error when the denominator is 0. SAFE_DIVIDE(numerator, denominator) returns NULL instead of throwing an error. This is critical in analytical queries where you compute rates or averages over user cohorts — some cohorts may have zero users or zero events, and a single zero value would fail the entire query. Use SAFE_DIVIDE(revenue, sessions) instead of revenue / sessions when sessions could be 0.

How does the QUALIFY clause work in BigQuery?

QUALIFY filters rows based on window function results — it is like a WHERE clause but evaluated after window functions. Without QUALIFY, you need a subquery: SELECT * FROM (SELECT *, ROW_NUMBER() OVER(PARTITION BY user_id ORDER BY ts DESC) AS rn FROM events) WHERE rn = 1. With QUALIFY: SELECT * FROM events QUALIFY ROW_NUMBER() OVER(PARTITION BY user_id ORDER BY ts DESC) = 1. This is the standard pattern for deduplication — keeping only the latest event per user.

How do PARTITION BY and CLUSTER BY improve BigQuery performance?

PARTITION BY divides a table into segments (partitions) based on a date column or integer range. When a query filters on the partition column (e.g., WHERE event_date = "2024-01-01"), BigQuery only scans that partition instead of the full table — dramatically reducing bytes processed and cost. CLUSTER BY physically sorts data within each partition by the cluster columns. Queries filtering on cluster columns (e.g., WHERE user_id = 12345) scan fewer storage blocks. Always partition on your most common date filter and cluster on the next most selective filter column.

What is BigQuery ML and how do I use it?

BigQuery ML allows you to train, evaluate, and use ML models directly in SQL without exporting data to Python or R. CREATE MODEL defines a model with model_type (linear_reg, logistic_reg, boosted_tree_classifier, kmeans, etc.) and trains it on a SELECT query. ML.EVALUATE returns performance metrics (RMSE, accuracy, AUC, etc.) for the trained model. ML.PREDICT applies the model to new data and returns predictions. This is ideal for analysts who want ML capabilities without managing Python environments or moving data.

How does the MERGE statement work in BigQuery for UPSERT operations?

MERGE performs conditional INSERT, UPDATE, or DELETE based on whether a row matches between a target table and a source. The pattern is: MERGE target T USING source S ON T.id = S.id WHEN MATCHED THEN UPDATE SET T.value = S.value WHEN NOT MATCHED THEN INSERT (id, value) VALUES (S.id, S.value). This is the standard pattern for incremental data loading — updating existing records and inserting new ones in a single atomic operation. MERGE is more efficient than DELETE + INSERT for large tables.

How do I create and use a UDF (User-Defined Function) in BigQuery?

CREATE FUNCTION defines a persistent UDF: CREATE FUNCTION my_dataset.my_fn(x INT64) RETURNS INT64 AS (x * 2). Temporary UDFs valid only for the current query session use CREATE TEMP FUNCTION. UDFs can be written in SQL (for simple expressions) or JavaScript (for complex string operations, parsing, or logic). Call them like built-in functions: SELECT my_dataset.my_fn(column) FROM table. UDFs are useful for encapsulating business logic (e.g., calculating fiscal quarters, masking PII fields) that would otherwise be repeated across many queries.