I'm writing down BigQuery specifics which I find different and interesting from my MySql knowledge. Hopefully, this digest will help someone else trying to learn and use BigQuery in the future.

Web view

Something similart to PhpMyAdmin for BigQuery where we can view the schema, data and test queries:

https://bigquery.cloud.google.com/dataset

Partitioned tables

Partitioned tables should make the queries faster and cheaper.

A partitioned table is a special table that is divided into segments, called partitions, that make it easier to manage and query your data. BigQuery offers date-partitioned tables, which means that the table is divided into a separate partition for each date. To create a date-partitioned table.

source

If you create a date-partitioned table, BigQuery automatically loads data into a date-based partition of the table. You can then reduce the number of bytes processed by restricting your queries to specific partitions in the table.

source

Wildcard tables

Queries can be done through several tables with the wildcard:

FROM bigquery-public-data.noaa_gsod.gsod*

Wildcard tables are useful when a dataset contains multiple, similarly named tables that have compatible schemas. Typically, such datasets contain tables that each represent data from a single day, month, or year.

source

Query modes

Synchronous - wait for the result Asynchronous - returns without any result, then check periodically if the result is available Interactive - query is executed ASAP Batch - starts the query as soon as idle resources are available. Batch queries don't count towards your concurrent rate limit, which can make it easier to start many queries at once. More info

Query caching

BigQuery caches query results on a best-effort basis for increased performance. Results are cached for approximately 24 hours and cache lifetimes are extended when a query returns a cached result.

Query limits

  • Maximum tables per query: 1,000
  • Maximum query length: 256 KB

source

Table Decorators

Normally, BigQuery performs a full column scan when running a query. You can use table decorators to perform a more cost-effective query of a subset of your data.

To get a snapshot of the table at one hour ago

SELECT COUNT(*) FROM [data-sensing-lab:gartner.seattle@-3600000]

To get table data added between one hour and half an hour ago

SELECT COUNT(*) FROM [data-sensing-lab:gartner.seattle@-3600000--1800000]

source

Special SQL query keywords

There are some special BigQuery SQL functions.

FLATTEN

When you query nested data, BigQuery automatically flattens the table data for you.

For example citiesLived.place in the SQL query will be returned in the flat table as citiesLived_place.

Although BigQuery can automatically flatten nested fields, you may need to explicitly call FLATTEN when dealing with more than one repeated field.

WITHIN

The WITHIN keyword specifically works with aggregate functions to aggregate across children and repeated fields within records and nested fields. When you specify the WITHIN keyword, you need to specify the scope over which you want to aggregate

source

Modifying or deleting rows or data

BigQuery tables are append-only. The query language does not currently support either updating or deleting data. In order to update or delete data, you must delete the table, then recreate the table with new data. Alternatively, you could write a query that modifies the data and specify a new results table.

source

Jobs

Jobs are used to start all potentially long-running actions, for instance: queries, table import, and export requests. Shorter actions, such as list or get requests, are not managed by a job resource.

source

Datasets

A dataset is a grouping mechanism that holds zero or more tables. Datasets are the lowest level unit of access control; you cannot control access at the table level.

source

Projects

A project holds a group of datasets. Projects are created and managed in the APIs console. Jobs are billed to the project to which they are assigned.

source

RECORD field type

All I could find in the Google docs about this field is:

A collection of one or more other fields.

source

If the Record field should contain multiple rows, it must be REPEATED source

Denormalization

In BigQuery, you typically should denormalize the data structure in order to enable super-fast querying. While JOINs on small datasets are possible with BigQuery, they're not as performant as a denormalized structure.

source

Next Post Previous Post