I'm writing down BigQuery specifics which I find different and interesting from my MySql knowledge. Hopefully, this digest will help someone else trying to learn and use BigQuery in the future.
Web view
Something similart to PhpMyAdmin for BigQuery where we can view the schema, data and test queries:
https://bigquery.cloud.google.com/dataset
Partitioned tables
Partitioned tables should make the queries faster and cheaper.
A partitioned table is a special table that is divided into segments, called partitions, that make it easier to manage and query your data. BigQuery offers date-partitioned tables, which means that the table is divided into a separate partition for each date. To create a date-partitioned table.
If you create a date-partitioned table, BigQuery automatically loads data into a date-based partition of the table. You can then reduce the number of bytes processed by restricting your queries to specific partitions in the table.
Wildcard tables
Queries can be done through several tables with the wildcard:
FROM bigquery-public-data.noaa_gsod.gsod*
Wildcard tables are useful when a dataset contains multiple, similarly named tables that have compatible schemas. Typically, such datasets contain tables that each represent data from a single day, month, or year.
Query modes
Synchronous - wait for the result Asynchronous - returns without any result, then check periodically if the result is available Interactive - query is executed ASAP Batch - starts the query as soon as idle resources are available. Batch queries don't count towards your concurrent rate limit, which can make it easier to start many queries at once. More info
Query caching
BigQuery caches query results on a best-effort basis for increased performance. Results are cached for approximately 24 hours and cache lifetimes are extended when a query returns a cached result.
Query limits
- Maximum tables per query: 1,000
- Maximum query length: 256 KB
Table Decorators
Normally, BigQuery performs a full column scan when running a query. You can use table decorators to perform a more cost-effective query of a subset of your data.
To get a snapshot of the table at one hour ago
SELECT COUNT(*) FROM [data-sensing-lab:gartner.seattle@-3600000]
To get table data added between one hour and half an hour ago
SELECT COUNT(*) FROM [data-sensing-lab:gartner.seattle@-3600000--1800000]
Special SQL query keywords
There are some special BigQuery SQL functions.
FLATTEN
When you query nested data, BigQuery automatically flattens the table data for you.
For example citiesLived.place
in the SQL query will be returned in the flat table as citiesLived_place
.
Although BigQuery can automatically flatten nested fields, you may need to explicitly call FLATTEN when dealing with more than one repeated field.
WITHIN
The WITHIN keyword specifically works with aggregate functions to aggregate across children and repeated fields within records and nested fields. When you specify the WITHIN keyword, you need to specify the scope over which you want to aggregate
Modifying or deleting rows or data
BigQuery tables are append-only. The query language does not currently support either updating or deleting data. In order to update or delete data, you must delete the table, then recreate the table with new data. Alternatively, you could write a query that modifies the data and specify a new results table.
Jobs
Jobs are used to start all potentially long-running actions, for instance: queries, table import, and export requests. Shorter actions, such as list or get requests, are not managed by a job resource.
Datasets
A dataset is a grouping mechanism that holds zero or more tables. Datasets are the lowest level unit of access control; you cannot control access at the table level.
Projects
A project holds a group of datasets. Projects are created and managed in the APIs console. Jobs are billed to the project to which they are assigned.
RECORD field type
All I could find in the Google docs about this field is:
A collection of one or more other fields.
If the Record field should contain multiple rows, it must be REPEATED source
Denormalization
In BigQuery, you typically should denormalize the data structure in order to enable super-fast querying. While JOINs on small datasets are possible with BigQuery, they're not as performant as a denormalized structure.