A library of key Big Query concepts, commands and miscelanea.

References

General Notes

Big Query vs On-Premise Solutions

  • On GCP, storage and compute are decoupled
  • Whereas on-premise, they both reside on the same server
  • Separate pricing for storage and compute, which should save costs
  • Fully managed and automatic scaling on BigQuery

BigQuery Fundamentals

  • MUST have a project, which organises all of your activities
  • Use the comment of #standardSQL at the top of queries (if being shared) to ensure the other user is using the standard SQL.
  • Can set queries to be private, project visible or public
  • Queries are cached, so if they are run again we aren’t charged again
  • Use the preview option instead of running a limited query to save costs

9 Key Features of BigQuery

  1. Fully-managed data warehouse - no-ops, petabyte-scale
  2. Reliability
  3. Economical - pay only for what you use
  4. Security - encrypted in transit and at rest
  5. Auditable - every transaltion is logged
  6. Scalable - highly paralell processing model
  7. Flexible - Mashup data across multiple datasets

3 Ways to Interface with BigQuery

  1. Web UI
  2. CLI
  3. REST API

Pricing

You only incur query job processing costs. Job types include:

  1. Query - charged by bytes processed
  2. Load data - free
  3. Extract - free
  4. Copy - free

Storage of data in Big Query is a separate cost.

3 Categories of BigQuery Pricing

  1. Storage - automatic discounts for old data (tables with data untouched for a month have the storage cost reduced by 50%)
  2. Processing - on demand or flat rate plans. On demand is based on amount of data processed. 1TB/month free. Have to opt in to run high-compute queries
  3. Free - loading, exporting, queries on metadata, cached queries, queries with errors

Reserved Slots

  • Slots are amount of total query throughput
  • Reserving slots is a contractual engagement with Google whereby you have certain resources dedicated to you at predefined times (e.g every Friday at 7pm)
  • Useful if you have very high work loads at known intervals
  • Helps minimise variability in query performance
  • By default you are on the shared resource pool

Quotas

  • 50 concurrent queries
  • Query timeout: 6 hours
  • 1,000 updates to a Table per day
  • 1,000 tables referenced by a single query
  • Max result size 128MB Compressed

Optimise Queries for Cost

  • Only include the columns and rows needed
    • Do not use SELECT *
    • Filter rows early
  • Use cached results when possible
    • Use permanent tables instead of views
    • Views are saved queries
  • Limit the use of User-Defined Functions

Misc Notes

  • Standard SQL is ANSI 2011 compliant and has performance advantages
  • Can run sections of code by highlighting the code and running
  • Hold down command and click on the table name to insert it into your query
  • Use the preview option to take a peek at the data in your table without running a query. Running a query will cost you money, previewing won’t 0 “size_bytes” is a reserved key word in BigQuery

Data Types

  • INT64 - 64 bit, can be nagative
  • FLOAT64 - 64 bit
  • STRING
  • DATE - yyyy-mm-dd
  • BOOL
  • ARRAY - [‘apple’, ‘pear’]
  • STRUCT -

Data Integrity

5 Principles of Dataset Integrity

  1. Validity
    • Challenges: out of range, empty fields, data mismatch
  2. Accuracy
    • Challenges: lookup datasets, do not exist
  3. Completeness
    • Challenges: missing data
  4. Consistency
    • Challenges: duplicate records, concurrency issues
  5. Uniformity
    • Challenges: Same unit of measurement

Common Commands & Syntax Notes

Must Escape the full name of the table as `project.dataset.table`

#standardQL
SELECT column1
FROM `project.dataset.table`

Format Function

For example, add commas to numeric results to make it easier to read.

Caution, format converts the data to a string. So you can unintentionally change the data type!

SELECT FORMAT("%,d", revenue_column) AS rev
FROM `my_project.dataset_name.revenue_table`

Aggregation Functions

  • SUM()
  • AVG()
  • COUNT()
  • MIN()
  • MAX()

Dates

-- Extract Dates
EXTRACT(YEAR FROM date_column)

-- Current Date
CURRENT_DATE([time_zone])

-- Add interval of time
DATE_ADD(DATE "2019-01-01", INTERVAL 10 DAY)

-- Subtract interval of time
DATE_SUB(DATE "2019-01-01", INTERVAL 10 DAY)

-- Difference between two dates
DATE_DIFF(DATE "2019-01-01", DATE "2018-01-01", DAY)

-- Truncate the date to the earliest start period
DATE_TRUNC(DATE "2019-01-31", MONTH) -- Returns "2019-01-01"

-- Parse a string to proper date format
PARSE_DATETIME("12/07/2019")

Strings

Cast a string

CAST('1234' AS INT64)

If not sure what the data should be coming in as, instead of erroring out, can use safe cast instead which will convert the erronous value to NULL instead.

-- Will return NULL instead of erroring out
SELECT
    SAFE_CAST("apple" AS INT64)

Concatenate strings

CONCAT("abc", "efg") -- Returns abcdefg

Check if a string ends in a value

ENDS_WITH("Apple", e) -- Returns True

Lower / upper case

LOWER("Apple") -- Returns "apple"
UPPER("apple") -- Returns "Apple"

Regex

REGEXP_CONTAINS("Lunchbox", r"^*box$") -- Returns True

Wildcard Searches

To search for an argument with only a single character before or after, use the underscore

"This" LIKE "_his" -- returns True
"TTTTThis" LIKE "_his" -- returns False

Unlimited number of characters before or after the search

"This" LIKE "%his" -- returns True
"TTTTThis" LIKE "%his" -- returns True

Data Cleaning with SQL

Cleaning for Validity

CASE WHEN
IF()
-- Also specifying fields as nullable or required

Cleaning for Accuracy

IN() -- with a subquery or JOIN

Cleaning for Completeness

NULLIF()
IFNULL()
COALESCE()

Cleaning for Consistency

PARSE_DATE()
SUBSTR()
REPLACE()

Cleaning for Uniformity

FORMAT()
CAST()