Blogs
Google BigQuery – Dos and Don’ts
Introduction
Moving from a standard on-premise, aka desktop based DB IDE to a cloud based DB is fun!
And why not!! You have a virtual scalable super-computer at your disposable, where your prior queries crunching millions of records, which could take an hour to execute, can now be completed in mere minutes.
The cloud is definitely closer to the holy grail, of supercomputers at your fingertips, and provides a quantum of productivity that cannot equaled to anything at our disposal so far. Yet, even this infinite source of resources, has to be used judiciously to allow for best use scenarios and cost effectiveness.
Cost?
Well surely nothing comes free! And neither does the cloud computing. In this article we will take a look at Google Clouds Bigquery, and associated information.
What is BigQuery?
Database by itself has multiple forms, depending on implementation. The whole gamut of its existence is a separate topic. Specifically talking about BigQuery, it is a petabyte level, structured database warehouse, providing high computation output but with low latency. As with everything cloud based, BigQuery is highly scalable.
It is a fully managed serverless DaaS, which you can query using SQL.
Bigquery cost is centered around amount of data stored on its servers and the egress cost for queries and processing. Storage cost is minimal, as Bigquery is designed as a data warehouse, with hosting large datasets.
Egress cost is based on the query used and how optimized the query execution is. The lesser the amount of data processed in the query, and smaller the output, lesser will be the cost.
Best practices
To highlight some of the key points for utilizing bigquery effectively and ensure cost control read on.
- Avoid Select * : Select * extracts all the column details for a table. Typical working dataset probably needs selective columns to be extracted and used. Hence, limit the column read, aka, data extracted, by limiting the number of columns read.
Eg. Select <column_name1>,<column_name2> from <table>
- Extract data before joining : When writing join statements, ensure data set used in joins, is already filtered to most relevant data, which helps reduce data volumes joined and processed
- View table details by making use of Preview tab. It gives a sampling of the data in the table without costing a dime.
- Use ORDER BY judiciously. They cost a ton, hence use it in the final query, and not joins and subqueries.
- Try to create partitioned tables where possible. Partitioned tables say on date, reduce cost massively, when a WHERE clause is used on the partition value. Create your data structure accordingly.
- For table joins, have the biggest table in the left most end, and work further right with smaller tables, to avoid overlapping runs on big tables.
- BigQuery cost does not decrease if LIMIT is used in query
- Use materialized views for queries which do not run regularly, but give a huge data output. This will help reduce storage cost, but be mindful of the processing cost