Blogs
Dataform-Versioning for BigQuery
GCP Dataform is a powerful tool for managing your data infrastructure, with Bigquery, on Google Cloud Platform (GCP). With Dataform, you can easily create, test, and deploy data models and pipelines using a simple, SQL-like syntax.
Usually when operating with Bigquery, we create queries and save them as views, or saved queries. Transformative queries can modify overtime, but currently there is no native solution to maintaining versioning of those changes, or accountability of any such query, procedure updates. Dataform provides that solution by maintaining versioning of those details. That is just one of its unique features.
GCP Dataform is a data modeling and pipeline management tool that allows you to create, test, and deploy data models and pipelines on GCP. Dataform uses a simple, SQL-like syntax to define your data models and pipelines, which makes it easy to learn and use.
- SQL-like syntax: Dataform uses a simple, SQL-like syntax to define your data models and pipelines. This makes it easy to learn and use, especially if you’re already familiar with SQL.
- Version control: Dataform integrates with Git, which allows you to version control your data models and pipelines. You can easily track changes to your data infrastructure over time and collaborate with other team members. This increases accountability and traceability
- Testing: Dataform provides a built-in testing framework that allows you to test your data models and pipelines. You can easily write tests to validate your data and ensure that your pipelines are working correctly.
- Deployment: Dataform integrates with GCP services, such as BigQuery and Cloud Storage, which allows you to deploy your data models and pipelines directly to GCP. You can easily deploy your data infrastructure with a single command.
- Monitoring: Dataform provides real-time monitoring of your data models and pipelines. You can easily track the progress of your pipelines and identify any issues.
- Cost effective: Reduced execution cost, as only integrated cost for usage of associated service is used, i.e. Bigquery egress costs.
- Reusability: Integrated use of JavaScript\Constants and variables. This can be leveraged for code reuse
- Powerful features like incremental and full reload to reduce cost and for faster execution
To use GCP Dataform, you need to create a project and define your data models and pipelines.
Create a repository for file versioning and storage and connect it to GIT. After the repo is created, you can start creating a workspace for setting up the project Bigquery structure. This is as simple as creating directories, corresponding to dataset placeholders and files which correspond to tables. Table names correspond to dataform file names
A data model definition is created for view\tables in following format –
data model
{{
config({
type: "view",
query: `
SELECT *
FROM my_table
`
})
}}
Model definition can be of multiple formats –
View definition –
config {
type: "view",
schema: "Views",
description: "Sample description of table",
tags: ["Comma separated tags"],
bigquery: {
labels: {
Label_definition_1: "label_definition_1",
Label_definition_2: "label_definition_2",
Label_definition_3: "label_definition_3",
},
}
}
Table definition –
config {
type: <table\incremental>,
schema: "<Dataset>",
description: "Sample description of table",
assertions: {
nonNull: ["Non Null column name"],
uniqueKey: ["Unique key column name"]
},
tags: ["Comma separated tags"],
bigquery: {
labels: {
Label_definition_1: "label_definition_1",
Label_definition_2: "label_definition_2",
Label_definition_3: "label_definition_3",
},
partitionBy: "DATE_TRUNC(Column name, day\month)"
},
uniqueKey: ["Unique key column name, for incremental table refresh implementation"]
}
Once you’ve defined your data models and pipelines, you can use Dataform to deploy them to GCP. To deploy your data infrastructure, you need to configure your GCP credentials and run a deployment command:
# Configure your GCP credentials
dataform init
# Deploy your data infrastructure
dataform deploy
GCP Dataform is a powerful tool for managing your data infrastructure on GCP. With Dataform, you can easily create, test, and deploy data models and pipelines using a simple, SQL-like syntax. Dataform provides a range of features, including version control, testing, deployment, and monitoring, which makes it a versatile tool for various use cases. If you’re looking for a simple and powerful way to manage your data infrastructure on GCP, GCP Dataform is definitely worth considering.