Add snapshots to your DAG
Related documentation
What are snapshots?
Analysts often need to "look back in time" at previous data states in their mutable tables. While some source data systems are built in a way that makes accessing historical data possible, this is not always the case. dbt provides a mechanism, snapshots, which records changes to a mutable table over time.
Snapshots implement type-2 Slowly Changing Dimensions over mutable source tables. These Slowly Changing Dimensions (or SCDs) identify how a row in a table changes over time. Imagine you have an orders table where the status field can be overwritten as the order is processed.
| Loading table... |
Now, imagine that the order goes from "pending" to "shipped". That same record will now look like:
| Loading table... |
This order is now in the "shipped" state, but we've lost the information about when the order was last in the "pending" state. This makes it difficult (or impossible) to analyze how long it took for an order to ship. dbt can "snapshot" these changes to help you understand how values in a row change over time. Here's an example of a snapshot table for the previous example:
| Loading table... |
Configuring snapshots
Configure your snapshots in YAML files to tell dbt how to detect record changes. Define snapshots configurations in YAML files, alongside your models, for a cleaner, faster, and more consistent set up. Place snapshot YAML files in the models directory or in a snapshots directory.
snapshots:
- name: string
relation: relation # source('my_source', 'my_table') or ref('my_model')
description: markdown_string
config:
database: string
schema: string
alias: string
strategy: timestamp | check
unique_key: column_name_or_expression
check_cols: [column_name] | all
updated_at: column_name
snapshot_meta_column_names: dictionary
dbt_valid_to_current: string
hard_deletes: ignore | invalidate | new_record
The following table outlines the configurations available for snapshots:
| Loading table... |
- In v1.9,
target_schemabecame optional, allowing snapshots to be environment-aware. By default, withouttarget_schemaortarget_databasedefined, snapshots now use thegenerate_schema_nameorgenerate_database_namemacros to determine where to build. - Developers can still set a custom location with
schemaanddatabaseconfigs, consistent with other resource types. - A number of other configurations are also supported (for example,
tagsandpost-hook). For the complete list, refer to Snapshot configurations. - You can configure snapshots from both the
dbt_project.ymlfile and aconfigblock. For more information, refer to the configuration docs.
Add a snapshot to your project
To add a snapshot to your project follow these steps. For users on versions 1.8 and earlier, refer to Legacy snapshot configurations.
-
Create a properties YAML file in your
snapshotsdirectory:snapshots/orders_snapshot.ymland add your configuration details. You can also configure your snapshot from your project YAML file (dbt_project.yml) (docs).snapshots/orders_snapshot.ymlsnapshots:
- name: orders_snapshot
relation: source('jaffle_shop', 'orders')
config:
schema: snapshots
database: analytics
unique_key: id
strategy: timestamp
updated_at: updated_at
dbt_valid_to_current: "to_date('9999-12-31')" # Specifies that current records should have `dbt_valid_to` set to `'9999-12-31'` instead of `NULL`. -
Since snapshots focus on configuration, the transformation logic is minimal. Typically, you'd select all data from the source. If you need to apply transformations (like filters, deduplication), it's best practice to define an ephemeral model and reference it in your snapshot configuration.
models/ephemeral_orders.sql{{ config(materialized='ephemeral') }}
select * from {{ source('jaffle_shop', 'orders') }} -
Check whether the result set of your query includes a reliable timestamp column that indicates when a record was last updated. For our example, the
updated_atcolumn reliably indicates record changes, so we can use thetimestampstrategy. If your query result set does not have a reliable timestamp, you'll need to instead use thecheckstrategy — more details on this below. -
Run the
dbt snapshotcommand — for our example, a new table will be created atanalytics.snapshots.orders_snapshot. Theschemaconfig will utilize thegenerate_schema_namemacro.$ dbt snapshot
Running with dbt=1.9.0
15:07:36 | Concurrency: 8 threads (target='dev')
15:07:36 |
15:07:36 | 1 of 1 START snapshot snapshots.orders_snapshot...... [RUN]
15:07:36 | 1 of 1 OK snapshot snapshots.orders_snapshot..........[SELECT 3 in 1.82s]
15:07:36 |
15:07:36 | Finished running 1 snapshots in 0.68s.
Completed successfully
Done. PASS=2 ERROR=0 SKIP=0 TOTAL=1 -
Inspect the results by selecting from the table dbt created (
analytics.snapshots.orders_snapshot). After the first run, you should see the results of your query, plus the snapshot meta fields as described later on. -
Run the
dbt snapshotcommand again and inspect the results. If any records have been updated, the snapshot should reflect this. -
Select from the
snapshotin downstream models using thereffunction.models/changed_orders.sqlselect * from {{ ref('orders_snapshot') }} -
Snapshots are only useful if you run them frequently — schedule the
dbt snapshotcommand to run regularly.
Configuration best practices
How snapshots work
When you run the dbt snapshot command:
- On the first run: dbt will create the initial snapshot table — this will be the result set of your
selectstatement, with additional columns includingdbt_valid_fromanddbt_valid_to. All records will have adbt_valid_to = nullor the value specified indbt_valid_to_currentif configured. - On subsequent runs: dbt will check which records have changed or if any new records have been created:
- The
dbt_valid_tocolumn will be updated for any existing records that have changed. - The updated record and any new records will be inserted into the snapshot table. These records will now have
dbt_valid_to = nullor the value configured indbt_valid_to_current.
- The
Note
- These column names can be customized to your team or organizational conventions using the snapshot_meta_column_names config.
- Use the
dbt_valid_to_currentconfig to set a custom indicator for the value ofdbt_valid_toin current snapshot records (like a future date such as9999-12-31). By default, this value isNULL. When set, dbt will use this specified value instead ofNULLfordbt_valid_tofor current records in the snapshot table. - Use the
hard_deletesconfig to track hard deletes by adding a new record when a row becomes "deleted" in source. Supported options areignore,invalidate, andnew_record.
Snapshots can be referenced in downstream models the same way as referencing models — by using the ref function.
Detecting row changes
Snapshot "strategies" define how dbt knows if a row has changed. There are two strategies built-in to dbt:
- Timestamp — Uses an
updated_atcolumn to determine if a row has changed. - Check — Compares a list of columns between their current and historical values to determine if a row has changed.
Timestamp strategy (recommended)
The timestamp strategy uses an updated_at field to determine if a row has changed. If the configured updated_at column for a row is more recent than the last time the snapshot ran, then dbt will invalidate the old record and record the new one. If the timestamps are unchanged, then dbt will not take any action.
Why timestamp is recommended?
- Requires tracking only one column (
updated_at) - Automatically handles new or removed columns in the source table
- Less prone to errors when the table schema evolves over time (for example, if using the
checkstrategy, you might need to update thecheck_colsconfiguration)
The timestamp strategy requires the following configurations:
| Loading table... |
Example usage:
snapshots:
- name: orders_snapshot_timestamp
relation: source('jaffle_shop', 'orders')
config:
schema: snapshots
unique_key: id
strategy: timestamp
updated_at: updated_at
Check strategy
The check strategy is useful for tables which do not have a reliable updated_at column. This strategy works by comparing a list of columns between their current and historical values. If any of these columns have changed, then dbt will invalidate the old record and record the new one. If the column values are identical, then dbt will not take any action.
The check strategy requires the following configurations:
| Loading table... |
The check snapshot strategy can be configured to track changes to all columns by supplying check_cols = 'all'. It is better to explicitly enumerate the columns that you want to check. Consider using a surrogate key to condense many columns into a single column.
Example usage
snapshots:
- name: orders_snapshot_check
relation: source('jaffle_shop', 'orders')
config:
schema: snapshots
unique_key: id
strategy: check
check_cols:
- status
- is_cancelled
Example usage with updated_at
When using the check strategy, dbt tracks changes by comparing values in check_cols. By default, dbt uses the timestamp to update dbt_updated_at, dbt_valid_from and dbt_valid_to fields. Optionally you can set an updated_at column:
- If
updated_atis configured, thecheckstrategy uses this column instead, as with the timestamp strategy. - If
updated_atvalue is null, dbt defaults to using the current timestamp.
Check out the following example, which shows how to use the check strategy with updated_at:
snapshots:
- name: orders_snapshot
relation: ref('stg_orders')
config:
schema: snapshots
unique_key: order_id
strategy: check
check_cols:
- status
- is_cancelled
updated_at: updated_at
In this example:
- If at least one of the specified
check_colschanges, the snapshot creates a new row. If theupdated_atcolumn has a value (is not null), the snapshot uses it; otherwise, it defaults to the timestamp. - If
updated_atisn’t set, then dbt automatically falls back to using the current timestamp to track changes. - Use this approach when your
updated_atcolumn isn't reliable for tracking record updates, but you still want to use it — rather than the snapshot's execution time — whenever row changes are detected.
Hard deletes (opt-in)
In dbt v1.9 and higher, the hard_deletes config replaces the invalidate_hard_deletes config to give you more control on how to handle deleted rows from the source. The hard_deletes config is not a separate strategy but an additional opt-in feature that can be used with any snapshot strategy.
The hard_deletes config has three options/fields:
| Loading table... |
Example usage
snapshots:
- name: orders_snapshot_hard_delete
relation: source('jaffle_shop', 'orders')
config:
schema: snapshots
unique_key: id
strategy: timestamp
updated_at: updated_at
hard_deletes: new_record # options are: 'ignore', 'invalidate', or 'new_record'
In this example, the hard_deletes: new_record config will add a new row for deleted records with the dbt_is_deleted column set to True.
Any restored records are added as new rows with the dbt_is_deleted field set to False.
The resulting table will look like this:
| Loading table... |
Snapshot meta-fields
Snapshot tables will be created as a clone of your source dataset, plus some additional meta-fields*.
- These column names can be customized to your team or organizational conventions using the
snapshot_meta_column_namesconfig. - Use the
dbt_valid_to_currentconfig to set a custom indicator for the value ofdbt_valid_toin current snapshot records (like a future date such as9999-12-31). By default, this value isNULL. When set, dbt will use this specified value instead ofNULLfordbt_valid_tofor current records in the snapshot table. - Use the
hard_deletesconfig to track deleted records as new rows with thedbt_is_deletedmeta field when using thehard_deletes='new_record'field.
| Loading table... |
All of these column names can be customized using the snapshot_meta_column_names config. Refer to this example for more details.
*The timestamps used for each column are subtly different depending on the strategy you use:
-
For the
timestampstrategy, the configuredupdated_atcolumn is used to populate thedbt_valid_from,dbt_valid_toanddbt_updated_atcolumns. -
For the
checkstrategy, the current timestamp is used to populate each column. If configured, thecheckstrategy uses theupdated_atcolumn instead, as with the timestamp strategy.
FAQs
Was this page helpful?
This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.