BlogTutorials

dbt Tutorial for Beginners: Your First Data Model in 15 Minutes

dbt (data build tool) is the standard way to transform data inside your warehouse using SQL. This tutorial gets you to your first working model in 15 minutes.

By PipeForge··10 min read

If you've been working with data warehouses for more than a month, you've probably heard of dbt. It's become the default tool for transforming raw data inside BigQuery, Snowflake, or Redshift into clean, analysis-ready tables. This dbt tutorial for beginners cuts through the jargon and gets you to a working data model in 15 minutes — explaining why each piece exists, not just how to type it.

What Is dbt? (Plain-English Explanation)

dbt (data build tool) is an open-source framework that lets you write data transformations as SQL SELECT statements. You write the SQL, dbt wraps it in a CREATE TABLE AS or CREATE VIEW AS statement, runs it against your warehouse, and manages the dependencies between tables.

Before dbt, data transformations happened in a scattered mess of stored procedures, ad-hoc scripts, and BI tool calculated fields. dbt brought software engineering practices to SQL: version control, testing, documentation, and modular reuse.

Key mental model: dbt doesn't move data. It transforms data that's already in your warehouse. You still need an ETL/ELT tool (like PipeForge) to get raw data in first, then dbt cleans and structures it.

Core dbt Concepts Every Beginner Needs to Know

Models

A model is a single SQL file (ending in .sql) that contains a SELECT statement. dbt runs each model and materializes the result as a table or view in your warehouse. A typical dbt project has dozens or hundreds of models organized into layers.

Sources

Sources are the raw tables that your ETL pipeline has loaded into the warehouse. You declare them in a sources.yml file so dbt knows where the raw data lives and can track freshness (alerting you if a table hasn't updated in the expected timeframe).

Refs

The ref() function is how dbt models reference each other. Instead of writing FROM my_schema.orders, you write FROM {{ ref('stg_orders') }}. This lets dbt build a dependency graph and run models in the right order.

Tests

dbt has built-in data tests: not_null, unique, accepted_values, and relationships. You declare them in a YAML file and dbt runs them after each build, failing the run if data quality violations are found.

Materializations

A materialization controls how dbt persists a model: as a view (recomputed each query), a table (physically created and stored), or incremental (append only new rows each run). Most models start as views; compute-heavy ones become tables.

The Typical dbt Project Structure

text
my_dbt_project/
├── models/
│   ├── staging/          # Clean raw source data, one model per source table
│   │   ├── stg_orders.sql
│   │   └── stg_customers.sql
│   ├── intermediate/     # Business logic joins and aggregations
│   │   └── int_customer_orders.sql
│   └── marts/            # Final analysis-ready tables for BI tools
│       └── customer_summary.sql
├── sources.yml           # Declares raw source tables
├── schema.yml            # Column descriptions and tests
└── dbt_project.yml       # Project config

Your First dbt Model: Orders to Customer Summary

Let's walk through a concrete example. You have raw Shopify orders loaded into BigQuery by your ETL pipeline. You want a customer_summary table that shows each customer's total orders, revenue, and first/last order date.

Step 1: Declare your source

yaml
# sources.yml
version: 2

sources:
  - name: shopify_raw
    database: your_project
    schema: shopify
    tables:
      - name: orders
        loaded_at_field: updated_at
        freshness:
          warn_after: {count: 25, period: hour}
          error_after: {count: 49, period: hour}

Step 2: Write a staging model

sql
-- models/staging/stg_orders.sql
-- Clean and type-cast the raw orders table

SELECT
  id                                        AS order_id,
  customer_id,
  CAST(created_at AS TIMESTAMP)             AS created_at,
  CAST(total_price AS NUMERIC)              AS total_price,
  financial_status,
  fulfillment_status,
  CAST(total_refunded AS NUMERIC)           AS total_refunded
FROM {{ source('shopify_raw', 'orders') }}
WHERE financial_status != 'voided'

Step 3: Write the mart model

sql
-- models/marts/customer_summary.sql
-- One row per customer with key lifetime metrics

SELECT
  customer_id,
  COUNT(order_id)                           AS total_orders,
  SUM(total_price - total_refunded)         AS net_revenue,
  MIN(created_at)                           AS first_order_at,
  MAX(created_at)                           AS last_order_at,
  DATE_DIFF(
    CAST(MAX(created_at) AS DATE),
    CAST(MIN(created_at) AS DATE),
    DAY
  )                                         AS customer_lifespan_days
FROM {{ ref('stg_orders') }}
GROUP BY 1

Step 4: Add tests

yaml
# schema.yml
version: 2

models:
  - name: customer_summary
    description: "One row per customer with lifetime order metrics"
    columns:
      - name: customer_id
        description: "Shopify customer ID"
        tests:
          - not_null
          - unique
      - name: total_orders
        tests:
          - not_null
      - name: net_revenue
        tests:
          - not_null

Step 5: Run it

bash
# Run all models
dbt run

# Run and test
dbt build

# Run only the customer_summary model and its dependencies
dbt run --select +customer_summary

Traditional dbt Setup vs. AI-Generated dbt with PipeForge

The traditional dbt setup requires installing the dbt CLI, configuring a profiles.yml with your warehouse credentials, initializing a project, writing all the models, and setting up a scheduler (dbt Cloud or Airflow). That's a multi-hour setup even for an experienced engineer.

PipeForge can generate dbt SQL models as part of a pipeline description. If you tell PipeForge "transform my raw Shopify orders into a customer_summary mart with total orders, net revenue, and first/last order date", the AI generates the staging model, the mart model, the sources YAML, and the schema tests — ready to deploy or download.

ApproachTime to First ModelRequiresBest For
dbt CLI (manual)2–4 hoursCLI setup, Python, warehouse credentials, SQL knowledgeData engineers who want full control
dbt Cloud45 minutesAccount, warehouse credentials, SQL knowledgeTeams that want managed orchestration
PipeForge (AI-generated)10 minutesPlain-English description, warehouse credentialsAnalysts and ops teams without dbt experience

What to Learn Next After This dbt Tutorial

  • Incremental models: learn how to process only new rows each run instead of rebuilding full tables
  • Macros: reusable Jinja-templated SQL snippets for common patterns across your project
  • Packages: dbt-utils and dbt-expectations add dozens of helper functions and test types
  • Exposures: document which dashboards depend on which models so you know the blast radius of a change
  • dbt Semantic Layer: define metrics centrally so every BI tool calculates them the same way

If you haven't yet moved raw data into your warehouse, start with our guide on no-code ETL pipelines — you need the data in BigQuery or Snowflake before dbt can transform it.

Generate dbt models with AI — no CLI setup needed

PipeForge can generate your dbt staging and mart models from a plain-English description. Connect your warehouse, describe what you want, and deploy in minutes.

Try PipeForge free

More from PipeForge