Advanced SQL Queries for Data Analysis: Window Functions, CTEs, and Performance Tuning

This tutorial is a hands-on, command-heavy guide to writing advanced SQL for analytics. You’ll learn how to use window functions for ranking and time-series analysis, how to structure complex logic with CTEs (including recursive CTEs), and how to tune performance using indexes, query rewrites, and execution-plan thinking.

Examples focus on PostgreSQL syntax (because it’s widely used and standards-friendly), but most concepts translate to SQL Server, Oracle, and modern MySQL (8.0+ supports window functions and CTEs). Where syntax differs, it’s called out.

Sample Dataset (Schema + Seed Data)
Window Functions: The Core Analytics Toolkit
CTEs: Readable, Modular SQL
Performance Tuning for Analytical Queries
Putting It Together: A Full Analytical Query

Sample Dataset (Schema + Seed Data)

To make the examples concrete, we’ll use a small e-commerce schema. You can run these commands in PostgreSQL.

-- Drop tables if they exist (careful in real environments)
DROP TABLE IF EXISTS order_items;
DROP TABLE IF EXISTS orders;
DROP TABLE IF EXISTS products;
DROP TABLE IF EXISTS customers;

CREATE TABLE customers (
  customer_id   BIGSERIAL PRIMARY KEY,
  customer_name TEXT NOT NULL,
  region        TEXT NOT NULL,
  created_at    TIMESTAMP NOT NULL DEFAULT now()
);

CREATE TABLE products (
  product_id    BIGSERIAL PRIMARY KEY,
  product_name  TEXT NOT NULL,
  category      TEXT NOT NULL,
  unit_price    NUMERIC(12,2) NOT NULL
);

CREATE TABLE orders (
  order_id      BIGSERIAL PRIMARY KEY,
  customer_id   BIGINT NOT NULL REFERENCES customers(customer_id),
  order_date    DATE NOT NULL,
  status        TEXT NOT NULL CHECK (status IN ('placed','paid','shipped','cancelled'))
);

CREATE TABLE order_items (
  order_item_id BIGSERIAL PRIMARY KEY,
  order_id      BIGINT NOT NULL REFERENCES orders(order_id),
  product_id    BIGINT NOT NULL REFERENCES products(product_id),
  quantity      INT NOT NULL CHECK (quantity > 0),
  unit_price    NUMERIC(12,2) NOT NULL
);

-- Helpful indexes for analytics
CREATE INDEX idx_orders_customer_date ON orders(customer_id, order_date);
CREATE INDEX idx_orders_date ON orders(order_date);
CREATE INDEX idx_order_items_order ON order_items(order_id);
CREATE INDEX idx_order_items_product ON order_items(product_id);

-- Seed data
INSERT INTO customers (customer_name, region, created_at) VALUES
('Aster Corp','North','2024-01-10'),
('Beryl LLC','South','2024-02-05'),
('Cobalt Inc','North','2024-02-20'),
('Dune Co','West','2024-03-01');

INSERT INTO products (product_name, category, unit_price) VALUES
('Notebook','Office',4.50),
('Pen','Office',1.20),
('Mug','Home',8.00),
('Lamp','Home',22.00),
('Headphones','Electronics',55.00);

INSERT INTO orders (customer_id, order_date, status) VALUES
(1,'2024-03-01','paid'),
(1,'2024-03-15','shipped'),
(2,'2024-03-16','paid'),
(3,'2024-03-20','cancelled'),
(2,'2024-04-02','shipped'),
(4,'2024-04-10','paid');

INSERT INTO order_items (order_id, product_id, quantity, unit_price) VALUES
(1,1,10,4.50),
(1,2,20,1.20),
(2,3,2,8.00),
(2,5,1,55.00),
(3,1,5,4.50),
(4,4,1,22.00),
(5,5,2,55.00),
(6,2,50,1.20);

A common pattern in analytics is to create a “fact” query that computes order revenue:

SELECT
  o.order_id,
  o.customer_id,
  o.order_date,
  o.status,
  SUM(oi.quantity * oi.unit_price) AS order_revenue
FROM orders o
JOIN order_items oi ON oi.order_id = o.order_id
GROUP BY o.order_id, o.customer_id, o.order_date, o.status;

We’ll reuse this idea throughout.

Window Functions: The Core Analytics Toolkit

Window functions let you compute values across a set of rows related to the current row without collapsing results like GROUP BY does.

PARTITION BY vs GROUP BY

GROUP BY reduces rows: one output row per group.
PARTITION BY keeps rows: each row gets a computed value based on its partition.

Example: show each order and the customer’s total revenue (across all orders) on every row.

WITH order_revenue AS (
  SELECT
    o.order_id,
    o.customer_id,
    o.order_date,
    SUM(oi.quantity * oi.unit_price) AS order_revenue
  FROM orders o
  JOIN order_items oi ON oi.order_id = o.order_id
  WHERE o.status IN ('paid','shipped')  -- exclude cancelled
  GROUP BY o.order_id, o.customer_id, o.order_date
)
SELECT
  order_id,
  customer_id,
  order_date,
  order_revenue,
  SUM(order_revenue) OVER (PARTITION BY customer_id) AS customer_lifetime_revenue
FROM order_revenue
ORDER BY customer_id, order_date;

This is a classic analytical layout: detailed rows plus “context” metrics.

Ranking: ROW_NUMBER, RANK, DENSE_RANK

Ranking functions answer questions like “What is the customer’s largest order?” or “Which products are top sellers?”

Differences:

ROW_NUMBER() assigns unique sequential numbers (ties get different numbers).
RANK() leaves gaps after ties (1,1,3).
DENSE_RANK() no gaps after ties (1,1,2).

Example: rank orders by revenue within each customer.

WITH order_revenue AS (
  SELECT
    o.order_id,
    o.customer_id,
    o.order_date,
    SUM(oi.quantity * oi.unit_price) AS order_revenue
  FROM orders o
  JOIN order_items oi ON oi.order_id = o.order_id
  WHERE o.status IN ('paid','shipped')
  GROUP BY o.order_id, o.customer_id, o.order_date
)
SELECT
  customer_id,
  order_id,
  order_date,
  order_revenue,
  ROW_NUMBER() OVER (PARTITION BY customer_id ORDER BY order_revenue DESC) AS rn,
  RANK()       OVER (PARTITION BY customer_id ORDER BY order_revenue DESC) AS rnk,
  DENSE_RANK() OVER (PARTITION BY customer_id ORDER BY order_revenue DESC) AS drnk
FROM order_revenue
ORDER BY customer_id, rn;

Use ROW_NUMBER() when you need “pick exactly one row.” Use RANK()/DENSE_RANK() when ties matter.

Top-N per Group

A frequent analytics request: “Top 2 orders per customer” or “Top 3 products per category.”

Top 2 orders per customer:

WITH order_revenue AS (
  SELECT
    o.order_id,
    o.customer_id,
    o.order_date,
    SUM(oi.quantity * oi.unit_price) AS order_revenue
  FROM orders o
  JOIN order_items oi ON oi.order_id = o.order_id
  WHERE o.status IN ('paid','shipped')
  GROUP BY o.order_id, o.customer_id, o.order_date
),
ranked AS (
  SELECT
    *,
    ROW_NUMBER() OVER (PARTITION BY customer_id ORDER BY order_revenue DESC, order_date DESC) AS rn
  FROM order_revenue
)
SELECT
  customer_id, order_id, order_date, order_revenue
FROM ranked
WHERE rn <= 2
ORDER BY customer_id, rn;

This pattern (CTE + window rank + filter) is the most portable and readable approach.

Running Totals and Moving Averages

Window frames let you define “which rows count” around the current row.

Running total (cumulative revenue per customer over time)

WITH order_revenue AS (
  SELECT
    o.order_id,
    o.customer_id,
    o.order_date,
    SUM(oi.quantity * oi.unit_price) AS order_revenue
  FROM orders o
  JOIN order_items oi ON oi.order_id = o.order_id
  WHERE o.status IN ('paid','shipped')
  GROUP BY o.order_id, o.customer_id, o.order_date
)
SELECT
  customer_id,
  order_date,
  order_id,
  order_revenue,
  SUM(order_revenue) OVER (
    PARTITION BY customer_id
    ORDER BY order_date, order_id
    ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW
  ) AS running_revenue
FROM order_revenue
ORDER BY customer_id, order_date, order_id;

Key ideas:

ORDER BY inside OVER(...) defines a sequence.
ROWS BETWEEN ... defines the frame. For cumulative sums, use UNBOUNDED PRECEDING to CURRENT ROW.

3-order moving average (per customer)

WITH order_revenue AS (
  SELECT
    o.order_id,
    o.customer_id,
    o.order_date,
    SUM(oi.quantity * oi.unit_price) AS order_revenue
  FROM orders o
  JOIN order_items oi ON oi.order_id = o.order_id
  WHERE o.status IN ('paid','shipped')
  GROUP BY o.order_id, o.customer_id, o.order_date
)
SELECT
  customer_id,
  order_date,
  order_id,
  order_revenue,
  AVG(order_revenue) OVER (
    PARTITION BY customer_id
    ORDER BY order_date, order_id
    ROWS BETWEEN 2 PRECEDING AND CURRENT ROW
  ) AS moving_avg_last_3_orders
FROM order_revenue
ORDER BY customer_id, order_date, order_id;

Use ROWS for “last N rows.” Use RANGE for “last N units of time” (but be careful: RANGE semantics depend on the data type and can include ties in ways that surprise people).

LAG/LEAD for Period-over-Period Analysis

LAG() and LEAD() let you reference previous/next rows in a window ordering. This is essential for growth rates, churn signals, and anomaly detection.

Example: month-over-month revenue per region.

WITH monthly_region_revenue AS (
  SELECT
    c.region,
    date_trunc('month', o.order_date)::date AS month_start,
    SUM(oi.quantity * oi.unit_price) AS revenue
  FROM orders o
  JOIN customers c ON c.customer_id = o.customer_id
  JOIN order_items oi ON oi.order_id = o.order_id
  WHERE o.status IN ('paid','shipped')
  GROUP BY c.region, date_trunc('month', o.order_date)::date
)
SELECT
  region,
  month_start,
  revenue,
  LAG(revenue) OVER (PARTITION BY region ORDER BY month_start) AS prev_month_revenue,
  revenue - LAG(revenue) OVER (PARTITION BY region ORDER BY month_start) AS delta,
  CASE
    WHEN LAG(revenue) OVER (PARTITION BY region ORDER BY month_start) IS NULL THEN NULL
    WHEN LAG(revenue) OVER (PARTITION BY region ORDER BY month_start) = 0 THEN NULL
    ELSE (revenue / LAG(revenue) OVER (PARTITION BY region ORDER BY month_start)) - 1
  END AS growth_rate
FROM monthly_region_revenue
ORDER BY region, month_start;

Notes:

The CTE aggregates first (monthly totals), then the window function compares months.
Growth rate needs divide-by-zero protection.

FIRST_VALUE/LAST_VALUE and Window Frames

FIRST_VALUE() and LAST_VALUE() are often misunderstood because the default frame matters.

Suppose you want, for each customer order, the first order date and the most recent order date.

WITH customer_orders AS (
  SELECT DISTINCT
    o.customer_id,
    o.order_id,
    o.order_date
  FROM orders o
  WHERE o.status IN ('paid','shipped')
)
SELECT
  customer_id,
  order_id,
  order_date,
  FIRST_VALUE(order_date) OVER (
    PARTITION BY customer_id
    ORDER BY order_date, order_id
    ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING
  ) AS first_order_date,
  LAST_VALUE(order_date) OVER (
    PARTITION BY customer_id
    ORDER BY order_date, order_id
    ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING
  ) AS last_order_date
FROM customer_orders
ORDER BY customer_id, order_date, order_id;

Why specify ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING?

Without it, LAST_VALUE() often returns the “last value in the current frame,” which by default may end at the current row, not the partition end.

Alternative: in PostgreSQL you can also use MAX(order_date) OVER (PARTITION BY customer_id) for “last date” if you don’t need the last by ordering with tie-breakers.

Percentiles and Distribution

Percentiles help you understand distributions (e.g., “median order value” or “95th percentile”).

PostgreSQL supports ordered-set aggregates like percentile_cont.

Median order revenue overall:

WITH order_revenue AS (
  SELECT
    o.order_id,
    SUM(oi.quantity * oi.unit_price) AS order_revenue
  FROM orders o
  JOIN order_items oi ON oi.order_id = o.order_id
  WHERE o.status IN ('paid','shipped')
  GROUP BY o.order_id
)
SELECT
  percentile_cont(0.5) WITHIN GROUP (ORDER BY order_revenue) AS median_order_revenue,
  percentile_cont(0.95) WITHIN GROUP (ORDER BY order_revenue) AS p95_order_revenue
FROM order_revenue;

Percentiles per region:

WITH order_revenue AS (
  SELECT
    o.order_id,
    c.region,
    SUM(oi.quantity * oi.unit_price) AS order_revenue
  FROM orders o
  JOIN customers c ON c.customer_id = o.customer_id
  JOIN order_items oi ON oi.order_id = o.order_id
  WHERE o.status IN ('paid','shipped')
  GROUP BY o.order_id, c.region
)
SELECT
  region,
  percentile_cont(0.5) WITHIN GROUP (ORDER BY order_revenue) AS median_order_revenue
FROM order_revenue
GROUP BY region
ORDER BY region;

CTEs: Readable, Modular SQL

CTEs (WITH ...) let you build queries in stages. This improves readability and makes debugging easier.

Basic CTEs and Multi-Stage Pipelines

A common analytics task: compute customer KPIs (orders, revenue, average order value, first/last order date).

WITH order_revenue AS (
  SELECT
    o.order_id,
    o.customer_id,
    o.order_date,
    SUM(oi.quantity * oi.unit_price) AS order_revenue
  FROM orders o
  JOIN order_items oi ON oi.order_id = o.order_id
  WHERE o.status IN ('paid','shipped')
  GROUP BY o.order_id, o.customer_id, o.order_date
),
customer_kpis AS (
  SELECT
    customer_id,
    COUNT(*) AS orders_count,
    SUM(order_revenue) AS total_revenue,
    AVG(order_revenue) AS avg_order_value,
    MIN(order_date) AS first_order_date,
    MAX(order_date) AS last_order_date
  FROM order_revenue
  GROUP BY customer_id
)
SELECT
  c.customer_id,
  c.customer_name,
  c.region,
  k.orders_count,
  k.total_revenue,
  k.avg_order_value,
  k.first_order_date,
  k.last_order_date
FROM customers c
LEFT JOIN customer_kpis k ON k.customer_id = c.customer_id
ORDER BY k.total_revenue DESC NULLS LAST;

Why this structure works well:

order_revenue is a reusable “fact layer.”
customer_kpis is the aggregated “metric layer.”
Final SELECT joins dimensions (customers) to metrics.

CTEs vs Subqueries vs Views

CTEs are best when:

You want a clear multi-step pipeline.
You need to reuse a derived dataset multiple times in one query.
You want to isolate logic for easier testing.

Subqueries can be best when:

The derived dataset is used once and the query is simple.
You want the optimizer to inline aggressively (this depends on the database).

Views are best when:

You want a stable interface for BI tools.
You reuse the same logic across many queries.
You want permission boundaries (grant access to a view, not raw tables).

Important PostgreSQL note: Historically, CTEs were optimization fences (materialized by default). In modern PostgreSQL versions, many CTEs can be inlined, but you can still control behavior:

WITH order_revenue AS MATERIALIZED (
  SELECT ...
)
SELECT ...

WITH order_revenue AS NOT MATERIALIZED (
  SELECT ...
)
SELECT ...

Use MATERIALIZED when computing the CTE is expensive but reused multiple times and you want to avoid recomputation. Use NOT MATERIALIZED when you want the optimizer to push filters down into the CTE.

Recursive CTEs: Hierarchies and Sequences

Recursive CTEs solve problems like:

Organizational charts (manager → employee)
Category trees
Generating date series (though PostgreSQL also has generate_series)

Example: Product category hierarchy (toy example)

Create a category table:

DROP TABLE IF EXISTS categories;

CREATE TABLE categories (
  category_id BIGSERIAL PRIMARY KEY,
  category_name TEXT NOT NULL,
  parent_category_id BIGINT REFERENCES categories(category_id)
);

INSERT INTO categories (category_name, parent_category_id) VALUES
('All', NULL),
('Office', 1),
('Home', 1),
('Electronics', 1),
('Writing', 2),
('Lighting', 3);

Now fetch the full tree with depth:

WITH RECURSIVE cat_tree AS (
  SELECT
    category_id,
    category_name,
    parent_category_id,
    0 AS depth,
    category_name::text AS path
  FROM categories
  WHERE parent_category_id IS NULL

  UNION ALL

  SELECT
    c.category_id,
    c.category_name,
    c.parent_category_id,
    t.depth + 1 AS depth,
    (t.path || ' > ' || c.category_name) AS path
  FROM categories c
  JOIN cat_tree t ON t.category_id = c.parent_category_id
)
SELECT
  category_id,
  category_name,
  parent_category_id,
  depth,
  path
FROM cat_tree
ORDER BY path;

Key ideas:

The “anchor” query selects roots.
The recursive member joins children to parents.
path is a helpful technique for ordering and debugging.

Example: Generate a monthly calendar and left join revenue

This is useful when you need months with zero revenue to still appear.

WITH RECURSIVE months AS (
  SELECT date '2024-01-01' AS month_start
  UNION ALL
  SELECT (month_start + interval '1 month')::date
  FROM months
  WHERE month_start < date '2024-06-01'
),
monthly_revenue AS (
  SELECT
    date_trunc('month', o.order_date)::date AS month_start,
    SUM(oi.quantity * oi.unit_price) AS revenue
  FROM orders o
  JOIN order_items oi ON oi.order_id = o.order_id
  WHERE o.status IN ('paid','shipped')
  GROUP BY date_trunc('month', o.order_date)::date
)
SELECT
  m.month_start,
  COALESCE(r.revenue, 0) AS revenue
FROM months m
LEFT JOIN monthly_revenue r USING (month_start)
ORDER BY m.month_start;

Performance Tuning for Analytical Queries

Analytics queries often scan lots of rows, join large tables, and compute heavy aggregates. Performance tuning is about:

Reading fewer rows
Joining efficiently
Aggregating efficiently
Avoiding unnecessary sorts
Using the optimizer effectively

Execution Plans: EXPLAIN and EXPLAIN ANALYZE

Start with the plan. In PostgreSQL:

EXPLAIN
SELECT
  o.customer_id,
  SUM(oi.quantity * oi.unit_price) AS revenue
FROM orders o
JOIN order_items oi ON oi.order_id = o.order_id
WHERE o.status IN ('paid','shipped')
  AND o.order_date >= date '2024-03-01'
  AND o.order_date <  date '2024-05-01'
GROUP BY o.customer_id;

To measure real timing (in a safe environment):

EXPLAIN (ANALYZE, BUFFERS, VERBOSE)
SELECT
  o.customer_id,
  SUM(oi.quantity * oi.unit_price) AS revenue
FROM orders o
JOIN order_items oi ON oi.order_id = o.order_id
WHERE o.status IN ('paid','shipped')
  AND o.order_date >= date '2024-03-01'
  AND o.order_date <  date '2024-05-01'
GROUP BY o.customer_id;

What to look for:

Seq Scan vs Index Scan (sequential scans can be fine for large portions of a table, but not always)
Join type: Nested Loop, Hash Join, Merge Join
Sort nodes (expensive sorts can dominate runtime)
Buffers read/hit (I/O vs cache)

Indexing Strategy for Analytics

Indexes help most when:

You filter on a column (WHERE)
You join on a column (JOIN)
You order by a column (ORDER BY) and can use an index-only scan or avoid sorting

Composite indexes

If you frequently filter by date range and status:

CREATE INDEX idx_orders_status_date ON orders(status, order_date);

If you frequently filter by customer and date:

CREATE INDEX idx_orders_customer_status_date ON orders(customer_id, status, order_date);

Be mindful: too many indexes slow down writes. Pick indexes based on actual workload.

Covering indexes (PostgreSQL INCLUDE)

If you filter by order_date but also need customer_id without hitting the table:

CREATE INDEX idx_orders_date_include_customer
ON orders(order_date)
INCLUDE (customer_id, status);

This can enable index-only scans if visibility map conditions are met.

Partial indexes

If most queries only care about paid/shipped orders, a partial index can be extremely effective:

CREATE INDEX idx_orders_active_date
ON orders(order_date, customer_id)
WHERE status IN ('paid','shipped');

This index is smaller and more selective than indexing all statuses.

Sargability and Predicate Pushdown

A predicate is sargable if the database can use an index efficiently.

Bad (non-sargable) pattern:

SELECT *
FROM orders
WHERE date_trunc('month', order_date) = date '2024-03-01';

This applies a function to the column, often preventing index use.

Better:

SELECT *
FROM orders
WHERE order_date >= date '2024-03-01'
  AND order_date <  date '2024-04-01';

Similarly, avoid wrapping indexed columns in CAST, COALESCE, or arithmetic in the WHERE clause when you can rewrite it.

Avoiding Common Anti-Patterns

1) SELECT * in analytical joins

SELECT * can force reading wide rows and unnecessary columns. Prefer selecting only what you need.

SELECT o.order_id, o.order_date, oi.product_id, oi.quantity
FROM orders o
JOIN order_items oi ON oi.order_id = o.order_id;

2) Joining before filtering

Filter early to reduce join input sizes.

Less optimal:

SELECT ...
FROM orders o
JOIN order_items oi ON oi.order_id = o.order_id
WHERE o.order_date >= date '2024-03-01';

Often better (especially when it helps the optimizer and reduces rows early):

WITH filtered_orders AS (
  SELECT order_id, customer_id, order_date
  FROM orders
  WHERE order_date >= date '2024-03-01'
    AND order_date <  date '2024-04-01'
    AND status IN ('paid','shipped')
)
SELECT
  fo.customer_id,
  SUM(oi.quantity * oi.unit_price) AS revenue
FROM filtered_orders fo
JOIN order_items oi ON oi.order_id = fo.order_id
GROUP BY fo.customer_id;

This also improves readability: “first pick relevant orders, then compute metrics.”

3) DISTINCT as a band-aid

If you need DISTINCT to fix duplicates after joins, it often indicates a modeling or join-key issue. Diagnose why duplicates appear rather than masking them.

Partitioning and Incremental Aggregation

For large datasets (millions to billions of rows), consider:

Table partitioning by date

PostgreSQL declarative partitioning example:

-- Example structure; do not run blindly in production
CREATE TABLE orders_big (
  order_id    BIGSERIAL,
  customer_id BIGINT NOT NULL,
  order_date  DATE NOT NULL,
  status      TEXT NOT NULL,
  PRIMARY KEY (order_id, order_date)
) PARTITION BY RANGE (order_date);

CREATE TABLE orders_big_2024_03 PARTITION OF orders_big
FOR VALUES FROM ('2024-03-01') TO ('2024-04-01');

CREATE TABLE orders_big_2024_04 PARTITION OF orders_big
FOR VALUES FROM ('2024-04-01') TO ('2024-05-01');

Benefits:

Queries with date filters can prune partitions (scan fewer chunks).
Maintenance (vacuum, archiving) can be easier.

Trade-offs:

More complex DDL and index management.
Poor partition key choice can hurt performance.

Incremental aggregates

Instead of recomputing monthly revenue from raw fact tables every time, store aggregated results:

CREATE TABLE monthly_customer_revenue (
  month_start  DATE NOT NULL,
  customer_id  BIGINT NOT NULL,
  revenue      NUMERIC(12,2) NOT NULL,
  PRIMARY KEY (month_start, customer_id)
);

Then update incrementally (daily job) rather than scanning all history.

Materialized Views and Precomputation

Materialized views store query results physically.

CREATE MATERIALIZED VIEW mv_order_revenue AS
SELECT
  o.order_id,
  o.customer_id,
  o.order_date,
  SUM(oi.quantity * oi.unit_price) AS order_revenue
FROM orders o
JOIN order_items oi ON oi.order_id = o.order_id
WHERE o.status IN ('paid','shipped')
GROUP BY o.order_id, o.customer_id, o.order_date;

-- Refresh when data changes (schedule it)
REFRESH MATERIALIZED VIEW mv_order_revenue;

You can index a materialized view:

CREATE INDEX idx_mv_order_revenue_customer_date
ON mv_order_revenue(customer_id, order_date);

When to use:

Heavy transformations used by many dashboards
Data changes in batches (nightly ETL), not constantly
You need predictable query latency

Putting It Together: A Full Analytical Query

Let’s build a query that answers:

For each region and month, show revenue, previous month revenue, growth rate, and the top product by revenue in that region-month.

This combines:

CTE pipeline
Window functions (LAG, ranking)
Careful grouping and joining

WITH filtered_orders AS (
  SELECT
    o.order_id,
    o.customer_id,
    o.order_date
  FROM orders o
  WHERE o.status IN ('paid','shipped')
    AND o.order_date >= date '2024-03-01'
    AND o.order_date <  date '2024-06-01'
),
line_facts AS (
  SELECT
    fo.order_id,
    fo.customer_id,
    fo.order_date,
    c.region,
    oi.product_id,
    (oi.quantity * oi.unit_price) AS line_revenue
  FROM filtered_orders fo
  JOIN customers c ON c.customer_id = fo.customer_id
  JOIN order_items oi ON oi.order_id = fo.order_id
),
region_month_revenue AS (
  SELECT
    region,
    date_trunc('month', order_date)::date AS month_start,
    SUM(line_revenue) AS revenue
  FROM line_facts
  GROUP BY region, date_trunc('month', order_date)::date
),
region_month_product AS (
  SELECT
    region,
    date_trunc('month', order_date)::date AS month_start,
    product_id,
    SUM(line_revenue) AS product_revenue
  FROM line_facts
  GROUP BY region, date_trunc('month', order_date)::date, product_id
),
ranked_products AS (
  SELECT
    rmp.*,
    ROW_NUMBER() OVER (
      PARTITION BY region, month_start
      ORDER BY product_revenue DESC, product_id
    ) AS rn
  FROM region_month_product rmp
)
SELECT
  rmr.region,
  rmr.month_start,
  rmr.revenue,
  LAG(rmr.revenue) OVER (PARTITION BY rmr.region ORDER BY rmr.month_start) AS prev_month_revenue,
  CASE
    WHEN LAG(rmr.revenue) OVER (PARTITION BY rmr.region ORDER BY rmr.month_start) IS NULL THEN NULL
    WHEN LAG(rmr.revenue) OVER (PARTITION BY rmr.region ORDER BY rmr.month_start) = 0 THEN NULL
    ELSE (rmr.revenue / LAG(rmr.revenue) OVER (PARTITION BY rmr.region ORDER BY rmr.month_start)) - 1
  END AS growth_rate,
  p.product_name AS top_product,
  rp.product_revenue AS top_product_revenue
FROM region_month_revenue rmr
LEFT JOIN ranked_products rp
  ON rp.region = rmr.region
 AND rp.month_start = rmr.month_start
 AND rp.rn = 1
LEFT JOIN products p ON p.product_id = rp.product_id
ORDER BY rmr.region, rmr.month_start;

Why this performs reasonably well (and how to improve it at scale):

We filter orders early (filtered_orders) to reduce join volume.
We compute line-level facts once (line_facts) and reuse them.
We aggregate at the needed grains: region-month and region-month-product.
At scale, consider:
- Indexes on orders(status, order_date) or partial indexes for active statuses
- Partitioning orders by date
- Materializing line_facts or order_revenue as a materialized view if reused widely

Practical Checklist

When writing advanced analytical SQL:

Define the grain: What is one row in your result (order, customer-month, region-day)?
Filter early: Reduce rows before expensive joins/aggregations.
Use window functions for “context metrics” (running totals, ranks, comparisons).
Use CTEs to build a pipeline of transformations you can reason about.
Inspect plans with EXPLAIN (ANALYZE, BUFFERS) and tune based on evidence.
Index for your predicates and joins, not for every column.
Avoid non-sargable filters and unnecessary sorts.
Precompute (materialized views / aggregate tables) when latency matters and data changes in batches.

If you want, share:

your database engine (PostgreSQL/MySQL/SQL Server/etc.),
approximate table sizes,
and one slow query,

and I can rewrite it using the patterns above and explain the plan-level improvements.

Advanced SQL Queries for Data Analysis: Window Functions, CTEs, and Performance Tuning

Advanced SQL Queries for Data Analysis: Window Functions, CTEs, and Performance Tuning

Table of Contents

Sample Dataset (Schema + Seed Data)

Window Functions: The Core Analytics Toolkit

PARTITION BY vs GROUP BY

Ranking: ROW_NUMBER, RANK, DENSE_RANK

Top-N per Group

Running Totals and Moving Averages

Running total (cumulative revenue per customer over time)

3-order moving average (per customer)

LAG/LEAD for Period-over-Period Analysis

FIRST_VALUE/LAST_VALUE and Window Frames

Percentiles and Distribution

CTEs: Readable, Modular SQL

Basic CTEs and Multi-Stage Pipelines

CTEs vs Subqueries vs Views

Recursive CTEs: Hierarchies and Sequences

Example: Product category hierarchy (toy example)

Example: Generate a monthly calendar and left join revenue

Performance Tuning for Analytical Queries

Execution Plans: EXPLAIN and EXPLAIN ANALYZE

Indexing Strategy for Analytics

Composite indexes

Covering indexes (PostgreSQL INCLUDE)

Partial indexes

Sargability and Predicate Pushdown

Avoiding Common Anti-Patterns

1) SELECT * in analytical joins

2) Joining before filtering

3) DISTINCT as a band-aid

Partitioning and Incremental Aggregation

Table partitioning by date

Incremental aggregates

Materialized Views and Precomputation

Putting It Together: A Full Analytical Query

Practical Checklist

Related Tutorials