Advanced SQL Queries for Data Analysis: Window Functions, CTEs, and Performance Tuning
This tutorial is a hands-on, command-heavy guide to writing advanced SQL for analytics. You’ll learn how to use window functions for ranking and time-series analysis, how to structure complex logic with CTEs (including recursive CTEs), and how to tune performance using indexes, query rewrites, and execution-plan thinking.
Examples focus on PostgreSQL syntax (because it’s widely used and standards-friendly), but most concepts translate to SQL Server, Oracle, and modern MySQL (8.0+ supports window functions and CTEs). Where syntax differs, it’s called out.
Table of Contents
- Sample Dataset (Schema + Seed Data)
- Window Functions: The Core Analytics Toolkit
- CTEs: Readable, Modular SQL
- Performance Tuning for Analytical Queries
- Putting It Together: A Full Analytical Query
Sample Dataset (Schema + Seed Data)
To make the examples concrete, we’ll use a small e-commerce schema. You can run these commands in PostgreSQL.
-- Drop tables if they exist (careful in real environments)
DROP TABLE IF EXISTS order_items;
DROP TABLE IF EXISTS orders;
DROP TABLE IF EXISTS products;
DROP TABLE IF EXISTS customers;
CREATE TABLE customers (
customer_id BIGSERIAL PRIMARY KEY,
customer_name TEXT NOT NULL,
region TEXT NOT NULL,
created_at TIMESTAMP NOT NULL DEFAULT now()
);
CREATE TABLE products (
product_id BIGSERIAL PRIMARY KEY,
product_name TEXT NOT NULL,
category TEXT NOT NULL,
unit_price NUMERIC(12,2) NOT NULL
);
CREATE TABLE orders (
order_id BIGSERIAL PRIMARY KEY,
customer_id BIGINT NOT NULL REFERENCES customers(customer_id),
order_date DATE NOT NULL,
status TEXT NOT NULL CHECK (status IN ('placed','paid','shipped','cancelled'))
);
CREATE TABLE order_items (
order_item_id BIGSERIAL PRIMARY KEY,
order_id BIGINT NOT NULL REFERENCES orders(order_id),
product_id BIGINT NOT NULL REFERENCES products(product_id),
quantity INT NOT NULL CHECK (quantity > 0),
unit_price NUMERIC(12,2) NOT NULL
);
-- Helpful indexes for analytics
CREATE INDEX idx_orders_customer_date ON orders(customer_id, order_date);
CREATE INDEX idx_orders_date ON orders(order_date);
CREATE INDEX idx_order_items_order ON order_items(order_id);
CREATE INDEX idx_order_items_product ON order_items(product_id);
-- Seed data
INSERT INTO customers (customer_name, region, created_at) VALUES
('Aster Corp','North','2024-01-10'),
('Beryl LLC','South','2024-02-05'),
('Cobalt Inc','North','2024-02-20'),
('Dune Co','West','2024-03-01');
INSERT INTO products (product_name, category, unit_price) VALUES
('Notebook','Office',4.50),
('Pen','Office',1.20),
('Mug','Home',8.00),
('Lamp','Home',22.00),
('Headphones','Electronics',55.00);
INSERT INTO orders (customer_id, order_date, status) VALUES
(1,'2024-03-01','paid'),
(1,'2024-03-15','shipped'),
(2,'2024-03-16','paid'),
(3,'2024-03-20','cancelled'),
(2,'2024-04-02','shipped'),
(4,'2024-04-10','paid');
INSERT INTO order_items (order_id, product_id, quantity, unit_price) VALUES
(1,1,10,4.50),
(1,2,20,1.20),
(2,3,2,8.00),
(2,5,1,55.00),
(3,1,5,4.50),
(4,4,1,22.00),
(5,5,2,55.00),
(6,2,50,1.20);
A common pattern in analytics is to create a “fact” query that computes order revenue:
SELECT
o.order_id,
o.customer_id,
o.order_date,
o.status,
SUM(oi.quantity * oi.unit_price) AS order_revenue
FROM orders o
JOIN order_items oi ON oi.order_id = o.order_id
GROUP BY o.order_id, o.customer_id, o.order_date, o.status;
We’ll reuse this idea throughout.
Window Functions: The Core Analytics Toolkit
Window functions let you compute values across a set of rows related to the current row without collapsing results like GROUP BY does.
PARTITION BY vs GROUP BY
GROUP BYreduces rows: one output row per group.PARTITION BYkeeps rows: each row gets a computed value based on its partition.
Example: show each order and the customer’s total revenue (across all orders) on every row.
WITH order_revenue AS (
SELECT
o.order_id,
o.customer_id,
o.order_date,
SUM(oi.quantity * oi.unit_price) AS order_revenue
FROM orders o
JOIN order_items oi ON oi.order_id = o.order_id
WHERE o.status IN ('paid','shipped') -- exclude cancelled
GROUP BY o.order_id, o.customer_id, o.order_date
)
SELECT
order_id,
customer_id,
order_date,
order_revenue,
SUM(order_revenue) OVER (PARTITION BY customer_id) AS customer_lifetime_revenue
FROM order_revenue
ORDER BY customer_id, order_date;
This is a classic analytical layout: detailed rows plus “context” metrics.
Ranking: ROW_NUMBER, RANK, DENSE_RANK
Ranking functions answer questions like “What is the customer’s largest order?” or “Which products are top sellers?”
Differences:
ROW_NUMBER()assigns unique sequential numbers (ties get different numbers).RANK()leaves gaps after ties (1,1,3).DENSE_RANK()no gaps after ties (1,1,2).
Example: rank orders by revenue within each customer.
WITH order_revenue AS (
SELECT
o.order_id,
o.customer_id,
o.order_date,
SUM(oi.quantity * oi.unit_price) AS order_revenue
FROM orders o
JOIN order_items oi ON oi.order_id = o.order_id
WHERE o.status IN ('paid','shipped')
GROUP BY o.order_id, o.customer_id, o.order_date
)
SELECT
customer_id,
order_id,
order_date,
order_revenue,
ROW_NUMBER() OVER (PARTITION BY customer_id ORDER BY order_revenue DESC) AS rn,
RANK() OVER (PARTITION BY customer_id ORDER BY order_revenue DESC) AS rnk,
DENSE_RANK() OVER (PARTITION BY customer_id ORDER BY order_revenue DESC) AS drnk
FROM order_revenue
ORDER BY customer_id, rn;
Use ROW_NUMBER() when you need “pick exactly one row.” Use RANK()/DENSE_RANK() when ties matter.
Top-N per Group
A frequent analytics request: “Top 2 orders per customer” or “Top 3 products per category.”
Top 2 orders per customer:
WITH order_revenue AS (
SELECT
o.order_id,
o.customer_id,
o.order_date,
SUM(oi.quantity * oi.unit_price) AS order_revenue
FROM orders o
JOIN order_items oi ON oi.order_id = o.order_id
WHERE o.status IN ('paid','shipped')
GROUP BY o.order_id, o.customer_id, o.order_date
),
ranked AS (
SELECT
*,
ROW_NUMBER() OVER (PARTITION BY customer_id ORDER BY order_revenue DESC, order_date DESC) AS rn
FROM order_revenue
)
SELECT
customer_id, order_id, order_date, order_revenue
FROM ranked
WHERE rn <= 2
ORDER BY customer_id, rn;
This pattern (CTE + window rank + filter) is the most portable and readable approach.
Running Totals and Moving Averages
Window frames let you define “which rows count” around the current row.
Running total (cumulative revenue per customer over time)
WITH order_revenue AS (
SELECT
o.order_id,
o.customer_id,
o.order_date,
SUM(oi.quantity * oi.unit_price) AS order_revenue
FROM orders o
JOIN order_items oi ON oi.order_id = o.order_id
WHERE o.status IN ('paid','shipped')
GROUP BY o.order_id, o.customer_id, o.order_date
)
SELECT
customer_id,
order_date,
order_id,
order_revenue,
SUM(order_revenue) OVER (
PARTITION BY customer_id
ORDER BY order_date, order_id
ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW
) AS running_revenue
FROM order_revenue
ORDER BY customer_id, order_date, order_id;
Key ideas:
ORDER BYinsideOVER(...)defines a sequence.ROWS BETWEEN ...defines the frame. For cumulative sums, useUNBOUNDED PRECEDINGtoCURRENT ROW.
3-order moving average (per customer)
WITH order_revenue AS (
SELECT
o.order_id,
o.customer_id,
o.order_date,
SUM(oi.quantity * oi.unit_price) AS order_revenue
FROM orders o
JOIN order_items oi ON oi.order_id = o.order_id
WHERE o.status IN ('paid','shipped')
GROUP BY o.order_id, o.customer_id, o.order_date
)
SELECT
customer_id,
order_date,
order_id,
order_revenue,
AVG(order_revenue) OVER (
PARTITION BY customer_id
ORDER BY order_date, order_id
ROWS BETWEEN 2 PRECEDING AND CURRENT ROW
) AS moving_avg_last_3_orders
FROM order_revenue
ORDER BY customer_id, order_date, order_id;
Use ROWS for “last N rows.” Use RANGE for “last N units of time” (but be careful: RANGE semantics depend on the data type and can include ties in ways that surprise people).
LAG/LEAD for Period-over-Period Analysis
LAG() and LEAD() let you reference previous/next rows in a window ordering. This is essential for growth rates, churn signals, and anomaly detection.
Example: month-over-month revenue per region.
WITH monthly_region_revenue AS (
SELECT
c.region,
date_trunc('month', o.order_date)::date AS month_start,
SUM(oi.quantity * oi.unit_price) AS revenue
FROM orders o
JOIN customers c ON c.customer_id = o.customer_id
JOIN order_items oi ON oi.order_id = o.order_id
WHERE o.status IN ('paid','shipped')
GROUP BY c.region, date_trunc('month', o.order_date)::date
)
SELECT
region,
month_start,
revenue,
LAG(revenue) OVER (PARTITION BY region ORDER BY month_start) AS prev_month_revenue,
revenue - LAG(revenue) OVER (PARTITION BY region ORDER BY month_start) AS delta,
CASE
WHEN LAG(revenue) OVER (PARTITION BY region ORDER BY month_start) IS NULL THEN NULL
WHEN LAG(revenue) OVER (PARTITION BY region ORDER BY month_start) = 0 THEN NULL
ELSE (revenue / LAG(revenue) OVER (PARTITION BY region ORDER BY month_start)) - 1
END AS growth_rate
FROM monthly_region_revenue
ORDER BY region, month_start;
Notes:
- The CTE aggregates first (monthly totals), then the window function compares months.
- Growth rate needs divide-by-zero protection.
FIRST_VALUE/LAST_VALUE and Window Frames
FIRST_VALUE() and LAST_VALUE() are often misunderstood because the default frame matters.
Suppose you want, for each customer order, the first order date and the most recent order date.
WITH customer_orders AS (
SELECT DISTINCT
o.customer_id,
o.order_id,
o.order_date
FROM orders o
WHERE o.status IN ('paid','shipped')
)
SELECT
customer_id,
order_id,
order_date,
FIRST_VALUE(order_date) OVER (
PARTITION BY customer_id
ORDER BY order_date, order_id
ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING
) AS first_order_date,
LAST_VALUE(order_date) OVER (
PARTITION BY customer_id
ORDER BY order_date, order_id
ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING
) AS last_order_date
FROM customer_orders
ORDER BY customer_id, order_date, order_id;
Why specify ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING?
- Without it,
LAST_VALUE()often returns the “last value in the current frame,” which by default may end at the current row, not the partition end.
Alternative: in PostgreSQL you can also use MAX(order_date) OVER (PARTITION BY customer_id) for “last date” if you don’t need the last by ordering with tie-breakers.
Percentiles and Distribution
Percentiles help you understand distributions (e.g., “median order value” or “95th percentile”).
PostgreSQL supports ordered-set aggregates like percentile_cont.
Median order revenue overall:
WITH order_revenue AS (
SELECT
o.order_id,
SUM(oi.quantity * oi.unit_price) AS order_revenue
FROM orders o
JOIN order_items oi ON oi.order_id = o.order_id
WHERE o.status IN ('paid','shipped')
GROUP BY o.order_id
)
SELECT
percentile_cont(0.5) WITHIN GROUP (ORDER BY order_revenue) AS median_order_revenue,
percentile_cont(0.95) WITHIN GROUP (ORDER BY order_revenue) AS p95_order_revenue
FROM order_revenue;
Percentiles per region:
WITH order_revenue AS (
SELECT
o.order_id,
c.region,
SUM(oi.quantity * oi.unit_price) AS order_revenue
FROM orders o
JOIN customers c ON c.customer_id = o.customer_id
JOIN order_items oi ON oi.order_id = o.order_id
WHERE o.status IN ('paid','shipped')
GROUP BY o.order_id, c.region
)
SELECT
region,
percentile_cont(0.5) WITHIN GROUP (ORDER BY order_revenue) AS median_order_revenue
FROM order_revenue
GROUP BY region
ORDER BY region;
CTEs: Readable, Modular SQL
CTEs (WITH ...) let you build queries in stages. This improves readability and makes debugging easier.
Basic CTEs and Multi-Stage Pipelines
A common analytics task: compute customer KPIs (orders, revenue, average order value, first/last order date).
WITH order_revenue AS (
SELECT
o.order_id,
o.customer_id,
o.order_date,
SUM(oi.quantity * oi.unit_price) AS order_revenue
FROM orders o
JOIN order_items oi ON oi.order_id = o.order_id
WHERE o.status IN ('paid','shipped')
GROUP BY o.order_id, o.customer_id, o.order_date
),
customer_kpis AS (
SELECT
customer_id,
COUNT(*) AS orders_count,
SUM(order_revenue) AS total_revenue,
AVG(order_revenue) AS avg_order_value,
MIN(order_date) AS first_order_date,
MAX(order_date) AS last_order_date
FROM order_revenue
GROUP BY customer_id
)
SELECT
c.customer_id,
c.customer_name,
c.region,
k.orders_count,
k.total_revenue,
k.avg_order_value,
k.first_order_date,
k.last_order_date
FROM customers c
LEFT JOIN customer_kpis k ON k.customer_id = c.customer_id
ORDER BY k.total_revenue DESC NULLS LAST;
Why this structure works well:
order_revenueis a reusable “fact layer.”customer_kpisis the aggregated “metric layer.”- Final SELECT joins dimensions (
customers) to metrics.
CTEs vs Subqueries vs Views
CTEs are best when:
- You want a clear multi-step pipeline.
- You need to reuse a derived dataset multiple times in one query.
- You want to isolate logic for easier testing.
Subqueries can be best when:
- The derived dataset is used once and the query is simple.
- You want the optimizer to inline aggressively (this depends on the database).
Views are best when:
- You want a stable interface for BI tools.
- You reuse the same logic across many queries.
- You want permission boundaries (grant access to a view, not raw tables).
Important PostgreSQL note: Historically, CTEs were optimization fences (materialized by default). In modern PostgreSQL versions, many CTEs can be inlined, but you can still control behavior:
WITH order_revenue AS MATERIALIZED (
SELECT ...
)
SELECT ...
or
WITH order_revenue AS NOT MATERIALIZED (
SELECT ...
)
SELECT ...
Use MATERIALIZED when computing the CTE is expensive but reused multiple times and you want to avoid recomputation. Use NOT MATERIALIZED when you want the optimizer to push filters down into the CTE.
Recursive CTEs: Hierarchies and Sequences
Recursive CTEs solve problems like:
- Organizational charts (manager → employee)
- Category trees
- Generating date series (though PostgreSQL also has
generate_series)
Example: Product category hierarchy (toy example)
Create a category table:
DROP TABLE IF EXISTS categories;
CREATE TABLE categories (
category_id BIGSERIAL PRIMARY KEY,
category_name TEXT NOT NULL,
parent_category_id BIGINT REFERENCES categories(category_id)
);
INSERT INTO categories (category_name, parent_category_id) VALUES
('All', NULL),
('Office', 1),
('Home', 1),
('Electronics', 1),
('Writing', 2),
('Lighting', 3);
Now fetch the full tree with depth:
WITH RECURSIVE cat_tree AS (
SELECT
category_id,
category_name,
parent_category_id,
0 AS depth,
category_name::text AS path
FROM categories
WHERE parent_category_id IS NULL
UNION ALL
SELECT
c.category_id,
c.category_name,
c.parent_category_id,
t.depth + 1 AS depth,
(t.path || ' > ' || c.category_name) AS path
FROM categories c
JOIN cat_tree t ON t.category_id = c.parent_category_id
)
SELECT
category_id,
category_name,
parent_category_id,
depth,
path
FROM cat_tree
ORDER BY path;
Key ideas:
- The “anchor” query selects roots.
- The recursive member joins children to parents.
pathis a helpful technique for ordering and debugging.
Example: Generate a monthly calendar and left join revenue
This is useful when you need months with zero revenue to still appear.
WITH RECURSIVE months AS (
SELECT date '2024-01-01' AS month_start
UNION ALL
SELECT (month_start + interval '1 month')::date
FROM months
WHERE month_start < date '2024-06-01'
),
monthly_revenue AS (
SELECT
date_trunc('month', o.order_date)::date AS month_start,
SUM(oi.quantity * oi.unit_price) AS revenue
FROM orders o
JOIN order_items oi ON oi.order_id = o.order_id
WHERE o.status IN ('paid','shipped')
GROUP BY date_trunc('month', o.order_date)::date
)
SELECT
m.month_start,
COALESCE(r.revenue, 0) AS revenue
FROM months m
LEFT JOIN monthly_revenue r USING (month_start)
ORDER BY m.month_start;
Performance Tuning for Analytical Queries
Analytics queries often scan lots of rows, join large tables, and compute heavy aggregates. Performance tuning is about:
- Reading fewer rows
- Joining efficiently
- Aggregating efficiently
- Avoiding unnecessary sorts
- Using the optimizer effectively
Execution Plans: EXPLAIN and EXPLAIN ANALYZE
Start with the plan. In PostgreSQL:
EXPLAIN
SELECT
o.customer_id,
SUM(oi.quantity * oi.unit_price) AS revenue
FROM orders o
JOIN order_items oi ON oi.order_id = o.order_id
WHERE o.status IN ('paid','shipped')
AND o.order_date >= date '2024-03-01'
AND o.order_date < date '2024-05-01'
GROUP BY o.customer_id;
To measure real timing (in a safe environment):
EXPLAIN (ANALYZE, BUFFERS, VERBOSE)
SELECT
o.customer_id,
SUM(oi.quantity * oi.unit_price) AS revenue
FROM orders o
JOIN order_items oi ON oi.order_id = o.order_id
WHERE o.status IN ('paid','shipped')
AND o.order_date >= date '2024-03-01'
AND o.order_date < date '2024-05-01'
GROUP BY o.customer_id;
What to look for:
- Seq Scan vs Index Scan (sequential scans can be fine for large portions of a table, but not always)
- Join type: Nested Loop, Hash Join, Merge Join
- Sort nodes (expensive sorts can dominate runtime)
- Buffers read/hit (I/O vs cache)
Indexing Strategy for Analytics
Indexes help most when:
- You filter on a column (WHERE)
- You join on a column (JOIN)
- You order by a column (ORDER BY) and can use an index-only scan or avoid sorting
Composite indexes
If you frequently filter by date range and status:
CREATE INDEX idx_orders_status_date ON orders(status, order_date);
If you frequently filter by customer and date:
CREATE INDEX idx_orders_customer_status_date ON orders(customer_id, status, order_date);
Be mindful: too many indexes slow down writes. Pick indexes based on actual workload.
Covering indexes (PostgreSQL INCLUDE)
If you filter by order_date but also need customer_id without hitting the table:
CREATE INDEX idx_orders_date_include_customer
ON orders(order_date)
INCLUDE (customer_id, status);
This can enable index-only scans if visibility map conditions are met.
Partial indexes
If most queries only care about paid/shipped orders, a partial index can be extremely effective:
CREATE INDEX idx_orders_active_date
ON orders(order_date, customer_id)
WHERE status IN ('paid','shipped');
This index is smaller and more selective than indexing all statuses.
Sargability and Predicate Pushdown
A predicate is sargable if the database can use an index efficiently.
Bad (non-sargable) pattern:
SELECT *
FROM orders
WHERE date_trunc('month', order_date) = date '2024-03-01';
This applies a function to the column, often preventing index use.
Better:
SELECT *
FROM orders
WHERE order_date >= date '2024-03-01'
AND order_date < date '2024-04-01';
Similarly, avoid wrapping indexed columns in CAST, COALESCE, or arithmetic in the WHERE clause when you can rewrite it.
Avoiding Common Anti-Patterns
1) SELECT * in analytical joins
SELECT * can force reading wide rows and unnecessary columns. Prefer selecting only what you need.
SELECT o.order_id, o.order_date, oi.product_id, oi.quantity
FROM orders o
JOIN order_items oi ON oi.order_id = o.order_id;
2) Joining before filtering
Filter early to reduce join input sizes.
Less optimal:
SELECT ...
FROM orders o
JOIN order_items oi ON oi.order_id = o.order_id
WHERE o.order_date >= date '2024-03-01';
Often better (especially when it helps the optimizer and reduces rows early):
WITH filtered_orders AS (
SELECT order_id, customer_id, order_date
FROM orders
WHERE order_date >= date '2024-03-01'
AND order_date < date '2024-04-01'
AND status IN ('paid','shipped')
)
SELECT
fo.customer_id,
SUM(oi.quantity * oi.unit_price) AS revenue
FROM filtered_orders fo
JOIN order_items oi ON oi.order_id = fo.order_id
GROUP BY fo.customer_id;
This also improves readability: “first pick relevant orders, then compute metrics.”
3) DISTINCT as a band-aid
If you need DISTINCT to fix duplicates after joins, it often indicates a modeling or join-key issue. Diagnose why duplicates appear rather than masking them.
Partitioning and Incremental Aggregation
For large datasets (millions to billions of rows), consider:
Table partitioning by date
PostgreSQL declarative partitioning example:
-- Example structure; do not run blindly in production
CREATE TABLE orders_big (
order_id BIGSERIAL,
customer_id BIGINT NOT NULL,
order_date DATE NOT NULL,
status TEXT NOT NULL,
PRIMARY KEY (order_id, order_date)
) PARTITION BY RANGE (order_date);
CREATE TABLE orders_big_2024_03 PARTITION OF orders_big
FOR VALUES FROM ('2024-03-01') TO ('2024-04-01');
CREATE TABLE orders_big_2024_04 PARTITION OF orders_big
FOR VALUES FROM ('2024-04-01') TO ('2024-05-01');
Benefits:
- Queries with date filters can prune partitions (scan fewer chunks).
- Maintenance (vacuum, archiving) can be easier.
Trade-offs:
- More complex DDL and index management.
- Poor partition key choice can hurt performance.
Incremental aggregates
Instead of recomputing monthly revenue from raw fact tables every time, store aggregated results:
CREATE TABLE monthly_customer_revenue (
month_start DATE NOT NULL,
customer_id BIGINT NOT NULL,
revenue NUMERIC(12,2) NOT NULL,
PRIMARY KEY (month_start, customer_id)
);
Then update incrementally (daily job) rather than scanning all history.
Materialized Views and Precomputation
Materialized views store query results physically.
CREATE MATERIALIZED VIEW mv_order_revenue AS
SELECT
o.order_id,
o.customer_id,
o.order_date,
SUM(oi.quantity * oi.unit_price) AS order_revenue
FROM orders o
JOIN order_items oi ON oi.order_id = o.order_id
WHERE o.status IN ('paid','shipped')
GROUP BY o.order_id, o.customer_id, o.order_date;
-- Refresh when data changes (schedule it)
REFRESH MATERIALIZED VIEW mv_order_revenue;
You can index a materialized view:
CREATE INDEX idx_mv_order_revenue_customer_date
ON mv_order_revenue(customer_id, order_date);
When to use:
- Heavy transformations used by many dashboards
- Data changes in batches (nightly ETL), not constantly
- You need predictable query latency
Putting It Together: A Full Analytical Query
Let’s build a query that answers:
For each region and month, show revenue, previous month revenue, growth rate, and the top product by revenue in that region-month.
This combines:
- CTE pipeline
- Window functions (
LAG, ranking) - Careful grouping and joining
WITH filtered_orders AS (
SELECT
o.order_id,
o.customer_id,
o.order_date
FROM orders o
WHERE o.status IN ('paid','shipped')
AND o.order_date >= date '2024-03-01'
AND o.order_date < date '2024-06-01'
),
line_facts AS (
SELECT
fo.order_id,
fo.customer_id,
fo.order_date,
c.region,
oi.product_id,
(oi.quantity * oi.unit_price) AS line_revenue
FROM filtered_orders fo
JOIN customers c ON c.customer_id = fo.customer_id
JOIN order_items oi ON oi.order_id = fo.order_id
),
region_month_revenue AS (
SELECT
region,
date_trunc('month', order_date)::date AS month_start,
SUM(line_revenue) AS revenue
FROM line_facts
GROUP BY region, date_trunc('month', order_date)::date
),
region_month_product AS (
SELECT
region,
date_trunc('month', order_date)::date AS month_start,
product_id,
SUM(line_revenue) AS product_revenue
FROM line_facts
GROUP BY region, date_trunc('month', order_date)::date, product_id
),
ranked_products AS (
SELECT
rmp.*,
ROW_NUMBER() OVER (
PARTITION BY region, month_start
ORDER BY product_revenue DESC, product_id
) AS rn
FROM region_month_product rmp
)
SELECT
rmr.region,
rmr.month_start,
rmr.revenue,
LAG(rmr.revenue) OVER (PARTITION BY rmr.region ORDER BY rmr.month_start) AS prev_month_revenue,
CASE
WHEN LAG(rmr.revenue) OVER (PARTITION BY rmr.region ORDER BY rmr.month_start) IS NULL THEN NULL
WHEN LAG(rmr.revenue) OVER (PARTITION BY rmr.region ORDER BY rmr.month_start) = 0 THEN NULL
ELSE (rmr.revenue / LAG(rmr.revenue) OVER (PARTITION BY rmr.region ORDER BY rmr.month_start)) - 1
END AS growth_rate,
p.product_name AS top_product,
rp.product_revenue AS top_product_revenue
FROM region_month_revenue rmr
LEFT JOIN ranked_products rp
ON rp.region = rmr.region
AND rp.month_start = rmr.month_start
AND rp.rn = 1
LEFT JOIN products p ON p.product_id = rp.product_id
ORDER BY rmr.region, rmr.month_start;
Why this performs reasonably well (and how to improve it at scale):
- We filter orders early (
filtered_orders) to reduce join volume. - We compute line-level facts once (
line_facts) and reuse them. - We aggregate at the needed grains: region-month and region-month-product.
- At scale, consider:
- Indexes on
orders(status, order_date)or partial indexes for active statuses - Partitioning orders by date
- Materializing
line_factsororder_revenueas a materialized view if reused widely
- Indexes on
Practical Checklist
When writing advanced analytical SQL:
- Define the grain: What is one row in your result (order, customer-month, region-day)?
- Filter early: Reduce rows before expensive joins/aggregations.
- Use window functions for “context metrics” (running totals, ranks, comparisons).
- Use CTEs to build a pipeline of transformations you can reason about.
- Inspect plans with
EXPLAIN (ANALYZE, BUFFERS)and tune based on evidence. - Index for your predicates and joins, not for every column.
- Avoid non-sargable filters and unnecessary sorts.
- Precompute (materialized views / aggregate tables) when latency matters and data changes in batches.
If you want, share:
- your database engine (PostgreSQL/MySQL/SQL Server/etc.),
- approximate table sizes,
- and one slow query,
and I can rewrite it using the patterns above and explain the plan-level improvements.