While basic SQL can handle simple queries, data analytics demands advanced techniques to extract meaningful insights from complex datasets. This comprehensive guide covers the sophisticated SQL skills that separate data professionals from casual users.
Advanced Skills You'll Master
- Window functions for sophisticated analytics
- Common Table Expressions (CTEs) for complex logic
- Advanced JOIN techniques and performance optimization
- Date/time analytics and time series operations
- Statistical functions and data modeling
- Query optimization and performance tuning
🪟 Window Functions: The Game Changer
Why Window Functions Matter
Window functions perform calculations across a set of related rows without collapsing the result set. They enable complex analytics like running totals, rankings, and comparisons that would be difficult or impossible with traditional GROUP BY operations.
Essential Window Function Patterns
🏆 Ranking and Row Number Functions
-- ROW_NUMBER: Assigns unique sequential numbers
SELECT
employee_name,
salary,
department,
ROW_NUMBER() OVER (ORDER BY salary DESC) as overall_rank,
ROW_NUMBER() OVER (PARTITION BY department ORDER BY salary DESC) as dept_rank
FROM employees;
-- RANK: Handles ties by giving same rank, skips next numbers
SELECT
product_name,
revenue,
RANK() OVER (ORDER BY revenue DESC) as revenue_rank
FROM products;
-- DENSE_RANK: Handles ties but doesn't skip numbers
SELECT
student_name,
score,
DENSE_RANK() OVER (ORDER BY score DESC) as rank
FROM exam_results;
-- NTILE: Divides rows into N groups
SELECT
customer_name,
total_purchases,
NTILE(4) OVER (ORDER BY total_purchases DESC) as quartile
FROM customers;
- ROW_NUMBER(): Unique sequential numbering, useful for pagination
- RANK(): Handles ties but skips subsequent ranks
- DENSE_RANK(): Handles ties without skipping ranks
- NTILE(n): Divides data into n equal groups or buckets
📊 Aggregate Window Functions
-- Running totals and cumulative calculations
SELECT
order_date,
daily_revenue,
SUM(daily_revenue) OVER (ORDER BY order_date) as running_total,
AVG(daily_revenue) OVER (ORDER BY order_date ROWS 6 PRECEDING) as week_avg
FROM daily_sales;
-- Moving averages with different window frames
SELECT
date,
stock_price,
AVG(stock_price) OVER (
ORDER BY date
ROWS BETWEEN 29 PRECEDING AND CURRENT ROW
) as moving_avg_30_day,
AVG(stock_price) OVER (
ORDER BY date
ROWS BETWEEN 6 PRECEDING AND 6 FOLLOWING
) as centered_avg_14_day
FROM stock_prices;
-- Percentage of total calculations
SELECT
product_category,
sales_amount,
sales_amount / SUM(sales_amount) OVER () * 100 as pct_of_total,
sales_amount / SUM(sales_amount) OVER (PARTITION BY region) * 100 as pct_of_region
FROM sales_data;
- Running totals: Cumulative sums for progressive analysis
- Moving averages: Smooth out trends and reduce noise
- Window frames: Control which rows to include in calculations
- Percentage calculations: Compare parts to whole efficiently
🔄 Lead and Lag Functions
-- LAG: Access previous row values
SELECT
date,
revenue,
LAG(revenue, 1) OVER (ORDER BY date) as prev_day_revenue,
revenue - LAG(revenue, 1) OVER (ORDER BY date) as daily_change,
(revenue - LAG(revenue, 1) OVER (ORDER BY date)) /
LAG(revenue, 1) OVER (ORDER BY date) * 100 as pct_change
FROM daily_revenue;
-- LEAD: Access next row values
SELECT
customer_id,
purchase_date,
LEAD(purchase_date, 1) OVER (
PARTITION BY customer_id
ORDER BY purchase_date
) as next_purchase,
DATEDIFF(
LEAD(purchase_date, 1) OVER (
PARTITION BY customer_id
ORDER BY purchase_date
),
purchase_date
) as days_to_next_purchase
FROM customer_purchases;
-- FIRST_VALUE and LAST_VALUE
SELECT
employee_id,
salary,
hire_date,
FIRST_VALUE(salary) OVER (
PARTITION BY department
ORDER BY hire_date
) as first_hire_salary,
LAST_VALUE(salary) OVER (
PARTITION BY department
ORDER BY hire_date
RANGE BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING
) as latest_hire_salary
FROM employees;
- LAG(): Compare with previous periods for trend analysis
- LEAD(): Look ahead to future values for forecasting
- FIRST_VALUE/LAST_VALUE: Compare with first/last in window
- Time series analysis: Perfect for sequential data analysis
🔗 Common Table Expressions (CTEs)
Organizing Complex Logic
CTEs (Common Table Expressions) allow you to create temporary named result sets that exist only during query execution. They make complex queries more readable, maintainable, and enable recursive operations.
CTE Patterns and Applications
📝 Basic CTEs for Query Organization
-- Single CTE for data preparation
WITH clean_sales AS (
SELECT
product_id,
customer_id,
order_date,
quantity,
unit_price,
quantity * unit_price as total_amount
FROM raw_sales
WHERE order_date >= '2024-01-01'
AND quantity > 0
AND unit_price > 0
)
SELECT
product_id,
COUNT(*) as order_count,
SUM(quantity) as total_quantity,
SUM(total_amount) as total_revenue,
AVG(total_amount) as avg_order_value
FROM clean_sales
GROUP BY product_id
ORDER BY total_revenue DESC;
-- Multiple CTEs for step-by-step analysis
WITH monthly_sales AS (
SELECT
DATE_TRUNC('month', order_date) as month,
SUM(amount) as monthly_revenue
FROM orders
WHERE order_date >= '2023-01-01'
GROUP BY DATE_TRUNC('month', order_date)
),
sales_with_growth AS (
SELECT
month,
monthly_revenue,
LAG(monthly_revenue) OVER (ORDER BY month) as prev_month,
(monthly_revenue - LAG(monthly_revenue) OVER (ORDER BY month)) /
LAG(monthly_revenue) OVER (ORDER BY month) * 100 as growth_rate
FROM monthly_sales
)
SELECT
month,
monthly_revenue,
growth_rate,
CASE
WHEN growth_rate > 10 THEN 'High Growth'
WHEN growth_rate > 0 THEN 'Positive Growth'
WHEN growth_rate < -10 THEN 'Declining'
ELSE 'Stable'
END as growth_category
FROM sales_with_growth
ORDER BY month;
- Data cleaning: Prepare and filter data in logical steps
- Multiple CTEs: Chain operations for complex analysis
- Readability: Break complex logic into understandable parts
- Reusability: Reference the same CTE multiple times
🔄 Recursive CTEs
-- Organizational hierarchy traversal
WITH RECURSIVE employee_hierarchy AS (
-- Anchor: Start with top-level managers
SELECT
employee_id,
employee_name,
manager_id,
1 as level,
employee_name as path
FROM employees
WHERE manager_id IS NULL
UNION ALL
-- Recursive: Find subordinates
SELECT
e.employee_id,
e.employee_name,
e.manager_id,
eh.level + 1,
eh.path || ' -> ' || e.employee_name
FROM employees e
INNER JOIN employee_hierarchy eh ON e.manager_id = eh.employee_id
WHERE eh.level < 10 -- Prevent infinite recursion
)
SELECT
level,
employee_name,
path,
COUNT(*) OVER (PARTITION BY level) as employees_at_level
FROM employee_hierarchy
ORDER BY level, employee_name;
-- Date series generation
WITH RECURSIVE date_series AS (
SELECT DATE('2024-01-01') as date_value
UNION ALL
SELECT date_value + INTERVAL 1 DAY
FROM date_series
WHERE date_value < DATE('2024-12-31')
),
daily_sales_complete AS (
SELECT
ds.date_value,
COALESCE(s.daily_total, 0) as daily_sales
FROM date_series ds
LEFT JOIN (
SELECT
DATE(order_date) as sale_date,
SUM(amount) as daily_total
FROM orders
WHERE order_date BETWEEN '2024-01-01' AND '2024-12-31'
GROUP BY DATE(order_date)
) s ON ds.date_value = s.sale_date
)
SELECT
date_value,
daily_sales,
AVG(daily_sales) OVER (
ORDER BY date_value
ROWS 6 PRECEDING
) as seven_day_avg
FROM daily_sales_complete
ORDER BY date_value;
- Hierarchical data: Traverse organizational or category trees
- Date series: Generate continuous date ranges for complete datasets
- Path finding: Navigate relationships and dependencies
- Recursive logic: Handle self-referential data structures
🔗 Advanced JOIN Techniques
Beyond Basic JOINs
While INNER and LEFT JOINs handle most scenarios, advanced analytics often requires more sophisticated joining techniques. These patterns help solve complex data integration challenges efficiently.
🪞 Self JOINs for Comparative Analysis
-- Finding employee pairs in same department
SELECT
e1.employee_name as employee1,
e2.employee_name as employee2,
e1.department,
ABS(e1.salary - e2.salary) as salary_difference
FROM employees e1
INNER JOIN employees e2
ON e1.department = e2.department
AND e1.employee_id < e2.employee_id -- Avoid duplicates and self-matches
WHERE ABS(e1.salary - e2.salary) < 5000;
-- Comparing current vs previous values
SELECT
s1.date as current_date,
s1.revenue as current_revenue,
s2.date as previous_date,
s2.revenue as previous_revenue,
s1.revenue - s2.revenue as revenue_change,
CASE
WHEN s1.revenue > s2.revenue THEN 'Increase'
WHEN s1.revenue < s2.revenue THEN 'Decrease'
ELSE 'No Change'
END as trend
FROM sales_summary s1
LEFT JOIN sales_summary s2
ON s2.date = s1.date - INTERVAL 1 DAY
ORDER BY s1.date;
⚙️ Complex JOIN Conditions
-- Range-based joins for time periods
SELECT
e.employee_name,
e.start_date,
s.salary_amount,
s.effective_date
FROM employees e
INNER JOIN salary_history s
ON e.employee_id = s.employee_id
AND s.effective_date <= e.start_date + INTERVAL 90 DAY -- Salary within 90 days
AND (s.end_date IS NULL OR s.end_date >= e.start_date);
-- Conditional joins with multiple criteria
SELECT
c.customer_name,
c.registration_date,
p.promotion_name,
p.discount_percentage
FROM customers c
LEFT JOIN promotions p
ON (c.customer_type = 'PREMIUM' AND p.customer_tier = 'PREMIUM')
OR (c.customer_type = 'REGULAR' AND p.customer_tier IN ('REGULAR', 'ALL'))
AND p.start_date <= c.registration_date
AND p.end_date >= c.registration_date;
-- Fuzzy matching with similarity functions
SELECT
c1.company_name as source_company,
c2.company_name as matched_company,
SIMILARITY(c1.company_name, c2.company_name) as similarity_score
FROM companies c1
CROSS JOIN companies c2
WHERE c1.company_id != c2.company_id
AND SIMILARITY(c1.company_name, c2.company_name) > 0.8
ORDER BY similarity_score DESC;
📅 Advanced Date/Time Analytics
Time-Based Analysis Mastery
Time series analysis is crucial for business intelligence. Advanced date/time functions enable sophisticated temporal analysis, cohort studies, and seasonal pattern detection.
📊 Advanced Date Functions
-- Cohort analysis: Customer retention by signup month
WITH customer_cohorts AS (
SELECT
customer_id,
DATE_TRUNC('month', registration_date) as cohort_month,
registration_date
FROM customers
),
customer_orders AS (
SELECT
c.customer_id,
c.cohort_month,
DATE_TRUNC('month', o.order_date) as order_month,
EXTRACT(EPOCH FROM (DATE_TRUNC('month', o.order_date) - c.cohort_month)) / 2628000 as month_number
FROM customer_cohorts c
INNER JOIN orders o ON c.customer_id = o.customer_id
)
SELECT
cohort_month,
month_number,
COUNT(DISTINCT customer_id) as active_customers,
COUNT(DISTINCT customer_id) * 100.0 /
FIRST_VALUE(COUNT(DISTINCT customer_id)) OVER (
PARTITION BY cohort_month
ORDER BY month_number
) as retention_rate
FROM customer_orders
GROUP BY cohort_month, month_number
ORDER BY cohort_month, month_number;
-- Business days calculation
WITH business_days AS (
SELECT
order_id,
order_date,
delivery_date,
CASE
WHEN EXTRACT(DOW FROM generate_series) NOT IN (0, 6)
THEN 1 ELSE 0
END as is_business_day
FROM orders,
generate_series(order_date, delivery_date, '1 day'::interval) as generate_series
)
SELECT
order_id,
order_date,
delivery_date,
SUM(is_business_day) as business_days_to_delivery
FROM business_days
GROUP BY order_id, order_date, delivery_date;
🌟 Seasonal and Cyclical Analysis
-- Seasonal patterns by day of week and month
SELECT
EXTRACT(MONTH FROM order_date) as month,
TO_CHAR(order_date, 'Day') as day_of_week,
COUNT(*) as order_count,
SUM(amount) as total_revenue,
AVG(amount) as avg_order_value,
-- Compare to overall averages
COUNT(*) * 100.0 / SUM(COUNT(*)) OVER () as pct_of_total_orders,
SUM(amount) / SUM(SUM(amount)) OVER () * 100 as pct_of_total_revenue
FROM orders
WHERE order_date >= CURRENT_DATE - INTERVAL '2 years'
GROUP BY
EXTRACT(MONTH FROM order_date),
EXTRACT(DOW FROM order_date),
TO_CHAR(order_date, 'Day')
ORDER BY month, EXTRACT(DOW FROM order_date);
-- Year-over-year comparison with seasonal adjustment
WITH monthly_sales AS (
SELECT
EXTRACT(YEAR FROM order_date) as year,
EXTRACT(MONTH FROM order_date) as month,
SUM(amount) as monthly_revenue
FROM orders
GROUP BY EXTRACT(YEAR FROM order_date), EXTRACT(MONTH FROM order_date)
)
SELECT
year,
month,
monthly_revenue,
LAG(monthly_revenue, 12) OVER (ORDER BY year, month) as same_month_prev_year,
(monthly_revenue - LAG(monthly_revenue, 12) OVER (ORDER BY year, month)) /
LAG(monthly_revenue, 12) OVER (ORDER BY year, month) * 100 as yoy_growth_rate,
-- 12-month rolling average
AVG(monthly_revenue) OVER (
ORDER BY year, month
ROWS BETWEEN 11 PRECEDING AND CURRENT ROW
) as rolling_12_month_avg
FROM monthly_sales
ORDER BY year, month;
⚡ Query Performance Optimization
Performance Optimization Strategies
🚀 Indexing Strategies
-- Composite indexes for multi-column filters
CREATE INDEX idx_orders_customer_date
ON orders (customer_id, order_date);
-- Partial indexes for filtered queries
CREATE INDEX idx_active_customers
ON customers (last_purchase_date)
WHERE status = 'ACTIVE';
-- Expression indexes for computed columns
CREATE INDEX idx_customer_full_name
ON customers (LOWER(first_name || ' ' || last_name));
- Create composite indexes for multi-column queries
- Use partial indexes for frequently filtered data
- Consider covering indexes to avoid table lookups
⚙️ Query Structure
-- Use EXISTS instead of IN for subqueries
SELECT customer_name
FROM customers c
WHERE EXISTS (
SELECT 1 FROM orders o
WHERE o.customer_id = c.customer_id
);
-- LIMIT with ORDER BY requires index
SELECT * FROM large_table
WHERE category = 'Electronics'
ORDER BY created_date DESC
LIMIT 100;
- Use EXISTS instead of IN for better performance
- Add ORDER BY columns to indexes
- Filter early, aggregate late
Common Performance Pitfalls
- SELECT *: Always specify only needed columns
- Non-sargable conditions: Avoid functions in WHERE clauses
- Cartesian products: Always include proper JOIN conditions
- Subquery optimization: Consider CTEs or window functions instead
- Data type mismatches: Ensure proper type casting
🏢 Complete Real-World Analysis
E-commerce Performance Dashboard Query
WITH base_metrics AS (
-- Calculate core metrics per customer per month
SELECT
c.customer_id,
c.customer_name,
c.registration_date,
DATE_TRUNC('month', o.order_date) as order_month,
COUNT(DISTINCT o.order_id) as monthly_orders,
SUM(oi.quantity * oi.unit_price) as monthly_revenue,
COUNT(DISTINCT oi.product_id) as unique_products,
AVG(oi.quantity * oi.unit_price) as avg_order_value
FROM customers c
INNER JOIN orders o ON c.customer_id = o.customer_id
INNER JOIN order_items oi ON o.order_id = oi.order_id
WHERE o.order_date >= CURRENT_DATE - INTERVAL '12 months'
GROUP BY c.customer_id, c.customer_name, c.registration_date,
DATE_TRUNC('month', o.order_date)
),
customer_lifecycle AS (
-- Add customer lifecycle metrics
SELECT
*,
-- Customer age in months
EXTRACT(EPOCH FROM (order_month - DATE_TRUNC('month', registration_date))) / 2628000 as customer_age_months,
-- Running totals per customer
SUM(monthly_revenue) OVER (
PARTITION BY customer_id
ORDER BY order_month
) as cumulative_revenue,
-- Customer velocity (orders per month since registration)
SUM(monthly_orders) OVER (
PARTITION BY customer_id
ORDER BY order_month
) / NULLIF(
EXTRACT(EPOCH FROM (order_month - DATE_TRUNC('month', registration_date))) / 2628000 + 1,
0
) as orders_per_month_avg,
-- Time since last order
LAG(order_month) OVER (
PARTITION BY customer_id
ORDER BY order_month
) as prev_order_month
FROM base_metrics
),
customer_segments AS (
-- Segment customers based on behavior
SELECT
*,
CASE
WHEN cumulative_revenue >= 1000 AND orders_per_month_avg >= 2 THEN 'VIP'
WHEN cumulative_revenue >= 500 OR orders_per_month_avg >= 1 THEN 'High Value'
WHEN customer_age_months <= 3 THEN 'New Customer'
WHEN EXTRACT(EPOCH FROM (CURRENT_DATE - order_month)) / 2628000 > 3 THEN 'At Risk'
ELSE 'Regular'
END as customer_segment,
-- RFM-like scoring
NTILE(5) OVER (ORDER BY monthly_revenue DESC) as revenue_quintile,
NTILE(5) OVER (ORDER BY monthly_orders DESC) as frequency_quintile,
NTILE(5) OVER (ORDER BY order_month DESC) as recency_quintile
FROM customer_lifecycle
),
final_dashboard AS (
SELECT
order_month,
customer_segment,
-- Count metrics
COUNT(DISTINCT customer_id) as active_customers,
SUM(monthly_orders) as total_orders,
-- Revenue metrics
SUM(monthly_revenue) as total_revenue,
AVG(monthly_revenue) as avg_revenue_per_customer,
SUM(monthly_revenue) / SUM(monthly_orders) as avg_order_value,
-- Growth metrics
SUM(monthly_revenue) - LAG(SUM(monthly_revenue)) OVER (
PARTITION BY customer_segment
ORDER BY order_month
) as revenue_growth,
(SUM(monthly_revenue) - LAG(SUM(monthly_revenue)) OVER (
PARTITION BY customer_segment
ORDER BY order_month
)) / NULLIF(LAG(SUM(monthly_revenue)) OVER (
PARTITION BY customer_segment
ORDER BY order_month
), 0) * 100 as revenue_growth_pct,
-- Customer health metrics
AVG(orders_per_month_avg) as avg_purchase_frequency,
COUNT(CASE WHEN customer_age_months <= 1 THEN 1 END) as new_customers,
COUNT(CASE WHEN EXTRACT(EPOCH FROM (CURRENT_DATE - order_month)) / 2628000 > 2 THEN 1 END) as at_risk_customers
FROM customer_segments
GROUP BY order_month, customer_segment
)
SELECT
order_month,
customer_segment,
active_customers,
total_orders,
total_revenue,
avg_revenue_per_customer,
avg_order_value,
revenue_growth_pct,
avg_purchase_frequency,
new_customers,
at_risk_customers,
-- Additional KPIs
total_revenue / SUM(total_revenue) OVER (PARTITION BY order_month) * 100 as segment_revenue_share,
active_customers / SUM(active_customers) OVER (PARTITION BY order_month) * 100 as segment_customer_share
FROM final_dashboard
ORDER BY order_month DESC, total_revenue DESC;
What This Query Does
Creates a comprehensive customer dashboard with segmentation, lifecycle metrics, and growth analysis
Advanced Techniques Used
CTEs, window functions, NTILE for segmentation, LAG for growth calculations
Business Value
Identifies customer segments, tracks health metrics, and enables data-driven retention strategies
🎯 Master Advanced SQL Techniques
These advanced SQL techniques will transform your data analysis capabilities. Practice with real datasets to build expertise and confidence in complex analytical scenarios.
Practice Regularly
Work with different datasets and scenarios
Build Dashboards
Create comprehensive analytical queries
Optimize Performance
Focus on query efficiency and scalability