SQL for Data Analytics: Advanced Techniques

While basic SQL can handle simple queries, data analytics demands advanced techniques to extract meaningful insights from complex datasets. This comprehensive guide covers the sophisticated SQL skills that separate data professionals from casual users.

Advanced Skills You'll Master

Window functions for sophisticated analytics
Common Table Expressions (CTEs) for complex logic
Advanced JOIN techniques and performance optimization
Date/time analytics and time series operations
Statistical functions and data modeling
Query optimization and performance tuning

🪟 Window Functions: The Game Changer

Why Window Functions Matter

Window functions perform calculations across a set of related rows without collapsing the result set. They enable complex analytics like running totals, rankings, and comparisons that would be difficult or impossible with traditional GROUP BY operations.

Key Advantage: Preserve row-level detail while performing aggregate calculations

Essential Window Function Patterns

🏆 Ranking and Row Number Functions

-- ROW_NUMBER: Assigns unique sequential numbers
SELECT 
    employee_name,
    salary,
    department,
    ROW_NUMBER() OVER (ORDER BY salary DESC) as overall_rank,
    ROW_NUMBER() OVER (PARTITION BY department ORDER BY salary DESC) as dept_rank
FROM employees;

-- RANK: Handles ties by giving same rank, skips next numbers
SELECT 
    product_name,
    revenue,
    RANK() OVER (ORDER BY revenue DESC) as revenue_rank
FROM products;

-- DENSE_RANK: Handles ties but doesn't skip numbers
SELECT 
    student_name,
    score,
    DENSE_RANK() OVER (ORDER BY score DESC) as rank
FROM exam_results;

-- NTILE: Divides rows into N groups
SELECT 
    customer_name,
    total_purchases,
    NTILE(4) OVER (ORDER BY total_purchases DESC) as quartile
FROM customers;

ROW_NUMBER(): Unique sequential numbering, useful for pagination
RANK(): Handles ties but skips subsequent ranks
DENSE_RANK(): Handles ties without skipping ranks
NTILE(n): Divides data into n equal groups or buckets

📊 Aggregate Window Functions

-- Running totals and cumulative calculations
SELECT 
    order_date,
    daily_revenue,
    SUM(daily_revenue) OVER (ORDER BY order_date) as running_total,
    AVG(daily_revenue) OVER (ORDER BY order_date ROWS 6 PRECEDING) as week_avg
FROM daily_sales;

-- Moving averages with different window frames
SELECT 
    date,
    stock_price,
    AVG(stock_price) OVER (
        ORDER BY date 
        ROWS BETWEEN 29 PRECEDING AND CURRENT ROW
    ) as moving_avg_30_day,
    AVG(stock_price) OVER (
        ORDER BY date 
        ROWS BETWEEN 6 PRECEDING AND 6 FOLLOWING
    ) as centered_avg_14_day
FROM stock_prices;

-- Percentage of total calculations
SELECT 
    product_category,
    sales_amount,
    sales_amount / SUM(sales_amount) OVER () * 100 as pct_of_total,
    sales_amount / SUM(sales_amount) OVER (PARTITION BY region) * 100 as pct_of_region
FROM sales_data;

Running totals: Cumulative sums for progressive analysis
Moving averages: Smooth out trends and reduce noise
Window frames: Control which rows to include in calculations
Percentage calculations: Compare parts to whole efficiently

🔄 Lead and Lag Functions

-- LAG: Access previous row values
SELECT 
    date,
    revenue,
    LAG(revenue, 1) OVER (ORDER BY date) as prev_day_revenue,
    revenue - LAG(revenue, 1) OVER (ORDER BY date) as daily_change,
    (revenue - LAG(revenue, 1) OVER (ORDER BY date)) / 
    LAG(revenue, 1) OVER (ORDER BY date) * 100 as pct_change
FROM daily_revenue;

-- LEAD: Access next row values
SELECT 
    customer_id,
    purchase_date,
    LEAD(purchase_date, 1) OVER (
        PARTITION BY customer_id 
        ORDER BY purchase_date
    ) as next_purchase,
    DATEDIFF(
        LEAD(purchase_date, 1) OVER (
            PARTITION BY customer_id 
            ORDER BY purchase_date
        ), 
        purchase_date
    ) as days_to_next_purchase
FROM customer_purchases;

-- FIRST_VALUE and LAST_VALUE
SELECT 
    employee_id,
    salary,
    hire_date,
    FIRST_VALUE(salary) OVER (
        PARTITION BY department 
        ORDER BY hire_date
    ) as first_hire_salary,
    LAST_VALUE(salary) OVER (
        PARTITION BY department 
        ORDER BY hire_date 
        RANGE BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING
    ) as latest_hire_salary
FROM employees;

LAG(): Compare with previous periods for trend analysis
LEAD(): Look ahead to future values for forecasting
FIRST_VALUE/LAST_VALUE: Compare with first/last in window
Time series analysis: Perfect for sequential data analysis

🔗 Common Table Expressions (CTEs)

Organizing Complex Logic

CTEs (Common Table Expressions) allow you to create temporary named result sets that exist only during query execution. They make complex queries more readable, maintainable, and enable recursive operations.

Perfect For: Breaking down complex queries into logical, readable steps

CTE Patterns and Applications

📝 Basic CTEs for Query Organization

-- Single CTE for data preparation
WITH clean_sales AS (
    SELECT 
        product_id,
        customer_id,
        order_date,
        quantity,
        unit_price,
        quantity * unit_price as total_amount
    FROM raw_sales
    WHERE order_date >= '2024-01-01'
        AND quantity > 0
        AND unit_price > 0
)
SELECT 
    product_id,
    COUNT(*) as order_count,
    SUM(quantity) as total_quantity,
    SUM(total_amount) as total_revenue,
    AVG(total_amount) as avg_order_value
FROM clean_sales
GROUP BY product_id
ORDER BY total_revenue DESC;

-- Multiple CTEs for step-by-step analysis
WITH monthly_sales AS (
    SELECT 
        DATE_TRUNC('month', order_date) as month,
        SUM(amount) as monthly_revenue
    FROM orders
    WHERE order_date >= '2023-01-01'
    GROUP BY DATE_TRUNC('month', order_date)
),
sales_with_growth AS (
    SELECT 
        month,
        monthly_revenue,
        LAG(monthly_revenue) OVER (ORDER BY month) as prev_month,
        (monthly_revenue - LAG(monthly_revenue) OVER (ORDER BY month)) / 
        LAG(monthly_revenue) OVER (ORDER BY month) * 100 as growth_rate
    FROM monthly_sales
)
SELECT 
    month,
    monthly_revenue,
    growth_rate,
    CASE 
        WHEN growth_rate > 10 THEN 'High Growth'
        WHEN growth_rate > 0 THEN 'Positive Growth'
        WHEN growth_rate < -10 THEN 'Declining'
        ELSE 'Stable'
    END as growth_category
FROM sales_with_growth
ORDER BY month;

Data cleaning: Prepare and filter data in logical steps
Multiple CTEs: Chain operations for complex analysis
Readability: Break complex logic into understandable parts
Reusability: Reference the same CTE multiple times

🔄 Recursive CTEs

-- Organizational hierarchy traversal
WITH RECURSIVE employee_hierarchy AS (
    -- Anchor: Start with top-level managers
    SELECT 
        employee_id,
        employee_name,
        manager_id,
        1 as level,
        employee_name as path
    FROM employees
    WHERE manager_id IS NULL
    
    UNION ALL
    
    -- Recursive: Find subordinates
    SELECT 
        e.employee_id,
        e.employee_name,
        e.manager_id,
        eh.level + 1,
        eh.path || ' -> ' || e.employee_name
    FROM employees e
    INNER JOIN employee_hierarchy eh ON e.manager_id = eh.employee_id
    WHERE eh.level < 10  -- Prevent infinite recursion
)
SELECT 
    level,
    employee_name,
    path,
    COUNT(*) OVER (PARTITION BY level) as employees_at_level
FROM employee_hierarchy
ORDER BY level, employee_name;

-- Date series generation
WITH RECURSIVE date_series AS (
    SELECT DATE('2024-01-01') as date_value
    
    UNION ALL
    
    SELECT date_value + INTERVAL 1 DAY
    FROM date_series
    WHERE date_value < DATE('2024-12-31')
),
daily_sales_complete AS (
    SELECT 
        ds.date_value,
        COALESCE(s.daily_total, 0) as daily_sales
    FROM date_series ds
    LEFT JOIN (
        SELECT 
            DATE(order_date) as sale_date,
            SUM(amount) as daily_total
        FROM orders
        WHERE order_date BETWEEN '2024-01-01' AND '2024-12-31'
        GROUP BY DATE(order_date)
    ) s ON ds.date_value = s.sale_date
)
SELECT 
    date_value,
    daily_sales,
    AVG(daily_sales) OVER (
        ORDER BY date_value 
        ROWS 6 PRECEDING
    ) as seven_day_avg
FROM daily_sales_complete
ORDER BY date_value;

Hierarchical data: Traverse organizational or category trees
Date series: Generate continuous date ranges for complete datasets
Path finding: Navigate relationships and dependencies
Recursive logic: Handle self-referential data structures

🔗 Advanced JOIN Techniques

Beyond Basic JOINs

While INNER and LEFT JOINs handle most scenarios, advanced analytics often requires more sophisticated joining techniques. These patterns help solve complex data integration challenges efficiently.

Advanced Techniques: Self-joins, LATERAL joins, and multiple join conditions

🪞 Self JOINs for Comparative Analysis

-- Finding employee pairs in same department
SELECT 
    e1.employee_name as employee1,
    e2.employee_name as employee2,
    e1.department,
    ABS(e1.salary - e2.salary) as salary_difference
FROM employees e1
INNER JOIN employees e2 
    ON e1.department = e2.department
    AND e1.employee_id < e2.employee_id  -- Avoid duplicates and self-matches
WHERE ABS(e1.salary - e2.salary) < 5000;

-- Comparing current vs previous values
SELECT 
    s1.date as current_date,
    s1.revenue as current_revenue,
    s2.date as previous_date,
    s2.revenue as previous_revenue,
    s1.revenue - s2.revenue as revenue_change,
    CASE 
        WHEN s1.revenue > s2.revenue THEN 'Increase'
        WHEN s1.revenue < s2.revenue THEN 'Decrease'
        ELSE 'No Change'
    END as trend
FROM sales_summary s1
LEFT JOIN sales_summary s2 
    ON s2.date = s1.date - INTERVAL 1 DAY
ORDER BY s1.date;

⚙️ Complex JOIN Conditions

-- Range-based joins for time periods
SELECT 
    e.employee_name,
    e.start_date,
    s.salary_amount,
    s.effective_date
FROM employees e
INNER JOIN salary_history s 
    ON e.employee_id = s.employee_id
    AND s.effective_date <= e.start_date + INTERVAL 90 DAY  -- Salary within 90 days
    AND (s.end_date IS NULL OR s.end_date >= e.start_date);

-- Conditional joins with multiple criteria
SELECT 
    c.customer_name,
    c.registration_date,
    p.promotion_name,
    p.discount_percentage
FROM customers c
LEFT JOIN promotions p 
    ON (c.customer_type = 'PREMIUM' AND p.customer_tier = 'PREMIUM')
    OR (c.customer_type = 'REGULAR' AND p.customer_tier IN ('REGULAR', 'ALL'))
    AND p.start_date <= c.registration_date
    AND p.end_date >= c.registration_date;

-- Fuzzy matching with similarity functions
SELECT 
    c1.company_name as source_company,
    c2.company_name as matched_company,
    SIMILARITY(c1.company_name, c2.company_name) as similarity_score
FROM companies c1
CROSS JOIN companies c2
WHERE c1.company_id != c2.company_id
    AND SIMILARITY(c1.company_name, c2.company_name) > 0.8
ORDER BY similarity_score DESC;

📅 Advanced Date/Time Analytics

Time-Based Analysis Mastery

Time series analysis is crucial for business intelligence. Advanced date/time functions enable sophisticated temporal analysis, cohort studies, and seasonal pattern detection.

📊 Advanced Date Functions

-- Cohort analysis: Customer retention by signup month
WITH customer_cohorts AS (
    SELECT 
        customer_id,
        DATE_TRUNC('month', registration_date) as cohort_month,
        registration_date
    FROM customers
),
customer_orders AS (
    SELECT 
        c.customer_id,
        c.cohort_month,
        DATE_TRUNC('month', o.order_date) as order_month,
        EXTRACT(EPOCH FROM (DATE_TRUNC('month', o.order_date) - c.cohort_month)) / 2628000 as month_number
    FROM customer_cohorts c
    INNER JOIN orders o ON c.customer_id = o.customer_id
)
SELECT 
    cohort_month,
    month_number,
    COUNT(DISTINCT customer_id) as active_customers,
    COUNT(DISTINCT customer_id) * 100.0 / 
    FIRST_VALUE(COUNT(DISTINCT customer_id)) OVER (
        PARTITION BY cohort_month 
        ORDER BY month_number
    ) as retention_rate
FROM customer_orders
GROUP BY cohort_month, month_number
ORDER BY cohort_month, month_number;

-- Business days calculation
WITH business_days AS (
    SELECT 
        order_id,
        order_date,
        delivery_date,
        CASE 
            WHEN EXTRACT(DOW FROM generate_series) NOT IN (0, 6) 
            THEN 1 ELSE 0 
        END as is_business_day
    FROM orders,
         generate_series(order_date, delivery_date, '1 day'::interval) as generate_series
)
SELECT 
    order_id,
    order_date,
    delivery_date,
    SUM(is_business_day) as business_days_to_delivery
FROM business_days
GROUP BY order_id, order_date, delivery_date;

🌟 Seasonal and Cyclical Analysis

-- Seasonal patterns by day of week and month
SELECT 
    EXTRACT(MONTH FROM order_date) as month,
    TO_CHAR(order_date, 'Day') as day_of_week,
    COUNT(*) as order_count,
    SUM(amount) as total_revenue,
    AVG(amount) as avg_order_value,
    -- Compare to overall averages
    COUNT(*) * 100.0 / SUM(COUNT(*)) OVER () as pct_of_total_orders,
    SUM(amount) / SUM(SUM(amount)) OVER () * 100 as pct_of_total_revenue
FROM orders
WHERE order_date >= CURRENT_DATE - INTERVAL '2 years'
GROUP BY 
    EXTRACT(MONTH FROM order_date),
    EXTRACT(DOW FROM order_date),
    TO_CHAR(order_date, 'Day')
ORDER BY month, EXTRACT(DOW FROM order_date);

-- Year-over-year comparison with seasonal adjustment
WITH monthly_sales AS (
    SELECT 
        EXTRACT(YEAR FROM order_date) as year,
        EXTRACT(MONTH FROM order_date) as month,
        SUM(amount) as monthly_revenue
    FROM orders
    GROUP BY EXTRACT(YEAR FROM order_date), EXTRACT(MONTH FROM order_date)
)
SELECT 
    year,
    month,
    monthly_revenue,
    LAG(monthly_revenue, 12) OVER (ORDER BY year, month) as same_month_prev_year,
    (monthly_revenue - LAG(monthly_revenue, 12) OVER (ORDER BY year, month)) /
    LAG(monthly_revenue, 12) OVER (ORDER BY year, month) * 100 as yoy_growth_rate,
    -- 12-month rolling average
    AVG(monthly_revenue) OVER (
        ORDER BY year, month 
        ROWS BETWEEN 11 PRECEDING AND CURRENT ROW
    ) as rolling_12_month_avg
FROM monthly_sales
ORDER BY year, month;

⚡ Query Performance Optimization

Performance Optimization Strategies

🚀 Indexing Strategies

-- Composite indexes for multi-column filters
CREATE INDEX idx_orders_customer_date 
ON orders (customer_id, order_date);

-- Partial indexes for filtered queries
CREATE INDEX idx_active_customers 
ON customers (last_purchase_date) 
WHERE status = 'ACTIVE';

-- Expression indexes for computed columns
CREATE INDEX idx_customer_full_name 
ON customers (LOWER(first_name || ' ' || last_name));

Create composite indexes for multi-column queries
Use partial indexes for frequently filtered data
Consider covering indexes to avoid table lookups

⚙️ Query Structure

-- Use EXISTS instead of IN for subqueries
SELECT customer_name 
FROM customers c
WHERE EXISTS (
    SELECT 1 FROM orders o 
    WHERE o.customer_id = c.customer_id
);

-- LIMIT with ORDER BY requires index
SELECT * FROM large_table 
WHERE category = 'Electronics'
ORDER BY created_date DESC 
LIMIT 100;

Use EXISTS instead of IN for better performance
Add ORDER BY columns to indexes
Filter early, aggregate late

Common Performance Pitfalls

SELECT *: Always specify only needed columns
Non-sargable conditions: Avoid functions in WHERE clauses
Cartesian products: Always include proper JOIN conditions
Subquery optimization: Consider CTEs or window functions instead
Data type mismatches: Ensure proper type casting

🏢 Complete Real-World Analysis

E-commerce Performance Dashboard Query

WITH base_metrics AS (
    -- Calculate core metrics per customer per month
    SELECT 
        c.customer_id,
        c.customer_name,
        c.registration_date,
        DATE_TRUNC('month', o.order_date) as order_month,
        COUNT(DISTINCT o.order_id) as monthly_orders,
        SUM(oi.quantity * oi.unit_price) as monthly_revenue,
        COUNT(DISTINCT oi.product_id) as unique_products,
        AVG(oi.quantity * oi.unit_price) as avg_order_value
    FROM customers c
    INNER JOIN orders o ON c.customer_id = o.customer_id
    INNER JOIN order_items oi ON o.order_id = oi.order_id
    WHERE o.order_date >= CURRENT_DATE - INTERVAL '12 months'
    GROUP BY c.customer_id, c.customer_name, c.registration_date, 
             DATE_TRUNC('month', o.order_date)
),
customer_lifecycle AS (
    -- Add customer lifecycle metrics
    SELECT 
        *,
        -- Customer age in months
        EXTRACT(EPOCH FROM (order_month - DATE_TRUNC('month', registration_date))) / 2628000 as customer_age_months,
        
        -- Running totals per customer
        SUM(monthly_revenue) OVER (
            PARTITION BY customer_id 
            ORDER BY order_month
        ) as cumulative_revenue,
        
        -- Customer velocity (orders per month since registration)
        SUM(monthly_orders) OVER (
            PARTITION BY customer_id 
            ORDER BY order_month
        ) / NULLIF(
            EXTRACT(EPOCH FROM (order_month - DATE_TRUNC('month', registration_date))) / 2628000 + 1,
            0
        ) as orders_per_month_avg,
        
        -- Time since last order
        LAG(order_month) OVER (
            PARTITION BY customer_id 
            ORDER BY order_month
        ) as prev_order_month
        
    FROM base_metrics
),
customer_segments AS (
    -- Segment customers based on behavior
    SELECT 
        *,
        CASE 
            WHEN cumulative_revenue >= 1000 AND orders_per_month_avg >= 2 THEN 'VIP'
            WHEN cumulative_revenue >= 500 OR orders_per_month_avg >= 1 THEN 'High Value'
            WHEN customer_age_months <= 3 THEN 'New Customer'
            WHEN EXTRACT(EPOCH FROM (CURRENT_DATE - order_month)) / 2628000 > 3 THEN 'At Risk'
            ELSE 'Regular'
        END as customer_segment,
        
        -- RFM-like scoring
        NTILE(5) OVER (ORDER BY monthly_revenue DESC) as revenue_quintile,
        NTILE(5) OVER (ORDER BY monthly_orders DESC) as frequency_quintile,
        NTILE(5) OVER (ORDER BY order_month DESC) as recency_quintile
        
    FROM customer_lifecycle
),
final_dashboard AS (
    SELECT 
        order_month,
        customer_segment,
        
        -- Count metrics
        COUNT(DISTINCT customer_id) as active_customers,
        SUM(monthly_orders) as total_orders,
        
        -- Revenue metrics
        SUM(monthly_revenue) as total_revenue,
        AVG(monthly_revenue) as avg_revenue_per_customer,
        SUM(monthly_revenue) / SUM(monthly_orders) as avg_order_value,
        
        -- Growth metrics
        SUM(monthly_revenue) - LAG(SUM(monthly_revenue)) OVER (
            PARTITION BY customer_segment 
            ORDER BY order_month
        ) as revenue_growth,
        
        (SUM(monthly_revenue) - LAG(SUM(monthly_revenue)) OVER (
            PARTITION BY customer_segment 
            ORDER BY order_month
        )) / NULLIF(LAG(SUM(monthly_revenue)) OVER (
            PARTITION BY customer_segment 
            ORDER BY order_month
        ), 0) * 100 as revenue_growth_pct,
        
        -- Customer health metrics
        AVG(orders_per_month_avg) as avg_purchase_frequency,
        COUNT(CASE WHEN customer_age_months <= 1 THEN 1 END) as new_customers,
        COUNT(CASE WHEN EXTRACT(EPOCH FROM (CURRENT_DATE - order_month)) / 2628000 > 2 THEN 1 END) as at_risk_customers
        
    FROM customer_segments
    GROUP BY order_month, customer_segment
)
SELECT 
    order_month,
    customer_segment,
    active_customers,
    total_orders,
    total_revenue,
    avg_revenue_per_customer,
    avg_order_value,
    revenue_growth_pct,
    avg_purchase_frequency,
    new_customers,
    at_risk_customers,
    
    -- Additional KPIs
    total_revenue / SUM(total_revenue) OVER (PARTITION BY order_month) * 100 as segment_revenue_share,
    active_customers / SUM(active_customers) OVER (PARTITION BY order_month) * 100 as segment_customer_share
    
FROM final_dashboard
ORDER BY order_month DESC, total_revenue DESC;

What This Query Does

Creates a comprehensive customer dashboard with segmentation, lifecycle metrics, and growth analysis

Advanced Techniques Used

CTEs, window functions, NTILE for segmentation, LAG for growth calculations

Business Value

Identifies customer segments, tracks health metrics, and enables data-driven retention strategies

🎯 Master Advanced SQL Techniques

These advanced SQL techniques will transform your data analysis capabilities. Practice with real datasets to build expertise and confidence in complex analytical scenarios.

🔄

Practice Regularly

Work with different datasets and scenarios

📊

Build Dashboards

Create comprehensive analytical queries

⚡

Optimize Performance

Focus on query efficiency and scalability

Data Analyst Course Data Science Program

← Back to Blog