This lesson covers fundamental principles and techniques of data visualization for effective communication of insights, with focus on selecting right chart type for your data.
1. Know Your Purpose
2. Know Your Data
3. Know Your Audience
Step 1: Identify Your Message
What story are you trying to tell?
- "X is bigger than Y" → Comparison
- "X affects Y" → Relationship
- "X changed over time" → Trend
- "X consists of A, B, C" → Composition
- "X is distributed this way" → Distribution
Step 2: Analyze Your Data Structure
-- Analyze data structure for chart selection
WITH data_analysis AS (
SELECT
-- Identify data types
CASE
WHEN COUNT(DISTINCT category) < 20 THEN 'Categorical'
ELSE 'Numerical'
END as primary_data_type,
-- Check for time component
CASE
WHEN date_column IS NOT NULL THEN 'Time Series'
ELSE 'Cross-sectional'
END as temporal_dimension,
-- Check for hierarchical structure
CASE
WHEN parent_category IS NOT NULL THEN 'Hierarchical'
ELSE 'Flat'
END as structure_type,
-- Check for geographical component
CASE
WHEN latitude IS NOT NULL OR country IS NOT NULL THEN 'Geographical'
ELSE 'Non-geographical'
END as spatial_dimension
FROM your_data_table
)
SELECT
primary_data_type,
temporal_dimension,
structure_type,
spatial_dimension,
CASE
WHEN primary_data_type = 'Categorical' AND temporal_dimension = 'Time Series'
THEN 'Consider: Line chart, Bar chart over time'
WHEN primary_data_type = 'Categorical' AND temporal_dimension = 'Cross-sectional'
THEN 'Consider: Bar chart, Pie chart, Treemap'
WHEN primary_data_type = 'Numerical' AND temporal_dimension = 'Time Series'
THEN 'Consider: Line chart, Area chart'
WHEN primary_data_type = 'Numerical' AND temporal_dimension = 'Cross-sectional'
THEN 'Consider: Scatter plot, Histogram, Box plot'
WHEN spatial_dimension = 'Geographical'
THEN 'Consider: Choropleth map, Bubble map'
ELSE 'Consider: Mixed chart types'
END as recommended_charts
FROM data_analysis;
Step 3: Match Chart to Purpose
| Purpose | Best Chart Types | When to Use |
|---|---|---|
| Comparison | Bar, Column, Line | Compare values across categories |
| Relationship | Scatter, Bubble, Heatmap | Show correlations between variables |
| Distribution | Histogram, Box, Violin | Show data spread and patterns |
| Composition | Pie, Donut, Treemap, Sunburst | Show parts of whole |
| Trend | Line, Area, Candlestick | Show changes over time |
| Geospatial | Choropleth, Bubble map | Show data by location |
1. Wrong Chart Type
2. Poor Scaling
3. Visual Clutter
4. Color Misuse
Purpose: Compare values across different categories
Characteristics:
When to Use Bar Charts:
- Comparing sales by product category
- Showing population by country
- Displaying survey responses by option
- Comparing performance metrics across departments
- Showing budget allocation by category
Bar Chart Types:
SQL Example for Bar Chart Data:
-- Prepare data for bar chart: Sales by product category
WITH category_sales AS (
SELECT
product_category,
SUM(sales_amount) as total_sales,
COUNT(*) as transaction_count,
AVG(sales_amount) as avg_transaction_value
FROM sales_data
WHERE date >= '2023-01-01' AND date <= '2023-12-31'
GROUP BY product_category
ORDER BY total_sales DESC
),
chart_data AS (
SELECT
product_category,
total_sales,
-- Calculate percentage for stacked bar charts
ROUND(total_sales * 100.0 / SUM(total_sales) OVER (), 2) as percentage_of_total,
-- Create ranking for sorting
RANK() OVER (ORDER BY total_sales DESC) as sales_rank,
-- Format for display
'$' || ROUND(total_sales/1000000, 2) || 'M' as formatted_sales
FROM category_sales
)
SELECT
product_category as category_label,
total_sales as bar_value,
percentage_of_total as stack_percentage,
sales_rank as sort_order,
formatted_sales as display_label,
CASE
WHEN sales_rank <= 3 THEN 'Top Performer'
WHEN sales_rank <= 6 THEN 'Mid Performer'
ELSE 'Low Performer'
END as performance_category
FROM chart_data
ORDER BY sort_order;
Purpose: Show distribution of continuous numerical data
Characteristics:
When to Use Histograms:
- Analyzing age distribution of customers
- Showing distribution of test scores
- Displaying income distribution
- Analyzing response times
- Showing product price distribution
Histogram Components:
SQL Example for Histogram Data:
-- Prepare data for histogram: Customer age distribution
WITH age_data AS (
SELECT
AGE(CURRENT_DATE, birth_date) as age_years
FROM customers
WHERE birth_date IS NOT NULL
),
age_bins AS (
SELECT
age_years,
-- Create age bins (10-year intervals)
CASE
WHEN age_years < 20 THEN '0-19'
WHEN age_years < 30 THEN '20-29'
WHEN age_years < 40 THEN '30-39'
WHEN age_years < 50 THEN '40-49'
WHEN age_years < 60 THEN '50-59'
WHEN age_years < 70 THEN '60-69'
ELSE '70+'
END as age_group,
-- For continuous histogram (individual bins)
FLOOR(age_years / 10) * 10 as bin_start,
FLOOR(age_years / 10) * 10 + 10 as bin_end
FROM age_data
),
histogram_data AS (
SELECT
age_group as category_label,
bin_start,
bin_end,
COUNT(*) as frequency,
-- Calculate percentage
ROUND(COUNT(*) * 100.0 / (SELECT COUNT(*) FROM age_data), 2) as percentage,
-- Calculate cumulative percentage
ROUND(SUM(COUNT(*)) OVER (ORDER BY bin_start) * 100.0 / (SELECT COUNT(*) FROM age_data), 2) as cumulative_percentage
FROM age_bins
GROUP BY age_group, bin_start, bin_end
ORDER BY bin_start
)
SELECT
category_label,
bin_start as histogram_bin_start,
bin_end as histogram_bin_end,
frequency as bar_height,
percentage as relative_frequency,
cumulative_percentage,
CASE
WHEN cumulative_percentage <= 25 THEN 'First Quartile'
WHEN cumulative_percentage <= 50 THEN 'Second Quartile'
WHEN cumulative_percentage <= 75 THEN 'Third Quartile'
ELSE 'Fourth Quartile'
END as distribution_quartile
FROM histogram_data;
| Aspect | Bar Chart | Histogram |
|---|---|---|
| Data Type | Categorical | Continuous (binned) |
| Purpose | Comparison | Distribution |
| Bar Gaps | Yes (separate categories) | No (continuous data) |
| X-axis | Category labels | Numerical ranges |
| Y-axis | Values/Measurements | Frequency/Count |
| Order | Can be rearranged | Fixed numerical order |
Use Bar Chart When:
Use Histogram When:
Purpose: Show relationship between two numerical variables
Characteristics:
When to Use Scatter Plots:
- Analyzing correlation between advertising spend and sales
- Exploring relationship between height and weight
- Identifying outliers in performance data
- Showing relationship between experience and salary
- Analyzing correlation between study time and test scores
Scatter Plot Variations:
SQL Example for Scatter Plot Data:
-- Prepare data for scatter plot: Advertising spend vs Sales revenue
WITH scatter_data AS (
SELECT
advertising_spend as x_variable,
sales_revenue as y_variable,
product_category as color_group,
market_size as bubble_size,
-- Calculate correlation coefficient
CORR(advertising_spend, sales_revenue) OVER () as overall_correlation,
-- Create quadrants for analysis
CASE
WHEN advertising_spend > AVG(advertising_spend) OVER ()
AND sales_revenue > AVG(sales_revenue) OVER () THEN 'High-High'
WHEN advertising_spend > AVG(advertising_spend) OVER ()
AND sales_revenue <= AVG(sales_revenue) OVER () THEN 'High-Low'
WHEN advertising_spend <= AVG(advertising_spend) OVER ()
AND sales_revenue > AVG(sales_revenue) OVER () THEN 'Low-High'
ELSE 'Low-Low'
END as quadrant
FROM marketing_performance
WHERE date >= '2023-01-01' AND date <= '2023-12-31'
),
regression_line AS (
-- Calculate linear regression line
SELECT
AVG(x_variable) as mean_x,
AVG(y_variable) as mean_y,
-- Calculate slope and intercept
(SUM((x_variable - AVG(x_variable) OVER ()) * (y_variable - AVG(y_variable) OVER ())) /
SUM(POWER(x_variable - AVG(x_variable) OVER (), 2))) as slope,
AVG(y_variable) - (SUM((x_variable - AVG(x_variable) OVER ()) * (y_variable - AVG(y_variable) OVER ())) /
SUM(POWER(x_variable - AVG(x_variable) OVER (), 2))) * AVG(x_variable) as intercept
FROM scatter_data
)
SELECT
s.x_variable,
s.y_variable,
s.color_group,
s.bubble_size,
s.quadrant,
-- Regression line values
r.intercept + r.slope * s.x_variable as regression_y,
-- Distance from regression line
ABS(s.y_variable - (r.intercept + r.slope * s.x_variable)) as residual_distance,
CASE
WHEN s.residual_distance > 2 * STDDEV(s.y_variable - (r.intercept + r.slope * s.x_variable)) OVER () THEN 'Outlier'
ELSE 'Normal'
END as outlier_status
FROM scatter_data s, regression_line r
ORDER BY s.x_variable;
Purpose: Show trends and changes over time
Characteristics:
When to Use Line Charts:
- Showing stock price changes over time
- Displaying website traffic trends
- Tracking sales performance monthly
- Showing temperature changes throughout day
- Comparing multiple metrics over time
Line Chart Types:
SQL Example for Line Chart Data:
-- Prepare data for line chart: Monthly sales trends
WITH time_series_data AS (
SELECT
DATE_TRUNC('month', date) as month_period,
product_category,
SUM(sales_amount) as monthly_sales,
COUNT(DISTINCT customer_id) as unique_customers,
AVG(sales_amount) as avg_transaction_value
FROM sales_data
WHERE date >= '2023-01-01' AND date <= '2023-12-31'
GROUP BY DATE_TRUNC('month', date), product_category
),
trend_analysis AS (
SELECT
month_period,
product_category as series_name,
monthly_sales as line_value,
unique_customers as secondary_metric,
-- Calculate month-over-month growth
monthly_sales - LAG(monthly_sales) OVER (PARTITION BY product_category ORDER BY month_period) as mom_change,
(monthly_sales - LAG(monthly_sales) OVER (PARTITION BY product_category ORDER BY month_period)) /
LAG(monthly_sales) OVER (PARTITION BY product_category ORDER BY month_period) * 100 as mom_growth_rate,
-- Calculate moving average for trend line
AVG(monthly_sales) OVER (
PARTITION BY product_category
ORDER BY month_period
ROWS BETWEEN 2 PRECEDING AND CURRENT ROW
) as moving_avg_3month,
-- Identify trend direction
CASE
WHEN monthly_sales > LAG(monthly_sales) OVER (PARTITION BY product_category ORDER BY month_period)
AND LAG(monthly_sales) OVER (PARTITION BY product_category ORDER BY month_period) >
LAG(monthly_sales, 2) OVER (PARTITION BY product_category ORDER BY month_period) THEN 'Strong Upward'
WHEN monthly_sales > LAG(monthly_sales) OVER (PARTITION BY product_category ORDER BY month_period) THEN 'Upward'
WHEN monthly_sales < LAG(monthly_sales) OVER (PARTITION BY product_category ORDER BY month_period)
AND LAG(monthly_sales) OVER (PARTITION BY product_category ORDER BY month_period) <
LAG(monthly_sales, 2) OVER (PARTITION BY product_category ORDER BY month_period) THEN 'Strong Downward'
WHEN monthly_sales < LAG(monthly_sales) OVER (PARTITION BY product_category ORDER BY month_period) THEN 'Downward'
ELSE 'Stable'
END as trend_direction
FROM time_series_data
)
SELECT
month_period as time_axis,
series_name,
line_value,
moving_avg_3month as trend_line,
mom_growth_rate as growth_rate,
trend_direction,
CASE
WHEN trend_direction LIKE '%Upward%' THEN '#2E7D32' -- Green
WHEN trend_direction LIKE '%Downward%' THEN '#C62828' -- Red
ELSE '#1565C0' -- Blue
END as trend_color
FROM trend_analysis
ORDER BY time_axis, series_name;
| Aspect | Scatter Plot | Line Chart |
|---|---|---|
| Purpose | Relationship/Correlation | Trend/Change over time |
| X-axis | Numerical variable | Time/Sequential |
| Connection | No lines between points | Lines connect points |
| Emphasis | Individual data points | Continuity and trend |
| Time | Not required | Essential component |
Use Scatter Plot When:
Use Line Chart When:
Purpose: Show distribution summary with quartiles and outliers
Characteristics:
When to Use Box Plots:
- Comparing distributions across categories
- Identifying outliers in datasets
- Showing statistical summaries
- Comparing performance across groups
- Analyzing test scores by class
Box Plot Components:
SQL Example for Box Plot Data:
-- Prepare data for box plot: Salary distribution by department
WITH distribution_stats AS (
SELECT
department,
salary,
-- Calculate quartiles using window functions
PERCENTILE_CONT(0.25) WITHIN GROUP (ORDER BY salary) OVER (PARTITION BY department) as q1,
PERCENTILE_CONT(0.50) WITHIN GROUP (ORDER BY salary) OVER (PARTITION BY department) as median,
PERCENTILE_CONT(0.75) WITHIN GROUP (ORDER BY salary) OVER (PARTITION BY department) as q3,
AVG(salary) OVER (PARTITION BY department) as mean_salary,
STDDEV(salary) OVER (PARTITION BY department) as std_salary
FROM employee_data
),
box_plot_data AS (
SELECT
department,
salary,
q1,
median,
q3,
-- Calculate IQR and whisker bounds
q3 - q1 as iqr,
q1 - 1.5 * (q3 - q1) as lower_whisker,
q3 + 1.5 * (q3 - q1) as upper_whisker,
-- Classify data points
CASE
WHEN salary < q1 - 1.5 * (q3 - q1) THEN 'Lower Outlier'
WHEN salary > q3 + 1.5 * (q3 - q1) THEN 'Upper Outlier'
WHEN salary BETWEEN q1 AND q3 THEN 'Within IQR'
WHEN salary < q1 THEN 'Lower Whisker'
ELSE 'Upper Whisker'
END as point_classification,
-- Calculate z-score for additional context
(salary - mean_salary) / std_salary as z_score
FROM distribution_stats
),
summary_stats AS (
SELECT
department,
MIN(salary) as min_salary,
MAX(salary) as max_salary,
AVG(salary) as avg_salary,
COUNT(*) as employee_count,
COUNT(CASE WHEN point_classification LIKE '%Outlier%' THEN 1 END) as outlier_count
FROM box_plot_data
GROUP BY department
)
SELECT
b.department,
b.q1 as box_lower_bound,
b.median as box_median,
b.q3 as box_upper_bound,
b.lower_whisker,
b.upper_whisker,
b.salary as individual_value,
b.point_classification,
b.z_score,
s.employee_count,
s.outlier_count,
ROUND(s.outlier_count * 100.0 / s.employee_count, 2) as outlier_percentage
FROM box_plot_data b
JOIN summary_stats s ON b.department = s.department
ORDER BY b.department, b.salary;
Purpose: Show distribution shape with density information
Characteristics:
When to Use Violin Plots:
- Showing detailed distribution shapes
- Identifying multimodal patterns
- Comparing complex distributions
- Analyzing response time patterns
- Exploring customer behavior distributions
Violin Plot Components:
SQL Example for Violin Plot Data:
-- Prepare data for violin plot: Customer satisfaction scores by region
WITH density_data AS (
SELECT
region,
satisfaction_score,
-- Create bins for density calculation
FLOOR(satisfaction_score * 10) / 10 as score_bin,
-- Calculate percentiles for violin shape
PERCENTILE_CONT(0.05) WITHIN GROUP (ORDER BY satisfaction_score) OVER (PARTITION BY region) as p5,
PERCENTILE_CONT(0.25) WITHIN GROUP (ORDER BY satisfaction_score) OVER (PARTITION BY region) as p25,
PERCENTILE_CONT(0.50) WITHIN GROUP (ORDER BY satisfaction_score) OVER (PARTITION BY region) as p50,
PERCENTILE_CONT(0.75) WITHIN GROUP (ORDER BY satisfaction_score) OVER (PARTITION BY region) as p75,
PERCENTILE_CONT(0.95) WITHIN GROUP (ORDER BY satisfaction_score) OVER (PARTITION BY region) as p95
FROM customer_satisfaction
),
violin_shape AS (
SELECT
region,
score_bin,
-- Calculate density (frequency) for each bin
COUNT(*) as frequency,
-- Calculate relative density for violin width
COUNT(*) * 1.0 / SUM(COUNT(*)) OVER (PARTITION BY region) as relative_density,
-- Calculate violin width (scaled for visualization)
COUNT(*) * 1.0 / MAX(COUNT(*)) OVER (PARTITION BY region) as violin_width,
-- Get quartile values for box plot overlay
MIN(p25) as q25,
MIN(p50) as median,
MIN(p75) as q75,
MIN(p5) as min_5th,
MIN(p95) as max_95th,
-- Identify distribution characteristics
CASE
WHEN COUNT(*) = MAX(COUNT(*)) OVER (PARTITION BY region) THEN 'Peak'
ELSE 'Normal'
END as bin_type
FROM density_data
GROUP BY region, score_bin, p5, p25, p50, p75, p95
),
distribution_analysis AS (
SELECT
region,
-- Check for multimodality
COUNT(CASE WHEN bin_type = 'Peak' THEN 1 END) as peak_count,
CASE
WHEN COUNT(CASE WHEN bin_type = 'Peak' THEN 1 END) > 1 THEN 'Multimodal'
ELSE 'Unimodal'
END as distribution_type,
-- Calculate skewness (simplified)
AVG(CASE WHEN score_bin > median THEN violin_width * (score_bin - median) END) -
AVG(CASE WHEN score_bin < median THEN violin_width * (median - score_bin) END) as skewness_indicator,
COUNT(*) as total_observations
FROM violin_shape
GROUP BY region, median
)
SELECT
v.region,
v.score_bin as value_axis,
v.violin_width as violin_width_positive,
-v.violin_width as violin_width_negative, -- Mirror for violin shape
v.relative_density as density_value,
v.q25 as box_lower,
v.median as box_middle,
v.q75 as box_upper,
v.min_5th as whisker_lower,
v.max_95th as whisker_upper,
d.distribution_type,
d.peak_count,
CASE
WHEN d.distribution_type = 'Multimodal' THEN 'Complex distribution'
WHEN v.skewness_indicator > 0 THEN 'Right-skewed'
WHEN v.skewness_indicator < 0 THEN 'Left-skewed'
ELSE 'Symmetric'
END as distribution_shape
FROM violin_shape v
JOIN distribution_analysis d ON v.region = d.region
ORDER BY v.region, v.score_bin;
| Aspect | Box Plot | Violin Plot |
|---|---|---|
| Detail Level | Summary statistics | Full distribution |
| Shape | Rectangular box | Violin-shaped density |
| Information | Quartiles, outliers | Density, multimodality |
| Complexity | Simple, standardized | Detailed, informative |
| Best For | Quick comparisons | Detailed analysis |
Use Box Plot When:
Use Violin Plot When:
Purpose: Show data intensity using color gradients in a grid
Characteristics:
When to Use Heat Maps:
- Sales performance by product and region
- Website traffic by hour and day
- Correlation matrices
- Risk assessment by category and severity
- Activity patterns by time and location
SQL Example for Heat Map Data:
-- Prepare data for heat map: Sales by product category and month
WITH heat_map_data AS (
SELECT
product_category as row_category,
EXTRACT(MONTH FROM date) as column_category,
SUM(sales_amount) as heat_value,
COUNT(*) as transaction_count,
-- Calculate percentile for color scaling
PERCENT_RANK() OVER (ORDER BY SUM(sales_amount)) as percentile_rank,
-- Create color intensity levels
NTILE(5) OVER (ORDER BY SUM(sales_amount)) as color_bucket
FROM sales_data
WHERE date >= '2023-01-01' AND date <= '2023-12-31'
GROUP BY product_category, EXTRACT(MONTH FROM date)
),
color_mapping AS (
SELECT
row_category,
column_category,
heat_value,
percentile_rank,
color_bucket,
-- Generate color values (simplified)
CASE
WHEN color_bucket = 1 THEN '#FFEBEE' -- Light red
WHEN color_bucket = 2 THEN '#FFCDD2' -- Light-medium red
WHEN color_bucket = 3 THEN '#EF9A9A' -- Medium red
WHEN color_bucket = 4 THEN '#E57373' -- Medium-dark red
ELSE '#C62828' -- Dark red
END as fill_color,
-- Create labels
CASE
WHEN column_category = 1 THEN 'Jan'
WHEN column_category = 2 THEN 'Feb'
WHEN column_category = 3 THEN 'Mar'
WHEN column_category = 4 THEN 'Apr'
WHEN column_category = 5 THEN 'May'
WHEN column_category = 6 THEN 'Jun'
WHEN column_category = 7 THEN 'Jul'
WHEN column_category = 8 THEN 'Aug'
WHEN column_category = 9 THEN 'Sep'
WHEN column_category = 10 THEN 'Oct'
WHEN column_category = 11 THEN 'Nov'
WHEN column_category = 12 THEN 'Dec'
END as month_label
FROM heat_map_data
)
SELECT
row_category,
column_category,
month_label,
heat_value,
fill_color,
percentile_rank,
-- Format for display
'$' || ROUND(heat_value/1000000, 1) || 'M' as formatted_value,
CASE
WHEN percentile_rank >= 0.8 THEN 'Very High'
WHEN percentile_rank >= 0.6 THEN 'High'
WHEN percentile_rank >= 0.4 THEN 'Medium'
WHEN percentile_rank >= 0.2 THEN 'Low'
ELSE 'Very Low'
END as performance_level
FROM color_mapping
ORDER BY row_category, column_category;
Purpose: Show data values by geographical regions using color
Characteristics:
When to Use Choropleth Maps:
- Population density by country
- Sales performance by state
- Election results by district
- Disease prevalence by region
- Income levels by county
SQL Example for Choropleth Map Data:
-- Prepare data for choropleth map: Sales by US state
WITH state_sales AS (
SELECT
state_code,
state_name,
SUM(sales_amount) as total_sales,
COUNT(DISTINCT customer_id) as customer_count,
AVG(sales_amount) as avg_sale_value,
-- Calculate per capita sales if population data available
SUM(sales_amount) / NULLIF(population, 0) as sales_per_capita
FROM sales_data s
JOIN geographic_data g ON s.state = g.state_code
WHERE date >= '2023-01-01' AND date <= '2023-12-31'
GROUP BY state_code, state_name, population
),
color_classification AS (
SELECT
state_code,
state_name,
total_sales,
customer_count,
sales_per_capita,
-- Create quantiles for color classification
NTILE(5) OVER (ORDER BY total_sales) as sales_quintile,
-- Calculate percentiles
PERCENTILE_CONT(0.20) WITHIN GROUP (ORDER BY total_sales) OVER () as p20,
PERCENTILE_CONT(0.40) WITHIN GROUP (ORDER BY total_sales) OVER () as p40,
PERCENTILE_CONT(0.60) WITHIN GROUP (ORDER BY total_sales) OVER () as p60,
PERCENTILE_CONT(0.80) WITHIN GROUP (ORDER BY total_sales) OVER () as p80
FROM state_sales
),
map_data AS (
SELECT
state_code,
state_name,
total_sales,
customer_count,
sales_per_capita,
sales_quintile,
-- Generate color based on quintile
CASE
WHEN sales_quintile = 1 THEN '#E3F2FD' -- Very light blue
WHEN sales_quintile = 2 THEN '#90CAF9' -- Light blue
WHEN sales_quintile = 3 THEN '#42A5F5' -- Medium blue
WHEN sales_quintile = 4 THEN '#1E88E5' -- Dark blue
ELSE '#0D47A1' -- Very dark blue
END as fill_color,
-- Create hover information
state_name || ': $' || ROUND(total_sales/1000000, 1) || 'M' as hover_text,
-- Performance classification
CASE
WHEN sales_quintile >= 4 THEN 'Top Performer'
WHEN sales_quintile >= 3 THEN 'Above Average'
WHEN sales_quintile >= 2 THEN 'Average'
ELSE 'Below Average'
END as performance_category
FROM color_classification
)
SELECT
state_code as geo_id,
state_name as geo_label,
total_sales as metric_value,
fill_color,
hover_text,
performance_category,
sales_per_capita as normalized_metric,
customer_count as supporting_metric
FROM map_data
ORDER BY total_sales DESC;
Purpose: Show hierarchical data as nested rectangles
Characteristics:
When to Use Treemaps:
- Sales by product category and subcategory
- Budget allocation by department and project
- File system usage analysis
- Market share by company and product
- Website traffic by section and subsection
SQL Example for Treemap Data:
-- Prepare data for treemap: Sales by category hierarchy
WITH category_hierarchy AS (
SELECT
main_category,
subcategory,
product_name,
SUM(sales_amount) as sales_value,
COUNT(*) as transaction_count,
-- Calculate hierarchy levels
1 as level_main,
2 as level_sub,
3 as level_product
FROM sales_data
WHERE date >= '2023-01-01' AND date <= '2023-12-31'
GROUP BY main_category, subcategory, product_name
),
hierarchy_aggregates AS (
-- Calculate totals for each level
SELECT
main_category,
NULL as subcategory,
NULL as product_name,
SUM(sales_value) as sales_value,
COUNT(*) as item_count,
1 as hierarchy_level,
'Main Category' as node_type
FROM category_hierarchy
GROUP BY main_category
UNION ALL
SELECT
main_category,
subcategory,
NULL as product_name,
SUM(sales_value) as sales_value,
COUNT(*) as item_count,
2 as hierarchy_level,
'Subcategory' as node_type
FROM category_hierarchy
GROUP BY main_category, subcategory
UNION ALL
SELECT
main_category,
subcategory,
product_name,
sales_value,
1 as item_count,
3 as hierarchy_level,
'Product' as node_type
FROM category_hierarchy
),
treemap_data AS (
SELECT
COALESCE(main_category, 'Total') as parent_category,
COALESCE(subcategory, main_category) as category_name,
COALESCE(product_name, subcategory) as item_name,
sales_value as rectangle_size,
item_count as supporting_metric,
hierarchy_level,
node_type,
-- Calculate percentage of parent
sales_value * 100.0 / SUM(sales_value) OVER (PARTITION BY
CASE
WHEN hierarchy_level = 1 THEN 'Total'
WHEN hierarchy_level = 2 THEN main_category
ELSE subcategory
END
) as parent_percentage,
-- Generate colors based on performance
NTILE(10) OVER (ORDER BY sales_value) as color_bucket,
-- Create path for hierarchy
CASE
WHEN hierarchy_level = 1 THEN main_category
WHEN hierarchy_level = 2 THEN main_category || ' > ' || subcategory
ELSE main_category || ' > ' || subcategory || ' > ' || product_name
END as hierarchy_path
FROM hierarchy_aggregates
)
SELECT
hierarchy_path as treemap_path,
category_name as node_label,
rectangle_size as node_size,
parent_percentage as relative_size,
hierarchy_level as nesting_level,
node_type,
color_bucket,
CASE
WHEN color_bucket >= 8 THEN '#1B5E20' -- Dark green
WHEN color_bucket >= 6 THEN '#4CAF50' -- Green
WHEN color_bucket >= 4 THEN '#8BC34A' -- Light green
WHEN color_bucket >= 2 THEN '#CDDC39' -- Yellow-green
ELSE '#F57C00' -- Orange
END as fill_color,
'$' || ROUND(rectangle_size/1000000, 2) || 'M' as formatted_size
FROM treemap_data
ORDER BY hierarchy_level, rectangle_size DESC;
Purpose: Show hierarchical data as concentric circles
Characteristics:
When to Use Sunburst Charts:
- Organizational structure analysis
- File system usage visualization
- Product category breakdown
- Budget allocation visualization
- Website navigation path analysis
SQL Example for Sunburst Data:
-- Prepare data for sunburst: Organizational hierarchy
WITH org_hierarchy AS (
SELECT
department,
team,
employee_name,
salary as value_metric,
COUNT(*) as employee_count,
-- Calculate levels
1 as level_dept,
2 as level_team,
3 as level_employee
FROM employee_data
WHERE status = 'active'
),
sunburst_nodes AS (
-- Create nodes for each level
SELECT
department as node_id,
NULL as parent_id,
department as node_label,
SUM(salary) as node_value,
COUNT(*) as node_size,
1 as level,
'Department' as node_type
FROM org_hierarchy
GROUP BY department
UNION ALL
SELECT
department || '_' || team as node_id,
department as parent_id,
team as node_label,
SUM(salary) as node_value,
COUNT(*) as node_size,
2 as level,
'Team' as node_type
FROM org_hierarchy
GROUP BY department, team
UNION ALL
SELECT
department || '_' || team || '_' || employee_name as node_id,
department || '_' || team as parent_id,
employee_name as node_label,
salary as node_value,
1 as node_size,
3 as level,
'Employee' as node_type
FROM org_hierarchy
),
sunburst_calculations AS (
SELECT
node_id,
parent_id,
node_label,
node_value,
node_size,
level,
node_type,
-- Calculate angle for each node (360 degrees total)
node_value * 360.0 / SUM(node_value) OVER (PARTITION BY
CASE
WHEN level = 1 THEN 'root'
WHEN level = 2 THEN parent_id
ELSE parent_id
END
) as arc_angle,
-- Calculate cumulative angle for positioning
SUM(node_value) OVER (
PARTITION BY
CASE
WHEN level = 1 THEN 'root'
WHEN level = 2 THEN parent_id
ELSE parent_id
END
ORDER BY node_value
ROWS BETWEEN UNBOUNDED PRECEDING AND 1 PRECEDING
) * 360.0 / SUM(node_value) OVER (
PARTITION BY
CASE
WHEN level = 1 THEN 'root'
WHEN level = 2 THEN parent_id
ELSE parent_id
END
) as start_angle,
-- Calculate radius based on level
level * 50 as inner_radius,
(level + 1) * 50 as outer_radius,
-- Generate colors
NTILE(12) OVER (ORDER BY node_value) as color_bucket
FROM sunburst_nodes
)
SELECT
node_id,
parent_id,
node_label,
node_value,
arc_angle,
start_angle,
inner_radius,
outer_radius,
level,
node_type,
CASE
WHEN level = 1 THEN '#1976D2' -- Blue for departments
WHEN level = 2 THEN '#388E3C' -- Green for teams
ELSE '#F57C00' -- Orange for employees
END as fill_color,
'$' || ROUND(node_value/1000, 0) || 'K' as formatted_value,
CASE
WHEN node_type = 'Employee' THEN node_label
ELSE node_label || ' (' || node_size || ')'
END as display_label
FROM sunburst_calculations
ORDER BY level, start_angle;
| Chart Type | Best For | Data Structure | Key Advantage |
|---|---|---|---|
| Heat Map | Intensity patterns | Two categorical + one numerical | Pattern identification |
| Choropleth | Geographic data | Geographic regions + values | Spatial relationships |
| Treemap | Hierarchical proportions | Nested categories | Space efficiency |
| Sunburst | Hierarchical breakdown | Multi-level hierarchy | Proportional hierarchy |
Purpose: Show cumulative effect of sequential positive and negative values
Characteristics:
When to Use Waterfall Charts:
- Financial statement analysis (revenue to profit)
- Population changes over time
- Project budget tracking
- Inventory level changes
- Customer acquisition and churn analysis
SQL Example for Waterfall Chart Data:
-- Prepare data for waterfall chart: Monthly profit analysis
WITH profit_components AS (
SELECT
'Starting Cash' as component_name,
starting_balance as value,
0 as sort_order,
'Total' as component_type
FROM financial_summary
WHERE month = '2023-01-01'
UNION ALL
SELECT
'Revenue' as component_name,
SUM(revenue) as value,
1 as sort_order,
'Positive' as component_type
FROM financial_data
WHERE month BETWEEN '2023-01-01' AND '2023-12-31'
UNION ALL
SELECT
'Cost of Goods Sold' as component_name,
-SUM(cogs) as value,
2 as sort_order,
'Negative' as component_type
FROM financial_data
WHERE month BETWEEN '2023-01-01' AND '2023-12-31'
UNION ALL
SELECT
'Operating Expenses' as component_name,
-SUM(operating_expenses) as value,
3 as sort_order,
'Negative' as component_type
FROM financial_data
WHERE month BETWEEN '2023-01-01' AND '2023-12-31'
UNION ALL
SELECT
'Interest Expense' as component_name,
-SUM(interest_expense) as value,
4 as sort_order,
'Negative' as component_type
FROM financial_data
WHERE month BETWEEN '2023-01-01' AND '2023-12-31'
UNION ALL
SELECT
'Taxes' as component_name,
-SUM(taxes) as value,
5 as sort_order,
'Negative' as component_type
FROM financial_data
WHERE month BETWEEN '2023-01-01' AND '2023-12-31'
UNION ALL
SELECT
'Net Profit' as component_name,
(SELECT SUM(revenue) - SUM(cogs) - SUM(operating_expenses) - SUM(interest_expense) - SUM(taxes)
FROM financial_data WHERE month BETWEEN '2023-01-01' AND '2023-12-31') as value,
6 as sort_order,
'Total' as component_type
),
waterfall_calculations AS (
SELECT
component_name,
value,
sort_order,
component_type,
-- Calculate cumulative values
SUM(value) OVER (ORDER BY sort_order ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) as cumulative_value,
-- Calculate previous cumulative for floating effect
LAG(SUM(value) OVER (ORDER BY sort_order ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW), 1)
OVER (ORDER BY sort_order) as previous_cumulative,
-- Generate colors
CASE
WHEN component_type = 'Total' THEN '#2E7D32' -- Green for totals
WHEN component_type = 'Positive' THEN '#388E3C' -- Green for positive
WHEN component_type = 'Negative' THEN '#C62828' -- Red for negative
ELSE '#1565C0' -- Blue for starting
END as bar_color,
-- Format for display
CASE
WHEN component_type = 'Total' THEN 'Total'
WHEN component_type = 'Positive' THEN '+$' || ROUND(value/1000000, 1) || 'M'
WHEN component_type = 'Negative' THEN '-$' || ROUND(ABS(value)/1000000, 1) || 'M'
ELSE '$' || ROUND(value/1000000, 1) || 'M'
END as formatted_value
FROM profit_components
)
SELECT
component_name as step_label,
value as step_value,
previous_cumulative as floating_base,
cumulative_value as running_total,
bar_color,
formatted_value,
component_type,
sort_order as display_order
FROM waterfall_calculations
ORDER BY sort_order;
Purpose: Show breakdown of total into components with hierarchical relationships
Characteristics:
When to Use Cascade Charts:
- Sales breakdown by region and product
- Budget allocation by department and project
- Customer segmentation analysis
- Website traffic source breakdown
- Product portfolio analysis
SQL Example for Cascade Chart Data:
-- Prepare data for cascade chart: Sales breakdown by hierarchy
WITH sales_hierarchy AS (
SELECT
'Total Sales' as level_1,
region as level_2,
product_category as level_3,
SUM(sales_amount) as sales_value,
COUNT(*) as transaction_count
FROM sales_data
WHERE date >= '2023-01-01' AND date <= '2023-12-31'
GROUP BY region, product_category
),
cascade_levels AS (
-- Level 1: Total
SELECT
1 as cascade_level,
'Total Sales' as category_name,
NULL as parent_category,
SUM(sales_value) as category_value,
100.0 as percentage_of_parent,
'#1976D2' as level_color
FROM sales_hierarchy
UNION ALL
-- Level 2: Regions
SELECT
2 as cascade_level,
region as category_name,
'Total Sales' as parent_category,
SUM(sales_value) as category_value,
SUM(sales_value) * 100.0 / (SELECT SUM(sales_value) FROM sales_hierarchy) as percentage_of_parent,
'#388E3C' as level_color
FROM sales_hierarchy
GROUP BY region
UNION ALL
-- Level 3: Product Categories within Regions
SELECT
3 as cascade_level,
region || ' - ' || product_category as category_name,
region as parent_category,
SUM(sales_value) as category_value,
SUM(sales_value) * 100.0 / (
SELECT SUM(sales_value)
FROM sales_hierarchy sh2
WHERE sh2.region = sales_hierarchy.region
) as percentage_of_parent,
'#F57C00' as level_color
FROM sales_hierarchy
GROUP BY region, product_category
),
cascade_positioning AS (
SELECT
cascade_level,
category_name,
parent_category,
category_value,
percentage_of_parent,
level_color,
-- Calculate horizontal position
(cascade_level - 1) * 200 as x_position,
-- Calculate vertical position (stacked within level)
ROW_NUMBER() OVER (PARTITION BY cascade_level ORDER BY category_value DESC) as vertical_rank,
-- Calculate width proportional to value
CASE
WHEN cascade_level = 1 THEN 150
WHEN cascade_level = 2 THEN category_value * 100.0 / (SELECT MAX(category_value) FROM cascade_levels WHERE cascade_level = 2) * 120
ELSE category_value * 100.0 / (SELECT MAX(category_value) FROM cascade_levels WHERE cascade_level = 3) * 100
END as bar_width,
-- Format for display
'$' || ROUND(category_value/1000000, 1) || 'M' as formatted_value,
ROUND(percentage_of_parent, 1) || '%' as percentage_label
FROM cascade_levels
)
SELECT
category_name as node_label,
parent_category as connection_from,
x_position as horizontal_position,
vertical_rank * 40 as vertical_position,
bar_width as node_width,
category_value as node_value,
percentage_of_parent as relative_size,
level_color as fill_color,
formatted_value,
percentage_label,
cascade_level as hierarchy_level
FROM cascade_positioning
ORDER BY cascade_level, vertical_rank;
| Aspect | Waterfall Chart | Cascade Chart |
|---|---|---|
| Purpose | Cumulative effect | Hierarchical breakdown |
| Flow | Sequential accumulation | Left-to-right breakdown |
| Structure | Linear progression | Tree-like expansion |
| Emphasis | Net change | Component relationships |
| Best For | Financial analysis | Portfolio breakdown |
Use Waterfall Chart When:
Use Cascade Chart When:
In the next lesson, we'll explore dashboard design and data storytelling to create compelling visual narratives.