Loading…

Data Visualization Fundamentals | Data Analyst Fundamentals | Skivvy

YouTube video ID required

Data Visualization Fundamentals

Name: Data Analyst Fundamentals
Availability: InStock

Lesson Overview

This lesson covers fundamental principles and techniques of data visualization for effective communication of insights, with focus on selecting right chart type for your data.

What You'll Learn:

Principles of effective visualization
Chart types and their appropriate uses
Chart selection principles
Color theory and design principles
Common visualization mistakes to avoid
Tools for creating visualizations

Key Concepts:

Data Visualization: Graphical representation of data and information
Chart Types: Different ways to visually represent data
Visual Encoding: Using visual properties to represent data
Data-Ink Ratio: Proportion of ink used to display data
Perception: How humans interpret visual information

Chart Selection Principles

Fundamental Principles

1. Know Your Purpose

Comparison: Show differences between categories or over time
Relationship: Display correlations or connections between variables
Distribution: Show how data is spread across values
Composition: Show parts of a whole
Trend: Display changes over time
Geospatial: Show data by location

2. Know Your Data

Categorical: Distinct groups or labels
Numerical: Continuous or discrete numbers
Time-based: Data with temporal dimension
Hierarchical: Data with nested relationships
Geographical: Data with location components

3. Know Your Audience

Technical vs Non-technical: Complexity level
Decision-makers vs Analysts: Action vs exploration
Time Constraints: Quick insights vs detailed analysis
Domain Knowledge: Familiarity with subject matter

Chart Selection Framework

Step 1: Identify Your Message

What story are you trying to tell?
- "X is bigger than Y" → Comparison
- "X affects Y" → Relationship
- "X changed over time" → Trend
- "X consists of A, B, C" → Composition
- "X is distributed this way" → Distribution

Step 2: Analyze Your Data Structure

-- Analyze data structure for chart selection
WITH data_analysis AS (
    SELECT 
        -- Identify data types
        CASE 
            WHEN COUNT(DISTINCT category) < 20 THEN 'Categorical'
            ELSE 'Numerical'
        END as primary_data_type,
        -- Check for time component
        CASE 
            WHEN date_column IS NOT NULL THEN 'Time Series'
            ELSE 'Cross-sectional'
        END as temporal_dimension,
        -- Check for hierarchical structure
        CASE 
            WHEN parent_category IS NOT NULL THEN 'Hierarchical'
            ELSE 'Flat'
        END as structure_type,
        -- Check for geographical component
        CASE 
            WHEN latitude IS NOT NULL OR country IS NOT NULL THEN 'Geographical'
            ELSE 'Non-geographical'
        END as spatial_dimension
    FROM your_data_table
)
SELECT 
    primary_data_type,
    temporal_dimension,
    structure_type,
    spatial_dimension,
    CASE 
        WHEN primary_data_type = 'Categorical' AND temporal_dimension = 'Time Series' 
        THEN 'Consider: Line chart, Bar chart over time'
        WHEN primary_data_type = 'Categorical' AND temporal_dimension = 'Cross-sectional'
        THEN 'Consider: Bar chart, Pie chart, Treemap'
        WHEN primary_data_type = 'Numerical' AND temporal_dimension = 'Time Series'
        THEN 'Consider: Line chart, Area chart'
        WHEN primary_data_type = 'Numerical' AND temporal_dimension = 'Cross-sectional'
        THEN 'Consider: Scatter plot, Histogram, Box plot'
        WHEN spatial_dimension = 'Geographical'
        THEN 'Consider: Choropleth map, Bubble map'
        ELSE 'Consider: Mixed chart types'
    END as recommended_charts
FROM data_analysis;

Step 3: Match Chart to Purpose

Purpose	Best Chart Types	When to Use
Comparison	Bar, Column, Line	Compare values across categories
Relationship	Scatter, Bubble, Heatmap	Show correlations between variables
Distribution	Histogram, Box, Violin	Show data spread and patterns
Composition	Pie, Donut, Treemap, Sunburst	Show parts of whole
Trend	Line, Area, Candlestick	Show changes over time
Geospatial	Choropleth, Bubble map	Show data by location

Common Visualization Mistakes

1. Wrong Chart Type

Using pie charts for comparison instead of composition
Using line charts for categorical data
Using 3D charts when 2D suffices

2. Poor Scaling

Truncated axes that mislead
Inconsistent scales across charts
Log scales without clear indication

3. Visual Clutter

Too many colors or patterns
Unnecessary gridlines and borders
Over-annotation

4. Color Misuse

Using too many colors
Poor color contrast
Ignoring color blindness

Bar Chart vs Histogram

Bar Chart

Purpose: Compare values across different categories

Characteristics:

Categorical data on x-axis
Numerical values on y-axis
Gaps between bars (discrete categories)
Can be horizontal or vertical

When to Use Bar Charts:

- Comparing sales by product category
- Showing population by country
- Displaying survey responses by option
- Comparing performance metrics across departments
- Showing budget allocation by category

Bar Chart Types:

Vertical Bar Chart: Standard comparison
Horizontal Bar Chart: Better for long category names
Grouped Bar Chart: Compare multiple series
Stacked Bar Chart: Show composition within categories
100% Stacked Bar Chart: Show relative proportions

SQL Example for Bar Chart Data:

-- Prepare data for bar chart: Sales by product category
WITH category_sales AS (
    SELECT 
        product_category,
        SUM(sales_amount) as total_sales,
        COUNT(*) as transaction_count,
        AVG(sales_amount) as avg_transaction_value
    FROM sales_data
    WHERE date >= '2023-01-01' AND date <= '2023-12-31'
    GROUP BY product_category
    ORDER BY total_sales DESC
),
chart_data AS (
    SELECT 
        product_category,
        total_sales,
        -- Calculate percentage for stacked bar charts
        ROUND(total_sales * 100.0 / SUM(total_sales) OVER (), 2) as percentage_of_total,
        -- Create ranking for sorting
        RANK() OVER (ORDER BY total_sales DESC) as sales_rank,
        -- Format for display
        '$' || ROUND(total_sales/1000000, 2) || 'M' as formatted_sales
    FROM category_sales
)
SELECT 
    product_category as category_label,
    total_sales as bar_value,
    percentage_of_total as stack_percentage,
    sales_rank as sort_order,
    formatted_sales as display_label,
    CASE 
        WHEN sales_rank <= 3 THEN 'Top Performer'
        WHEN sales_rank <= 6 THEN 'Mid Performer'
        ELSE 'Low Performer'
    END as performance_category
FROM chart_data
ORDER BY sort_order;

Histogram

Purpose: Show distribution of continuous numerical data

Characteristics:

Continuous numerical data on x-axis (binned)
Frequency or count on y-axis
No gaps between bars (continuous data)
Shows data distribution shape

When to Use Histograms:

- Analyzing age distribution of customers
- Showing distribution of test scores
- Displaying income distribution
- Analyzing response times
- Showing product price distribution

Histogram Components:

Bins: Intervals that group data points
Frequency: Count of data points in each bin
Shape: Distribution pattern (normal, skewed, bimodal)
Outliers: Data points far from main distribution

SQL Example for Histogram Data:

-- Prepare data for histogram: Customer age distribution
WITH age_data AS (
    SELECT 
        AGE(CURRENT_DATE, birth_date) as age_years
    FROM customers
    WHERE birth_date IS NOT NULL
),
age_bins AS (
    SELECT 
        age_years,
        -- Create age bins (10-year intervals)
        CASE 
            WHEN age_years < 20 THEN '0-19'
            WHEN age_years < 30 THEN '20-29'
            WHEN age_years < 40 THEN '30-39'
            WHEN age_years < 50 THEN '40-49'
            WHEN age_years < 60 THEN '50-59'
            WHEN age_years < 70 THEN '60-69'
            ELSE '70+'
        END as age_group,
        -- For continuous histogram (individual bins)
        FLOOR(age_years / 10) * 10 as bin_start,
        FLOOR(age_years / 10) * 10 + 10 as bin_end
    FROM age_data
),
histogram_data AS (
    SELECT 
        age_group as category_label,
        bin_start,
        bin_end,
        COUNT(*) as frequency,
        -- Calculate percentage
        ROUND(COUNT(*) * 100.0 / (SELECT COUNT(*) FROM age_data), 2) as percentage,
        -- Calculate cumulative percentage
        ROUND(SUM(COUNT(*)) OVER (ORDER BY bin_start) * 100.0 / (SELECT COUNT(*) FROM age_data), 2) as cumulative_percentage
    FROM age_bins
    GROUP BY age_group, bin_start, bin_end
    ORDER BY bin_start
)
SELECT 
    category_label,
        bin_start as histogram_bin_start,
        bin_end as histogram_bin_end,
        frequency as bar_height,
        percentage as relative_frequency,
        cumulative_percentage,
        CASE 
            WHEN cumulative_percentage <= 25 THEN 'First Quartile'
            WHEN cumulative_percentage <= 50 THEN 'Second Quartile'
            WHEN cumulative_percentage <= 75 THEN 'Third Quartile'
            ELSE 'Fourth Quartile'
        END as distribution_quartile
FROM histogram_data;

Key Differences

Aspect	Bar Chart	Histogram
Data Type	Categorical	Continuous (binned)
Purpose	Comparison	Distribution
Bar Gaps	Yes (separate categories)	No (continuous data)
X-axis	Category labels	Numerical ranges
Y-axis	Values/Measurements	Frequency/Count
Order	Can be rearranged	Fixed numerical order

Decision Guidelines

Use Bar Chart When:

You have distinct categories
You want to compare values
Order of categories is meaningful or can be optimized
You need to show exact values

Use Histogram When:

You have continuous numerical data
You want to understand distribution
You need to identify patterns like skewness or outliers
You're analyzing statistical properties

Scatter Plot vs Line Chart

Scatter Plot

Purpose: Show relationship between two numerical variables

Characteristics:

Two numerical variables (X and Y axes)
Individual data points as dots
Reveals correlations, clusters, and outliers
No connecting lines between points

When to Use Scatter Plots:

- Analyzing correlation between advertising spend and sales
- Exploring relationship between height and weight
- Identifying outliers in performance data
- Showing relationship between experience and salary
- Analyzing correlation between study time and test scores

Scatter Plot Variations:

Basic Scatter: Two variables
Bubble Chart: Third variable as bubble size
Color-coded Scatter: Third variable as color
3D Scatter: Three variables (use sparingly)

SQL Example for Scatter Plot Data:

-- Prepare data for scatter plot: Advertising spend vs Sales revenue
WITH scatter_data AS (
    SELECT 
        advertising_spend as x_variable,
        sales_revenue as y_variable,
        product_category as color_group,
        market_size as bubble_size,
        -- Calculate correlation coefficient
        CORR(advertising_spend, sales_revenue) OVER () as overall_correlation,
        -- Create quadrants for analysis
        CASE 
            WHEN advertising_spend > AVG(advertising_spend) OVER () 
                 AND sales_revenue > AVG(sales_revenue) OVER () THEN 'High-High'
            WHEN advertising_spend > AVG(advertising_spend) OVER () 
                 AND sales_revenue <= AVG(sales_revenue) OVER () THEN 'High-Low'
            WHEN advertising_spend <= AVG(advertising_spend) OVER () 
                 AND sales_revenue > AVG(sales_revenue) OVER () THEN 'Low-High'
            ELSE 'Low-Low'
        END as quadrant
    FROM marketing_performance
    WHERE date >= '2023-01-01' AND date <= '2023-12-31'
),
regression_line AS (
    -- Calculate linear regression line
    SELECT 
        AVG(x_variable) as mean_x,
        AVG(y_variable) as mean_y,
        -- Calculate slope and intercept
        (SUM((x_variable - AVG(x_variable) OVER ()) * (y_variable - AVG(y_variable) OVER ())) / 
         SUM(POWER(x_variable - AVG(x_variable) OVER (), 2))) as slope,
        AVG(y_variable) - (SUM((x_variable - AVG(x_variable) OVER ()) * (y_variable - AVG(y_variable) OVER ())) / 
         SUM(POWER(x_variable - AVG(x_variable) OVER (), 2))) * AVG(x_variable) as intercept
    FROM scatter_data
)
SELECT 
    s.x_variable,
    s.y_variable,
    s.color_group,
    s.bubble_size,
    s.quadrant,
    -- Regression line values
    r.intercept + r.slope * s.x_variable as regression_y,
    -- Distance from regression line
    ABS(s.y_variable - (r.intercept + r.slope * s.x_variable)) as residual_distance,
    CASE 
        WHEN s.residual_distance > 2 * STDDEV(s.y_variable - (r.intercept + r.slope * s.x_variable)) OVER () THEN 'Outlier'
        ELSE 'Normal'
    END as outlier_status
FROM scatter_data s, regression_line r
ORDER BY s.x_variable;

Line Chart

Purpose: Show trends and changes over time

Characteristics:

Time on x-axis, numerical value on y-axis
Connected data points showing progression
Emphasizes continuity and trends
Can show multiple series for comparison

When to Use Line Charts:

- Showing stock price changes over time
- Displaying website traffic trends
- Tracking sales performance monthly
- Showing temperature changes throughout day
- Comparing multiple metrics over time

Line Chart Types:

Single Line: One metric over time
Multi-line: Multiple metrics comparison
Area Chart: Filled area under line
Stacked Area: Composition over time

SQL Example for Line Chart Data:

-- Prepare data for line chart: Monthly sales trends
WITH time_series_data AS (
    SELECT 
        DATE_TRUNC('month', date) as month_period,
        product_category,
        SUM(sales_amount) as monthly_sales,
        COUNT(DISTINCT customer_id) as unique_customers,
        AVG(sales_amount) as avg_transaction_value
    FROM sales_data
    WHERE date >= '2023-01-01' AND date <= '2023-12-31'
    GROUP BY DATE_TRUNC('month', date), product_category
),
trend_analysis AS (
    SELECT 
        month_period,
        product_category as series_name,
        monthly_sales as line_value,
        unique_customers as secondary_metric,
        -- Calculate month-over-month growth
        monthly_sales - LAG(monthly_sales) OVER (PARTITION BY product_category ORDER BY month_period) as mom_change,
        (monthly_sales - LAG(monthly_sales) OVER (PARTITION BY product_category ORDER BY month_period)) / 
        LAG(monthly_sales) OVER (PARTITION BY product_category ORDER BY month_period) * 100 as mom_growth_rate,
        -- Calculate moving average for trend line
        AVG(monthly_sales) OVER (
            PARTITION BY product_category 
            ORDER BY month_period 
            ROWS BETWEEN 2 PRECEDING AND CURRENT ROW
        ) as moving_avg_3month,
        -- Identify trend direction
        CASE 
            WHEN monthly_sales > LAG(monthly_sales) OVER (PARTITION BY product_category ORDER BY month_period) 
                 AND LAG(monthly_sales) OVER (PARTITION BY product_category ORDER BY month_period) > 
                      LAG(monthly_sales, 2) OVER (PARTITION BY product_category ORDER BY month_period) THEN 'Strong Upward'
            WHEN monthly_sales > LAG(monthly_sales) OVER (PARTITION BY product_category ORDER BY month_period) THEN 'Upward'
            WHEN monthly_sales < LAG(monthly_sales) OVER (PARTITION BY product_category ORDER BY month_period) 
                 AND LAG(monthly_sales) OVER (PARTITION BY product_category ORDER BY month_period) < 
                      LAG(monthly_sales, 2) OVER (PARTITION BY product_category ORDER BY month_period) THEN 'Strong Downward'
            WHEN monthly_sales < LAG(monthly_sales) OVER (PARTITION BY product_category ORDER BY month_period) THEN 'Downward'
            ELSE 'Stable'
        END as trend_direction
    FROM time_series_data
)
SELECT 
    month_period as time_axis,
    series_name,
    line_value,
    moving_avg_3month as trend_line,
    mom_growth_rate as growth_rate,
    trend_direction,
    CASE 
        WHEN trend_direction LIKE '%Upward%' THEN '#2E7D32'  -- Green
        WHEN trend_direction LIKE '%Downward%' THEN '#C62828'  -- Red
        ELSE '#1565C0'  -- Blue
    END as trend_color
FROM trend_analysis
ORDER BY time_axis, series_name;

Key Differences

Aspect	Scatter Plot	Line Chart
Purpose	Relationship/Correlation	Trend/Change over time
X-axis	Numerical variable	Time/Sequential
Connection	No lines between points	Lines connect points
Emphasis	Individual data points	Continuity and trend
Time	Not required	Essential component

Decision Guidelines

Use Scatter Plot When:

You want to explore relationships between variables
You need to identify correlations or patterns
You're looking for clusters or outliers
Time is not primary dimension

Use Line Chart When:

You're showing changes over time
Continuity and progression are important
You need to identify trends
You're comparing multiple time series

Box Plot vs Violin Plot

Box Plot

Purpose: Show distribution summary with quartiles and outliers

Characteristics:

Five-number summary (min, Q1, median, Q3, max)
Box represents interquartile range (IQR)
Whiskers extend to 1.5 × IQR
Points beyond whiskers are outliers
Compact and standardized

When to Use Box Plots:

- Comparing distributions across categories
- Identifying outliers in datasets
- Showing statistical summaries
- Comparing performance across groups
- Analyzing test scores by class

Box Plot Components:

Median: Middle value (50th percentile)
Q1/Q3: 25th and 75th percentiles
IQR: Interquartile range (Q3 - Q1)
Whiskers: Extend to min/max within 1.5 × IQR
Outliers: Points beyond whiskers

SQL Example for Box Plot Data:

-- Prepare data for box plot: Salary distribution by department
WITH distribution_stats AS (
    SELECT 
        department,
        salary,
        -- Calculate quartiles using window functions
        PERCENTILE_CONT(0.25) WITHIN GROUP (ORDER BY salary) OVER (PARTITION BY department) as q1,
        PERCENTILE_CONT(0.50) WITHIN GROUP (ORDER BY salary) OVER (PARTITION BY department) as median,
        PERCENTILE_CONT(0.75) WITHIN GROUP (ORDER BY salary) OVER (PARTITION BY department) as q3,
        AVG(salary) OVER (PARTITION BY department) as mean_salary,
        STDDEV(salary) OVER (PARTITION BY department) as std_salary
    FROM employee_data
),
box_plot_data AS (
    SELECT 
        department,
        salary,
        q1,
        median,
        q3,
        -- Calculate IQR and whisker bounds
        q3 - q1 as iqr,
        q1 - 1.5 * (q3 - q1) as lower_whisker,
        q3 + 1.5 * (q3 - q1) as upper_whisker,
        -- Classify data points
        CASE 
            WHEN salary < q1 - 1.5 * (q3 - q1) THEN 'Lower Outlier'
            WHEN salary > q3 + 1.5 * (q3 - q1) THEN 'Upper Outlier'
            WHEN salary BETWEEN q1 AND q3 THEN 'Within IQR'
            WHEN salary < q1 THEN 'Lower Whisker'
            ELSE 'Upper Whisker'
        END as point_classification,
        -- Calculate z-score for additional context
        (salary - mean_salary) / std_salary as z_score
    FROM distribution_stats
),
summary_stats AS (
    SELECT 
        department,
        MIN(salary) as min_salary,
        MAX(salary) as max_salary,
        AVG(salary) as avg_salary,
        COUNT(*) as employee_count,
        COUNT(CASE WHEN point_classification LIKE '%Outlier%' THEN 1 END) as outlier_count
    FROM box_plot_data
    GROUP BY department
)
SELECT 
    b.department,
    b.q1 as box_lower_bound,
    b.median as box_median,
    b.q3 as box_upper_bound,
    b.lower_whisker,
    b.upper_whisker,
    b.salary as individual_value,
    b.point_classification,
    b.z_score,
    s.employee_count,
    s.outlier_count,
    ROUND(s.outlier_count * 100.0 / s.employee_count, 2) as outlier_percentage
FROM box_plot_data b
JOIN summary_stats s ON b.department = s.department
ORDER BY b.department, b.salary;

Violin Plot

Purpose: Show distribution shape with density information

Characteristics:

Combines box plot with density curve
Shows probability density at different values
Wider sections indicate higher probability
Reveals multimodal distributions
More detailed than box plot

When to Use Violin Plots:

- Showing detailed distribution shapes
- Identifying multimodal patterns
- Comparing complex distributions
- Analyzing response time patterns
- Exploring customer behavior distributions

Violin Plot Components:

Central Box: Contains median and quartiles
Density Curves: Show probability distribution
Width: Indicates frequency/probability
Multiple Peaks: Show multimodal distributions

SQL Example for Violin Plot Data:

-- Prepare data for violin plot: Customer satisfaction scores by region
WITH density_data AS (
    SELECT 
        region,
        satisfaction_score,
        -- Create bins for density calculation
        FLOOR(satisfaction_score * 10) / 10 as score_bin,
        -- Calculate percentiles for violin shape
        PERCENTILE_CONT(0.05) WITHIN GROUP (ORDER BY satisfaction_score) OVER (PARTITION BY region) as p5,
        PERCENTILE_CONT(0.25) WITHIN GROUP (ORDER BY satisfaction_score) OVER (PARTITION BY region) as p25,
        PERCENTILE_CONT(0.50) WITHIN GROUP (ORDER BY satisfaction_score) OVER (PARTITION BY region) as p50,
        PERCENTILE_CONT(0.75) WITHIN GROUP (ORDER BY satisfaction_score) OVER (PARTITION BY region) as p75,
        PERCENTILE_CONT(0.95) WITHIN GROUP (ORDER BY satisfaction_score) OVER (PARTITION BY region) as p95
    FROM customer_satisfaction
),
violin_shape AS (
    SELECT 
        region,
        score_bin,
        -- Calculate density (frequency) for each bin
        COUNT(*) as frequency,
        -- Calculate relative density for violin width
        COUNT(*) * 1.0 / SUM(COUNT(*)) OVER (PARTITION BY region) as relative_density,
        -- Calculate violin width (scaled for visualization)
        COUNT(*) * 1.0 / MAX(COUNT(*)) OVER (PARTITION BY region) as violin_width,
        -- Get quartile values for box plot overlay
        MIN(p25) as q25,
        MIN(p50) as median,
        MIN(p75) as q75,
        MIN(p5) as min_5th,
        MIN(p95) as max_95th,
        -- Identify distribution characteristics
        CASE 
            WHEN COUNT(*) = MAX(COUNT(*)) OVER (PARTITION BY region) THEN 'Peak'
            ELSE 'Normal'
        END as bin_type
    FROM density_data
    GROUP BY region, score_bin, p5, p25, p50, p75, p95
),
distribution_analysis AS (
    SELECT 
        region,
        -- Check for multimodality
        COUNT(CASE WHEN bin_type = 'Peak' THEN 1 END) as peak_count,
        CASE 
            WHEN COUNT(CASE WHEN bin_type = 'Peak' THEN 1 END) > 1 THEN 'Multimodal'
            ELSE 'Unimodal'
        END as distribution_type,
        -- Calculate skewness (simplified)
        AVG(CASE WHEN score_bin > median THEN violin_width * (score_bin - median) END) - 
        AVG(CASE WHEN score_bin < median THEN violin_width * (median - score_bin) END) as skewness_indicator,
        COUNT(*) as total_observations
    FROM violin_shape
    GROUP BY region, median
)
SELECT 
    v.region,
    v.score_bin as value_axis,
    v.violin_width as violin_width_positive,
    -v.violin_width as violin_width_negative,  -- Mirror for violin shape
    v.relative_density as density_value,
    v.q25 as box_lower,
    v.median as box_middle,
    v.q75 as box_upper,
    v.min_5th as whisker_lower,
    v.max_95th as whisker_upper,
    d.distribution_type,
    d.peak_count,
    CASE 
        WHEN d.distribution_type = 'Multimodal' THEN 'Complex distribution'
        WHEN v.skewness_indicator > 0 THEN 'Right-skewed'
        WHEN v.skewness_indicator < 0 THEN 'Left-skewed'
        ELSE 'Symmetric'
    END as distribution_shape
FROM violin_shape v
JOIN distribution_analysis d ON v.region = d.region
ORDER BY v.region, v.score_bin;

Key Differences

Aspect	Box Plot	Violin Plot
Detail Level	Summary statistics	Full distribution
Shape	Rectangular box	Violin-shaped density
Information	Quartiles, outliers	Density, multimodality
Complexity	Simple, standardized	Detailed, informative
Best For	Quick comparisons	Detailed analysis

Decision Guidelines

Use Box Plot When:

You need a quick, standardized summary
Comparing many categories
Audience needs simple interpretation
Outlier identification is primary goal
Space is limited

Use Violin Plot When:

You need detailed distribution information
Identifying multimodal patterns
Exploratory data analysis
Audience understands statistical concepts
Showing complex distributions

Heat Map vs Choropleth Map vs Treemap vs Sunburst

Heat Map

Purpose: Show data intensity using color gradients in a grid

Characteristics:

Two categorical dimensions
Color intensity represents numerical value
Grid layout with rows and columns
Excellent for pattern identification

When to Use Heat Maps:

- Sales performance by product and region
- Website traffic by hour and day
- Correlation matrices
- Risk assessment by category and severity
- Activity patterns by time and location

SQL Example for Heat Map Data:

-- Prepare data for heat map: Sales by product category and month
WITH heat_map_data AS (
    SELECT 
        product_category as row_category,
        EXTRACT(MONTH FROM date) as column_category,
        SUM(sales_amount) as heat_value,
        COUNT(*) as transaction_count,
        -- Calculate percentile for color scaling
        PERCENT_RANK() OVER (ORDER BY SUM(sales_amount)) as percentile_rank,
        -- Create color intensity levels
        NTILE(5) OVER (ORDER BY SUM(sales_amount)) as color_bucket
    FROM sales_data
    WHERE date >= '2023-01-01' AND date <= '2023-12-31'
    GROUP BY product_category, EXTRACT(MONTH FROM date)
),
color_mapping AS (
    SELECT 
        row_category,
        column_category,
        heat_value,
        percentile_rank,
        color_bucket,
        -- Generate color values (simplified)
        CASE 
            WHEN color_bucket = 1 THEN '#FFEBEE'  -- Light red
            WHEN color_bucket = 2 THEN '#FFCDD2'  -- Light-medium red
            WHEN color_bucket = 3 THEN '#EF9A9A'  -- Medium red
            WHEN color_bucket = 4 THEN '#E57373'  -- Medium-dark red
            ELSE '#C62828'  -- Dark red
        END as fill_color,
        -- Create labels
        CASE 
            WHEN column_category = 1 THEN 'Jan'
            WHEN column_category = 2 THEN 'Feb'
            WHEN column_category = 3 THEN 'Mar'
            WHEN column_category = 4 THEN 'Apr'
            WHEN column_category = 5 THEN 'May'
            WHEN column_category = 6 THEN 'Jun'
            WHEN column_category = 7 THEN 'Jul'
            WHEN column_category = 8 THEN 'Aug'
            WHEN column_category = 9 THEN 'Sep'
            WHEN column_category = 10 THEN 'Oct'
            WHEN column_category = 11 THEN 'Nov'
            WHEN column_category = 12 THEN 'Dec'
        END as month_label
    FROM heat_map_data
)
SELECT 
    row_category,
    column_category,
    month_label,
    heat_value,
    fill_color,
    percentile_rank,
    -- Format for display
        '$' || ROUND(heat_value/1000000, 1) || 'M' as formatted_value,
    CASE 
        WHEN percentile_rank >= 0.8 THEN 'Very High'
        WHEN percentile_rank >= 0.6 THEN 'High'
        WHEN percentile_rank >= 0.4 THEN 'Medium'
        WHEN percentile_rank >= 0.2 THEN 'Low'
        ELSE 'Very Low'
    END as performance_level
FROM color_mapping
ORDER BY row_category, column_category;

Choropleth Map

Purpose: Show data values by geographical regions using color

Characteristics:

Geographic boundaries (countries, states, etc.)
Color intensity represents data values
Spatial patterns and relationships
Requires geographic data

When to Use Choropleth Maps:

- Population density by country
- Sales performance by state
- Election results by district
- Disease prevalence by region
- Income levels by county

SQL Example for Choropleth Map Data:

-- Prepare data for choropleth map: Sales by US state
WITH state_sales AS (
    SELECT 
        state_code,
        state_name,
        SUM(sales_amount) as total_sales,
        COUNT(DISTINCT customer_id) as customer_count,
        AVG(sales_amount) as avg_sale_value,
        -- Calculate per capita sales if population data available
        SUM(sales_amount) / NULLIF(population, 0) as sales_per_capita
    FROM sales_data s
    JOIN geographic_data g ON s.state = g.state_code
    WHERE date >= '2023-01-01' AND date <= '2023-12-31'
    GROUP BY state_code, state_name, population
),
color_classification AS (
    SELECT 
        state_code,
        state_name,
        total_sales,
        customer_count,
        sales_per_capita,
        -- Create quantiles for color classification
        NTILE(5) OVER (ORDER BY total_sales) as sales_quintile,
        -- Calculate percentiles
        PERCENTILE_CONT(0.20) WITHIN GROUP (ORDER BY total_sales) OVER () as p20,
        PERCENTILE_CONT(0.40) WITHIN GROUP (ORDER BY total_sales) OVER () as p40,
        PERCENTILE_CONT(0.60) WITHIN GROUP (ORDER BY total_sales) OVER () as p60,
        PERCENTILE_CONT(0.80) WITHIN GROUP (ORDER BY total_sales) OVER () as p80
    FROM state_sales
),
map_data AS (
    SELECT 
        state_code,
        state_name,
        total_sales,
        customer_count,
        sales_per_capita,
        sales_quintile,
        -- Generate color based on quintile
        CASE 
            WHEN sales_quintile = 1 THEN '#E3F2FD'  -- Very light blue
            WHEN sales_quintile = 2 THEN '#90CAF9'  -- Light blue
            WHEN sales_quintile = 3 THEN '#42A5F5'  -- Medium blue
            WHEN sales_quintile = 4 THEN '#1E88E5'  -- Dark blue
            ELSE '#0D47A1'  -- Very dark blue
        END as fill_color,
        -- Create hover information
        state_name || ': $' || ROUND(total_sales/1000000, 1) || 'M' as hover_text,
        -- Performance classification
        CASE 
            WHEN sales_quintile >= 4 THEN 'Top Performer'
            WHEN sales_quintile >= 3 THEN 'Above Average'
            WHEN sales_quintile >= 2 THEN 'Average'
            ELSE 'Below Average'
        END as performance_category
    FROM color_classification
)
SELECT 
    state_code as geo_id,
    state_name as geo_label,
    total_sales as metric_value,
    fill_color,
    hover_text,
    performance_category,
    sales_per_capita as normalized_metric,
    customer_count as supporting_metric
FROM map_data
ORDER BY total_sales DESC;

Treemap

Purpose: Show hierarchical data as nested rectangles

Characteristics:

Hierarchical structure with nested rectangles
Rectangle size represents numerical value
Color can represent additional dimension
Space-efficient for large datasets

When to Use Treemaps:

- Sales by product category and subcategory
- Budget allocation by department and project
- File system usage analysis
- Market share by company and product
- Website traffic by section and subsection

SQL Example for Treemap Data:

-- Prepare data for treemap: Sales by category hierarchy
WITH category_hierarchy AS (
    SELECT 
        main_category,
        subcategory,
        product_name,
        SUM(sales_amount) as sales_value,
        COUNT(*) as transaction_count,
        -- Calculate hierarchy levels
        1 as level_main,
        2 as level_sub,
        3 as level_product
    FROM sales_data
    WHERE date >= '2023-01-01' AND date <= '2023-12-31'
    GROUP BY main_category, subcategory, product_name
),
hierarchy_aggregates AS (
    -- Calculate totals for each level
    SELECT 
        main_category,
        NULL as subcategory,
        NULL as product_name,
        SUM(sales_value) as sales_value,
        COUNT(*) as item_count,
        1 as hierarchy_level,
        'Main Category' as node_type
    FROM category_hierarchy
    GROUP BY main_category
    
    UNION ALL
    
    SELECT 
        main_category,
        subcategory,
        NULL as product_name,
        SUM(sales_value) as sales_value,
        COUNT(*) as item_count,
        2 as hierarchy_level,
        'Subcategory' as node_type
    FROM category_hierarchy
    GROUP BY main_category, subcategory
    
    UNION ALL
    
    SELECT 
        main_category,
        subcategory,
        product_name,
        sales_value,
        1 as item_count,
        3 as hierarchy_level,
        'Product' as node_type
    FROM category_hierarchy
),
treemap_data AS (
    SELECT 
        COALESCE(main_category, 'Total') as parent_category,
        COALESCE(subcategory, main_category) as category_name,
        COALESCE(product_name, subcategory) as item_name,
        sales_value as rectangle_size,
        item_count as supporting_metric,
        hierarchy_level,
        node_type,
        -- Calculate percentage of parent
        sales_value * 100.0 / SUM(sales_value) OVER (PARTITION BY 
            CASE 
                WHEN hierarchy_level = 1 THEN 'Total'
                WHEN hierarchy_level = 2 THEN main_category
                ELSE subcategory
            END
        ) as parent_percentage,
        -- Generate colors based on performance
        NTILE(10) OVER (ORDER BY sales_value) as color_bucket,
        -- Create path for hierarchy
        CASE 
            WHEN hierarchy_level = 1 THEN main_category
            WHEN hierarchy_level = 2 THEN main_category || ' > ' || subcategory
            ELSE main_category || ' > ' || subcategory || ' > ' || product_name
        END as hierarchy_path
    FROM hierarchy_aggregates
)
SELECT 
    hierarchy_path as treemap_path,
    category_name as node_label,
    rectangle_size as node_size,
    parent_percentage as relative_size,
    hierarchy_level as nesting_level,
    node_type,
    color_bucket,
    CASE 
        WHEN color_bucket >= 8 THEN '#1B5E20'  -- Dark green
        WHEN color_bucket >= 6 THEN '#4CAF50'  -- Green
        WHEN color_bucket >= 4 THEN '#8BC34A'  -- Light green
        WHEN color_bucket >= 2 THEN '#CDDC39'  -- Yellow-green
        ELSE '#F57C00'  -- Orange
    END as fill_color,
    '$' || ROUND(rectangle_size/1000000, 2) || 'M' as formatted_size
FROM treemap_data
ORDER BY hierarchy_level, rectangle_size DESC;

Sunburst

Purpose: Show hierarchical data as concentric circles

Characteristics:

Hierarchical structure with radial layout
Center represents root, outer rings represent deeper levels
Arc size represents value
Good for showing proportions in hierarchy

When to Use Sunburst Charts:

- Organizational structure analysis
- File system usage visualization
- Product category breakdown
- Budget allocation visualization
- Website navigation path analysis

SQL Example for Sunburst Data:

-- Prepare data for sunburst: Organizational hierarchy
WITH org_hierarchy AS (
    SELECT 
        department,
        team,
        employee_name,
        salary as value_metric,
        COUNT(*) as employee_count,
        -- Calculate levels
        1 as level_dept,
        2 as level_team,
        3 as level_employee
    FROM employee_data
    WHERE status = 'active'
),
sunburst_nodes AS (
    -- Create nodes for each level
    SELECT 
        department as node_id,
        NULL as parent_id,
        department as node_label,
        SUM(salary) as node_value,
        COUNT(*) as node_size,
        1 as level,
        'Department' as node_type
    FROM org_hierarchy
    GROUP BY department
    
    UNION ALL
    
    SELECT 
        department || '_' || team as node_id,
        department as parent_id,
        team as node_label,
        SUM(salary) as node_value,
        COUNT(*) as node_size,
        2 as level,
        'Team' as node_type
    FROM org_hierarchy
    GROUP BY department, team
    
    UNION ALL
    
    SELECT 
        department || '_' || team || '_' || employee_name as node_id,
        department || '_' || team as parent_id,
        employee_name as node_label,
        salary as node_value,
        1 as node_size,
        3 as level,
        'Employee' as node_type
    FROM org_hierarchy
),
sunburst_calculations AS (
    SELECT 
        node_id,
        parent_id,
        node_label,
        node_value,
        node_size,
        level,
        node_type,
        -- Calculate angle for each node (360 degrees total)
        node_value * 360.0 / SUM(node_value) OVER (PARTITION BY 
            CASE 
                WHEN level = 1 THEN 'root'
                WHEN level = 2 THEN parent_id
                ELSE parent_id
            END
        ) as arc_angle,
        -- Calculate cumulative angle for positioning
        SUM(node_value) OVER (
            PARTITION BY 
                CASE 
                    WHEN level = 1 THEN 'root'
                    WHEN level = 2 THEN parent_id
                    ELSE parent_id
                END
            ORDER BY node_value
            ROWS BETWEEN UNBOUNDED PRECEDING AND 1 PRECEDING
        ) * 360.0 / SUM(node_value) OVER (
            PARTITION BY 
                CASE 
                    WHEN level = 1 THEN 'root'
                    WHEN level = 2 THEN parent_id
                    ELSE parent_id
                END
        ) as start_angle,
        -- Calculate radius based on level
        level * 50 as inner_radius,
        (level + 1) * 50 as outer_radius,
        -- Generate colors
        NTILE(12) OVER (ORDER BY node_value) as color_bucket
    FROM sunburst_nodes
)
SELECT 
    node_id,
    parent_id,
    node_label,
    node_value,
    arc_angle,
    start_angle,
    inner_radius,
    outer_radius,
    level,
    node_type,
    CASE 
        WHEN level = 1 THEN '#1976D2'  -- Blue for departments
        WHEN level = 2 THEN '#388E3C'  -- Green for teams
        ELSE '#F57C00'  -- Orange for employees
    END as fill_color,
    '$' || ROUND(node_value/1000, 0) || 'K' as formatted_value,
    CASE 
        WHEN node_type = 'Employee' THEN node_label
        ELSE node_label || ' (' || node_size || ')'
    END as display_label
FROM sunburst_calculations
ORDER BY level, start_angle;

Decision Guidelines

Chart Type	Best For	Data Structure	Key Advantage
Heat Map	Intensity patterns	Two categorical + one numerical	Pattern identification
Choropleth	Geographic data	Geographic regions + values	Spatial relationships
Treemap	Hierarchical proportions	Nested categories	Space efficiency
Sunburst	Hierarchical breakdown	Multi-level hierarchy	Proportional hierarchy

Waterfall and Cascade Charts

Waterfall Chart

Purpose: Show cumulative effect of sequential positive and negative values

Characteristics:

Starting and ending values
Intermediate positive and negative changes
Floating columns showing cumulative effect
Color coding for positive/negative values

When to Use Waterfall Charts:

- Financial statement analysis (revenue to profit)
- Population changes over time
- Project budget tracking
- Inventory level changes
- Customer acquisition and churn analysis

SQL Example for Waterfall Chart Data:

-- Prepare data for waterfall chart: Monthly profit analysis
WITH profit_components AS (
    SELECT 
        'Starting Cash' as component_name,
        starting_balance as value,
        0 as sort_order,
        'Total' as component_type
    FROM financial_summary
    WHERE month = '2023-01-01'
    
    UNION ALL
    
    SELECT 
        'Revenue' as component_name,
        SUM(revenue) as value,
        1 as sort_order,
        'Positive' as component_type
    FROM financial_data
    WHERE month BETWEEN '2023-01-01' AND '2023-12-31'
    
    UNION ALL
    
    SELECT 
        'Cost of Goods Sold' as component_name,
        -SUM(cogs) as value,
        2 as sort_order,
        'Negative' as component_type
    FROM financial_data
    WHERE month BETWEEN '2023-01-01' AND '2023-12-31'
    
    UNION ALL
    
    SELECT 
        'Operating Expenses' as component_name,
        -SUM(operating_expenses) as value,
        3 as sort_order,
        'Negative' as component_type
    FROM financial_data
    WHERE month BETWEEN '2023-01-01' AND '2023-12-31'
    
    UNION ALL
    
    SELECT 
        'Interest Expense' as component_name,
        -SUM(interest_expense) as value,
        4 as sort_order,
        'Negative' as component_type
    FROM financial_data
    WHERE month BETWEEN '2023-01-01' AND '2023-12-31'
    
    UNION ALL
    
    SELECT 
        'Taxes' as component_name,
        -SUM(taxes) as value,
        5 as sort_order,
        'Negative' as component_type
    FROM financial_data
    WHERE month BETWEEN '2023-01-01' AND '2023-12-31'
    
    UNION ALL
    
    SELECT 
        'Net Profit' as component_name,
        (SELECT SUM(revenue) - SUM(cogs) - SUM(operating_expenses) - SUM(interest_expense) - SUM(taxes) 
         FROM financial_data WHERE month BETWEEN '2023-01-01' AND '2023-12-31') as value,
        6 as sort_order,
        'Total' as component_type
),
waterfall_calculations AS (
    SELECT 
        component_name,
        value,
        sort_order,
        component_type,
        -- Calculate cumulative values
        SUM(value) OVER (ORDER BY sort_order ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) as cumulative_value,
        -- Calculate previous cumulative for floating effect
        LAG(SUM(value) OVER (ORDER BY sort_order ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW), 1) 
        OVER (ORDER BY sort_order) as previous_cumulative,
        -- Generate colors
        CASE 
            WHEN component_type = 'Total' THEN '#2E7D32'  -- Green for totals
            WHEN component_type = 'Positive' THEN '#388E3C'  -- Green for positive
            WHEN component_type = 'Negative' THEN '#C62828'  -- Red for negative
            ELSE '#1565C0'  -- Blue for starting
        END as bar_color,
        -- Format for display
        CASE 
            WHEN component_type = 'Total' THEN 'Total'
            WHEN component_type = 'Positive' THEN '+$' || ROUND(value/1000000, 1) || 'M'
            WHEN component_type = 'Negative' THEN '-$' || ROUND(ABS(value)/1000000, 1) || 'M'
            ELSE '$' || ROUND(value/1000000, 1) || 'M'
        END as formatted_value
    FROM profit_components
)
SELECT 
    component_name as step_label,
    value as step_value,
    previous_cumulative as floating_base,
    cumulative_value as running_total,
    bar_color,
    formatted_value,
    component_type,
    sort_order as display_order
FROM waterfall_calculations
ORDER BY sort_order;

Cascade Chart

Purpose: Show breakdown of total into components with hierarchical relationships

Characteristics:

Total value on left
Component breakdown flowing to right
Shows proportional relationships
Can show multiple levels of breakdown

When to Use Cascade Charts:

- Sales breakdown by region and product
- Budget allocation by department and project
- Customer segmentation analysis
- Website traffic source breakdown
- Product portfolio analysis

SQL Example for Cascade Chart Data:

-- Prepare data for cascade chart: Sales breakdown by hierarchy
WITH sales_hierarchy AS (
    SELECT 
        'Total Sales' as level_1,
        region as level_2,
        product_category as level_3,
        SUM(sales_amount) as sales_value,
        COUNT(*) as transaction_count
    FROM sales_data
    WHERE date >= '2023-01-01' AND date <= '2023-12-31'
    GROUP BY region, product_category
),
cascade_levels AS (
    -- Level 1: Total
    SELECT 
        1 as cascade_level,
        'Total Sales' as category_name,
        NULL as parent_category,
        SUM(sales_value) as category_value,
        100.0 as percentage_of_parent,
        '#1976D2' as level_color
    FROM sales_hierarchy
    
    UNION ALL
    
    -- Level 2: Regions
    SELECT 
        2 as cascade_level,
        region as category_name,
        'Total Sales' as parent_category,
        SUM(sales_value) as category_value,
        SUM(sales_value) * 100.0 / (SELECT SUM(sales_value) FROM sales_hierarchy) as percentage_of_parent,
        '#388E3C' as level_color
    FROM sales_hierarchy
    GROUP BY region
    
    UNION ALL
    
    -- Level 3: Product Categories within Regions
    SELECT 
        3 as cascade_level,
        region || ' - ' || product_category as category_name,
        region as parent_category,
        SUM(sales_value) as category_value,
        SUM(sales_value) * 100.0 / (
            SELECT SUM(sales_value) 
            FROM sales_hierarchy sh2 
            WHERE sh2.region = sales_hierarchy.region
        ) as percentage_of_parent,
        '#F57C00' as level_color
    FROM sales_hierarchy
    GROUP BY region, product_category
),
cascade_positioning AS (
    SELECT 
        cascade_level,
        category_name,
        parent_category,
        category_value,
        percentage_of_parent,
        level_color,
        -- Calculate horizontal position
        (cascade_level - 1) * 200 as x_position,
        -- Calculate vertical position (stacked within level)
        ROW_NUMBER() OVER (PARTITION BY cascade_level ORDER BY category_value DESC) as vertical_rank,
        -- Calculate width proportional to value
        CASE 
            WHEN cascade_level = 1 THEN 150
            WHEN cascade_level = 2 THEN category_value * 100.0 / (SELECT MAX(category_value) FROM cascade_levels WHERE cascade_level = 2) * 120
            ELSE category_value * 100.0 / (SELECT MAX(category_value) FROM cascade_levels WHERE cascade_level = 3) * 100
        END as bar_width,
        -- Format for display
        '$' || ROUND(category_value/1000000, 1) || 'M' as formatted_value,
        ROUND(percentage_of_parent, 1) || '%' as percentage_label
    FROM cascade_levels
)
SELECT 
    category_name as node_label,
    parent_category as connection_from,
    x_position as horizontal_position,
    vertical_rank * 40 as vertical_position,
    bar_width as node_width,
    category_value as node_value,
    percentage_of_parent as relative_size,
    level_color as fill_color,
    formatted_value,
    percentage_label,
    cascade_level as hierarchy_level
FROM cascade_positioning
ORDER BY cascade_level, vertical_rank;

Key Differences

Aspect	Waterfall Chart	Cascade Chart
Purpose	Cumulative effect	Hierarchical breakdown
Flow	Sequential accumulation	Left-to-right breakdown
Structure	Linear progression	Tree-like expansion
Emphasis	Net change	Component relationships
Best For	Financial analysis	Portfolio breakdown

Decision Guidelines

Use Waterfall Chart When:

Showing how starting value becomes ending value
Analyzing financial statements
Tracking cumulative changes
Showing net effect of multiple factors

Use Cascade Chart When:

Breaking down totals into components
Showing hierarchical relationships
Analyzing portfolio composition
Displaying proportional breakdowns

Key Takeaways

Chart Selection: Choose charts based on your data type and story you want to tell
Bar vs Histogram: Bar charts for categorical comparison, histograms for distribution analysis
Scatter vs Line: Scatter plots for relationships, line charts for trends over time
Box vs Violin: Box plots for quick summaries, violin plots for detailed distributions
Advanced Charts: Heat maps for intensity, choropleth for geography, treemaps/sunbursts for hierarchy
Waterfall vs Cascade: Waterfall for cumulative effects, cascade for hierarchical breakdowns
Design Principles: Keep it simple, use appropriate colors, and ensure clarity

Next Steps

In the next lesson, we'll explore dashboard design and data storytelling to create compelling visual narratives.

Data Analyst Fundamentals

01Data Fundamentals and Preparation3 hours

02Statistics, Visualization and Analysis2 hours

03Business Intelligence and Governance2 hours

Data Visualization Fundamentals

Lesson Overview

What You'll Learn:

Key Concepts:

Chart Selection Principles

Fundamental Principles

Chart Selection Framework

Common Visualization Mistakes

Bar Chart vs Histogram

Bar Chart

Histogram

Key Differences

Decision Guidelines

Scatter Plot vs Line Chart

Scatter Plot

Line Chart

Key Differences

Decision Guidelines

Box Plot vs Violin Plot

Box Plot

Violin Plot

Key Differences

Decision Guidelines

Heat Map vs Choropleth Map vs Treemap vs Sunburst

Heat Map

Choropleth Map

Treemap

Sunburst

Decision Guidelines

Waterfall and Cascade Charts

Waterfall Chart

Cascade Chart

Key Differences

Decision Guidelines

Key Takeaways

Next Steps