Loading…

Descriptive Statistics | Data Analyst Fundamentals | Skivvy

YouTube video ID required

Descriptive Statistics

Name: Data Analyst Fundamentals
Availability: InStock

Lesson Overview

This lesson covers descriptive statistics, which help summarize and describe main features of a dataset using measures of central tendency and dispersion.

What You'll Learn:

Measures of central tendency (mean, median, mode)
Measures of dispersion (variance, standard deviation, range)
Data distribution and skewness
Percentiles and quartiles
Visual summaries of data

Key Concepts:

Mean: Average value of a dataset
Median: Middle value when data is ordered
Standard Deviation: Measure of data spread
Distribution: Pattern of data values
Percentiles: Values below which a percentage of data falls

Measures of Central Tendency

Measures of central tendency describe the center or typical value of a dataset. They help us understand where the "middle" of our data lies.

Mean (Average)

Definition: The sum of all values divided by the number of values.

Formula: $$\bar{x} = \frac{\sum_{i=1}^{n} x_i}{n}$$

Where:

$\bar{x}$ = mean
$x_i$ = individual values
$n$ = number of values

Calculation Example:

Dataset: [23, 45, 67, 89, 12, 34, 56]
Mean = (23 + 45 + 67 + 89 + 12 + 34 + 56) / 7
Mean = 326 / 7 = 46.57

Properties:

Uses all data points
Sensitive to outliers
Works best for symmetric distributions
Most commonly used measure

SQL Implementation:

SELECT AVG(salary) as mean_salary
FROM employees;

Median

Definition: The middle value when data is arranged in order. If there's an even number of values, it's the average of the two middle values.

Calculation Steps:

Sort the data in ascending order
Find the middle position
For odd n: position = (n + 1) / 2
For even n: average of positions n/2 and (n/2) + 1

Calculation Example:

Odd number of values:
Dataset: [12, 23, 34, 45, 56, 67, 89]
Sorted: [12, 23, 34, 45, 56, 67, 89]
Position = (7 + 1) / 2 = 4th position
Median = 45

Even number of values:
Dataset: [12, 23, 34, 45, 56, 67]
Sorted: [12, 23, 34, 45, 56, 67]
Median = (3rd + 4th) / 2 = (34 + 45) / 2 = 39.5

Properties:

Not affected by extreme values
Good for skewed distributions
Represents the 50th percentile
May not use all information in the data

SQL Implementation:

SELECT PERCENTILE_CONT(0.5) WITHIN GROUP (ORDER BY salary) as median_salary
FROM employees;

Mode

Definition: The value that appears most frequently in a dataset.

Types:

Unimodal: One mode
Bimodal: Two modes
Multimodal: More than two modes
No mode: All values appear equally

Calculation Example:

Dataset: [23, 45, 67, 45, 12, 34, 45, 56]
Frequency count:
23: 1 time, 45: 3 times, 67: 1 time
12: 1 time, 34: 1 time, 56: 1 time
Mode = 45 (appears most frequently)

Properties:

Can be used for categorical data
May not exist or may not be unique
Not affected by outliers
Can be misleading for continuous data

SQL Implementation:

SELECT salary as mode_salary, COUNT(*) as frequency
FROM employees
GROUP BY salary
ORDER BY COUNT(*) DESC
LIMIT 1;

Choosing the Right Measure

Situation	Best Measure	Reason
Symmetric data	Mean	Uses all data efficiently
Skewed data	Median	Not affected by outliers
Categorical data	Mode	Only measure for non-numeric data
Data with outliers	Median	Resistant to extreme values
Normal distribution	Mean	Most efficient estimator

Measures of Dispersion

Measures of dispersion describe how spread out the data values are from the center.

Range

Definition: The difference between the maximum and minimum values.

Formula: Range = Maximum - Minimum

Calculation Example:

Dataset: [12, 23, 34, 45, 56, 67, 89]
Range = 89 - 12 = 77

Properties:

Easy to calculate and understand
Highly sensitive to outliers
Only uses two data points
Doesn't consider distribution shape

SQL Implementation:

SELECT MAX(salary) - MIN(salary) as salary_range
FROM employees;

Variance

Definition: The average of the squared deviations from the mean.

Population Variance Formula: $$\sigma^2 = \frac{\sum_{i=1}^{N} (x_i - \mu)^2}{N}$$

Sample Variance Formula: $$s^2 = \frac{\sum_{i=1}^{n} (x_i - \bar{x})^2}{n-1}$$

Calculation Example:

Dataset: [23, 45, 67, 89, 12, 34, 56]
Mean = 46.57

Deviations from mean:
23-46.57 = -23.57, (-23.57)² = 555.44
45-46.57 = -1.57, (-1.57)² = 2.46
67-46.57 = 20.43, (20.43)² = 417.39
89-46.57 = 42.43, (42.43)² = 1800.30
12-46.57 = -34.57, (-34.57)² = 1195.04
34-46.57 = -12.57, (-12.57)² = 158.00
56-46.57 = 9.43, (9.43)² = 88.92

Sum of squared deviations = 4217.55
Variance = 4217.55 / (7-1) = 702.93

Properties:

Uses all data points
Measures average squared deviation
Units are squared (not intuitive)
Foundation for standard deviation

SQL Implementation:

SELECT VAR_SAMP(salary) as sample_variance
FROM employees;

Standard Deviation

Definition: The square root of the variance, representing the typical distance from the mean.

Formula: $s = \sqrt{s^2} = \sqrt{\frac{\sum_{i=1}^{n} (x_i - \bar{x})^2}{n-1}}$

Calculation Example:

Using previous variance calculation:
Variance = 702.93
Standard Deviation = √702.93 = 26.52

Interpretation:

Most data falls within 1 standard deviation of the mean (68% for normal distribution)
About 95% falls within 2 standard deviations
About 99.7% falls within 3 standard deviations

Properties:

Same units as original data
Most commonly used measure of spread
Sensitive to outliers
Foundation for many statistical tests

SQL Implementation:

SELECT STDDEV_SAMP(salary) as sample_stddev
FROM employees;

Interquartile Range (IQR)

Definition: The range between the 25th and 75th percentiles.

Formula: IQR = Q3 - Q1

Calculation Steps:

Find the 25th percentile (Q1)
Find the 75th percentile (Q3)
Calculate the difference

Calculation Example:

Dataset: [12, 23, 34, 45, 56, 67, 89]
Q1 (25th percentile) = 28.5
Q3 (75th percentile) = 61.5
IQR = 61.5 - 28.5 = 33

Properties:

Resistant to outliers
Measures spread of middle 50% of data
Used in box plots
Good for skewed distributions

SQL Implementation:

SELECT 
    PERCENTILE_CONT(0.75) WITHIN GROUP (ORDER BY salary) -
    PERCENTILE_CONT(0.25) WITHIN GROUP (ORDER BY salary) as iqr
FROM employees;

Interpretation and Sensitivity to Outliers

Understanding Outliers

Outlier: An observation that lies an abnormal distance from other values in a random sample from a population.

Common Detection Methods:

Z-score: Values with |z| > 3 are outliers
IQR Method: Values below Q1 - 1.5×IQR or above Q3 + 1.5×IQR
Visual inspection: Box plots, scatter plots

Impact on Different Measures

Mean Sensitivity:

Dataset without outlier: [10, 20, 30, 40, 50]
Mean = 30

Dataset with outlier: [10, 20, 30, 40, 500]
Mean = 120 (dramatically changed!)

Median Resistance:

Dataset without outlier: [10, 20, 30, 40, 50]
Median = 30

Dataset with outlier: [10, 20, 30, 40, 500]
Median = 30 (unchanged!)

Standard Deviation Impact:

Dataset without outlier: [10, 20, 30, 40, 50]
Std Dev = 15.81

Dataset with outlier: [10, 20, 30, 40, 500]
Std Dev = 194.35 (massively increased!)

Practical Examples

Example 1: Salary Data

Company salaries: [45k, 48k, 52k, 55k, 58k, 62k, 500k]

Mean = 117k (misleading due to CEO salary)
Median = 55k (better representation)
Mode = No mode

Standard deviation = 178k (inflated by outlier)
IQR = 62k - 48k = 14k (more realistic spread)

Example 2: Test Scores

Test scores: [65, 72, 78, 82, 85, 88, 92, 95, 98]

Mean = 83.9
Median = 85
Mode = No mode
Range = 33
Std Dev = 11.2

Interpretation: 
- Average score is 83.9
- Half students scored above 85
- Scores typically vary by ±11.2 from average
- No outliers detected

Choosing Appropriate Measures

Guidelines for Selection:

Check for outliers first

-- Detect outliers using IQR method
WITH stats AS (
    SELECT 
        PERCENTILE_CONT(0.25) WITHIN GROUP (ORDER BY value) as q1,
        PERCENTILE_CONT(0.75) WITHIN GROUP (ORDER BY value) as q3
    FROM your_table
)
SELECT *
FROM your_table, stats
WHERE value < stats.q1 - 1.5 * (stats.q3 - stats.q1)
   OR value > stats.q3 + 1.5 * (stats.q3 - stats.q1);

Consider data distribution
- Symmetric: Use mean and standard deviation
- Skewed: Use median and IQR
- Bimodal: Report both modes
Think about your audience
- Technical audience: Standard deviation
- General audience: Range and median
- Business decisions: Multiple measures

Real-World Applications

Financial Analysis:

-- Analyzing stock returns
SELECT 
    AVG(daily_return) as mean_return,
    STDDEV(daily_return) as volatility,
    PERCENTILE_CONT(0.5) WITHIN GROUP (ORDER BY daily_return) as median_return,
    MAX(daily_return) - MIN(daily_return) as range
FROM stock_prices
WHERE date >= '2023-01-01';

Quality Control:

-- Manufacturing quality metrics
SELECT 
    AVG(measurement) as process_mean,
    STDDEV(measurement) as process_stddev,
    AVG(measurement) - 3*STDDEV(measurement) as lower_control_limit,
    AVG(measurement) + 3*STDDEV(measurement) as upper_control_limit
FROM quality_measurements
WHERE production_date = CURRENT_DATE;

Customer Analytics:

-- Customer purchase behavior
SELECT 
    AVG(purchase_amount) as avg_purchase,
    MEDIAN(purchase_amount) as median_purchase,
    MODE() WITHIN GROUP (ORDER BY purchase_amount) as most_common,
    STDDEV(purchase_amount) as purchase_variability
FROM customer_orders
WHERE order_date >= '2023-01-01';

Key Takeaways

Central Tendency: Mean, median, and mode each tell different stories about data center
Dispersion Measures: Range, variance, and standard deviation quantify data spread
Outlier Impact: Mean and standard deviation are sensitive; median and IQR are resistant
Context Matters: Choose measures based on data distribution and analysis goals
Multiple Perspectives: Use several measures together for complete understanding
Practical Application: Consider your audience and decision-making needs

Next Steps

In the next lesson, we'll explore inferential statistics and hypothesis testing to make predictions about populations based on sample data.

Data Analyst Fundamentals

01Data Fundamentals and Preparation3 hours

02Statistics, Visualization and Analysis2 hours

03Business Intelligence and Governance2 hours

Descriptive Statistics

Lesson Overview

What You'll Learn:

Key Concepts:

Measures of Central Tendency

Mean (Average)

Median

Mode

Choosing the Right Measure

Measures of Dispersion

Range

Variance

Standard Deviation

Interquartile Range (IQR)

Interpretation and Sensitivity to Outliers

Understanding Outliers

Impact on Different Measures

Practical Examples

Choosing Appropriate Measures

Real-World Applications

Key Takeaways

Next Steps