This lesson covers descriptive statistics, which help summarize and describe main features of a dataset using measures of central tendency and dispersion.
Measures of central tendency describe the center or typical value of a dataset. They help us understand where the "middle" of our data lies.
Definition: The sum of all values divided by the number of values.
Formula: $$\bar{x} = \frac{\sum_{i=1}^{n} x_i}{n}$$
Where:
Calculation Example:
Dataset: [23, 45, 67, 89, 12, 34, 56]
Mean = (23 + 45 + 67 + 89 + 12 + 34 + 56) / 7
Mean = 326 / 7 = 46.57
Properties:
SQL Implementation:
SELECT AVG(salary) as mean_salary
FROM employees;
Definition: The middle value when data is arranged in order. If there's an even number of values, it's the average of the two middle values.
Calculation Steps:
Calculation Example:
Odd number of values:
Dataset: [12, 23, 34, 45, 56, 67, 89]
Sorted: [12, 23, 34, 45, 56, 67, 89]
Position = (7 + 1) / 2 = 4th position
Median = 45
Even number of values:
Dataset: [12, 23, 34, 45, 56, 67]
Sorted: [12, 23, 34, 45, 56, 67]
Median = (3rd + 4th) / 2 = (34 + 45) / 2 = 39.5
Properties:
SQL Implementation:
SELECT PERCENTILE_CONT(0.5) WITHIN GROUP (ORDER BY salary) as median_salary
FROM employees;
Definition: The value that appears most frequently in a dataset.
Types:
Calculation Example:
Dataset: [23, 45, 67, 45, 12, 34, 45, 56]
Frequency count:
23: 1 time, 45: 3 times, 67: 1 time
12: 1 time, 34: 1 time, 56: 1 time
Mode = 45 (appears most frequently)
Properties:
SQL Implementation:
SELECT salary as mode_salary, COUNT(*) as frequency
FROM employees
GROUP BY salary
ORDER BY COUNT(*) DESC
LIMIT 1;
| Situation | Best Measure | Reason |
|---|---|---|
| Symmetric data | Mean | Uses all data efficiently |
| Skewed data | Median | Not affected by outliers |
| Categorical data | Mode | Only measure for non-numeric data |
| Data with outliers | Median | Resistant to extreme values |
| Normal distribution | Mean | Most efficient estimator |
Measures of dispersion describe how spread out the data values are from the center.
Definition: The difference between the maximum and minimum values.
Formula: Range = Maximum - Minimum
Calculation Example:
Dataset: [12, 23, 34, 45, 56, 67, 89]
Range = 89 - 12 = 77
Properties:
SQL Implementation:
SELECT MAX(salary) - MIN(salary) as salary_range
FROM employees;
Definition: The average of the squared deviations from the mean.
Population Variance Formula: $$\sigma^2 = \frac{\sum_{i=1}^{N} (x_i - \mu)^2}{N}$$
Sample Variance Formula: $$s^2 = \frac{\sum_{i=1}^{n} (x_i - \bar{x})^2}{n-1}$$
Calculation Example:
Dataset: [23, 45, 67, 89, 12, 34, 56]
Mean = 46.57
Deviations from mean:
23-46.57 = -23.57, (-23.57)² = 555.44
45-46.57 = -1.57, (-1.57)² = 2.46
67-46.57 = 20.43, (20.43)² = 417.39
89-46.57 = 42.43, (42.43)² = 1800.30
12-46.57 = -34.57, (-34.57)² = 1195.04
34-46.57 = -12.57, (-12.57)² = 158.00
56-46.57 = 9.43, (9.43)² = 88.92
Sum of squared deviations = 4217.55
Variance = 4217.55 / (7-1) = 702.93
Properties:
SQL Implementation:
SELECT VAR_SAMP(salary) as sample_variance
FROM employees;
Definition: The square root of the variance, representing the typical distance from the mean.
Formula: $s = \sqrt{s^2} = \sqrt{\frac{\sum_{i=1}^{n} (x_i - \bar{x})^2}{n-1}}$
Calculation Example:
Using previous variance calculation:
Variance = 702.93
Standard Deviation = √702.93 = 26.52
Interpretation:
Properties:
SQL Implementation:
SELECT STDDEV_SAMP(salary) as sample_stddev
FROM employees;
Definition: The range between the 25th and 75th percentiles.
Formula: IQR = Q3 - Q1
Calculation Steps:
Calculation Example:
Dataset: [12, 23, 34, 45, 56, 67, 89]
Q1 (25th percentile) = 28.5
Q3 (75th percentile) = 61.5
IQR = 61.5 - 28.5 = 33
Properties:
SQL Implementation:
SELECT
PERCENTILE_CONT(0.75) WITHIN GROUP (ORDER BY salary) -
PERCENTILE_CONT(0.25) WITHIN GROUP (ORDER BY salary) as iqr
FROM employees;
Outlier: An observation that lies an abnormal distance from other values in a random sample from a population.
Common Detection Methods:
Mean Sensitivity:
Dataset without outlier: [10, 20, 30, 40, 50]
Mean = 30
Dataset with outlier: [10, 20, 30, 40, 500]
Mean = 120 (dramatically changed!)
Median Resistance:
Dataset without outlier: [10, 20, 30, 40, 50]
Median = 30
Dataset with outlier: [10, 20, 30, 40, 500]
Median = 30 (unchanged!)
Standard Deviation Impact:
Dataset without outlier: [10, 20, 30, 40, 50]
Std Dev = 15.81
Dataset with outlier: [10, 20, 30, 40, 500]
Std Dev = 194.35 (massively increased!)
Example 1: Salary Data
Company salaries: [45k, 48k, 52k, 55k, 58k, 62k, 500k]
Mean = 117k (misleading due to CEO salary)
Median = 55k (better representation)
Mode = No mode
Standard deviation = 178k (inflated by outlier)
IQR = 62k - 48k = 14k (more realistic spread)
Example 2: Test Scores
Test scores: [65, 72, 78, 82, 85, 88, 92, 95, 98]
Mean = 83.9
Median = 85
Mode = No mode
Range = 33
Std Dev = 11.2
Interpretation:
- Average score is 83.9
- Half students scored above 85
- Scores typically vary by ±11.2 from average
- No outliers detected
Guidelines for Selection:
Check for outliers first
-- Detect outliers using IQR method
WITH stats AS (
SELECT
PERCENTILE_CONT(0.25) WITHIN GROUP (ORDER BY value) as q1,
PERCENTILE_CONT(0.75) WITHIN GROUP (ORDER BY value) as q3
FROM your_table
)
SELECT *
FROM your_table, stats
WHERE value < stats.q1 - 1.5 * (stats.q3 - stats.q1)
OR value > stats.q3 + 1.5 * (stats.q3 - stats.q1);
Consider data distribution
Think about your audience
Financial Analysis:
-- Analyzing stock returns
SELECT
AVG(daily_return) as mean_return,
STDDEV(daily_return) as volatility,
PERCENTILE_CONT(0.5) WITHIN GROUP (ORDER BY daily_return) as median_return,
MAX(daily_return) - MIN(daily_return) as range
FROM stock_prices
WHERE date >= '2023-01-01';
Quality Control:
-- Manufacturing quality metrics
SELECT
AVG(measurement) as process_mean,
STDDEV(measurement) as process_stddev,
AVG(measurement) - 3*STDDEV(measurement) as lower_control_limit,
AVG(measurement) + 3*STDDEV(measurement) as upper_control_limit
FROM quality_measurements
WHERE production_date = CURRENT_DATE;
Customer Analytics:
-- Customer purchase behavior
SELECT
AVG(purchase_amount) as avg_purchase,
MEDIAN(purchase_amount) as median_purchase,
MODE() WITHIN GROUP (ORDER BY purchase_amount) as most_common,
STDDEV(purchase_amount) as purchase_variability
FROM customer_orders
WHERE order_date >= '2023-01-01';
In the next lesson, we'll explore inferential statistics and hypothesis testing to make predictions about populations based on sample data.