This lesson introduces data warehousing concepts and modern data architecture patterns for large-scale data analysis.
Understanding the fundamental differences between Online Transaction Processing (OLTP) and Online Analytical Processing (OLAP) systems is crucial for data architecture design.
Purpose: Handle day-to-day transactional operations
Characteristics:
Examples:
Sample OLTP Schema:
-- Highly normalized structure
CREATE TABLE customers (
customer_id INT PRIMARY KEY,
first_name VARCHAR(50) NOT NULL,
last_name VARCHAR(50) NOT NULL,
email VARCHAR(100) UNIQUE,
phone VARCHAR(20),
created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP
);
CREATE TABLE orders (
order_id INT PRIMARY KEY,
customer_id INT REFERENCES customers(customer_id),
order_date TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
total_amount DECIMAL(10,2),
status VARCHAR(20)
);
CREATE TABLE order_items (
order_item_id INT PRIMARY KEY,
order_id INT REFERENCES orders(order_id),
product_id INT,
quantity INT,
unit_price DECIMAL(10,2)
);
Purpose: Support complex analytical queries and business intelligence
Characteristics:
Examples:
Sample OLAP Schema:
-- Denormalized star schema
CREATE TABLE dim_customer (
customer_key INT PRIMARY KEY,
customer_id INT,
first_name VARCHAR(50),
last_name VARCHAR(50),
city VARCHAR(50),
state VARCHAR(50),
country VARCHAR(50),
age_group VARCHAR(20),
income_level VARCHAR(20)
);
CREATE TABLE dim_product (
product_key INT PRIMARY KEY,
product_id INT,
product_name VARCHAR(100),
category VARCHAR(50),
subcategory VARCHAR(50),
brand VARCHAR(50),
price_range VARCHAR(20)
);
CREATE TABLE dim_date (
date_key INT PRIMARY KEY,
full_date DATE,
year INT,
quarter INT,
month INT,
day_of_week VARCHAR(10),
is_holiday BOOLEAN
);
CREATE TABLE fact_sales (
sales_key INT PRIMARY KEY,
customer_key INT REFERENCES dim_customer(customer_key),
product_key INT REFERENCES dim_product(product_key),
date_key INT REFERENCES dim_date(date_key),
sales_amount DECIMAL(12,2),
quantity_sold INT,
discount_amount DECIMAL(10,2)
);
| Aspect | OLTP | OLAP |
|---|---|---|
| Purpose | Transaction processing | Data analysis |
| Data Structure | Normalized | Denormalized |
| Query Type | Simple CRUD | Complex aggregations |
| Response Time | Milliseconds | Seconds/minutes |
| Data Currency | Current | Historical |
| Users | Many end-users | Few analysts |
| Backup Strategy | Frequent, incremental | Periodic, full |
| Indexing | Primary/foreign keys | Bitmap, composite |
A data warehouse is a centralized repository designed specifically for query and analysis rather than transaction processing. It contains historical data from multiple sources, integrated and organized to support decision-making.
Key Characteristics:
Three-Tier Architecture:
┌─────────────────────────────────────────────────────────┐
│ Presentation Layer │
│ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ │
│ │ Reports │ │ Dashboards │ │ BI Tools │ │
│ └─────────────┘ └─────────────┘ └─────────────┘ │
└─────────────────────────────────────────────────────────┘
│
┌─────────────────────────────────────────────────────────┐
│ Analytics Layer │
│ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ │
│ │ OLAP │ │ Data Marts│ │ Cubes │ │
│ │ Server │ │ │ │ │ │
│ └─────────────┘ └─────────────┘ └─────────────┘ │
└─────────────────────────────────────────────────────────┘
│
┌─────────────────────────────────────────────────────────┐
│ Data Layer │
│ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ │
│ │ Staging │ │ Data │ │ Metadata │ │
│ │ Area │ │ Warehouse │ │ Repository │ │
│ └─────────────┘ └─────────────┘ └─────────────┘ │
└─────────────────────────────────────────────────────────┘
1. Source Systems
2. ETL/ELT Processes
3. Staging Area
4. Data Warehouse Database
5. Data Marts
Star Schema:
┌─────────────┐
│ dim_date │
└─────────────┘
│
┌─────────────┐ ┌─────────────┐ ┌─────────────┐
│ dim_customer │──────│ fact_sales │──────│ dim_product │
└─────────────┘ └─────────────┘ └─────────────┘
│
┌─────────────┐
│ dim_store │
└─────────────┘
Snowflake Schema:
┌─────────────┐
│ dim_date │
└─────────────┘
│
┌─────────────┐ ┌─────────────┐ ┌─────────────┐
│ dim_customer │──────│ fact_sales │──────│ dim_product │
└─────────────┘ └─────────────┘ └─────────────┘
│ │
┌─────────────┐ ┌─────────────┐
│ dim_region │ │ dim_category │
└─────────────┘ └─────────────┘
Traditional ETL:
Source Systems → Extract → Transform → Load → Data Warehouse
Modern ELT:
Source Systems → Extract → Load → Transform → Data Warehouse
ETL Pipeline Example:
-- Step 1: Extract from OLTP
SELECT
o.order_id,
o.customer_id,
o.order_date,
oi.product_id,
oi.quantity,
oi.unit_price,
c.first_name,
c.last_name,
c.city,
c.state
FROM orders o
JOIN order_items oi ON o.order_id = oi.order_id
JOIN customers c ON o.customer_id = c.customer_id
WHERE o.order_date >= '2023-01-01';
-- Step 2: Transform in staging
CREATE TABLE staging_sales AS
SELECT
o.order_id,
o.customer_id,
o.order_date,
oi.product_id,
oi.quantity,
oi.unit_price,
c.first_name,
c.last_name,
c.city,
c.state,
oi.quantity * oi.unit_price as total_amount,
EXTRACT(YEAR FROM o.order_date) as order_year,
EXTRACT(MONTH FROM o.order_date) as order_month
FROM orders o
JOIN order_items oi ON o.order_id = oi.order_id
JOIN customers c ON o.customer_id = c.customer_id
WHERE o.order_date >= '2023-01-01';
-- Step 3: Load into warehouse
INSERT INTO fact_sales (
customer_key, product_key, date_key,
sales_amount, quantity_sold
)
SELECT
dc.customer_key,
dp.product_key,
dd.date_key,
ss.total_amount,
ss.quantity
FROM staging_sales ss
JOIN dim_customer dc ON ss.customer_id = dc.customer_id
JOIN dim_product dp ON ss.product_id = dp.product_id
JOIN dim_date dd ON ss.order_date = dd.full_date;
Definition: A centralized repository that allows you to store all your structured and unstructured data at any scale.
Characteristics:
Use Cases:
Data Lake Architecture:
┌─────────────────────────────────────────────────────────┐
│ Data Lake │
│ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ │
│ │ Raw Zone │ │ Processed │ │ Curated │ │
│ │ │ │ Zone │ │ Zone │ │
│ │ • JSON │ │ • Parquet │ │ • Tables │ │
│ │ • CSV │ │ • Avro │ │ • Views │ │
│ │ • Logs │ │ • ORC │ │ • Marts │ │
│ │ • Images │ │ • JSON │ │ │ │
│ └─────────────┘ └─────────────┘ └─────────────┘ │
└─────────────────────────────────────────────────────────┘
Definition: A system designed for business intelligence and analytics, containing structured, processed data.
Characteristics:
Use Cases:
| Aspect | Data Lake | Data Warehouse |
|---|---|---|
| Data Structure | Raw, unstructured | Processed, structured |
| Schema | Schema-on-Read | Schema-on-Write |
| Data Types | All formats | Structured data |
| Users | Data scientists | Business users |
| Cost | Lower storage cost | Higher processing cost |
| Flexibility | High flexibility | Lower flexibility |
| Performance | Variable | Optimized for queries |
| Governance | Less mature | Strong governance |
Combines the best of both worlds:
┌─────────────────────────────────────────────────────────┐
│ Lakehouse │
│ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ │
│ │ Delta │ │ Apache │ │ Iceberg │ │
│ │ Lake │ │ Hudi │ │ Tables │ │
│ └─────────────┘ └─────────────┘ └─────────────┘ │
│ │ │ │
│ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ │
│ │ ACID │ │ Time │ │ Schema │ │
│ │Transactions │ │ Travel │ │ Evolution │ │
│ └─────────────┘ └─────────────┘ └─────────────┘ │
└─────────────────────────────────────────────────────────┘
Definition: Sequence of data points recorded at successive time intervals.
Properties:
Examples:
1. Wide Table Format
CREATE TABLE metrics_wide (
timestamp TIMESTAMP PRIMARY KEY,
cpu_usage DECIMAL(5,2),
memory_usage DECIMAL(5,2),
disk_io DECIMAL(10,2),
network_in DECIMAL(12,2),
network_out DECIMAL(12,2)
);
2. Narrow Table Format
CREATE TABLE metrics_narrow (
timestamp TIMESTAMP,
metric_name VARCHAR(50),
metric_value DECIMAL(15,6),
tags JSONB,
PRIMARY KEY (timestamp, metric_name)
);
3. Partitioned Storage
CREATE TABLE metrics_partitioned (
timestamp TIMESTAMP,
metric_name VARCHAR(50),
metric_value DECIMAL(15,6),
tags JSONB
) PARTITION BY RANGE (timestamp);
-- Monthly partitions
CREATE TABLE metrics_2023_01 PARTITION OF metrics_partitioned
FOR VALUES FROM ('2023-01-01') TO ('2023-02-01');
Specialized TSDBs:
Traditional Databases with TS Extensions:
1. Range Queries
-- Get metrics for specific time range
SELECT timestamp, metric_value
FROM metrics_narrow
WHERE metric_name = 'cpu_usage'
AND timestamp BETWEEN '2023-01-01 00:00:00'
AND '2023-01-01 23:59:59'
ORDER BY timestamp;
2. Aggregation Queries
-- Hourly averages
SELECT
date_trunc('hour', timestamp) as hour,
AVG(metric_value) as avg_value,
MIN(metric_value) as min_value,
MAX(metric_value) as max_value,
COUNT(*) as sample_count
FROM metrics_narrow
WHERE metric_name = 'cpu_usage'
AND timestamp >= NOW() - INTERVAL '24 hours'
GROUP BY date_trunc('hour', timestamp)
ORDER BY hour;
3. Downsampling
-- Create hourly aggregates from minute data
CREATE MATERIALIZED VIEW hourly_metrics AS
SELECT
date_trunc('hour', timestamp) as hour,
metric_name,
AVG(metric_value) as avg_value,
MIN(metric_value) as min_value,
MAX(metric_value) as max_value,
COUNT(*) as sample_count
FROM metrics_narrow
GROUP BY date_trunc('hour', timestamp), metric_name;
4. Time Window Functions
-- Moving averages
SELECT
timestamp,
metric_value,
AVG(metric_value) OVER (
ORDER BY timestamp
ROWS BETWEEN 5 PRECEDING AND CURRENT ROW
) as moving_avg_6_points
FROM metrics_narrow
WHERE metric_name = 'cpu_usage'
AND timestamp >= NOW() - INTERVAL '1 hour'
ORDER BY timestamp;
1. Single Metric Table
CREATE TABLE time_series (
id BIGSERIAL PRIMARY KEY,
timestamp TIMESTAMP NOT NULL,
metric_name VARCHAR(100) NOT NULL,
value DOUBLE PRECISION NOT NULL,
tags JSONB,
created_at TIMESTAMP DEFAULT NOW()
);
CREATE INDEX idx_time_series_timestamp ON time_series(timestamp);
CREATE INDEX idx_time_series_metric ON time_series(metric_name);
CREATE INDEX idx_time_series_tags ON time_series USING GIN(tags);
2. Multiple Metric Tables
CREATE TABLE cpu_metrics (
timestamp TIMESTAMP PRIMARY KEY,
host_id INTEGER,
cpu_usage DECIMAL(5,2),
load_average DECIMAL(5,2)
);
CREATE TABLE memory_metrics (
timestamp TIMESTAMP PRIMARY KEY,
host_id INTEGER,
total_memory BIGINT,
used_memory BIGINT,
free_memory BIGINT
);
3. Hybrid Approach
-- Metadata table
CREATE TABLE metric_definitions (
metric_id SERIAL PRIMARY KEY,
metric_name VARCHAR(100) UNIQUE,
data_type VARCHAR(20),
unit VARCHAR(20),
description TEXT
);
-- Data table with foreign key
CREATE TABLE metric_values (
timestamp TIMESTAMP NOT NULL,
metric_id INTEGER REFERENCES metric_definitions(metric_id),
value DOUBLE PRECISION NOT NULL,
tags JSONB,
PRIMARY KEY (timestamp, metric_id)
);
1. Partitioning Strategies
-- Time-based partitioning
CREATE TABLE metrics (
timestamp TIMESTAMP,
metric_name VARCHAR(50),
metric_value DECIMAL(15,6)
) PARTITION BY RANGE (timestamp);
-- Automatic partition creation
CREATE OR REPLACE FUNCTION create_monthly_partition(table_name text, start_date date)
RETURNS void AS $$
DECLARE
partition_name text;
end_date date;
BEGIN
partition_name := table_name || '_' || to_char(start_date, 'YYYY_MM');
end_date := start_date + interval '1 month';
EXECUTE format('CREATE TABLE %I PARTITION OF %I
FOR VALUES FROM (%L) TO (%L)',
partition_name, table_name, start_date, end_date);
END;
$$ LANGUAGE plpgsql;
2. Compression
-- Columnar storage for compression
CREATE TABLE compressed_metrics (
timestamp TIMESTAMP,
metric_name VARCHAR(50),
metric_value DECIMAL(15,6)
) USING COLUMNAR;
-- Compression settings
ALTER TABLE compressed_metrics SET (columnar.compression_level = 5);
3. Retention Policies
-- Automatic data cleanup
CREATE OR REPLACE FUNCTION cleanup_old_metrics()
RETURNS void AS $$
BEGIN
DELETE FROM metrics_narrow
WHERE timestamp < NOW() - INTERVAL '1 year';
DELETE FROM metrics_hourly
WHERE timestamp < NOW() - INTERVAL '5 years';
END;
$$ LANGUAGE plpgsql;
-- Schedule cleanup
SELECT cron.schedule('cleanup-metrics', '0 2 * * *', 'SELECT cleanup_old_metrics();');
IoT Sensor Data Pipeline:
-- Raw sensor data table
CREATE TABLE sensor_readings (
reading_id BIGSERIAL PRIMARY KEY,
sensor_id INTEGER NOT NULL,
timestamp TIMESTAMP NOT NULL,
temperature DECIMAL(5,2),
humidity DECIMAL(5,2),
pressure DECIMAL(7,2),
battery_level DECIMAL(3,2),
location JSONB,
created_at TIMESTAMP DEFAULT NOW()
) PARTITION BY RANGE (timestamp);
-- Hourly aggregates
CREATE MATERIALIZED VIEW hourly_sensor_stats AS
SELECT
date_trunc('hour', timestamp) as hour,
sensor_id,
AVG(temperature) as avg_temp,
MIN(temperature) as min_temp,
MAX(temperature) as max_temp,
AVG(humidity) as avg_humidity,
COUNT(*) as reading_count
FROM sensor_readings
WHERE timestamp >= NOW() - INTERVAL '30 days'
GROUP BY date_trunc('hour', timestamp), sensor_id;
-- Daily aggregates for long-term storage
CREATE MATERIALIZED VIEW daily_sensor_stats AS
SELECT
date_trunc('day', timestamp) as day,
sensor_id,
AVG(temperature) as avg_temp,
MIN(temperature) as min_temp,
MAX(temperature) as max_temp,
AVG(humidity) as avg_humidity,
COUNT(*) as reading_count
FROM sensor_readings
GROUP BY date_trunc('day', timestamp), sensor_id;
In the next module, we'll explore statistics, visualization, and analysis techniques to derive insights from well-structured data.