This lesson covers data governance frameworks and importance of tracking data lineage for compliance and quality.
Data Governance is the systematic management of data availability, usability, integrity, and security in an organization. It encompasses the people, processes, and technology required to manage data as a strategic asset.
Core Objectives
Key Components
Strategic Level
Data Governance Council
- Executive oversight
- Policy approval
- Resource allocation
- Risk management
Tactical Level
Data Stewards
- Domain expertise
- Quality standards
- Issue resolution
- Change management
Operational Level
Data Custodians
- Technical implementation
- Access controls
- Backup procedures
- Monitoring
-- Simple data quality check for governance
SELECT
'Data Quality Assessment' as assessment_type,
table_name,
column_name,
data_type,
-- Quality checks
COUNT(*) as total_records,
COUNT(CASE WHEN column_name IS NULL THEN 1 END) as null_count,
ROUND(COUNT(CASE WHEN column_name IS NULL THEN 1 END) * 100.0 / COUNT(*), 2) as null_percentage,
-- Data freshness
MAX(last_updated) as last_updated_date,
CURRENT_DATE - MAX(last_updated) as days_since_update,
-- Quality status
CASE
WHEN COUNT(CASE WHEN column_name IS NULL THEN 1 END) * 100.0 / COUNT(*) > 10 THEN 'Poor Quality'
WHEN COUNT(CASE WHEN column_name IS NULL THEN 1 END) * 100.0 / COUNT(*) > 5 THEN 'Fair Quality'
ELSE 'Good Quality'
END as quality_status
FROM information_schema.columns
JOIN table_metadata ON information_schema.columns.table_name = table_metadata.table_name
WHERE table_schema = 'production'
GROUP BY table_name, column_name, data_type
ORDER BY null_percentage DESC;
Data Classification Policy
Data Retention Policy
Access Control Policy
Naming Conventions
Tables: snake_case (e.g., customer_orders)
Columns: snake_case (e.g., customer_id)
Schemas: descriptive (e.g., analytics, production)
Indexes: idx_table_column (e.g., idx_customers_email)
Data Type Standards
IDs: BIGINT or UUID
Names: VARCHAR(255)
Emails: VARCHAR(320)
Dates: TIMESTAMP
Amounts: DECIMAL(10,2)
Flags: BOOLEAN
Quality Standards
GDPR (General Data Protection Regulation)
CCPA (California Consumer Privacy Act)
HIPAA (Health Insurance Portability and Accountability Act)
-- Simple compliance check for data retention
SELECT
'Compliance Assessment' as assessment_type,
table_name,
data_classification,
retention_period_days,
creation_date,
CURRENT_DATE - creation_date as age_days,
-- Compliance status
CASE
WHEN CURRENT_DATE - creation_date > retention_period_days THEN 'Non-Compliant'
ELSE 'Compliant'
END as compliance_status,
-- Action required
CASE
WHEN CURRENT_DATE - creation_date > retention_period_days THEN 'Archive or Delete'
ELSE 'No Action'
END as action_required
FROM data_inventory
WHERE data_classification IN 'Confidential', 'Restricted'
ORDER BY age_days DESC;
Data Lineage is the process of tracking data from its origin through all transformations to its final destination. It provides visibility into how data moves, changes, and is used across systems.
Lineage Components
Technical Lineage
Business Lineage
Operational Lineage
Manual Documentation
Automated Tools
Hybrid Approach
-- Basic data lineage tracking
WITH data_dependencies AS (
SELECT
source_table,
target_table,
transformation_type,
last_updated,
-- Lineage depth
CASE
WHEN source_table LIKE 'raw_%' THEN 1
WHEN source_table LIKE 'staging_%' THEN 2
WHEN source_table LIKE 'processed_%' THEN 3
ELSE 4
END as lineage_level
FROM data_lineage
),
lineage_path AS (
SELECT
source_table,
target_table,
transformation_type,
lineage_level,
-- Build lineage path
source_table || ' -> ' || target_table as data_flow
FROM data_dependencies
)
SELECT
'Data Lineage' as analysis_type,
source_table,
target_table,
transformation_type,
lineage_level,
data_flow,
-- Data freshness
DATEDIFF(day, last_updated, CURRENT_DATE) as days_since_update,
-- Lineage health
CASE
WHEN DATEDIFF(day, last_updated, CURRENT_DATE) > 30 THEN 'Stale'
WHEN DATEDIFF(day, last_updated, CURRENT_DATE) > 7 THEN 'Warning'
ELSE 'Current'
END as lineage_status
FROM lineage_path
ORDER BY lineage_level, source_table;
Data Trust
Impact Analysis
Problem Resolution
Regulatory Compliance
Transparency
Accountability
Compliance Verification
Quality Assurance
Trust Components
Trust Building Activities
-- Simple audit trail for data changes
SELECT
'Audit Trail' as audit_type,
table_name,
record_id,
operation_type,
old_value,
new_value,
changed_by,
change_timestamp,
-- Audit analysis
CASE
WHEN operation_type = 'DELETE' THEN 'Data Removal'
WHEN operation_type = 'UPDATE' THEN 'Data Modification'
WHEN operation_type = 'INSERT' THEN 'Data Creation'
ELSE 'Unknown'
END as operation_category,
-- Risk assessment
CASE
WHEN table_name IN 'customers', 'orders', 'payments' THEN 'High Risk'
WHEN table_name IN 'products', 'inventory' THEN 'Medium Risk'
ELSE 'Low Risk'
END as risk_level
FROM audit_log
WHERE change_timestamp >= CURRENT_DATE - INTERVAL '30 days'
ORDER BY change_timestamp DESC;
1. Establish Clear Policies
2. Define Roles and Responsibilities
3. Implement Technical Controls
4. Monitor and Enforce
5. Foster Data Culture
In the next lesson, we'll explore data quality and profiling to understand how to measure and improve data quality in governance programs.