This lesson covers ethical considerations, bias detection, and fairness in data analysis and decision-making.
Definition: Selection bias occurs when the sample selected for analysis is not representative of the population intended to be analyzed.
Types of Selection Bias
Examples
Definition: Confirmation bias is the tendency to search for, interpret, and recall information that confirms one's preexisting beliefs or hypotheses.
Manifestations in Data Analysis
Examples
Definition: Sampling bias occurs when some members of the intended population are systematically more likely to be selected in a sample than others.
Common Sampling Biases
Examples
Anchoring Bias
Availability Bias
Observer Bias
-- Detect potential sampling bias in survey data
WITH bias_analysis AS (
SELECT
'Sampling Bias Analysis' as analysis_type,
survey_channel,
COUNT(*) as response_count,
-- Demographic distribution
COUNT(CASE WHEN age_group = '18-24' THEN 1 END) as age_18_24,
COUNT(CASE WHEN age_group = '25-34' THEN 1 END) as age_25_34,
COUNT(CASE WHEN age_group = '35-44' THEN 1 END) as age_35_44,
COUNT(CASE WHEN age_group = '45-54' THEN 1 END) as age_45_54,
COUNT(CASE WHEN age_group = '55+' THEN 1 END) as age_55_plus,
-- Geographic distribution
COUNT(DISTINCT region) as regions_covered,
-- Response rate analysis
AVG(CASE WHEN response_time_minutes <= 5 THEN 1 ELSE 0 END) as quick_response_rate,
-- Bias indicators
ROUND(COUNT(*) * 100.0 / SUM(COUNT(*)) OVER (), 2) as channel_percentage
FROM survey_responses
GROUP BY survey_channel
UNION ALL
-- Compare with known population distribution
SELECT
'Population Comparison' as analysis_type,
'Target Population' as survey_channel,
10000 as response_count,
2000 as age_18_24,
3000 as age_25_34,
2500 as age_35_44,
1500 as age_45_54,
1000 as age_55_plus,
10 as regions_covered,
0.5 as quick_response_rate,
100.0 as channel_percentage
FROM dual
)
SELECT
analysis_type,
survey_channel,
response_count,
-- Age distribution comparison
ROUND(age_18_24 * 100.0 / response_count, 2) as pct_18_24,
ROUND(age_25_34 * 100.0 / response_count, 2) as pct_25_34,
ROUND(age_35_44 * 100.0 / response_count, 2) as pct_35_44,
ROUND(age_45_54 * 100.0 / response_count, 2) as pct_45_54,
ROUND(age_55_plus * 100.0 / response_count, 2) as pct_55_plus,
-- Bias assessment
CASE
WHEN analysis_type = 'Population Comparison' THEN 'Target Distribution'
WHEN channel_percentage < 10 THEN 'Underrepresented Channel'
WHEN channel_percentage > 50 THEN 'Overrepresented Channel'
ELSE 'Balanced Channel'
END as bias_indicator,
-- Recommendations
CASE
WHEN analysis_type = 'Population Comparison' THEN 'Benchmark'
WHEN regions_covered < 5 THEN 'Expand Geographic Coverage'
WHEN quick_response_rate > 0.8 THEN 'Check for Convenience Bias'
ELSE 'Monitor Distribution'
END as recommendation
FROM bias_analysis
ORDER BY analysis_type, response_count DESC;
Distorted Means and Medians
Incorrect Correlations
Invalid Generalizations
Poor Decision Making
Missed Opportunities
Reputational Damage
-- Analyze how bias affects business metrics
WITH bias_impact AS (
-- Compare biased vs unbiased sample performance
SELECT
'Bias Impact Analysis' as analysis_type,
sample_type,
COUNT(*) as sample_size,
-- Key metrics
AVG(customer_satisfaction) as avg_satisfaction,
AVG(purchase_amount) as avg_purchase,
AVG(customer_lifetime_value) as avg_clv,
-- Conversion rates
COUNT(CASE WHEN converted = 1 THEN 1 END) * 100.0 / COUNT(*) as conversion_rate,
-- Segment distribution
COUNT(CASE WHEN customer_segment = 'Premium' THEN 1 END) * 100.0 / COUNT(*) as premium_pct,
COUNT(CASE WHEN customer_segment = 'Standard' THEN 1 END) * 100.0 / COUNT(*) as standard_pct,
COUNT(CASE WHEN customer_segment = 'Budget' THEN 1 END) * 100.0 / COUNT(*) as budget_pct
FROM customer_analysis
WHERE sample_type IN ('Biased Sample', 'Random Sample')
GROUP BY sample_type
UNION ALL
-- True population metrics
SELECT
'True Population' as analysis_type,
'Full Population' as sample_type,
100000 as sample_size,
4.2 as avg_satisfaction,
150.00 as avg_purchase,
1200.00 as avg_clv,
12.5 as conversion_rate,
15.0 as premium_pct,
45.0 as standard_pct,
40.0 as budget_pct
FROM dual
)
SELECT
analysis_type,
sample_type,
sample_size,
avg_satisfaction,
avg_purchase,
avg_clv,
conversion_rate,
-- Bias impact calculation
ROUND((avg_satisfaction - 4.2) / 4.2 * 100, 2) as satisfaction_bias_pct,
ROUND((avg_purchase - 150.00) / 150.00 * 100, 2) as purchase_bias_pct,
ROUND((conversion_rate - 12.5) / 12.5 * 100, 2) as conversion_bias_pct,
-- Business impact assessment
CASE
WHEN analysis_type = 'True Population' THEN 'Benchmark'
WHEN ABS((avg_satisfaction - 4.2) / 4.2) > 0.1 THEN 'High Impact'
WHEN ABS((avg_satisfaction - 4.2) / 4.2) > 0.05 THEN 'Medium Impact'
ELSE 'Low Impact'
END as business_impact,
-- Recommended actions
CASE
WHEN analysis_type = 'True Population' THEN 'Reference Point'
WHEN premium_pct > 25 THEN 'Over-sampling Premium Segment'
WHEN budget_pct < 30 THEN 'Under-sampling Budget Segment'
ELSE 'Sample Appears Balanced'
END as recommended_action
FROM bias_impact
ORDER BY analysis_type;
Hiring Algorithms
Credit Scoring
Medical Diagnosis
Statistical Tests
Data Visualization
-- Calculate fairness metrics for algorithmic decisions
WITH fairness_metrics AS (
-- Calculate approval rates by demographic group
SELECT
demographic_group,
COUNT(*) as total_applications,
COUNT(CASE WHEN approved = 1 THEN 1 END) as approved_applications,
COUNT(CASE WHEN approved = 0 THEN 1 END) as denied_applications,
-- Approval rate
ROUND(COUNT(CASE WHEN approved = 1 THEN 1 END) * 100.0 / COUNT(*), 2) as approval_rate,
-- False positive/negative rates
COUNT(CASE WHEN approved = 1 AND actual_risk = 'High' THEN 1 END) as false_positives,
COUNT(CASE WHEN approved = 0 AND actual_risk = 'Low' THEN 1 END) as false_negatives
FROM loan_applications
GROUP BY demographic_group
),
disparate_impact AS (
-- Calculate disparate impact ratio
SELECT
demographic_group,
approval_rate,
-- Disparate impact (ratio to highest approval rate)
ROUND(approval_rate / MAX(approval_rate) OVER (), 3) as disparate_impact_ratio,
-- Four-fifths rule compliance
CASE
WHEN approval_rate >= MAX(approval_rate) OVER () * 0.8 THEN 'Compliant'
ELSE 'Non-Compliant'
END as four_fifths_compliance
FROM fairness_metrics
)
SELECT
'Fairness Analysis' as analysis_type,
f.demographic_group,
f.total_applications,
f.approval_rate,
di.disparate_impact_ratio,
di.four_fifths_compliance,
-- Additional fairness metrics
ROUND(f.false_positives * 100.0 / f.total_applications, 2) as false_positive_rate,
ROUND(f.false_negatives * 100.0 / f.total_applications, 2) as false_negative_rate,
-- Fairness assessment
CASE
WHEN di.disparate_impact_ratio < 0.8 THEN 'High Bias Risk'
WHEN di.disparate_impact_ratio < 0.9 THEN 'Medium Bias Risk'
ELSE 'Low Bias Risk'
END as bias_risk_level,
-- Mitigation recommendations
CASE
WHEN di.disparate_impact_ratio < 0.8 THEN 'Immediate Mitigation Required'
WHEN di.disparate_impact_ratio < 0.9 THEN 'Monitor and Investigate'
ELSE 'Maintain Fairness Monitoring'
END as recommended_action
FROM fairness_metrics f
JOIN disparate_impact di ON f.demographic_group = di.demographic_group
ORDER BY di.disparate_impact_ratio ASC;
Data Collection Improvements
Pre-processing Techniques
In-processing Methods
Post-processing Approaches
1. Assessment Phase
2. Mitigation Phase
3. Monitoring Phase
Beneficence
Non-maleficence
Autonomy
Justice
Data Collection Ethics
Data Processing Ethics
Data Sharing Ethics
-- Monitor ethical compliance in data usage
WITH ethical_compliance AS (
-- Data usage compliance check
SELECT
'Data Usage Ethics' as compliance_area,
data_purpose,
COUNT(*) as usage_count,
-- Consent tracking
COUNT(CASE WHEN consent_obtained = 1 THEN 1 END) as with_consent,
COUNT(CASE WHEN consent_obtained = 0 THEN 1 END) as without_consent,
-- Purpose alignment
COUNT(CASE WHEN purpose_aligned = 1 THEN 1 END) as purpose_aligned_count,
-- Data minimization
AVG(data_fields_used) as avg_fields_used,
MAX(total_fields_available) as max_fields_available,
-- Retention compliance
COUNT(CASE WHEN days_retained <= retention_limit THEN 1 END) as within_retention_limit
FROM data_usage_log
GROUP BY data_purpose
UNION ALL
-- Fairness and bias monitoring
SELECT
'Algorithmic Fairness' as compliance_area,
algorithm_name as data_purpose,
COUNT(*) as usage_count,
COUNT(CASE WHEN fairness_score >= 0.8 THEN 1 END) as with_consent,
COUNT(CASE WHEN fairness_score < 0.8 THEN 1 END) as without_consent,
COUNT(CASE WHEN bias_detected = 0 THEN 1 END) as purpose_aligned_count,
AVG(fairness_score) as avg_fields_used,
1.0 as max_fields_available,
COUNT(CASE WHEN last_audit_date >= CURRENT_DATE - INTERVAL '90 days' THEN 1 END) as within_retention_limit
FROM algorithm_audit_log
GROUP BY algorithm_name
)
SELECT
compliance_area,
data_purpose,
usage_count,
-- Compliance metrics
ROUND(with_consent * 100.0 / usage_count, 2) as consent_compliance_pct,
ROUND(purpose_aligned_count * 100.0 / usage_count, 2) as purpose_alignment_pct,
ROUND(within_retention_limit * 100.0 / usage_count, 2) as retention_compliance_pct,
-- Data minimization score
ROUND((1 - (avg_fields_used / max_fields_available)) * 100, 2) as minimization_score,
-- Overall ethical score
ROUND((
(with_consent * 100.0 / usage_count) * 0.3 +
(purpose_aligned_count * 100.0 / usage_count) * 0.3 +
(within_retention_limit * 100.0 / usage_count) * 0.2 +
((1 - (avg_fields_used / max_fields_available)) * 100) * 0.2
), 2) as ethical_compliance_score,
-- Risk assessment
CASE
WHEN (with_consent * 100.0 / usage_count) < 80 THEN 'High Risk'
WHEN (with_consent * 100.0 / usage_count) < 95 THEN 'Medium Risk'
ELSE 'Low Risk'
END as ethical_risk_level,
-- Recommended actions
CASE
WHEN (with_consent * 100.0 / usage_count) < 80 THEN 'Immediate Review Required'
WHEN (with_consent * 100.0 / usage_count) < 95 THEN 'Improve Consent Processes'
WHEN (purpose_aligned_count * 100.0 / usage_count) < 90 THEN 'Review Purpose Alignment'
ELSE 'Maintain Ethical Standards'
END as recommended_action
FROM ethical_compliance
ORDER BY ethical_compliance_score ASC;
1. Identify Ethical Issues
2. Evaluate Options
3. Make Decision
4. Review and Learn
Leadership Commitment
Culture and Training
Governance Structures
In the next lesson, we'll explore reporting and performance analysis to understand how to effectively communicate insights and measure business impact.