YouTube video ID required

Joins and Data Relationships

Name: Data Analyst Fundamentals
Availability: InStock

Lesson Overview

This lesson covers SQL joins and understanding relationships between different data tables for comprehensive analysis.

What You'll Learn:

Types of SQL joins (INNER, LEFT, RIGHT, FULL OUTER)
Understanding primary and foreign keys
Working with related data tables
Complex join operations
Performance considerations for joins

Key Concepts:

SQL Joins: Methods to combine rows from multiple tables
Primary Key: Unique identifier for records in a table
Foreign Key: Key used to link to another table
Data Relationships: Connections between different data entities

Understanding Data Relationships

Before diving into joins, it's essential to understand the different types of relationships that can exist between tables in a relational database.

Types of Table Relationships

One-to-One Relationship:

Each record in Table A relates to exactly one record in Table B
Example: Employees ↔ Employee Details (one employee has one detailed profile)
Implementation: Primary key in one table is foreign key in the other

One-to-Many Relationship:

Each record in Table A relates to multiple records in Table B
Example: Departments ↔ Employees (one department has many employees)
Implementation: Primary key of "one" table is foreign key in "many" table

Many-to-Many Relationship:

Records in Table A relate to multiple records in Table B and vice versa
Example: Students ↔ Courses (students take multiple courses, courses have multiple students)
Implementation: Requires a junction/bridge table

Sample Database Schema

Let's work with a comprehensive example database:

-- Departments table
CREATE TABLE departments (
    department_id INT PRIMARY KEY,
    department_name VARCHAR(50) NOT NULL,
    location VARCHAR(100)
);

-- Employees table
CREATE TABLE employees (
    employee_id INT PRIMARY KEY,
    first_name VARCHAR(50) NOT NULL,
    last_name VARCHAR(50) NOT NULL,
    email VARCHAR(100) UNIQUE,
    department_id INT,
    manager_id INT,
    salary DECIMAL(10,2),
    hire_date DATE,
    FOREIGN KEY (department_id) REFERENCES departments(department_id),
    FOREIGN KEY (manager_id) REFERENCES employees(employee_id)
);

-- Projects table
CREATE TABLE projects (
    project_id INT PRIMARY KEY,
    project_name VARCHAR(100) NOT NULL,
    start_date DATE,
    end_date DATE,
    budget DECIMAL(12,2)
);

-- Employee_Projects junction table (Many-to-Many)
CREATE TABLE employee_projects (
    employee_id INT,
    project_id INT,
    role VARCHAR(50),
    hours_worked INT,
    PRIMARY KEY (employee_id, project_id),
    FOREIGN KEY (employee_id) REFERENCES employees(employee_id),
    FOREIGN KEY (project_id) REFERENCES projects(project_id)
);

Sample Data:

Departments:
+---------------+------------------+----------+
| department_id | department_name  | location |
+---------------+------------------+----------+
| 1             | IT               | Building A|
| 2             | HR               | Building B|
| 3             | Sales            | Building C|
| 4             | Marketing        | Building D|
+---------------+------------------+----------+

Employees:
+-------------+------------+-----------+------------------+---------------+------------+--------+------------+
| employee_id | first_name | last_name | email            | department_id | manager_id | salary | hire_date  |
+-------------+------------+-----------+------------------+---------------+------------+--------+------------+
| 101         | John       | Smith     | [email protected] | 1             | NULL       | 90000  | 2020-01-15 |
| 102         | Mary       | Johnson   | [email protected] | 1             | 101        | 75000  | 2020-03-20 |
| 103         | Bob        | Brown     | [email protected]  | 2             | NULL       | 70000  | 2019-06-10 |
| 104         | Alice      | Davis     | [email protected]| 3             | NULL       | 80000  | 2021-02-01 |
| 105         | Charlie    | Wilson    | [email protected]| 1           | 101        | 65000  | 2022-07-15 |
| 106         | Diana      | Miller    | [email protected]| 3             | 104        | 60000  | 2022-09-01 |
+-------------+------------+-----------+------------------+---------------+------------+--------+------------+

Projects:
+------------+------------------+------------+------------+---------+
| project_id | project_name     | start_date | end_date   | budget  |
+------------+------------------+------------+------------+---------+
| 1          | Website Redesign | 2023-01-01 | 2023-06-30 | 50000   |
| 2          | Mobile App       | 2023-03-15 | 2023-12-31 | 120000  |
| 3          | Database Upgrade | 2023-02-01 | 2023-04-30 | 30000   |
+------------+------------------+------------+------------+---------+

Employee_Projects:
+-------------+------------+----------+--------------+
| employee_id | project_id | role     | hours_worked |
+-------------+------------+----------+--------------+
| 101         | 1          | Manager  | 200          |
| 102         | 1          | Developer| 300          |
| 105         | 1          | Developer| 250          |
| 101         | 2          | Manager  | 150          |
| 104         | 2          | Lead     | 400          |
| 106         | 2          | Designer | 350          |
| 102         | 3          | Developer| 180          |
| 105         | 3          | Developer| 220          |
+-------------+------------+----------+--------------+

JOIN Types: INNER, LEFT, RIGHT, FULL

SQL joins are used to combine rows from two or more tables based on related columns between them.

INNER JOIN

Definition: Returns only the rows that have matching values in both tables.

Syntax:

SELECT columns
FROM table1
INNER JOIN table2 ON table1.column = table2.column;

Visual Representation:

Table A       Table B        INNER JOIN
───────       ───────        ──────────
  ●●●           ●●●              ●●
  ●●●           ●●●              ●●
  ●●●           ●●●              ●●

Example:

-- Get employees with their departments
SELECT 
    e.first_name,
    e.last_name,
    d.department_name,
    d.location
FROM employees e
INNER JOIN departments d ON e.department_id = d.department_id;

Result:

+------------+-----------+------------------+----------+
| first_name | last_name | department_name  | location |
+------------+-----------+------------------+----------+
| John       | Smith     | IT               | Building A|
| Mary       | Johnson   | IT               | Building A|
| Bob        | Brown     | HR               | Building B|
| Alice      | Davis     | Sales            | Building C|
| Charlie    | Wilson    | IT               | Building A|
| Diana      | Miller    | Sales            | Building C|
+------------+-----------+------------------+----------+

Use Cases:

When you only need records that exist in both tables
Finding matching records between datasets
Ensuring data completeness

LEFT JOIN (LEFT OUTER JOIN)

Definition: Returns all rows from the left table and matched rows from the right table. If no match exists, NULL values are returned for right table columns.

Syntax:

SELECT columns
FROM table1
LEFT JOIN table2 ON table1.column = table2.column;

Visual Representation:

Table A       Table B        LEFT JOIN
───────       ───────        ─────────
  ●●●           ●●●              ●●●
  ●●●           ●●●              ●●●
  ●●●           ●●●              ●●●

Example:

-- Get all employees and their departments (including employees without departments)
SELECT 
    e.first_name,
    e.last_name,
    d.department_name
FROM employees e
LEFT JOIN departments d ON e.department_id = d.department_id;

Result:

+------------+-----------+------------------+
| first_name | last_name | department_name  |
+------------+-----------+------------------+
| John       | Smith     | IT               |
| Mary       | Johnson   | IT               |
| Bob        | Brown     | HR               |
| Alice      | Davis     | Sales            |
| Charlie    | Wilson    | IT               |
| Diana      | Miller    | Sales            |
+------------+-----------+------------------+

Use Cases:

When you want all records from the primary table regardless of matches
Finding records in one table that don't have matches in another
Reporting where completeness of primary table is important

RIGHT JOIN (RIGHT OUTER JOIN)

Definition: Returns all rows from the right table and matched rows from the left table. If no match exists, NULL values are returned for left table columns.

Syntax:

SELECT columns
FROM table1
RIGHT JOIN table2 ON table1.column = table2.column;

Visual Representation:

Table A       Table B        RIGHT JOIN
───────       ───────        ──────────
  ●●●           ●●●              ●●●
  ●●●           ●●●              ●●●
  ●●●           ●●●              ●●●

Example:

-- Get all departments and their employees (including departments without employees)
SELECT 
    e.first_name,
    e.last_name,
    d.department_name
FROM employees e
RIGHT JOIN departments d ON e.department_id = d.department_id;

Result:

+------------+-----------+------------------+
| first_name | last_name | department_name  |
+------------+-----------+------------------+
| John       | Smith     | IT               |
| Mary       | Johnson   | IT               |
| Charlie    | Wilson    | IT               |
| Bob        | Brown     | HR               |
| Alice      | Davis     | Sales            |
| Diana      | Miller    | Sales            |
| NULL       | NULL      | Marketing        |
+------------+-----------+------------------+

Use Cases:

When you want all records from the secondary table regardless of matches
Finding orphaned records in the right table
Less commonly used than LEFT JOIN (can be rewritten as LEFT JOIN)

FULL OUTER JOIN

Definition: Returns all rows from both tables. When there's no match, NULL values are returned for the missing side.

Syntax:

SELECT columns
FROM table1
FULL OUTER JOIN table2 ON table1.column = table2.column;

Visual Representation:

Table A       Table B        FULL OUTER JOIN
───────       ───────        ───────────────
  ●●●           ●●●              ●●●
  ●●●           ●●●              ●●●
  ●●●           ●●●              ●●●

Example:

-- Get all employees and all departments, matching where possible
SELECT 
    e.first_name,
    e.last_name,
    d.department_name
FROM employees e
FULL OUTER JOIN departments d ON e.department_id = d.department_id;

Result:

+------------+-----------+------------------+
| first_name | last_name | department_name  |
+------------+-----------+------------------+
| John       | Smith     | IT               |
| Mary       | Johnson   | IT               |
| Charlie    | Wilson    | IT               |
| Bob        | Brown     | HR               |
| Alice      | Davis     | Sales            |
| Diana      | Miller    | Sales            |
| NULL       | NULL      | Marketing        |
+------------+-----------+------------------+

Use Cases:

When you need complete data from both tables
Finding all records regardless of matches
Data reconciliation and auditing

Examples of Table Relationships

One-to-Many Relationship Example

Scenario: Employees and Departments

-- Get all departments with employee counts
SELECT 
    d.department_name,
    d.location,
    COUNT(e.employee_id) as employee_count
FROM departments d
LEFT JOIN employees e ON d.department_id = e.department_id
GROUP BY d.department_id, d.department_name, d.location
ORDER BY employee_count DESC;

Result:

+------------------+----------+----------------+
| department_name  | location | employee_count |
+------------------+----------+----------------+
| IT               | Building A| 3              |
| Sales            | Building C| 2              |
| HR               | Building B| 1              |
| Marketing        | Building D| 0              |
+------------------+----------+----------------+

Many-to-Many Relationship Example

Scenario: Employees and Projects

-- Get all employees with their assigned projects
SELECT 
    e.first_name,
    e.last_name,
    p.project_name,
    ep.role,
    ep.hours_worked
FROM employees e
INNER JOIN employee_projects ep ON e.employee_id = ep.employee_id
INNER JOIN projects p ON ep.project_id = p.project_id
ORDER BY e.last_name, p.project_name;

Result:

+------------+-----------+------------------+----------+--------------+
| first_name | last_name | project_name     | role     | hours_worked |
+------------+-----------+------------------+----------+--------------+
| Alice      | Davis     | Mobile App       | Lead     | 400          |
| Bob        | Brown     | NULL             | NULL     | NULL         |
| Charlie    | Wilson    | Website Redesign | Developer| 250          |
| Charlie    | Wilson    | Database Upgrade | Developer| 220          |
| Diana      | Miller    | Mobile App       | Designer | 350          |
| John       | Smith     | Website Redesign | Manager  | 200          |
| John       | Smith     | Mobile App       | Manager  | 150          |
| Mary       | Johnson   | Website Redesign | Developer| 300          |
| Mary       | Johnson   | Database Upgrade | Developer| 180          |
+------------+-----------+------------------+----------+--------------+

Self-Join Example

Scenario: Employee-Manager Relationships

-- Get employees with their managers
SELECT 
    e.first_name || ' ' || e.last_name AS employee_name,
    e.salary AS employee_salary,
    m.first_name || ' ' || m.last_name AS manager_name,
    m.salary AS manager_salary
FROM employees e
LEFT JOIN employees m ON e.manager_id = m.employee_id
ORDER BY m.employee_id, e.salary DESC;

Result:

+------------------+----------------+----------------+----------------+
| employee_name    | employee_salary| manager_name   | manager_salary |
+------------------+----------------+----------------+----------------+
| John Smith       | 90000          | NULL           | NULL           |
| Mary Johnson     | 75000          | John Smith     | 90000          |
| Charlie Wilson   | 65000          | John Smith     | 90000          |
| Alice Davis      | 80000          | NULL           | NULL           |
| Diana Miller     | 60000          | Alice Davis    | 80000          |
| Bob Brown        | 70000          | NULL           | NULL           |
+------------------+----------------+----------------+----------------+

GROUP BY and HAVING Clauses

GROUP BY Clause

Definition: Groups rows that have the same values in specified columns into summary rows.

Syntax:

SELECT column1, aggregate_function(column2)
FROM table
WHERE condition
GROUP BY column1, column2, ...
ORDER BY column1;

Basic Examples:

Group by Department:

-- Count employees by department
SELECT 
    d.department_name,
    COUNT(e.employee_id) as employee_count,
    AVG(e.salary) as avg_salary,
    MAX(e.salary) as max_salary,
    MIN(e.salary) as min_salary
FROM employees e
INNER JOIN departments d ON e.department_id = d.department_id
GROUP BY d.department_id, d.department_name
ORDER BY employee_count DESC;

Result:

+------------------+----------------+------------+------------+------------+
| department_name  | employee_count | avg_salary | max_salary | min_salary |
+------------------+----------------+------------+------------+------------+
| IT               | 3              | 76666.67   | 90000      | 65000      |
| Sales            | 2              | 70000.00   | 80000      | 60000      |
| HR               | 1              | 70000.00   | 70000      | 70000      |
+------------------+----------------+------------+------------+------------+

Multiple Column Grouping:

-- Group by department and salary ranges
SELECT 
    d.department_name,
    CASE 
        WHEN e.salary < 70000 THEN 'Low'
        WHEN e.salary < 85000 THEN 'Medium'
        ELSE 'High'
    END as salary_range,
    COUNT(*) as count
FROM employees e
INNER JOIN departments d ON e.department_id = d.department_id
GROUP BY d.department_name,
    CASE 
        WHEN e.salary < 70000 THEN 'Low'
        WHEN e.salary < 85000 THEN 'Medium'
        ELSE 'High'
    END
ORDER BY d.department_name, salary_range;

Result:

+------------------+--------------+-------+
| department_name  | salary_range | count |
+------------------+--------------+-------+
| HR               | Medium       | 1     |
| IT               | Low          | 1     |
| IT               | Medium       | 1     |
| IT               | High         | 1     |
| Sales            | Low          | 1     |
| Sales            | High         | 1     |
+------------------+--------------+-------+

HAVING Clause

Definition: Filters groups based on aggregate function results, similar to WHERE but for groups.

Syntax:

SELECT column1, aggregate_function(column2)
FROM table
WHERE condition
GROUP BY column1
HAVING aggregate_condition
ORDER BY column1;

HAVING vs WHERE:

WHERE: Filters individual rows before grouping
HAVING: Filters groups after aggregation

HAVING Examples:

Filter Groups by Count:

-- Departments with more than 1 employee
SELECT 
    d.department_name,
    COUNT(e.employee_id) as employee_count,
    AVG(e.salary) as avg_salary
FROM employees e
INNER JOIN departments d ON e.department_id = d.department_id
GROUP BY d.department_id, d.department_name
HAVING COUNT(e.employee_id) > 1
ORDER BY employee_count DESC;

Result:

+------------------+----------------+------------+
| department_name  | employee_count | avg_salary |
+------------------+----------------+------------+
| IT               | 3              | 76666.67   |
| Sales            | 2              | 70000.00   |
+------------------+----------------+------------+

Filter Groups by Average:

-- Departments with average salary above 70000
SELECT 
    d.department_name,
    COUNT(e.employee_id) as employee_count,
    AVG(e.salary) as avg_salary
FROM employees e
INNER JOIN departments d ON e.department_id = d.department_id
GROUP BY d.department_id, d.department_name
HAVING AVG(e.salary) > 70000
ORDER BY avg_salary DESC;

Result:

+------------------+----------------+------------+
| department_name  | employee_count | avg_salary |
+------------------+----------------+------------+
| IT               | 3              | 76666.67   |
| Sales            | 2              | 70000.00   |
+------------------+----------------+------------+

Complex HAVING Conditions:

-- Departments with specific criteria
SELECT 
    d.department_name,
    COUNT(e.employee_id) as employee_count,
    AVG(e.salary) as avg_salary,
    SUM(e.salary) as total_salary
FROM employees e
INNER JOIN departments d ON e.department_id = d.department_id
GROUP BY d.department_id, d.department_name
HAVING COUNT(e.employee_id) >= 2 
   AND AVG(e.salary) > 65000
   AND SUM(e.salary) > 100000
ORDER BY avg_salary DESC;

Aggregation Functions

Aggregation functions perform calculations on a set of values and return a single value.

Common Aggregation Functions

COUNT()

COUNT(*): Counts all rows including NULLs
COUNT(column): Counts non-NULL values in column
COUNT(DISTINCT column): Counts unique non-NULL values

-- Various COUNT examples
SELECT 
    COUNT(*) as total_employees,
    COUNT(department_id) as employees_with_dept,
    COUNT(DISTINCT department_id) as unique_departments,
    COUNT(DISTINCT manager_id) as unique_managers
FROM employees;

Result:

+-----------------+---------------------+---------------------+-------------------+
| total_employees | employees_with_dept  | unique_departments  | unique_managers    |
+-----------------+---------------------+---------------------+-------------------+
| 6               | 6                   | 3                   | 2                 |
+-----------------+---------------------+---------------------+-------------------+

SUM()

Calculates the sum of numeric values
Ignores NULL values

-- Total salary by department
SELECT 
    d.department_name,
    SUM(e.salary) as total_salary,
    SUM(e.salary * 0.1) as total_bonus_10_percent
FROM employees e
INNER JOIN departments d ON e.department_id = d.department_id
GROUP BY d.department_id, d.department_name;

AVG()

Calculates the average of numeric values
Ignores NULL values

-- Average salary statistics
SELECT 
    d.department_name,
    AVG(e.salary) as avg_salary,
    ROUND(AVG(e.salary), 2) as avg_salary_rounded,
    AVG(e.salary) / 1000 as avg_salary_thousands
FROM employees e
INNER JOIN departments d ON e.department_id = d.department_id
GROUP BY d.department_id, d.department_name;

MIN() and MAX()

Find minimum and maximum values
Work with numbers, dates, and strings

-- Min/Max statistics by department
SELECT 
    d.department_name,
    MIN(e.salary) as min_salary,
    MAX(e.salary) as max_salary,
    MIN(e.hire_date) as earliest_hire,
    MAX(e.hire_date) as latest_hire,
    MIN(e.last_name) as first_alphabetically
FROM employees e
INNER JOIN departments d ON e.department_id = d.department_id
GROUP BY d.department_id, d.department_name;

Advanced Aggregation Examples

Multiple Aggregations with Joins:

-- Comprehensive department statistics
SELECT 
    d.department_name,
    d.location,
    COUNT(e.employee_id) as employee_count,
    COUNT(DISTINCT e.manager_id) as manager_count,
    SUM(e.salary) as total_salary_budget,
    AVG(e.salary) as avg_salary,
    MIN(e.salary) as min_salary,
    MAX(e.salary) as max_salary,
    MAX(e.salary) - MIN(e.salary) as salary_range,
    ROUND(AVG(e.salary), 2) as avg_salary_rounded,
    SUM(CASE WHEN e.salary > 75000 THEN 1 ELSE 0 END) as high_earners_count
FROM departments d
LEFT JOIN employees e ON d.department_id = e.department_id
GROUP BY d.department_id, d.department_name, d.location
ORDER BY total_salary_budget DESC;

Result:

+------------------+----------+----------------+---------------+------------------+------------+------------+------------+--------------+---------------------+-------------------+
| department_name  | location | employee_count | manager_count | total_salary_budget| avg_salary | min_salary | max_salary | salary_range | avg_salary_rounded  | high_earners_count |
+------------------+----------+----------------+---------------+------------------+------------+------------+------------+--------------+---------------------+-------------------+
| IT               | Building A| 3              | 1             | 230000.00        | 76666.67   | 65000      | 90000      | 25000        | 76666.67           | 1                 |
| Sales            | Building C| 2              | 1             | 140000.00        | 70000.00   | 60000      | 80000      | 20000        | 70000.00           | 1                 |
| HR               | Building B| 1              | 0             | 70000.00         | 70000.00   | 70000      | 70000      | 0            | 70000.00           | 0                 |
| Marketing        | Building D| 0              | 0             | NULL             | NULL       | NULL       | NULL       | NULL         | NULL                | 0                 |
+------------------+----------+----------------+---------------+------------------+------------+------------+------------+--------------+---------------------+-------------------+

Aggregation with CASE Statements:

-- Salary distribution analysis
SELECT 
    d.department_name,
    COUNT(*) as total_employees,
    SUM(CASE WHEN e.salary < 65000 THEN 1 ELSE 0 END) as low_salary_count,
    SUM(CASE WHEN e.salary >= 65000 AND e.salary < 80000 THEN 1 ELSE 0 END) as medium_salary_count,
    SUM(CASE WHEN e.salary >= 80000 THEN 1 ELSE 0 END) as high_salary_count,
    ROUND(AVG(CASE WHEN e.salary < 65000 THEN e.salary END), 2) as avg_low_salary,
    ROUND(AVG(CASE WHEN e.salary >= 65000 AND e.salary < 80000 THEN e.salary END), 2) as avg_medium_salary,
    ROUND(AVG(CASE WHEN e.salary >= 80000 THEN e.salary END), 2) as avg_high_salary
FROM departments d
LEFT JOIN employees e ON d.department_id = e.department_id
GROUP BY d.department_id, d.department_name
ORDER BY total_employees DESC;

Key Takeaways

Join Types: Master INNER, LEFT, RIGHT, and FULL OUTER joins for different data combination needs
Relationships: Understand one-to-one, one-to-many, and many-to-many relationships
GROUP BY: Use to aggregate data and create summary statistics
HAVING vs WHERE: Filter groups with HAVING, filter rows with WHERE
Aggregation Functions: Master COUNT, SUM, AVG, MIN, MAX for data analysis
Performance: Optimize joins with proper indexing and filtering strategies

Next Steps

In the next lesson, we'll explore data warehousing and architecture to understand how to design scalable data systems.

Data Analyst Fundamentals

01Data Fundamentals and Preparation3 hours

02Statistics, Visualization and Analysis2 hours

03Business Intelligence and Governance2 hours

Joins and Data Relationships

Lesson Overview

What You'll Learn:

Key Concepts:

Understanding Data Relationships

Types of Table Relationships

Sample Database Schema

JOIN Types: INNER, LEFT, RIGHT, FULL

INNER JOIN

LEFT JOIN (LEFT OUTER JOIN)

RIGHT JOIN (RIGHT OUTER JOIN)

FULL OUTER JOIN

Examples of Table Relationships

One-to-Many Relationship Example

Many-to-Many Relationship Example

Self-Join Example

GROUP BY and HAVING Clauses

GROUP BY Clause

HAVING Clause

Aggregation Functions

Common Aggregation Functions

Advanced Aggregation Examples

Key Takeaways

Next Steps