This lesson covers SQL joins and understanding relationships between different data tables for comprehensive analysis.
Before diving into joins, it's essential to understand the different types of relationships that can exist between tables in a relational database.
One-to-One Relationship:
One-to-Many Relationship:
Many-to-Many Relationship:
Let's work with a comprehensive example database:
-- Departments table
CREATE TABLE departments (
department_id INT PRIMARY KEY,
department_name VARCHAR(50) NOT NULL,
location VARCHAR(100)
);
-- Employees table
CREATE TABLE employees (
employee_id INT PRIMARY KEY,
first_name VARCHAR(50) NOT NULL,
last_name VARCHAR(50) NOT NULL,
email VARCHAR(100) UNIQUE,
department_id INT,
manager_id INT,
salary DECIMAL(10,2),
hire_date DATE,
FOREIGN KEY (department_id) REFERENCES departments(department_id),
FOREIGN KEY (manager_id) REFERENCES employees(employee_id)
);
-- Projects table
CREATE TABLE projects (
project_id INT PRIMARY KEY,
project_name VARCHAR(100) NOT NULL,
start_date DATE,
end_date DATE,
budget DECIMAL(12,2)
);
-- Employee_Projects junction table (Many-to-Many)
CREATE TABLE employee_projects (
employee_id INT,
project_id INT,
role VARCHAR(50),
hours_worked INT,
PRIMARY KEY (employee_id, project_id),
FOREIGN KEY (employee_id) REFERENCES employees(employee_id),
FOREIGN KEY (project_id) REFERENCES projects(project_id)
);
Sample Data:
Departments:
+---------------+------------------+----------+
| department_id | department_name | location |
+---------------+------------------+----------+
| 1 | IT | Building A|
| 2 | HR | Building B|
| 3 | Sales | Building C|
| 4 | Marketing | Building D|
+---------------+------------------+----------+
Employees:
+-------------+------------+-----------+------------------+---------------+------------+--------+------------+
| employee_id | first_name | last_name | email | department_id | manager_id | salary | hire_date |
+-------------+------------+-----------+------------------+---------------+------------+--------+------------+
| 101 | John | Smith | [email protected] | 1 | NULL | 90000 | 2020-01-15 |
| 102 | Mary | Johnson | [email protected] | 1 | 101 | 75000 | 2020-03-20 |
| 103 | Bob | Brown | [email protected] | 2 | NULL | 70000 | 2019-06-10 |
| 104 | Alice | Davis | [email protected]| 3 | NULL | 80000 | 2021-02-01 |
| 105 | Charlie | Wilson | [email protected]| 1 | 101 | 65000 | 2022-07-15 |
| 106 | Diana | Miller | [email protected]| 3 | 104 | 60000 | 2022-09-01 |
+-------------+------------+-----------+------------------+---------------+------------+--------+------------+
Projects:
+------------+------------------+------------+------------+---------+
| project_id | project_name | start_date | end_date | budget |
+------------+------------------+------------+------------+---------+
| 1 | Website Redesign | 2023-01-01 | 2023-06-30 | 50000 |
| 2 | Mobile App | 2023-03-15 | 2023-12-31 | 120000 |
| 3 | Database Upgrade | 2023-02-01 | 2023-04-30 | 30000 |
+------------+------------------+------------+------------+---------+
Employee_Projects:
+-------------+------------+----------+--------------+
| employee_id | project_id | role | hours_worked |
+-------------+------------+----------+--------------+
| 101 | 1 | Manager | 200 |
| 102 | 1 | Developer| 300 |
| 105 | 1 | Developer| 250 |
| 101 | 2 | Manager | 150 |
| 104 | 2 | Lead | 400 |
| 106 | 2 | Designer | 350 |
| 102 | 3 | Developer| 180 |
| 105 | 3 | Developer| 220 |
+-------------+------------+----------+--------------+
SQL joins are used to combine rows from two or more tables based on related columns between them.
Definition: Returns only the rows that have matching values in both tables.
Syntax:
SELECT columns
FROM table1
INNER JOIN table2 ON table1.column = table2.column;
Visual Representation:
Table A Table B INNER JOIN
─────── ─────── ──────────
●●● ●●● ●●
●●● ●●● ●●
●●● ●●● ●●
Example:
-- Get employees with their departments
SELECT
e.first_name,
e.last_name,
d.department_name,
d.location
FROM employees e
INNER JOIN departments d ON e.department_id = d.department_id;
Result:
+------------+-----------+------------------+----------+
| first_name | last_name | department_name | location |
+------------+-----------+------------------+----------+
| John | Smith | IT | Building A|
| Mary | Johnson | IT | Building A|
| Bob | Brown | HR | Building B|
| Alice | Davis | Sales | Building C|
| Charlie | Wilson | IT | Building A|
| Diana | Miller | Sales | Building C|
+------------+-----------+------------------+----------+
Use Cases:
Definition: Returns all rows from the left table and matched rows from the right table. If no match exists, NULL values are returned for right table columns.
Syntax:
SELECT columns
FROM table1
LEFT JOIN table2 ON table1.column = table2.column;
Visual Representation:
Table A Table B LEFT JOIN
─────── ─────── ─────────
●●● ●●● ●●●
●●● ●●● ●●●
●●● ●●● ●●●
Example:
-- Get all employees and their departments (including employees without departments)
SELECT
e.first_name,
e.last_name,
d.department_name
FROM employees e
LEFT JOIN departments d ON e.department_id = d.department_id;
Result:
+------------+-----------+------------------+
| first_name | last_name | department_name |
+------------+-----------+------------------+
| John | Smith | IT |
| Mary | Johnson | IT |
| Bob | Brown | HR |
| Alice | Davis | Sales |
| Charlie | Wilson | IT |
| Diana | Miller | Sales |
+------------+-----------+------------------+
Use Cases:
Definition: Returns all rows from the right table and matched rows from the left table. If no match exists, NULL values are returned for left table columns.
Syntax:
SELECT columns
FROM table1
RIGHT JOIN table2 ON table1.column = table2.column;
Visual Representation:
Table A Table B RIGHT JOIN
─────── ─────── ──────────
●●● ●●● ●●●
●●● ●●● ●●●
●●● ●●● ●●●
Example:
-- Get all departments and their employees (including departments without employees)
SELECT
e.first_name,
e.last_name,
d.department_name
FROM employees e
RIGHT JOIN departments d ON e.department_id = d.department_id;
Result:
+------------+-----------+------------------+
| first_name | last_name | department_name |
+------------+-----------+------------------+
| John | Smith | IT |
| Mary | Johnson | IT |
| Charlie | Wilson | IT |
| Bob | Brown | HR |
| Alice | Davis | Sales |
| Diana | Miller | Sales |
| NULL | NULL | Marketing |
+------------+-----------+------------------+
Use Cases:
Definition: Returns all rows from both tables. When there's no match, NULL values are returned for the missing side.
Syntax:
SELECT columns
FROM table1
FULL OUTER JOIN table2 ON table1.column = table2.column;
Visual Representation:
Table A Table B FULL OUTER JOIN
─────── ─────── ───────────────
●●● ●●● ●●●
●●● ●●● ●●●
●●● ●●● ●●●
Example:
-- Get all employees and all departments, matching where possible
SELECT
e.first_name,
e.last_name,
d.department_name
FROM employees e
FULL OUTER JOIN departments d ON e.department_id = d.department_id;
Result:
+------------+-----------+------------------+
| first_name | last_name | department_name |
+------------+-----------+------------------+
| John | Smith | IT |
| Mary | Johnson | IT |
| Charlie | Wilson | IT |
| Bob | Brown | HR |
| Alice | Davis | Sales |
| Diana | Miller | Sales |
| NULL | NULL | Marketing |
+------------+-----------+------------------+
Use Cases:
Scenario: Employees and Departments
-- Get all departments with employee counts
SELECT
d.department_name,
d.location,
COUNT(e.employee_id) as employee_count
FROM departments d
LEFT JOIN employees e ON d.department_id = e.department_id
GROUP BY d.department_id, d.department_name, d.location
ORDER BY employee_count DESC;
Result:
+------------------+----------+----------------+
| department_name | location | employee_count |
+------------------+----------+----------------+
| IT | Building A| 3 |
| Sales | Building C| 2 |
| HR | Building B| 1 |
| Marketing | Building D| 0 |
+------------------+----------+----------------+
Scenario: Employees and Projects
-- Get all employees with their assigned projects
SELECT
e.first_name,
e.last_name,
p.project_name,
ep.role,
ep.hours_worked
FROM employees e
INNER JOIN employee_projects ep ON e.employee_id = ep.employee_id
INNER JOIN projects p ON ep.project_id = p.project_id
ORDER BY e.last_name, p.project_name;
Result:
+------------+-----------+------------------+----------+--------------+
| first_name | last_name | project_name | role | hours_worked |
+------------+-----------+------------------+----------+--------------+
| Alice | Davis | Mobile App | Lead | 400 |
| Bob | Brown | NULL | NULL | NULL |
| Charlie | Wilson | Website Redesign | Developer| 250 |
| Charlie | Wilson | Database Upgrade | Developer| 220 |
| Diana | Miller | Mobile App | Designer | 350 |
| John | Smith | Website Redesign | Manager | 200 |
| John | Smith | Mobile App | Manager | 150 |
| Mary | Johnson | Website Redesign | Developer| 300 |
| Mary | Johnson | Database Upgrade | Developer| 180 |
+------------+-----------+------------------+----------+--------------+
Scenario: Employee-Manager Relationships
-- Get employees with their managers
SELECT
e.first_name || ' ' || e.last_name AS employee_name,
e.salary AS employee_salary,
m.first_name || ' ' || m.last_name AS manager_name,
m.salary AS manager_salary
FROM employees e
LEFT JOIN employees m ON e.manager_id = m.employee_id
ORDER BY m.employee_id, e.salary DESC;
Result:
+------------------+----------------+----------------+----------------+
| employee_name | employee_salary| manager_name | manager_salary |
+------------------+----------------+----------------+----------------+
| John Smith | 90000 | NULL | NULL |
| Mary Johnson | 75000 | John Smith | 90000 |
| Charlie Wilson | 65000 | John Smith | 90000 |
| Alice Davis | 80000 | NULL | NULL |
| Diana Miller | 60000 | Alice Davis | 80000 |
| Bob Brown | 70000 | NULL | NULL |
+------------------+----------------+----------------+----------------+
Definition: Groups rows that have the same values in specified columns into summary rows.
Syntax:
SELECT column1, aggregate_function(column2)
FROM table
WHERE condition
GROUP BY column1, column2, ...
ORDER BY column1;
Basic Examples:
Group by Department:
-- Count employees by department
SELECT
d.department_name,
COUNT(e.employee_id) as employee_count,
AVG(e.salary) as avg_salary,
MAX(e.salary) as max_salary,
MIN(e.salary) as min_salary
FROM employees e
INNER JOIN departments d ON e.department_id = d.department_id
GROUP BY d.department_id, d.department_name
ORDER BY employee_count DESC;
Result:
+------------------+----------------+------------+------------+------------+
| department_name | employee_count | avg_salary | max_salary | min_salary |
+------------------+----------------+------------+------------+------------+
| IT | 3 | 76666.67 | 90000 | 65000 |
| Sales | 2 | 70000.00 | 80000 | 60000 |
| HR | 1 | 70000.00 | 70000 | 70000 |
+------------------+----------------+------------+------------+------------+
Multiple Column Grouping:
-- Group by department and salary ranges
SELECT
d.department_name,
CASE
WHEN e.salary < 70000 THEN 'Low'
WHEN e.salary < 85000 THEN 'Medium'
ELSE 'High'
END as salary_range,
COUNT(*) as count
FROM employees e
INNER JOIN departments d ON e.department_id = d.department_id
GROUP BY d.department_name,
CASE
WHEN e.salary < 70000 THEN 'Low'
WHEN e.salary < 85000 THEN 'Medium'
ELSE 'High'
END
ORDER BY d.department_name, salary_range;
Result:
+------------------+--------------+-------+
| department_name | salary_range | count |
+------------------+--------------+-------+
| HR | Medium | 1 |
| IT | Low | 1 |
| IT | Medium | 1 |
| IT | High | 1 |
| Sales | Low | 1 |
| Sales | High | 1 |
+------------------+--------------+-------+
Definition: Filters groups based on aggregate function results, similar to WHERE but for groups.
Syntax:
SELECT column1, aggregate_function(column2)
FROM table
WHERE condition
GROUP BY column1
HAVING aggregate_condition
ORDER BY column1;
HAVING vs WHERE:
HAVING Examples:
Filter Groups by Count:
-- Departments with more than 1 employee
SELECT
d.department_name,
COUNT(e.employee_id) as employee_count,
AVG(e.salary) as avg_salary
FROM employees e
INNER JOIN departments d ON e.department_id = d.department_id
GROUP BY d.department_id, d.department_name
HAVING COUNT(e.employee_id) > 1
ORDER BY employee_count DESC;
Result:
+------------------+----------------+------------+
| department_name | employee_count | avg_salary |
+------------------+----------------+------------+
| IT | 3 | 76666.67 |
| Sales | 2 | 70000.00 |
+------------------+----------------+------------+
Filter Groups by Average:
-- Departments with average salary above 70000
SELECT
d.department_name,
COUNT(e.employee_id) as employee_count,
AVG(e.salary) as avg_salary
FROM employees e
INNER JOIN departments d ON e.department_id = d.department_id
GROUP BY d.department_id, d.department_name
HAVING AVG(e.salary) > 70000
ORDER BY avg_salary DESC;
Result:
+------------------+----------------+------------+
| department_name | employee_count | avg_salary |
+------------------+----------------+------------+
| IT | 3 | 76666.67 |
| Sales | 2 | 70000.00 |
+------------------+----------------+------------+
Complex HAVING Conditions:
-- Departments with specific criteria
SELECT
d.department_name,
COUNT(e.employee_id) as employee_count,
AVG(e.salary) as avg_salary,
SUM(e.salary) as total_salary
FROM employees e
INNER JOIN departments d ON e.department_id = d.department_id
GROUP BY d.department_id, d.department_name
HAVING COUNT(e.employee_id) >= 2
AND AVG(e.salary) > 65000
AND SUM(e.salary) > 100000
ORDER BY avg_salary DESC;
Aggregation functions perform calculations on a set of values and return a single value.
COUNT()
-- Various COUNT examples
SELECT
COUNT(*) as total_employees,
COUNT(department_id) as employees_with_dept,
COUNT(DISTINCT department_id) as unique_departments,
COUNT(DISTINCT manager_id) as unique_managers
FROM employees;
Result:
+-----------------+---------------------+---------------------+-------------------+
| total_employees | employees_with_dept | unique_departments | unique_managers |
+-----------------+---------------------+---------------------+-------------------+
| 6 | 6 | 3 | 2 |
+-----------------+---------------------+---------------------+-------------------+
SUM()
-- Total salary by department
SELECT
d.department_name,
SUM(e.salary) as total_salary,
SUM(e.salary * 0.1) as total_bonus_10_percent
FROM employees e
INNER JOIN departments d ON e.department_id = d.department_id
GROUP BY d.department_id, d.department_name;
AVG()
-- Average salary statistics
SELECT
d.department_name,
AVG(e.salary) as avg_salary,
ROUND(AVG(e.salary), 2) as avg_salary_rounded,
AVG(e.salary) / 1000 as avg_salary_thousands
FROM employees e
INNER JOIN departments d ON e.department_id = d.department_id
GROUP BY d.department_id, d.department_name;
MIN() and MAX()
-- Min/Max statistics by department
SELECT
d.department_name,
MIN(e.salary) as min_salary,
MAX(e.salary) as max_salary,
MIN(e.hire_date) as earliest_hire,
MAX(e.hire_date) as latest_hire,
MIN(e.last_name) as first_alphabetically
FROM employees e
INNER JOIN departments d ON e.department_id = d.department_id
GROUP BY d.department_id, d.department_name;
Multiple Aggregations with Joins:
-- Comprehensive department statistics
SELECT
d.department_name,
d.location,
COUNT(e.employee_id) as employee_count,
COUNT(DISTINCT e.manager_id) as manager_count,
SUM(e.salary) as total_salary_budget,
AVG(e.salary) as avg_salary,
MIN(e.salary) as min_salary,
MAX(e.salary) as max_salary,
MAX(e.salary) - MIN(e.salary) as salary_range,
ROUND(AVG(e.salary), 2) as avg_salary_rounded,
SUM(CASE WHEN e.salary > 75000 THEN 1 ELSE 0 END) as high_earners_count
FROM departments d
LEFT JOIN employees e ON d.department_id = e.department_id
GROUP BY d.department_id, d.department_name, d.location
ORDER BY total_salary_budget DESC;
Result:
+------------------+----------+----------------+---------------+------------------+------------+------------+------------+--------------+---------------------+-------------------+
| department_name | location | employee_count | manager_count | total_salary_budget| avg_salary | min_salary | max_salary | salary_range | avg_salary_rounded | high_earners_count |
+------------------+----------+----------------+---------------+------------------+------------+------------+------------+--------------+---------------------+-------------------+
| IT | Building A| 3 | 1 | 230000.00 | 76666.67 | 65000 | 90000 | 25000 | 76666.67 | 1 |
| Sales | Building C| 2 | 1 | 140000.00 | 70000.00 | 60000 | 80000 | 20000 | 70000.00 | 1 |
| HR | Building B| 1 | 0 | 70000.00 | 70000.00 | 70000 | 70000 | 0 | 70000.00 | 0 |
| Marketing | Building D| 0 | 0 | NULL | NULL | NULL | NULL | NULL | NULL | 0 |
+------------------+----------+----------------+---------------+------------------+------------+------------+------------+--------------+---------------------+-------------------+
Aggregation with CASE Statements:
-- Salary distribution analysis
SELECT
d.department_name,
COUNT(*) as total_employees,
SUM(CASE WHEN e.salary < 65000 THEN 1 ELSE 0 END) as low_salary_count,
SUM(CASE WHEN e.salary >= 65000 AND e.salary < 80000 THEN 1 ELSE 0 END) as medium_salary_count,
SUM(CASE WHEN e.salary >= 80000 THEN 1 ELSE 0 END) as high_salary_count,
ROUND(AVG(CASE WHEN e.salary < 65000 THEN e.salary END), 2) as avg_low_salary,
ROUND(AVG(CASE WHEN e.salary >= 65000 AND e.salary < 80000 THEN e.salary END), 2) as avg_medium_salary,
ROUND(AVG(CASE WHEN e.salary >= 80000 THEN e.salary END), 2) as avg_high_salary
FROM departments d
LEFT JOIN employees e ON d.department_id = e.department_id
GROUP BY d.department_id, d.department_name
ORDER BY total_employees DESC;
In the next lesson, we'll explore data warehousing and architecture to understand how to design scalable data systems.