[SQL] Basic concepts of query functions

## Basic functions

- Inner join vs Outer join

```
INNER JOIN
   ( table1 )∩( table2 )
      [ overlap only ]

(   O   ) (   O   )
       *****  *****    <-- shaded overlap

LEFT JOIN
   ( table1 )⊃( table2 )
      [ all from table1 + overlap ]

(  OOOOO ) (   O   )
   (  OOOOO ) (   O   )
       *****            <-- overlap shaded too

RIGHT JOIN
   ( table1 )⊂( table2 )
      [ all from table2 + overlap ]

(   O   ) ( OOOOO )
   (   O   ) ( OOOOO )
       *****            <-- overlap shaded too

FULL OUTER JOIN
   ( table1 )∪( table2 )
      [ everything from both ]

( OOOOO ) ( OOOOO )
   ( OOOOO ) ( OOOOO )
```

- How window functions works

Sample data

| EmployeeID | Department | Salary |
|------------|------------|--------|
| 1          | HR         | 5000   |
| 2          | HR         | 4000   |
| 3          | IT         | 7000   |
| 4          | IT         | 6000   |
| 5          | IT         | 5000   |

```sql
-- Calculates the average salary per department without grouping the rows (each row still appears).
SELECT 
    EmployeeID,Department,Salary,
    AVG(Salary) OVER (PARTITION BY Department) AS AvgDeptSalary,
    ROW_NUMBER() OVER (PARTITION BY Department ORDER BY Salary DESC) AS RankInDept
FROM Employees;
```

Query result

| EmployeeID | Department | Salary | AvgDeptSalary | RankInDept |
|------------|------------|--------|---------------|------------|
| 1          | HR         | 5000   | 4500          | 1          |
| 2          | HR         | 4000   | 4500          | 2          |
| 3          | IT         | 7000   | 6000          | 1          |
| 4          | IT         | 6000   | 6000          | 2          |
| 5          | IT         | 5000   | 6000          | 3          |

## Partitioning

[Data partitioning recommendations for reliability - Microsoft Azure Well-Architected Framework | Microsoft Learn](https://learn.microsoft.com/en-us/azure/well-architected/design-guides/partition-data#partition-in-azure-sql-database) talks about scaling out with elastic pool.

[Partitioned Tables and Indexes - SQL Server, Azure SQL Database, Azure SQL Managed Instance | Microsoft Learn](https://learn.microsoft.com/en-us/sql/relational-databases/partitions/partitioned-tables-and-indexes?view=sql-server-ver17#partitioning-column) talks about partitioning table data.

- Logical view (what users see): Table or Index
- Physical view (what SQL Server stores on disk): Heap or B‑tree index (clustered / nonclustered)

Heap (table without clustered index)
```
Table (logical)
 └─ Heap (physical)
     ├─ Data pages
     ├─ IAM pages
     └─ Allocation units
```

Table with clustered index (most common)
```
Table (logical)
 └─ Clustered index (physical)
     ├─ Root
     ├─ Intermediate
     └─ Leaf level (actual table rows)
````

Partitioning case: original table has a clustered PK - Id. The strategy is to partition by a non unique datetime column (created datetime). Initially tried adding the partitioning column into the PK, but found this requires modification in applications as well as dependent tables.

Thus employed another approach - change the Id to nonclustered PK, and create a separate clustered index includes Id and partitioning column.  Changing the PK to nonclustered does not require any modifications to dependent tables or application mappings. Meanwhile, creating a separate clustered index that includes the partitioning column allows the table to be physically partitioned and enables reliable partition elimination.

```sql
-- Parent table — PK is NONCLUSTERED
CREATE TABLE dbo.[Log]
(
    ID INT NOT NULL IDENTITY(1,1),
    insert_time DATETIME NOT NULL DEFAULT GETDATE(),
    CONSTRAINT PK_Log_ID
        PRIMARY KEY NONCLUSTERED (ID)   -- logical key only
);

-- Dependent table with FK
CREATE TABLE dbo.[Log_data1]
(
    ID INT NOT NULL IDENTITY(1,1),
    LogID INT NOT NULL,
    insert_time DATETIME NOT NULL DEFAULT GETDATE(),
    CONSTRAINT PK_Log_data1_ID
        PRIMARY KEY CLUSTERED (ID)
);

ALTER TABLE dbo.[Log_data1]
ADD CONSTRAINT FK_Log_data1_Log
FOREIGN KEY (LogID)
REFERENCES dbo.[Log](ID);

-- Partition the table
CREATE PARTITION FUNCTION PF_Log_insert_time (DATETIME)
AS RANGE RIGHT FOR VALUES ('2025-01-01', '2026-01-01');

CREATE PARTITION SCHEME PS_Log_insert_time
AS PARTITION PF_Log_insert_time ALL TO ([PRIMARY]);

CREATE CLUSTERED INDEX CIX_Log_insert_time
ON dbo.[Log] (insert_time, ID)
ON PS_Log_insert_time(insert_time)
```

If this is a large table, we can create a new table then backfill data by batches intead of making changes diretly at the original one.