SQL & RDBMs

The following is based on of PostgreSQL.

PostgreSQL Links

Basic psql Commands

Operationpsql Commandquery
Show databases\lSELECT
* FROM pg_catalog.pg_tables
WHERE schemaname != ‘pg_catalog’
AND schemaname != ‘information_schema’;
Switch to, connect to, a database\c <database>
Show tables in a database\dt
Describe a table in a database\d+ <table>
Toggle the printing of column names and result row count footers\t

Distinct

The DISINCT clause will return the unique values for a given column. It can be applied to multiple columns to return the unique combination of the specified columns.

UNIONS

Enables the merging of results from multiple selects into a single result set. Requires that the columns have the same names and data types from the first select statement.

UNION differs from UNION ALL in that the former will deduplicate the results whereas the latter will return duplicate results.

JOINS

At their core, joins enable the combination of data from multiple tables via a common key or keys between different data sets. The common key, or columns of data, is called a predicate or the join condition. It is the ON a.id = b.id part of the query and the key to the join provides a logical way to relate the two records. The join predicate is the common column between the two tables that contains data with which you can join two records.

There are multiple types of joins, OUTER AND INNER joins.

The OUTER JOIN will match every record in the table on the side specified. For example a LEFT OUTER JOIN will return all records that match the query for the table on the left (the table in the FROM clause), and for those columns that do not match in the right table (the table in the ON clause), the result will be null.

The reverse is true for a RIGHT OUTER JOIN.

The INNER JOIN will ONLY return records where there is a match in both tables based on the join predicate.

OUTER JOINs are used to answer the “what is missing from this data set?” type of question. For example, given a table of products and a table of sales that includes product ids, you can use an OUTER join to find the set of product ids that have not yet been sold.

Difference between a full join and an inner join

A full join will return records from both tables and null columns on both the left and right side where matches do not exist. An inner join will only return records where there is a match in both tables. This enables us to find out where records are missing in both tables. A full join is also known as a full outer join. MySQL does not support this, but it can be emulated by performing a UNION on a LEFT JOIN and an RIGHT JOIN.

Self JOINS

There are times when we need to treat a given table as if it is two separate tables. The classic example is the “write a query that lists the managers for each employee in the employees table” question.

To get this answer you need to write a JOIN query where the left table is the employees table as some alias, and the right table is the same employees table as a different alias. Using an alias for the “two” tables is a requirement because we are referring to “two” separate tables.

Just as an aside, I have seen many people who come from traditional RDBMS implementations (Oracle, DB2, etc.) who attempt to write self JOINS in Hive which do not work. In a big data system like Hive a self join only works if you write the query with a sub select that will actually generate a 2nd table with which you can use to iterate over all of the values in your “left” table.

Keys

A set of columns in a table that enable a row to be uniquely identified. Some implementations allow foreign and unique keys to be NULL. A primary key can never be null.

Primary, unique, and foreign keys can all consist of the combination (composite) of more than one column.

  • Primary: a field in a record in a table that must be unique in that table. It is a special key because this key is directly tied to the table in which the record resides. It cannot be null.
  • Unique: Provides uniqueness across all other records in the same table
  • Foreign: Used to reference unique columns in other tables and provides a logical relationship between the values/records between two or more tables.

What is a Natural Key?
A key that is made up of columns that have a logical relationship to other columns in a table and does not require additional columns to be added to be unique and create an ‘artificial’ key.

Referential Integrity

A concept directly related to foreign keys. Referential integrity is when all key relationship must be maintained such that the removal or update of rows in tables that share the key maintain valid references. Given an example where we have two tables; a products table that contains an id (its primary key in the products table), a name and description. And a prices table where we have an id (its primary key in the prices table), a products_id field which points to the products.id field as a foreign key in the products table, and a price field. There are three rules:

  1. You cannot add a new record to the prices table unless the foreign key points to a valid record in the products table.
  2. If you delete a record in the products table the record (or records) in the prices table must also be deleted.
  3. If the primary key for a record in the products table changes all references to it in the prices table must also change.

Constraints

A constraint is used to specify a rule for the data in a table. Specifically, limiting the type, and or values, that can be stored in a given column or columns. I can be added when the table is created, or the table can be altered later to add the constraint(s).

Common types of constraints:

  • Primary Key: uniquely identifies a row in a table
  • NOT NULL: prevents a column value from being NULL
  • FOREIGN KEY: designates a relationship between a row with a row in another table
  • DEFAULT: automatically assigns a default value for a column if one is not provided in the SQL insert statement when creating a new row
  • CHECK: ensures that a column value meets a certain criteria before being inserted or updated
  • UNIQUE: ensures that a unique value is inserted in the given column and that no other row in the table can have a duplicate value; email address, SSN, etc.
  • INDEX: enables faster retrieval of records

Cursors

A cursor enables you to encapsulate a query such that you can process each row, one at a time. It is typically used when you want to process a large dataset with which you may otherwise run out of memory when attempting to process. In my experience I have never used them as I am always iterating over a result set in some sort of middleware, programming language and not trying to write code within the database itself. Following is a link that describes sql problems requiring cursors. I would argue that this kind of SQL should be avoided so that you can decouple your SQL implementation from your business logic.

HAVING, GROUP BY and WHERE Clauses

The GROUP BY clause enables us to group the results of an aggregate clause by the distinct values in another field.

The WHERE clause enables us to to specify criteria that specific fields values must match to be included in the query result.

The HAVING clause enables us to do what we would expect could be done with a WHERE clause with the result of aggregate functions (AVG, COUNT, MIN, MAX, SUM). One other difference with the HAVING clause is that we cannot refer to the virtual field by name that is defined in the query, but must “re-write” the aggregate call for the field in question and then apply the filtering condition for it.

For example, given the following table and data

company=# \d+ sales
                                                          Table "public.sales"
    Column    |  Type   | Collation | Nullable |              Default              | Storage | Compression | Stats target | Description 
--------------+---------+-----------+----------+-----------------------------------+---------+-------------+--------------+-------------
 id           | integer |           | not null | nextval('sales_id_seq'::regclass) | plain   |             |              | 
 customer_id  | integer |           | not null |                                   | plain   |             |              | 
 employees_id | integer |           | not null |                                   | plain   |             |              | 
 sale_date    | date    |           | not null |                                   | plain   |             |              | 
 products_id  | integer |           | not null |                                   | plain   |             |              | 
 price        | integer |           | not null |                                   | plain   |             |              | 
Indexes:
    "sales_pkey" PRIMARY KEY, btree (id)
Foreign-key constraints:
    "sales_customer_id_fkey" FOREIGN KEY (customer_id) REFERENCES customers(id)
    "sales_employees_id_fkey" FOREIGN KEY (employees_id) REFERENCES employees(id)
    "sales_products_id_fkey" FOREIGN KEY (products_id) REFERENCES products(id)
Access method: heap

company=# select * from sales;
 id | customer_id | employees_id | sale_date  | products_id |  price  
----+-------------+--------------+------------+-------------+---------
  1 |           1 |            1 | 2014-04-01 |           1 |    2500
  2 |           1 |            6 | 2014-03-01 |           1 |    2550
  3 |           2 |            6 | 2014-03-03 |           2 |  256700
  4 |           2 |            6 | 2014-04-03 |           3 | 1256700
  5 |           4 |            1 | 2014-04-10 |           1 |    2506
  6 |           2 |            7 | 2014-04-11 |           4 |   12383
  7 |           4 |            8 | 2014-05-10 |           1 |    8230
(7 rows)

In order to see the total number of sales grouped by the sales person (employee_id) for those employees who sold 10000 or more you would write the following query

SELECT
    SUM(s.price) as total_sale,
    s.employees_id
FROM
    sales s
GROUP BY
    s.employees_id
HAVING
    SUM(s.price) > 10000
;

 total_sale | employees_id 
------------+--------------
    1515950 |            6
      12383 |            7
(2 rows)

You would not be able to write the having clause as HAVING total_sale < 10000.

Additionally, notice that you cannot refer to the “total_sale” column in the HAVING clause, but need to include the aggregate function call for which you are defining a filtering criteria.

Indexes

In any database like system the more records that you exclude from a query, the faster and more efficient the query will be. Indexes allow us to partition the data against a given set of keys such that we can execute the query against a subset of the data.

The index data is typically stored in a B-tree. The nodes in the tree store the values for the specific columns against which the index is built and a pointer to the corresponding row in the table. Hashtables can also be used to store index data but a disadvantage of that implementation is that data in a hashtable is not sorted. Bitmap indexes work well with columns that are boolean values. The index data needs to be updated on inserts, updates, and deletes.

Selectivity and Cardinality

Cardinality, in it’s relation to an RDBMS, is the number of distinct values in a given column relative to the total number of rows in the table.

Selectivity is a measurement of how well a given column used for an index will neck down the amount of input data in a given query. Selectivity can be calculated with the following formula; simply the ratio of cardinality to the number of total records in the table. The higher the number the better.

selectivity of index = cardinality/number of records

The classic example is a table with people in it and calculating the selectivity based on the gender column. Since the possible genders (for simplicity’s sake) are only male and female the cardinality is 2. Assuming that we have 100,000 records in the table:

0.00002 = 2/100000

If we have an additional column in the table that is the U.S. State in which the person lives that would be:

0.0005 = 50/100000

As we can see the selectivity of the State column is much higher than that of gender and is a better index to help us filter out more data from an input query as long as we use State as part of the query.

Hash Indexes

Advantages:

  • Fast reads, O(1)
  • Sequential writes to the append-only log are fast
  • Immutability of persisted data enables a more simple concurrency and crash recovery model

Disadvantages:

  • Deleting entries is expensive
  • Rolling/merging logs is expensive
  • The entire hash table needs to fit in RAM

SSTables

Typically used in NoSQL database implementations. Ideal for write heavy implemenations.

Advantages

  • Fast writes as they are append only
  • Smaller disk footprint

Disadvantages

  • Slow reads because we need to read from disk
  • Merging needs to happen in the background. Because the segments are sorted, can use the final pass in mergesort, O(n).

B-Tree and B+ Tree Index Implementations

In a database index, the value associated with each leaf node references the actual data record in the database. The key and value together are called a payload.

Window Functions

A window function operates over a sub-set of rows that are related to the current row. This enables the SQL programmer to perform calculations on a sub-set of rows that are related to each other.

https://www.postgresql.org/docs/9.2/tutorial-window.html

RANK

The RANK() function provides the ability to generate query results ranked by the values in the specified columns in the result set. The primary use case is to generate top-N and bottom-N results. Best explained by an example. If we have a table with the following structure and data

company=# select * from employees;
 id | first_name |  last_name  | departments_id | salary | reports_to 
----+------------+-------------+----------------+--------+------------
  1 | Jack       | Ofalltrades |              1 |   1400 |           
  2 | John       | Doe         |              2 |   1450 |           
  3 | Alan       | Ginsberg    |              3 |   1150 |           
  4 | Ram        | Somebody    |                |    600 |           
  5 | Sales      | Dude        |              1 |   1450 |          1
  6 | Erkle      | Jones       |              1 |   1960 |          1
  7 | Neeraj     | Bahri       |              1 |   2460 |          1
  8 | Jing       | Chu         |              1 |   1665 |          1
  9 | Ernie      | Borgnine    |              2 |   1000 |          2
 10 | Frank      | LaPoint     |              3 |    840 |          3
 11 | Debbie     | Grey        |              3 |    935 |          3
(11 rows)

And we want to get all of the employees ordered by their salary we can use the RANK function.

SELECT
    concat(e.first_name, ' ', e.last_name),
    e.salary,
    RANK () OVER (
        ORDER BY e.salary
        DESC
    ) as rank
FROM employees e
;

      concat      | salary | rank 
------------------+--------+------
 Neeraj Bahri     |   2460 |    1
 Erkle Jones      |   1960 |    2
 Jing Chu         |   1665 |    3
 John Doe         |   1450 |    4
 Sales Dude       |   1450 |    4
 Jack Ofalltrades |   1400 |    6
 Alan Ginsberg    |   1150 |    7
 Ernie Borgnine   |   1000 |    8
 Debbie Grey      |    935 |    9
 Frank LaPoint    |    840 |   10
 Ram Somebody     |    600 |   11
(11 rows)

Notice that there are two 4 ranks. This occurs when there is a tie between the values on which the rank order is being generated. Because there are two number 4 ranks, we skip number 5 and continue with 6. If there where three number 4 ranks we would skip 5 and 6 and continue with rank 7.

https://www.postgresqltutorial.com/postgresql-percent_rank-function/

PARTITION

We can specify additional criteria to the OVER clause to group the results by the value of a field in the result set to create multiple sets of ranks. To see the top salaries partitioned by department the query would be.

SELECT
    es.employee_name,
    es.department,
    es.salary,
    RANK() OVER(PARTITION BY es.department ORDER BY es.salary DESC)
FROM
    (
    SELECT
        concat(e.first_name, ' ', e.last_name) AS employee_name,
        e.salary,
        d.name as department
    FROM employees e
    JOIN departments d
        ON e.departments_id = d.id
    ) as es
;

  employee_name   | department | salary | rank 
------------------+------------+--------+------
 Alan Ginsberg    | Accounting |   1150 |    1
 Debbie Grey      | Accounting |    935 |    2
 Frank LaPoint    | Accounting |    840 |    3
 John Doe         | Financing  |   1450 |    1
 Ernie Borgnine   | Financing  |   1000 |    2
 Neeraj Bahri     | Sales      |   2460 |    1
 Erkle Jones      | Sales      |   1960 |    2
 Jing Chu         | Sales      |   1665 |    3
 Sales Dude       | Sales      |   1450 |    4
 Jack Ofalltrades | Sales      |   1400 |    5
(10 rows)

Practice Queries

Given the structure of the tables as defined in db-init file in the git repo write the following queries. Try reading the heading for the practice query first and if you get stuck read the walk-through text. There are implementations of all of the queries below in the aforementioned repo.

List the first name and last name of each employee and the name of the departments to which they belong

employees-and-their-departements.sql

This type of query requires a JOIN between the employee and department table.

Bonus points for concatenating the first and last name into a new column

To accomplish this, use the CONCAT function with the employee.first_name and employee.last_name columns when querying from the employee table.

List the total number of employees per department, output results with the department name in one column and the number of employees in another

employees-per-department.sql

This is another query that requires a JOIN. It will also require using the sum aggregate function and a GROUP BY to tell the query engine by which “key” to group the results.

List the departments with three or more people

departments-with-three-or-more-employees.sql

This query requires using the HAVING clause because we must filter values that are the result of an aggregate function (the count of the employees per department). In order to resolve the department name we must also join the employee and department tables. Bonus points for keeping in mind that there might be employees that are not assigned to any department.

List the departments without any employees

departments-without-employees.sql

This query requires a LEFT OUTER JOIN between the department table, on the left, and the employee table on the right, selecting the department.name field and the employee.department_id field. Then a WHERE clause that filters for all rows where the employee.department_id field is NULL. This works because the join clause will return all records from the table on the left and will include a NULL value for the column in the right-hand table where there is no matching row.

Write a query that lists the managers, first name and last name AS “manager” (single column output) for all of the employees who are managers (do NOT have a reports_to value in their record).

managers.sql

This query requires analyzing the existing data and then writing a query where you are filtering for rows that have a NULL value.

List the employees who are managers and the count of the number of people each manage

managers-number-of-reports.sql

Here we bump up the complexity a bit. This type of query requires a JOIN against a derived table/sub-select query in order to return the correct results.

Let’s start with the inner-most sub-select; the data source for the outer query where will we get the count of the number of subordinates for each manager. This query will return two columns, the id of the manager and the number of subordinates.

SELECT
    e.reports_to,
    count(e.reports_to) AS report_count
FROM
    employee e
WHERE
    e.reports_to IS NOT NULL
GROUP BY
    e.reports_to;

We now have a derived table with the id for each of the managers and the number of subordinates.

We then use the derived table as the right-table in a JOIN against the employee table

SELECT
    CONCAT(e.first_name, ' ', e.last_name),
    t.report_count
FROM
    employee e
    JOIN (
        SELECT
            e.reports_to,
            COUNT(e.reports_to) as report_count
        FROM
            employee e
        WHERE
            e.reports_to IS NOT NULL
        GROUP BY
            e.reports_to
    ) t ON e.id = t.reports_to; 

What are the amounts for the sales people who made total sales of 20,000 or greater?

total-sales-above-20k.sql

This is a query where we will need to use the HAVING clause to filter for the records that we want to retain because we are filtering against the result of an aggregate function.

We will need to also use a JOIN to resolve the names of the sales people who have made total sales above the given amount.

Write a query that lists the employees, first name and last name AS “employee”, and the first and last name of their manager AS “manager” only for employees that have a manager

employee-and-manager.sql

This is the classic, self-join query example. In order to execute this query we must refer to the same table twice with a different alias for each and then execute a LEFT OUTER JOIN ensuring that we filter out any records where the first employee table reports_to field is NOT NULL.

Adding e1.reports_to to the SELECT clause AND excluding the WHERE clause while writing the query will make is clearer how this query works.

Write a query that lists the managers, first name and last name AS “manager” (single column output) for all of the employees who are managers that do not currently have any direct reports.

managers-without-direct-reports.sql

This is another permutation of the LEFT OUTER JOIN where we are looking for records with NULL values in a resulting column on the right. This requires an employee derived table that lists only the managers, which we have already written. This derived table needs to also return the id of the manager that we can then use to join against the employee table in the outer query and we have to filter for records where the reports_to column for the RHS table is NULL.

Write a query that creates a new table called top_salaries where an employee must have a salary higher than 1500. Include in the result set their first name, last name, salary, and department name.

top-salaries.sql and top-salaries-no-join.sql

This query requires that we persist the results of the query in a new table, and for that we use the CREATE TABLE <table-name> AS clause.

The key to this query is that we are joining multiple tables against all of the appropriate foreign keys. An alternate implementation includes selecting from multiple tables and then include multiple equality checks in the WHERE clause.

Bonus points for including the query that conditionally drops the table if it already exists.

Who are the sales people with three or more sales?

salespeople-with-three-or-more-orders.sql

This is another query that requires the HAVING clause and a JOIN between the employee and sales table.

Who is the 2nd most successful salesperson based on the total of sales dollars, excluding their salary?

2nd-most-successful-salesperson.sql

This is a query that requires a derived query that includes windowing and the RANK function over the sum of the price of the sales records. Then, we need to select from that derived query the record that is the 2nd in the ranked records.

ORM Frameworks

RDBMS Concepts and Internal Implementation Details

Indexing

B-Tree Query

Cost Factor and Plan Explanation

ACID and Transactions

A transaction is a change or group of changes to a database that must all complete successfully to ensure that the data in the database is in a consistent state. Denoted by a BEGIN and END the only once the processing has successfully traversed the internal operations of the transactions and exited the END stage is the data then available for consistent reads and further writes on the data.

Transactions must be Atomic, Consistent, Isolated, and Durable.

Atomicity guarantees that multiple statements withing a group of transactions are treated as a single “atomic” unit and either all complete successfully, or are rolled back so as not to leave the underlying data in an inconsistent state.
Consistency ensures that a transaction can only mutate data from one valid state to another.
Isolation controlling concurrency such the multiple concurrent actions on data leave the data a valid state, the same as which if each transaction occurred serially.
Durability guarantees that once committed a transaction will remain and be persisted even in the event of a crash.

Transaction States

The two basic operations are read or write.

  • Active
  • Partially Committed
  • Committed
  • Failed
  • Aborted: A rollback action leads to this state
  • Terminated

The DAG can follow and number of paths through the aforementioned nodes/states.

                                 Active
                               /        \
                    (r/w operations)  (failure)
                             /             \
           Partially Committed --(fail)-- Failed
                          /                  \
            (permanent store)             (roll back)
                         |                    |
                     Committed             Aborted
                           \                /
                            \ _Terminated_ /

Links

2-Phase Commit (2PC)

The high-level problem with a distributed RDBMS is that since transactions need to be atomically committed to the data store each of the nodes in the cluster cannot make that data available until all of the nodes in the cluster can confirm a successful write or failure of the transactions.

With a distributed database system a 2PC splits up the commit into two distinct phases to ensure the consistency of the data across the cluster. This protocol requires a Transaction Coordinator (TC) that manages the commit to the data.

Preparation Phase

The TC will confirm that all nodes in the cluster are able to promise to complete the update to the data.

  1. TC communicates to each node to prepare for the transaction.
  2. Each node ensures that it is able to execute the transaction; checking data integrity, and/or getting locks on resources.
  3. Once all nodes ensure that they can execute the transaction they indicates this “promise” to the TC.
  4. If all of the nodes do not respond “ready” the TC sends a message to each to rollback, at which time each rolls back any changes and/or releases any resources that it acquired in preparation of the transaction.

Commit Phase

The TC issues the update command to each of the nodes and confirms that all of them have successfully completed the transaction, or if one fails, commands each to roll back the change.

  1. Once the coordinator has received a “ready” message from all of the nodes it issues a command to all of the nodes to execute the transaction.
  2. The coordinator waits to receive a “complete” or a “fail” message from each of the nodes and or waits for one or more of them to timout.
  3. If each is successful it returns a “complete” message.
  4. If one or more sends a “fail” message or times out the TC sends a message to each node to rollback the change

SQL Injection Attacks

https://cheatsheetseries.owasp.org/cheatsheets/SQL_Injection_Prevention_Cheat_Sheet.html

Links