SQL

The following is based on of PostgreSQL.

Basic psql Commands

Operationpsql Command
Show databases\l
Switch to, connect to, a database\c <database>
Show tables in a database\dt
Describe a table in a database\d+ <table>
Toggle the printing of column names and result row count footers\t

Links

Distinct

The DISINCT clause will return the unique values for a given column. It can be applied to multiple columns to return the unique combination of the specified columns.

UNIONS

Enables the merging of results from multiple selects into a single result set. Requires that the columns have the same names and data types from the first select statement.

UNION differs from UNION ALL in that the former will deduplicate the results whereas the latter will return redundant results.

JOINS

At their core, joins enable the combination of data from multiple tables via a common key or keys between different data sets. They common key, or columns of data, is called a predicate or the join condition. It is the ‘ON a.id = b.id’ part of the query and the key to the join itself providing a logical way to relate the two records. The common column between the two tables that contains data with which you can join two records.

There are multiple types of joins, OUTER AND INNER joins.

The OUTER JOIN will match every record in the table on the side specified. For example a LEFT OUTER JOIN will return all records that match the query for the table on the left (the table in the FROM clause), and for those columns that do not match in the right table (the table in the ON clause), the result will be null.

The reverse is true for a RIGHT OUTER JOIN.

The INNER JOIN will ONLY return records where there is a match in both tables based on the join predicate.

OUTER JOINs are used to answer the “what is missing from this data set?” type of questions. For example, given a table of products and a table of sales that includes product ids, you can use an OUTER join to find the set of product ids that have not yet been sold.

Difference between a full join and an inner join

A full join will return records from both tables and null columns on both the left and right side where matches do not exist. An inner join will only return records where there is a match in both tables. This enables us to find out where records are missing in both tables. A full join is also known as a full outer join. MySQL does not support this, but it can be emulated by performing a UNION on a LEFT JOIN and an RIGHT JOIN.

Self JOINS

There are times when we need to treat a given table as if it is two separate tables. The classic example is the “write a query that lists the managers for each employee in the employees table” question.

To get this answer you need to write a JOIN query where the left table is the employees table as some alias, and the right table is the same employees table as a different alias. Using an alias for the “two” tables is a requirement because we are referring to “two” separate tables.

Just as an aside, I have seen many people who come from traditional RDBMS implementations (Oracle, DB2, etc.) attempt to write self JOINS in Hive which do not work. In a big data system like Hive a self join only works if you write the query with a sub select that will actually generate a 2nd table with which you can use to iterate over all of the values in your “left” table.

Keys

A set of columns in a table that enable a row to be uniquely identified. Some implementations allow foreign and unique keys to be NULL. A primary key can never be null.

Primary, unique, and foreign keys can all consist of the combination (composite) of more than one column.

  • Primary: a field in a record in a table that must be unique in that table. It is a special key because this key is directly tied to the table in which the record resides. It cannot be null.
  • Unique: Provides uniqueness across all other records in the same table
  • Foreign: Used to reference unique columns in other tables and provides a logical relationship between the values/records between two or more tables.

. What is a Natural Key?
. A key that is made up of columns that have a logical relationship to other columns in a table and does not require additional columns to be added to be unique and create an ‘artificial’ key.

Referential Integrity

A concept directly related to foreign keys, meaning that all key relationship that must be maintained such that the removal or update of rows in tables that share the key maintain valid references. Given an example where we have two tables. A products table that contains an id (its primary key in the products table), a name and description, and a prices table where we have an id (its primary key in the prices table), a products_id field which points to the products.id field as a foreign key in the products table, and a price field There are three rules:

  1. You cannot add a new record to the prices table unless the foreign key points to a valid record in the products table.
  2. If you delete a record in the products table the record (or records) in the prices table must also be deleted.
  3. If the primary key for a record in the products table changes all references to it in the prices table must also change.

Cursors

A cursor enables you to encapsulate a query such that you can process each row, one at a time. It is typically used when you want to process a large dataset with which you may otherwise run out of memory when attempting to process. In my experience I have never used them as I am always iterating over a result set in some sort of middleware, programming language and not trying to write code within the database itself. Following is a link that describes sql problems requiring cursors. I would argue that this kind of SQL should be avoided so that you can decouple your SQL implementation from your business logic.

HAVING, GROUP BY and WHERE Clauses

The GROUP BY clause enables us to group the results of an aggregate clause by the distinct values in another field.

The WHERE clause enables us to to specify criteria that specific fields values must match to be included in the query result.

The HAVING clause enables us to do what we would expect could be done with a WHERE clause with the result of aggregate functions (AVG, COUNT, MIN, MAX, SUM). One other difference with the HAVING clause is that we cannot refer the virtual field by name that is defined in the query, but must “re-write” the aggregate call for the field in question and then apply the filtering condition for it.

For example, given the following table and data

company=# \d+ sales
                                                          Table "public.sales"
    Column    |  Type   | Collation | Nullable |              Default              | Storage | Compression | Stats target | Description 
--------------+---------+-----------+----------+-----------------------------------+---------+-------------+--------------+-------------
 id           | integer |           | not null | nextval('sales_id_seq'::regclass) | plain   |             |              | 
 customer_id  | integer |           | not null |                                   | plain   |             |              | 
 employees_id | integer |           | not null |                                   | plain   |             |              | 
 sale_date    | date    |           | not null |                                   | plain   |             |              | 
 products_id  | integer |           | not null |                                   | plain   |             |              | 
 price        | integer |           | not null |                                   | plain   |             |              | 
Indexes:
    "sales_pkey" PRIMARY KEY, btree (id)
Foreign-key constraints:
    "sales_customer_id_fkey" FOREIGN KEY (customer_id) REFERENCES customers(id)
    "sales_employees_id_fkey" FOREIGN KEY (employees_id) REFERENCES employees(id)
    "sales_products_id_fkey" FOREIGN KEY (products_id) REFERENCES products(id)
Access method: heap

company=# select * from sales;
 id | customer_id | employees_id | sale_date  | products_id |  price  
----+-------------+--------------+------------+-------------+---------
  1 |           1 |            1 | 2014-04-01 |           1 |    2500
  2 |           1 |            6 | 2014-03-01 |           1 |    2550
  3 |           2 |            6 | 2014-03-03 |           2 |  256700
  4 |           2 |            6 | 2014-04-03 |           3 | 1256700
  5 |           4 |            1 | 2014-04-10 |           1 |    2506
  6 |           2 |            7 | 2014-04-11 |           4 |   12383
  7 |           4 |            8 | 2014-05-10 |           1 |    8230
(7 rows)

In order to see the total number of sales grouped by the sales person (employee_id) for those employees who sold 10000 or more you would write the following query

SELECT
    SUM(s.price) as total_sale,
    s.employees_id
FROM
    sales s
GROUP BY
    s.employees_id
HAVING
    SUM(s.price) > 10000
;

 total_sale | employees_id 
------------+--------------
    1515950 |            6
      12383 |            7
(2 rows)

You would not be able to write the having clause as HAVING total_sale < 10000.

Additionally, notice that you cannot refer to the “total_sale” column in the HAVING clause, but need to include the aggregate function call for which you are defining a filtering criteria.

Indexes

In any database like system the more you can reduce the number of records that you have to process for ANY query, the faster and more efficient the query will be. Indexes allow us to partition the data against a given set of keys such that we do not have to execute the query against the entire dataset in a table.

The index data is typically stored in a B-tree. The nodes in the tree store the values for the specific columns against which the index is built and a pointer to the corresponding row in the table. Hashtables can also be used to store index data but a disadvantage of that implementation if that data in a hashtable is not sorted. Bitmap indexes work well with columns that are boolean values. The index data need to be updated on inserts,updates,deletes.

Selectivity and Cardinality

Cardinality, in it’s relation to an RDBMS, is the number of distinct values in a given column relative to the total number of rows in the table.

Selectivity is a measurement of how well a given column used for an index will neck down the amount of input data in a given query. Selectivity can be calculated with the following formula, simply the ration of cardinality to the number of total records in the table. The higher the number the better.

selectivity of index = cardinality/number of records)

The classic example is a table with people in it and calculating the selectivity based on the gender column. Since the possible genders (for simplicity’s sake) are only male and female the cardinality is 2. Assuming that we have 10,000 records in the table:

0.00002 = 2/100000

If we have an additional column in the table that is the U.S. State in which the person lives that would be:

0.0005 = 50/100000

As we can see the selectivity of the State column is much higher than that of gender and is a better index to help us filter out more data from an input query as long as we use State as part of the query.

Window Functions

A winder function operates over a sub-set of rows that are related to the current row. This enables the SQL programmer to perform calculations on a sub-set of rows that are related to each other.

https://www.postgresql.org/docs/9.2/tutorial-window.html

RANK

The RANK() function provides the ability to generate query results ranked by the values in the specified columns in the result set. The primary use case is to generate top-N and bottom-N results. Best explained by an example. If we have a table with the following structure and data

company=# select * from employees;
 id | first_name |  last_name  | departments_id | salary | reports_to 
----+------------+-------------+----------------+--------+------------
  1 | Jack       | Ofalltrades |              1 |   1400 |           
  2 | John       | Doe         |              2 |   1450 |           
  3 | Alan       | Ginsberg    |              3 |   1150 |           
  4 | Ram        | Somebody    |                |    600 |           
  5 | Sales      | Dude        |              1 |   1450 |          1
  6 | Erkle      | Jones       |              1 |   1960 |          1
  7 | Neeraj     | Bahri       |              1 |   2460 |          1
  8 | Jing       | Chu         |              1 |   1665 |          1
  9 | Ernie      | Borgnine    |              2 |   1000 |          2
 10 | Frank      | LaPoint     |              3 |    840 |          3
 11 | Debbie     | Grey        |              3 |    935 |          3
(11 rows)

And we want to get all of the employees ordered by their salary we can use the RANK function.

SELECT
    concat(e.first_name, ' ', e.last_name),
    e.salary,
    RANK () OVER (
        ORDER BY e.salary
        DESC
    ) as rank
FROM employees e
;

      concat      | salary | rank 
------------------+--------+------
 Neeraj Bahri     |   2460 |    1
 Erkle Jones      |   1960 |    2
 Jing Chu         |   1665 |    3
 John Doe         |   1450 |    4
 Sales Dude       |   1450 |    4
 Jack Ofalltrades |   1400 |    6
 Alan Ginsberg    |   1150 |    7
 Ernie Borgnine   |   1000 |    8
 Debbie Grey      |    935 |    9
 Frank LaPoint    |    840 |   10
 Ram Somebody     |    600 |   11
(11 rows)

Notice that there are two 4 ranks. This occurs when there is a tie between the values on which the rank order is being generated. Because there are two number 4 ranks, we skip number 5 and continue with 6. If there where three number 4 ranks we would skip 5 and 6 and continue with rank 7.

https://www.postgresqltutorial.com/postgresql-percent_rank-function/

PARTITION

We can specify additional criteria to the OVER clause to group the results by the value of a field in the result set to create multiple sets of ranks. To see the top salaries partitioned by department the query would be.

SELECT
    es.employee_name,
    es.department,
    es.salary,
    RANK() OVER(PARTITION BY es.department ORDER BY es.salary DESC)
FROM
    (
    SELECT
        concat(e.first_name, ' ', e.last_name) AS employee_name,
        e.salary,
        d.name as department
    FROM employees e
    JOIN departments d
        ON e.departments_id = d.id
    ) as es
;

  employee_name   | department | salary | rank 
------------------+------------+--------+------
 Alan Ginsberg    | Accounting |   1150 |    1
 Debbie Grey      | Accounting |    935 |    2
 Frank LaPoint    | Accounting |    840 |    3
 John Doe         | Financing  |   1450 |    1
 Ernie Borgnine   | Financing  |   1000 |    2
 Neeraj Bahri     | Sales      |   2460 |    1
 Erkle Jones      | Sales      |   1960 |    2
 Jing Chu         | Sales      |   1665 |    3
 Sales Dude       | Sales      |   1450 |    4
 Jack Ofalltrades | Sales      |   1400 |    5
(10 rows)

Practice Queries

Given the structure of the tables as defined in sql/sample-data/company-schema.sql in the git repo write the following queries.

  1. List the first name and last name of each employee and the name of the departments to which they belong.
    1. Bonus points for concatenating the first and last name into a new column
  2. List the total number of employees per department.
    1. The same query as above, but only list those departments with three or more people.
  3. List the departments without any employees.
  4. List the employees who are managers and the count of the number of people each manage.
  5. What are the total sales numbers for the sales people who made total sales of 10,000 or greater?
  6. Write a query that lists the employees, first name and last name AS “employee”, and the first and last name of their manager AS “manager”.
  7. Write a query that lists the managers, first name and last name AS “manager” (single column output) for all of the employees who are managers (do NOT have a reports_to value in their record).
  8. Write a query that lists the managers, first name and last name AS “manager” (single column output) for all of the employees who are managers that do not currently have any direct reports.
  9. Write a query that creates a new table called top_salaries where an employee must have a salary higher than 1500. Include in the result set their first name, last name, salary, and department name.
  10. Who are the sales people with three or more orders?
  11. Who is the 2nd most successful salesperson based on the total number of sales, excluding their salary?