The following is based on of PostgreSQL.
Basic psql Commands
* FROM pg_catalog.pg_tables
WHERE schemaname != ‘pg_catalog’
AND schemaname != ‘information_schema’;
|Switch to, connect to, a database||\c <database>|
|Show tables in a database||\dt|
|Describe a table in a database||\d+ <table>|
|Toggle the printing of column names and result row count footers||\t|
The DISINCT clause will return the unique values for a given column. It can be applied to multiple columns to return the unique combination of the specified columns.
Enables the merging of results from multiple selects into a single result set. Requires that the columns have the same names and data types from the first select statement.
UNION differs from UNION ALL in that the former will deduplicate the results whereas the latter will return redundant results.
At their core, joins enable the combination of data from multiple tables via a common key or keys between different data sets. The common key, or columns of data, is called a predicate or the join condition. It is the
ON a.id = b.id part of the query and the key to the join provides a logical way to relate the two records. The join predicate is the common column between the two tables that contains data with which you can join two records.
There are multiple types of joins, OUTER AND INNER joins.
The OUTER JOIN will match every record in the table on the side specified. For example a LEFT OUTER JOIN will return all records that match the query for the table on the left (the table in the FROM clause), and for those columns that do not match in the right table (the table in the ON clause), the result will be null.
The reverse is true for a RIGHT OUTER JOIN.
The INNER JOIN will ONLY return records where there is a match in both tables based on the join predicate.
OUTER JOINs are used to answer the “what is missing from this data set?” type of question. For example, given a table of products and a table of sales that includes product ids, you can use an OUTER join to find the set of product ids that have not yet been sold.
Difference between a full join and an inner join
A full join will return records from both tables and null columns on both the left and right side where matches do not exist. An inner join will only return records where there is a match in both tables. This enables us to find out where records are missing in both tables. A full join is also known as a full outer join. MySQL does not support this, but it can be emulated by performing a UNION on a LEFT JOIN and an RIGHT JOIN.
There are times when we need to treat a given table as if it is two separate tables. The classic example is the “write a query that lists the managers for each employee in the employees table” question.
To get this answer you need to write a JOIN query where the left table is the employees table as some alias, and the right table is the same employees table as a different alias. Using an alias for the “two” tables is a requirement because we are referring to “two” separate tables.
Just as an aside, I have seen many people who come from traditional RDBMS implementations (Oracle, DB2, etc.) who attempt to write self JOINS in Hive which do not work. In a big data system like Hive a self join only works if you write the query with a sub select that will actually generate a 2nd table with which you can use to iterate over all of the values in your “left” table.
A set of columns in a table that enable a row to be uniquely identified. Some implementations allow foreign and unique keys to be NULL. A primary key can never be null.
Primary, unique, and foreign keys can all consist of the combination (composite) of more than one column.
- Primary: a field in a record in a table that must be unique in that table. It is a special key because this key is directly tied to the table in which the record resides. It cannot be null.
- Unique: Provides uniqueness across all other records in the same table
- Foreign: Used to reference unique columns in other tables and provides a logical relationship between the values/records between two or more tables.
What is a Natural Key?
A key that is made up of columns that have a logical relationship to other columns in a table and does not require additional columns to be added to be unique and create an ‘artificial’ key.
A concept directly related to foreign keys. Referential integrity is when all key relationship must be maintained such that the removal or update of rows in tables that share the key maintain valid references. Given an example where we have two tables; a products table that contains an id (its primary key in the products table), a name and description. And a prices table where we have an id (its primary key in the prices table), a products_id field which points to the
products.id field as a foreign key in the products table, and a price field There are three rules:
- You cannot add a new record to the prices table unless the foreign key points to a valid record in the products table.
- If you delete a record in the products table the record (or records) in the prices table must also be deleted.
- If the primary key for a record in the products table changes all references to it in the prices table must also change.
A cursor enables you to encapsulate a query such that you can process each row, one at a time. It is typically used when you want to process a large dataset with which you may otherwise run out of memory when attempting to process. In my experience I have never used them as I am always iterating over a result set in some sort of middleware, programming language and not trying to write code within the database itself. Following is a link that describes sql problems requiring cursors. I would argue that this kind of SQL should be avoided so that you can decouple your SQL implementation from your business logic.
HAVING, GROUP BY and WHERE Clauses
The GROUP BY clause enables us to group the results of an aggregate clause by the distinct values in another field.
The WHERE clause enables us to to specify criteria that specific fields values must match to be included in the query result.
The HAVING clause enables us to do what we would expect could be done with a WHERE clause with the result of aggregate functions (AVG, COUNT, MIN, MAX, SUM). One other difference with the HAVING clause is that we cannot refer to the virtual field by name that is defined in the query, but must “re-write” the aggregate call for the field in question and then apply the filtering condition for it.
For example, given the following table and data
company=# \d+ sales Table "public.sales" Column | Type | Collation | Nullable | Default | Storage | Compression | Stats target | Description --------------+---------+-----------+----------+-----------------------------------+---------+-------------+--------------+------------- id | integer | | not null | nextval('sales_id_seq'::regclass) | plain | | | customer_id | integer | | not null | | plain | | | employees_id | integer | | not null | | plain | | | sale_date | date | | not null | | plain | | | products_id | integer | | not null | | plain | | | price | integer | | not null | | plain | | | Indexes: "sales_pkey" PRIMARY KEY, btree (id) Foreign-key constraints: "sales_customer_id_fkey" FOREIGN KEY (customer_id) REFERENCES customers(id) "sales_employees_id_fkey" FOREIGN KEY (employees_id) REFERENCES employees(id) "sales_products_id_fkey" FOREIGN KEY (products_id) REFERENCES products(id) Access method: heap company=# select * from sales; id | customer_id | employees_id | sale_date | products_id | price ----+-------------+--------------+------------+-------------+--------- 1 | 1 | 1 | 2014-04-01 | 1 | 2500 2 | 1 | 6 | 2014-03-01 | 1 | 2550 3 | 2 | 6 | 2014-03-03 | 2 | 256700 4 | 2 | 6 | 2014-04-03 | 3 | 1256700 5 | 4 | 1 | 2014-04-10 | 1 | 2506 6 | 2 | 7 | 2014-04-11 | 4 | 12383 7 | 4 | 8 | 2014-05-10 | 1 | 8230 (7 rows)
In order to see the total number of sales grouped by the sales person (employee_id) for those employees who sold 10000 or more you would write the following query
SELECT SUM(s.price) as total_sale, s.employees_id FROM sales s GROUP BY s.employees_id HAVING SUM(s.price) > 10000 ; total_sale | employees_id ------------+-------------- 1515950 | 6 12383 | 7 (2 rows)
You would not be able to write the having clause as
HAVING total_sale < 10000.
Additionally, notice that you cannot refer to the “total_sale” column in the HAVING clause, but need to include the aggregate function call for which you are defining a filtering criteria.
In any database like system the more records that you exclude from a query, the faster and more efficient the query will be. Indexes allow us to partition the data against a given set of keys such that we can execute the query against a subset of the data.
The index data is typically stored in a B-tree. The nodes in the tree store the values for the specific columns against which the index is built and a pointer to the corresponding row in the table. Hashtables can also be used to store index data but a disadvantage of that implementation is that data in a hashtable is not sorted. Bitmap indexes work well with columns that are boolean values. The index data needs to be updated on inserts, updates, and deletes.
Selectivity and Cardinality
Cardinality, in it’s relation to an RDBMS, is the number of distinct values in a given column relative to the total number of rows in the table.
Selectivity is a measurement of how well a given column used for an index will neck down the amount of input data in a given query. Selectivity can be calculated with the following formula; simply the ratio of cardinality to the number of total records in the table. The higher the number the better.
selectivity of index = cardinality/number of records)
The classic example is a table with people in it and calculating the selectivity based on the gender column. Since the possible genders (for simplicity’s sake) are only male and female the cardinality is 2. Assuming that we have 100,000 records in the table:
0.00002 = 2/100000
If we have an additional column in the table that is the U.S. State in which the person lives that would be:
0.0005 = 50/100000
As we can see the selectivity of the State column is much higher than that of gender and is a better index to help us filter out more data from an input query as long as we use State as part of the query.
- Fast reads, O(1)
- Sequential writes to the append-only log are fast
- Immutability of persisted data enables a more simple concurrency and crash recovery model
- Deleting entries is expensive
- Rolling/merging logs is expensive
- The entire hash table needs to fit in RAM
Typically used in NoSQL database implementations. Ideal for write heavy implemenations.
- Fast writes as they are append only
- Smaller disk footprint
- Slow reads because we need to read from disk
- Merging needs to happen in the background. Because the segments are sorted, can use the final pass in mergesort, O(n).
B-Tree and B+ Tree Index Implementations
In a database index, the value associated with each leaf node references the actual data record in the database. The key and value together are called a payload.
A window function operates over a sub-set of rows that are related to the current row. This enables the SQL programmer to perform calculations on a sub-set of rows that are related to each other.
The RANK() function provides the ability to generate query results ranked by the values in the specified columns in the result set. The primary use case is to generate top-N and bottom-N results. Best explained by an example. If we have a table with the following structure and data
company=# select * from employees; id | first_name | last_name | departments_id | salary | reports_to ----+------------+-------------+----------------+--------+------------ 1 | Jack | Ofalltrades | 1 | 1400 | 2 | John | Doe | 2 | 1450 | 3 | Alan | Ginsberg | 3 | 1150 | 4 | Ram | Somebody | | 600 | 5 | Sales | Dude | 1 | 1450 | 1 6 | Erkle | Jones | 1 | 1960 | 1 7 | Neeraj | Bahri | 1 | 2460 | 1 8 | Jing | Chu | 1 | 1665 | 1 9 | Ernie | Borgnine | 2 | 1000 | 2 10 | Frank | LaPoint | 3 | 840 | 3 11 | Debbie | Grey | 3 | 935 | 3 (11 rows)
And we want to get all of the employees ordered by their salary we can use the RANK function.
SELECT concat(e.first_name, ' ', e.last_name), e.salary, RANK () OVER ( ORDER BY e.salary DESC ) as rank FROM employees e ; concat | salary | rank ------------------+--------+------ Neeraj Bahri | 2460 | 1 Erkle Jones | 1960 | 2 Jing Chu | 1665 | 3 John Doe | 1450 | 4 Sales Dude | 1450 | 4 Jack Ofalltrades | 1400 | 6 Alan Ginsberg | 1150 | 7 Ernie Borgnine | 1000 | 8 Debbie Grey | 935 | 9 Frank LaPoint | 840 | 10 Ram Somebody | 600 | 11 (11 rows)
Notice that there are two 4 ranks. This occurs when there is a tie between the values on which the rank order is being generated. Because there are two number 4 ranks, we skip number 5 and continue with 6. If there where three number 4 ranks we would skip 5 and 6 and continue with rank 7.
We can specify additional criteria to the OVER clause to group the results by the value of a field in the result set to create multiple sets of ranks. To see the top salaries partitioned by department the query would be.
SELECT es.employee_name, es.department, es.salary, RANK() OVER(PARTITION BY es.department ORDER BY es.salary DESC) FROM ( SELECT concat(e.first_name, ' ', e.last_name) AS employee_name, e.salary, d.name as department FROM employees e JOIN departments d ON e.departments_id = d.id ) as es ; employee_name | department | salary | rank ------------------+------------+--------+------ Alan Ginsberg | Accounting | 1150 | 1 Debbie Grey | Accounting | 935 | 2 Frank LaPoint | Accounting | 840 | 3 John Doe | Financing | 1450 | 1 Ernie Borgnine | Financing | 1000 | 2 Neeraj Bahri | Sales | 2460 | 1 Erkle Jones | Sales | 1960 | 2 Jing Chu | Sales | 1665 | 3 Sales Dude | Sales | 1450 | 4 Jack Ofalltrades | Sales | 1400 | 5 (10 rows)
Given the structure of the tables as defined in
sql/sample-data/company-schema.sql in the git repo write the following queries.
- List the first name and last name of each employee and the name of the departments to which they belong.
- Bonus points for concatenating the first and last name into a new column
- List the total number of employees per department.
- The same query as above, but only list those departments with three or more people.
- List the departments without any employees.
- List the employees who are managers and the count of the number of people each manage.
- What are the total sales numbers for the sales people who made total sales of 10,000 or greater?
- Write a query that lists the employees, first name and last name AS “employee”, and the first and last name of their manager AS “manager”.
- Write a query that lists the managers, first name and last name AS “manager” (single column output) for all of the employees who are managers (do NOT have a reports_to value in their record).
- Write a query that lists the managers, first name and last name AS “manager” (single column output) for all of the employees who are managers that do not currently have any direct reports.
- Write a query that creates a new table called top_salaries where an employee must have a salary higher than 1500. Include in the result set their first name, last name, salary, and department name.
- Who are the sales people with three or more orders?
- Who is the 2nd most successful salesperson based on the total number of sales, excluding their salary?
RDBMS Concepts and Internal Implementation Details
Cost Factor and Plan Explanation
ACID and Transactions
A transaction is a change or group of changes to a database that must all complete successfully to ensure that the data in the database is in a consistent state. Denoted by a BEGIN and END the only once the processing has successfully traversed the internal operations of the transactions and exited the END stage is the data then available for consistent reads and further writes on the data.
Transactions must be Atomic, Consistent, Isolated, and Durable.
Atomicity guarantees that multiple statements withing a group of transactions are treated as a single “atomic” unit and either all complete successfully, or are rolled back so as not to leave the underlying data in an inconsistent state.
Consistency ensures that a transaction can only mutate data from one valid state to another.
Isolation controlling concurrency such the multiple concurrent actions on data leave the data a valid state, the same as which if each transaction occurred serially.
Durability guarantees that once committed a transaction will remain and be persisted even in the event of a crash.
The two basic operations are read or write.
- Partially Committed
- Aborted: A rollback action leads to this state
The DAG can follow and number of paths through the aforementioned nodes/states.
Active / \ (r/w operations) (failure) / \ Partially Committed --(fail)-- Failed / \ (permanent store) (roll back) | | Committed Aborted \ / \ _Terminated_ /
2-Phase Commit (2PC)
The high-level problem with a distributed RDBMS is that since transactions need to be atomically committed to the data store each of the nodes in the cluster cannot make that data available until all of the nodes in the cluster can confirm a successful write or failure of the transactions.
With a distributed database system a 2PC splits up the commit into two distinct phases to ensure the consistency of the data across the cluster. This protocol requires a Transaction Coordinator (TC) that manages the commit to the data.
The TC will confirm that all nodes in the cluster are able to promise to complete the update to the data.
- TC communicates to each node to prepare for the transaction.
- Each node ensures that it is able to execute the transaction; checking data integrity, and/or getting locks on resources.
- Once all nodes ensure that they can execute the transaction they indicates this “promise” to the TC.
- If all of the nodes do not respond “ready” the TC sends a message to each to rollback, at which time each rolls back any changes and/or releases any resources that it acquired in preparation of the transaction.
The TC issues the update command to each of the nodes and confirms that all of them have successfully completed the transaction, or if one fails, commands each to roll back the change.
- Once the coordinator has received a “ready” message from all of the nodes it issues a command to all of the nodes to execute the transaction.
- The coordinator waits to receive a “complete” or a “fail” message from each of the nodes and or waits for one or more of them to timout.
- If each is successful it returns a “complete” message.
- If one or more sends a “fail” message or times out the TC sends a message to each node to rollback the change