Data Structures

Algorithms and data structures tend to go together.

Links

The basic ontology of data structures with just a very few examples of each type. A more complete list can be found here.

Linear Data Structures

Arrays

One of the most basic of data structures that exploits the addressing logic of linear, one dimensional, memory systems. A structure whereby the elements stored it in are adjacent to each other in memory and allow for O(1) random access.

Circular buffer

Matrix

Bloom Filter

A probablistic data structure to determine if a value may be in a set, or certainly is not. There are sometimes false positives, but never false negatives. Elements can only be added, not removed.

Lists

Linked lists: Each node in the list linked to the next in the list. O(1) insertion, but traversal is O(n). Singly linked lists only links in one direction from head to tail. Doubly linked has larger resource overhead for storage of forward and backward pointers but makes removal and insertion O(1). Benefits over array are better insertion and removal but slower random access. Also no need to reallocate the entire linked list when new items are added because each data does not need to be adjacent to the rest in memory.

Skip lists

Hash Based

  • Hash Table: Keys associated with values. The keys are hashed to provide for efficient lookup of the value to which the key is associated. Typically implemented with an array and a hash function (along with a collision algorithm to sort out instances where duplicate keys map to the same array position).
  • Bloom Filters

Trees

  • M-way Trees: M, being the degree of the tree. A binary search tree has a degree of 2. M-way trees can various degrees or number subtrees
    • Binary trees: elements have at most 2 children
      • Binary search tree
      • Red-black tree
    • B-trees: elements can have more than 2 children, or m children
      • B-tree
      • B+ tree: Similar to B-tree but all leaf nodes are linked together in a doubly-linked list.
    • Types:
      • Ternary tree
      • K-ary tree
      • Tries
    • Links:
      • https://webdocs.cs.ualberta.ca/~holte/T26/m-way-trees.html
      • https://www.studytonight.com/advanced-data-structures/b-trees-mway-trees-data-structure
  • Heaps
    • Heap
    • Binary heap
    • Weak heap
  • Trees
    • Tree
    • Radix tree
    • Suffix tree
  • Space-partitioning trees
    • Segment tree
    • Range tree
  • Graphs: Collections of vertexes and edges.
    • Graph
    • Adjacency matrix
    • DAG (Directed Acyclic Graph)
    • Directed ternary: each child has at most three children
  • Abstract data types:
    • Set: A (typically) unordered set of elements that only retains one of each value.
    • Stack: Collection of items where only the top most (most recently added item) is available. Typically last in, first out, depending on the insertion algorithm.
    • Queue: Collection of items where only the oldest item is available, new items added to the back, oldest from the front (FIFO or LILO).
    • String: In many high-level programming languages a String is a data structure that abstracts the creation, access, and mutation of an array of characters.

Common Data Structures

Graphs

A graph is a collection of vertices and edges. Edges connect vertices in the graph. The vertices are where the data is stored.

The three types of graphs are

  1. Undirected: A graph in which the edges do not connote any directional relationship.
  2. Connected: A graph in which all nodes can reach all other nodes in the graph by following edges in any direction.
  3. Directed: A graph in which each of the edges has a defined direction between the vertices. This defines how the vertices are related to each other. For instance, we can use a directed graph to model a map where a vertex is location A on the map and a directed edge is a one-way street that connects location A to location B.
Adjacency Matrix vs. Adjacency List

An adjacency matrix can be used to represent both a cyclic and acyclic graph. It is a two dimensional binary array of n*n size where n is the number of nodes in the graph. Each row’s id is a pointer to the node. The columns in each row indicate to which other node id there is a relationship. In a directed graph it includes a True/1 in the column if there is an edge from the current row’s node to the node represented by that column id. If it is undirected there is a True/1 in each column where an edge exists connecting the two nodes.

(add a graphic example)

An adjacency list is an array of separate linked lists. In a graph with n nodes the array is of length n. Each element of the array referring to a node in the array. For a directed graph, each array’s value is a list of node id’s to which there is a directional edge. For an undirected graph each array’s value is a list of nodes to which there is any edge.

(add a graphic example)

B-Tree

Usually a shallow but very wide structure. Similar to binary search tree except that a node can have more than two children. Better suited for r/w large blocks of data, think a filesystem or database. The leaves of the tree are where the data is stored. Intermediary nodes only store keys that are used as separators to partition the data. In a 2-3 b-tree, the parent node stores two values which separate the three child leaves. Each leaf stores multiple values which speeds up access because of the locality of the data on disk or in RAM.

          |   key=10   |    key=20    |
         /             |               \
        /              |                \
val=1, val=2     val=12, val=14     val=22, val=30

B-trees are of some “order N” meaning that nodes contain upto N children.

  • Searching begins at the root node and follows the pointers from node to node until the key is found, or the search fails by ending at a leaf node without the ability to traverse any further.
  • Root node initially begins with 1 key. It is a special exception and the only node allowed to have less than n keys in an order n B-tree.
  • To insert the key, we first search for it. If we find it, our work is done. If we don’t find it, we end up at the place where the key should be stored.
  • If keys are full, then the node is split and the middle value is promoted up the tree. If the promotiion goes all the way to the root and causes a split, the middle value is promoted up to a new root node.
  • To delete we must move the value to a leaf node first
Properties that must exist for a B-Tree
  • All leaf nodes must be at the same level/distance from the root node
  • Above the leaf nodes there should be no empty sub-trees
  • The height should be as “short” as possible.

B+ Tree

Unlike a B-Tree the B+ Tree stores data (pointers to rows, disk sectors, etc.) only in the leaf nodes and the leaves are stored as a doubly-linked list to enable access to adjacent leaves.

Parent nodes will store key values (not the pointers or values themselves) to enable the partitioning of the data and searching.

https://www.geeksforgeeks.org/difference-between-b-tree-and-b-tree

Trie

Keys are typically Strings. All descendants of a node have a common prefix. Think an autocomplete for a form field. Root is associated with an empty string. Values tend to be associated with leaves in the tree.

The edges/links between the nodes are associated with characters that are shared between the nodes.

Nodes in the trie do not store their key. It is the position of the node that determines its associated key. The value of the key is distributed across the trie and as a result all nodes may not contain a value.

The key thing to remember is that all of the children of a node share a common prefix of the string associated with the parent node. The root node is associated with the empty string and contain a set of edges that have the first character of each of the strings in the trie.

Heaps

It is a binary tree based structure that satisfies the heap property. For a max heap, if B is a child of parent A, then A’s key must be >= B’s key. The element with the greatest value is always the root node. This is for a max-heap. The converse is a min-heap. This is the primary underlying data structure for a priority queue.

The following are the rules that must be met in order for a tree to be a heap.

  • There are at most two children per parent node.
  • There is at most 1 node with just 1 child
  • That child is the LEFT child of its parent and
    • It is the right-most leaf at depth d.

Arrays can be used as the underlying data structure with a fairly simple algorithm. Given the following min heap:

          2
        /   \
       /     \
      5       6
     / \     /
   10  11   9

We can store each of the nodes in an array as follows

Array Index012345
Value25610119

The root value is always in array[0] such that the algorithm to find the children and parents works properly.

For any child in index k the following is true:

  • Left child = 2*k+1
  • Right child = 2*k+2
  • Parent = (k-1)/2

For the 5 node: k = 1, it’s left child = k*2+1 or 3 which is 10, right child = k*2+2 or 4 which is 11

For the 11 node: k = 4 its parent = (k-1)/2 or 1, which is 5

To insert values into the heap, always insert at the end of the array. Then check to see if the order property of the heap is still maintained. Check to see if the inserted item’s parent is of lesser value, if not swap the two values and continue checking up to the root of the tree.

To remove the least value from a heap and then re-sort it, you first remove the root value. After removing the root (in this case the min value), replace the value in the root with the value at the end of the array, removing that leaf from the tree. Then work the value down the tree by checking for the smaller of the two child values and swap the position with that value. This will ensure that the SMALLEST value becomes the root. Continue until neither of the children of the value being sifted down is smaller than it.

The reverse is true for max heaps.

  • O (1) for finding the max/min element is O(1) since we know that the first element of a heap is always the max/min
  • O (log n) for inserts and removals
  • O (n) to build initially

XOR Linked List

An XOR Linked List is a more memory efficient doubly-linked list which uses an XOR operation on data to derive the pointer location of the next and prev nodes.

  • http://www.geeksforgeeks.org/xor-linked-list-a-memory-efficient-doubly-linked-list-set-1/
  • https://en.wikipedia.org/wiki/XOR_linked_list
  • https://www.techiedelight.com/xor-linked-list-overview-implementation-c-cpp/

Skip List

Involves a list built in layers allowing you to “skip” to the node that is closer to the element for which we are searching. The lowest layer is the complete linked list. The top most is the layer in which each of the elements is a “key” and where each subsequent key is greater than or less than the previous. In this way, we can iterate through the sparsest set of links until we find the element with the starting value that we are looking for and then search downward in the layered lists for the specific element. Removes the performance O(n) bottleneck of having to iterate over the entire list. Insertion and deletion operations remain relatively efficient when compared to the standard Linked List sibling.

Average search is O(log n), worst is O(n) which is similar to linked list, but this outcome is extremely unlikely.

https://en.wikipedia.org/wiki/Skip_list