What Are Data Structures Really? The Fundamental Building Blocks of Computer Science
Published on 2019-11-12
21 min read
Computers. Machines created to do the bidding of thy masters, without questions, without reasons. Sounds great right? It absolutely is! Technology has vastly transformed the world we live in and improved the overall living standard throughout its short history.
This has been possible not because computers are incredibly smart (at this time), but because of the humans who tell computers what and how to do a task. When you truly think about how programming works and the field it has evolved into, there are a few basic elements that make it possible for us humans to tell computers what to do. One type of those elements is called a data structure.
Data structures are a building block to the software we write. They let us organize data in a computer for specific situations, so computers can do our bidding faster or make it easier for us to reason about. There are many different types of data structures with pros and cons in different scenarios, but my goal of this blog post is to write about each one in a way that makes it easy to understand for those interested. I won't be writing about every single data structure out there, but many of them are based on these fundamental ones or a combination of them, all making it easier for humans and computers to create better software!
TODO: GIF LETS GET THIS PARTY STARTED
TODO --- explain memory in a computer
Arrays are collections of items, typically of the same type, that are stored continuously in a block of memory. More specifically, an array is initialized to a specific size, memory for that size is allocated, and then items are added up in different indexes of the array up to that size.
Arrays hold the same type of data because it's easier to allocate memory when that is the case. For example, if you have an array of size 4, with the type of integer in Java, then the computer can allocate 4 bytes (1 byte for each integer) in memory for you. Last but not least, arrays are zero indexed, which means that indexes are starting at 0 for the first item. See the example for static array below!
Static arrays are just arrays in which the size doesn't dynamically adjust based on the amount of elements inserted. Once the array reaches max capacity, then you must manually create a new array and copy elements over, or remove elements from the array you have.
Have you ever tried adding a 13th egg into the carton of eggs that holds one dozen? Of course not, because it won't fit and the lid wouldn't close. That's how static arrays work.
You tell the computer, 'hey there, I'd like room for 12 eggs to use in my program. I don't have all the eggs yet, but when they come I will add them, so please save space for me'.
The computer obliges.
The egg man comes and asks you how many eggs would you like, you say 'one dozen', and he gives you twelves eggs, so you add them into your carton - you have reached full capacity.
Now, say that same egg man gives you a bonus egg for buying twelve. You have a few options:
- Throw it away or tell him no thanks, I have no room
- You remove one egg from the carton, add the bonus egg, and fry the one you removed so you no longer need to store it anywhere but your belly.
- You find an 18 egg carton, so you now have room for the 13th egg, and you throw the original 12 egg carton away.
That's how an egg array works in a computer too :).
TODO create code pen
Add an item
Get an item
const array = [0,1,2,3,4,5] // 0 is pointing to the first index in the array, which then retrieves that value at that location console.log(array) // prints out 0
Delete an item
Update an item
const array = [0,1,2,3,4,5] // 0 is pointing to the first index in the array, which we set to a new value array = -1 console.log(array) // prints out -1
- Easily change values inside the array without needing to create or remove memory at runtime
- Allocation is done at compile time and you know the size of the array
- Easy to handle since you don't need to maintain size if you know the data is static within it
- Fast look ups for values
- Quick to add items to the end of an array (that has space of course)
- Can lead to wasted memory if the memory goes un-used throughout the life of the program
- Runs out of memory if you need extra space and fill it up
- Encompassing the two above, one can just summarize as the inability to change the size of the array as needed
- Resizing requires logic and handling of memory allocation on your own
- Slow to insert update/remove/insert in the middle of the array
- Hold collections of data in programs so you don't have to use a boat load of variables for each
Dynamic arrays are very similar to static arrays, except for one distinguishing feature. They resize automatically for you! This means you don't need to know the size of your data before hand or worry about running out of space, because the array dynamically changes its size based on the amount of data it has.
Generally, dynamic arrays are actually an abstraction on top of arrays to handle the resizing and copying for you. By this I mean that, under the hood of a dynamic array is just a static array managed by own class and functions. As the amount of data grows and reaches the current internal array's capacity, the abstracted data structure then creates a new static array that is double the size of the current one, and copies all values into the new one. You of course, don't need to worry about this part, because that is where the beauty of its abstraction comes into play.
The basic operations are the same as the static array, please scroll up to see them :).
- Adding new values to the end still mostly fast
- Fast lookups and updates
- Resizes for you!
- Resizing can take time at large data sizes when appending new items
- Slow deletion/inserts in the middle of the array* What's the best way to create new array sizes? Doubling can lead to wasted space
Same common uses as static arrays, please check those out above if interested!
You can think of multi-dimensional arrays as just an array, of arrays - like a table. This simple explanation does make it sound like they are easy to grasp, but I find that they can be a bit difficult to wrap your head around the first time.
Why is that?
Because understanding how to find specific indexes within a 2D or 3D array can be confusing when doing so for the first time. I'm hoping I can break it down to help!
Before jumping into the basic operations in a 2D (or more) array, I want to dive deep into the concept of rows and columns, and how I keep track of which I'm indexing. Let's start with a picture.
As you can see, each row and column is representing an index, which can be used to find a specific value.
I think of rows as the first layer and columns as second, which helps me remember that to reference rows, I only need one index, while I need two to reference a column WITHIN a row, because in our code we first create the rows with arrays inside of them for columns.
The basic operations below should help drive this point home, and you will see from the comments when and where to use indexes.
Explain adding / inserting etc at specific rows or columns* Explain creating a 2D array in JS
- Getting neighbors
Linked Lists are linear data structures that don't exist in the same continuous memory location. Instead, each item in the list is called a Node, which then points to its next item. The node structure allows for users to store any type of data, such as other objects, other data structures, or just numbers (Node.data => whatever you would like) and a pointer to its next Node (Node.next => Node or null). The first item in the list is called the head node.
Imagine you are creating your own train, because you have recently installed a beautiful model train track in your basement. A train track is no good without a train, so you think to your self - hmm, maybe I should start building one.
You immediately realize that at the start of every train is the the locomotive, or the car with an engine. So you declare that as the start of your train, buy one, and set it on the track. This is analogous to the head of your train (or train linked list).
The first car is definitely an improvement, but your model track is long and you want to carry use this train to carry toys along the track - to do so, you need more cars. We'll start with the A car to carry all toys that start with the letter A and add it as the next train from the head. You want to carry toys from A to D, but unfortunately the B car hasn't arrive from Amazon yet, so you hook all the other ones up first.
Head -> A -> C -> D ->
Your train and track looks great, the kids are happy with the idea of new toys, and your happy the track is no longer empty. But, it's only been one day since ordering your B car - thus you must wait another day to complete your mission.
Finally! The next day arrives, you unpack the B car, and want to add it into its correct spot. There's one problem, you're having trouble finding exactly where A is on the train, so you start at the engine car and start working your way through the next cars. Well that was easy, it happened to be the next car! So you easily unhook car C from car A, add B to A, and C to B. Your train is complete! What a relief you didn't have to rebuild the entire thing from scratch.
Your final train looks like so:
Head -> A -> B -> C -> D
And as you can see, you now have a linked list train with each trailer hooking the cars together.
- Easy insertion and deletion anywhere in the list without having to re-organize the data structure
- Dynamic size
- Random access is difficult
- Slow to find specific item with specific data
- More memory required to store pointers
- React uses Linked Lists for it's Fiber reconciliation algorithm
- Used to implement stack and queues (see those below)
- Adjacency list implementation for sparse graphs
The idea and concept of Doubly Linked list are very similar, with only two key differences. The first being, instead of each Node pointing to the next Node in the list, it also includes a pointer to it's previous Node as well. Secondly, the last node of the linked list is called the tail node.
Having both of these pointers and both nodes allows us to easily traverse and manipulate the list without keeping track of previous nodes and also the ability to traverse in both directions.
- Inserting / Deleting a node in the middle of the list is very easy if a node is given to insert it before or after
- Traversal from either direction is much easier, and no need to track prev nodes
- More complex logic for maintaining pointers and the lists for certain operations
- Extra memory required for the extra pointer
- Explain notion of prev/next Node structure
- explain why you would want this
- Basic operations
- Common uses
- explain concept of LIFO
- explain concept with stack of plates
- explain how stacks work
- Basic operations
- common uses
- explain how you can create a min/max stack with all O(1) times
- Basic operations
- common uses
- explain concept of FIFO
- explain concept with line at Target
- explain how queues work
- Basic operations
- common uses
So far in the article, we've talked through linear data structures. Structures in which store data in a straight line type of fashion, but alas, we have arrived at our first non linear structure - Trees.
Data within a tree is stored hierarchically, meaning we can organize the data with relationships that have a hierarchy. Things like your family tree or an NFL football team. In a football team, you have the head coach, who under him or her has assistances, who then help coach every player. Even players can have hierarchy, since it is common to have captains or the quarterback be responsible for leadership over the teammates.
In Computer Science, we have MANY different types of data structures that are tees but implement them or organize data in specific ways. But before diving too deeply into the specific structures, there are a few common characteristics and terminology one should be comfortable with first.
Picture first, then definitions :).
The topmost node of the tree
- Team Owner
Any node without children
Given a node that is a child, the parent is the node in which the child is referenced from. An immediate ancestor of a given node.
- Given the QB node, its parent is the Team Captain
An immediate descendant from a parent node.
- Given the Team Captain, a child is the QB node
Given a node, its children are nodes that reference the same parent.
QB nodehas a sibling of the other
Given a node, its ancestors are the nodes reached by traveling towards the root.
- Given the
Head Coach Node, its ancestor is the
Given a node, its descendants are the nodes reached by traveling away the root.
- Given the
Head Coach Node, a descendant is the
A sequential list of nodes and edges that gets you from one node to another.
A data structure that has pointers to other nodes in the tree and a data structure that represents data in the tree node, like a number value.
A connection from one node to another.
The number of children a node has.
The distance of a given node from the root.
Binary Search Trees are a specialized tree data structure, in which each node only points to at most two other nodes, typically tracked with left and right pointers, and includes only unique data points. Their are a few rules mentioned below that must be true for a tree to be a BST.
The letters represent the val of each node, which could be another data structure depending on the use case.
The following must be true for each and every node in a BST:
- All keys in the given node's left subtree are less than the given node's key
- All keys in the given node's right subtree are less than the given node's key
- Each left and right subtree must also be a BST
- Each node must have a unique data point
A BST is complete, if every single level of the tree is complete filled except for possibly the last level, where all nodes are as far left as possible.
A BST is full if every single node other than the leaves of the tree has two children nodes.
The diagram given above for BSTs is both full and complete.
- Insert, delete, search can all be implemented in O(log n) which is very helpful for large data sets
- To maintain balance requires extra effort (see other tree structures below), but should be done to maintain fast operations
- Implementation in general is more complex than some of the other data structures
- Extra memory to store pointers to other tree nodes
- Sorted data structures with large data sets
- Indexing for Databases
- Arithmetic Expressions
- [Geeks for Geeks](https://www.geeksforgeeks.org/binary-- search-tree-data-structure/)
- My Code School
- Princeton CS
The concept of N-Ary trees is very similar to the Binary Search Tree structure above, with a difference of how many nodes a given node points to. Binary Trees are a specialized version of N-Ary trees, where n is two.
Typically, instead of pointing to left and right children, an array of pointers is used to track each child node.
The advantages and disadvantages are not listed below, because they are very similar to the Binary Search Tree. The number of nodes used to store your hierarchies depends on the problem you're trying to solve :).
The example above represents a 3-Ary tree.
One interesting note on these types of trees is that you can convert them to a Binary Tree or create them from one.
- File system hierarchies
- Org structures
- explain concept of using maps to store characters to new maps, end characters, and end words in a tree structure
- explain concept of implementing one and its use
- basic ops
- common uses
- explain concept of max/min heaps
- explain the rules
- explain implementation with array and its formulas
- explain node structure
- common uses
Graph's are a collection of nodes that are linked together to represent some type of relationship - they don't have to be hierarchical relationships like trees, and can represent any type of relationship we want.
A node in the graph
The connection between two vertices. If a graph is directed, then these connections are one way, otherwise they can be bi-directional.
A sequence of vertices that represents the directions one took from one vertex to another.
A path is called simple, if it has no repeated vertices.
A path that creates a circular path is called a cycle.
If a graph is called acyclic, it means there are no cycles within it.
Think about the town you grew up in, all of your favorite places you visited regularly, the special places you went with your family, and places you have memories of.
A lot of great memories huh?
Now if you took all of those places and connected them in different ways, like their location, your joy from visiting them, distance away from each other, then you would have your very own graph of the place you grew up.
The amazing part? You could create many different types of graphs for just the place you grew up to represent it in different ways.
You could create a graph of its population, or its various housing, roads, high schools, or even trees if you wanted.
Graphs represent many of the things in our world, and for that they are amazing!
I really recommend taking time to watch the following videos from YouTube on Graphs. I decided against writing my own based on just how great they are and the visuals are so helpful.
- Intro to Graphs
- Properties of Graphs
- Graph Representation Edge List
- Graph Representation Adjacency Matrix
- Graph Representation Adjacency List
- Graph Traversals