Stay surfing with SQL: The case for relational databases (Part 1)
Disclaimer: I am a university student who is very passionate about a lot of technical topics, one of them being databases. I am in no way, an expert on the subject, but I place my opinions here based on university courses I have taken, research papers I have much enjoyed, personal software projects that I have written and job experiences that I have obtained. If you find any misconstrued information, do not hesitate reaching out to me at firstname.lastname@example.org.
There is quite an enormous amount of hype that NoSQL/non-relational databases have garnered around for the past few years — ranging from whitepapers (claiming that *insert some NoSQL database here* is the fastest and most performant database management system) to personal GitHub projects (and now, even startups) that rely on NoSQL (I am looking at you, MongoDB) as a primary datastore. One has to admit — if I compared the dead-simple and easy-on-the-eyes documentation of MongoDB and Couchbase compared to that of MySQL and PostgreSQL’s, I would probably go with the former. However, looks and statistics can be quite deceiving. I hope to disseminate some information as to why relational database technologies are still largely the most fit for your project.
Into the shallow ends of relational theory and history
Relational database management systems garnered steam after some intensive research starting around the early 1970s, one of the most prominent being that of Edgar Codd’s. Much like the functional programming paradigm, the relational model prided itself in keeping data separate from the constructs that managed it. A declarative and higher-level way of accessing data — decoupling away from the internals and details of how the data is actually stored — would go a long way in helping data be easily digestible (allowing us to munge the data into multiple perspectives) and maintainable (higher-level abstractions arguably leads to a lot less bugs). Thus, the Structured Query Language was born — an accepted industrial abstraction to the internal mechanisms of a database management system, much like the purposes of the standard driving wheel hiding away the details of a car’s internal combustion engine. The Structured Query Language specification allowed for people to use various databases, ranging from free/open-sourced to billion-dollar behemoths, to be all grounded on one usable language. With SQL, you generally have this derived structure:
SELECT <column1, column2, columnN>
GROUP BY <columnN>
ORDER BY <columnN>
SQL, in a sense, is basically a function, where a table of data — a dataset — comes in as input and a table of data — a dataset — comes out as output. SQL is based heavily on relational algebra, which has more details here. In short:
- SELECT statements allows you to project your tables — much like how in functional programming, map() operators allowed you to transform a list of elements into another list of different elements.
- FROM statements allows you to control which database table should you be working on.
- WHERE statements allows you to select which data from the table should you work on, much like how in functional programming, a filter() operator cut down “unnecessary” elements of a list.
- joinCondition statements allows you to unite 2 datasets
- GROUP BY statements allows you to categorize blocks of data within a dataset/table by some attribute (a.k.a column), allowing you to match data based on attributes of the same value.
- HAVING statements, which is much like WHERE statements, but working with smashed data as a result of a GROUP BY operation.
- ORDER BY statements, which allows you to order data within a dataset by the value of some of their attribute/s.
SQL allows you to work with data in practically English — abstracting away the details of lower-level database internals with an interface — much like how we don’t have to know how to operate nuclear power plants to charge our smartphones because we operate on the interface that is our electrical outlets. This provides you a decoupled way to manage your data where you can treat your database system as a blackbox — allowing relatively inexperienced and everyday individuals to manage it for you.
With the relational model, tuples (a.k.a rows) represented an atomic piece of data, much like how in Object-Oriented programming languages, objects were the foundations of state. Each tuple had attributes (a.k.a. columns), much like how objects have properties. Each tuple was generally of the “same type” and thus should be stored in the same relation (a.k.a table) if they had the same ordered attributes. Thus, we can say that:
tuple1<type1, type2, type3> === tuple2<type4, type5, type6>
if and only if
type1 === type4 ^
type2 === type5 ^
type3 === type6Note: === is type equality, not value equality and ^ denotes logical conjunction a.k.a (A^B = true if A=true and B=true)
Tuples are very important as they represent a piece of data’s very existence — much like a social security number. Relational theory highly prioritizes the identification and categorization of data. In a more presentable sense:
name, age, ssn
"john doe", 18, 1234567
"jane doe", 19, 2345678"alien", 20, 1111111, "3 arms" --> This violates the schema and therefore cannot be inserted, when I try to insert it"someHuman", 11, 1234567 --> This abides by the schema and thus, can be inserted
It is highly important that tuples have the same ordered attributes or else you would not be allowed to stored them on the same table! Contrast that with something like MongoDB, where it is easy to do something like this:
"name": "John Doe",
"address": "Bumville Usa"
"name": "Jane Doe",
versus a relational database that enforces a schema. Most businesses derive much of their value from data. Schemas are your first step to organizing your data and allowing you to make generalizations and inferences on its structure. Identifying your data and its structure is a foundation to solving a seemingly incomprehensible problem.
Along with the above concepts, E.F. Codd also emphasized on “normal forms” — a way to eliminate repetitive data and to ensure the avoidance of data insertion, deletion, update anomalies (e.g. updating 3 rows where as you should only be logically updating 1 row (tuple), inserting a logically equivalent piece of data as 2 rows (tuples) because it has contents that are different from one another (e.g. storing the same class information, twice — as 2 rows, because 2 different students are enrolled in the same class). In short, normalization is here to make the task of managing data in a database as easy and logical as possible. I will be covering that in the Part 2 of my write-up.