Paper Review: What Comes Around Goes Around

7 minute read

Published: March 24, 2020

Context

This paper aims to compare the different data model proposals over the last 40-odd years in order to educate the current research era of database researchers what has already been tried. In the paper, the following simple relational schema from Codd’s paper on the relational model is used:

    Supplier(sno, sname, scity, sstate)
    Part(pno, pname, psize, pcolor)
    Supply(sno, pno, qty, price)

Hierarchical Model (IMS)

Each record is stored in a tree structure and has a unique parent record, unless it is the root of a tree. An IMS database is a collection of these trees. In order to sort the data for manipulation, IMS used a hierarchical sequence key (HSK) that essentially ordered all records depth-first, left-to-right.

This model is fairly idosyncratic and has some drawbacks, which include that it repeats information and existence depends on parents. For example, if a tree has a schema where each Part record belongs to a Supplier, that same part record could be repeated for every supplier in the database. Let’s say now that a tree had a schema where each Supplier record instead belonged to a Part. With this schema, a supplier couldn’t exist that had no Parts.

One lesson we can get from IMS is that the tree data structures they used are restrictive. The biggest lesson from this is the concept of independence, both physical and logical. Physical independence is the notion that the database can still work even if the storage layer is modified. Logical independence means having the programmer interact with a representation that is hopefully more flexible than the physical representation, which allows the underlying data to change and for the database to still work.

Network Model (CODASYL)

CODASYL organizes record types into a network where each node is a record and each edge is the relationship between the record types. For example, the Supplier relational schema would be organized so that Supply has two parents: Supplier (which has the relationship “Supplies”) and Part (which has the relationship “Supplied_by”). One of the benefits of this model is that unlike the Hierarchical model, a child can have multiple parent record types. The drawbacks of the CODASYL model are that it still provides no logical or physical independence and it is more complex than IMS. Also loading and recovery were worse in CODASYL because the data was harder to break up.

Relational Model

Ted Codd’s motivation was better data independence. Codd’s overall proposal was

Store data in a simple data structure (tables)
Access through high level data manipulation language (think sql-like languages)
No need for physical storage proposal

The relational model was flexible enough to represent almost anything. The lesson here for physical independence is that set-at-a-time languages are helpful. For logical independence,it is that a simple data model like tables are good. Additionally, relational databases showed that query optimizers can beat almost any DBMS programmers.

Entity-Relationship Model

Peter Chen thought of a database as a collection of entities. Each entity could have attributes, like Supplier has the attributes sno, sname, scity, sstate. There could be relationships between entities, like Supply is the relationship between Part and Supplier, that are 1-1, 1-n, n-1, or m-n. A relationship like Supply could have its own attributes, like quantity and price.

For whatever reason the E-R model stagnated and wasn’t really implemented in a DBMS. But, it found good use for database schema design. The lesson learned from the E-R model is that functional dependencies (the one sproposed by normalization theory) we’re too complicated, so people used the arrows and boxes from the E-R model. Tldr: KISS (Keep It Simple Stupid).

Extended Relational Model (R++ era)

During this period of time (starting in early 80’s), people wrote papers on applications that had poor performance when implemented in a relational DBMS. This lead to some proposals for extensions to the relational model. Some of the requested extensions included set attributes, aggregation (being able to refer to foreign keys without a join), and generalization (inheritance for databases). Relational database companies were more focused on improving the scalability and performance of their systems, and used little of the R++ ideas. The lesson from the R++ era is that new suggestions won’t go anywhere without a big performance or functionality improvement.

Semantic Model

Semantic Data Models (SDM) use the idea of classes, which were a collection of records that follow the same schema. SDM uses the aggregation concept from R++ by letting a record have an attribute that is a set of records from some class. Also, SDM supports multiple inheritance. Classes can be the union, intersection, or difference between multiple classes. For example, American_Oil_Tankers could be a class that is the union of two classes: Oil_tankers and American_ship. Semantic data models had little long term effect because of some of the same reasons as R++. Also, it wasn’t too hard to simulate the good ideas on top of the relational model.

Object-oriented Model

The problem that the community identified was an “impedance mismatch” between relational databases and languages like C++. THe answer to this problem would be a persistent programming language, which would represent disk-based data as well as in-memory data and database operations would be built into programming language.

There were some major efforts to create commercial versions of OODBs. Why did that fail? Some major reasons for the failure were having no large value added and that if there were any applications that were not written in C++, the company couldn’t use the OODB products. The lessons to be learned from the object-oriented model era are that people won’t buy software unless it helps them alleviate a major pain and that persistent languages won’t succeed without the backing of the programming language community.

Object-relational Model

The OR proposal was motivated mainly by the fact that simple GIS queries are hard to express using SQL and they don’t perform well on standard B-trees. Because of the GIS problem, the OR proposal noted that the ability to customize a DBMS to their particular needs. Concretely the OR proposal added the following to a SQL engine:

user-defined data types
user-defined operators
user-defined functions
user-defined access methods

One of the major benefits from the OR model was putting code in the database and allowing people to use user-defined access methods. One of the things that prevented complete market penetration was the lack of standards for UDFs. Without standards, a really big company has to push hard for that new technology. In the relational era, that was IBM.

Semi-structured (XML) Model

Schema last is when a schema isn’t required in advance. Due to this, each record is essentially a “bucket of bits”. In order to make sense of records, each attribute must be tagged with some metadata that describes it. While some people claim that this schema-last data models can be a DBMS in themselves, these are probably a niche market. It has been suggested that XML might be able to fix problems with semantic heterogeneity, which is when the information in an object doesn’t follow a standard convention. This is actually the opposite, XML makes it harder to deal with this issue.

Comments

The idea of this paper was to look at the evolution of data model proposals over time and see which ones bore resemblances to earlier proposals. It makes sense that a lot of data models will just be repackaging old ideas, because quite a bit has been tried since the first databases were created. The paper concludes by saying that seemingly the only really new ideas seem to be: code in the database and schema last systems.

Share on

Twitter Facebook LinkedIn

Peter Griggs