The Relational Model for Database Management: Version 2

Author: E. F. Codd
3.0
This Month Stack Overflow 2

Comments

by anonymous   2019-07-21

Re "atomic"

In Codd's original 1969 and 1970 papers he defined relations as having a value for every attribute in a row. The value could be anything, including a relation. This used no notion of "atomic". He explained that "atomic" meant not relation-valued (ie not table-valued):

So far, we have discussed examples of relations which are defined on simple domains--domains whose elements are atomic (nondecomposable) values. Nonatomic values can be discussed within the relational framework. Thus, some domains may have relations as elements.

He used "simple", "atomic" and "nondecomposable" as non-relational informal expository notions. He understood that a relation has rows of which each column has an associated name and value; attributes are by definiton "single-valued"; the value is of any type. The only structural property that matters relationally is being a relation. It is also just a value, but you can query it. Then he used "nonsimple" etc meaning relation-valued.

By the time of Codd's 1990 book The Relational Model for Database Management: Version 2:

From a database perspective, data can be classified into two types: atomic and compound. Atomic data cannot be decomposed into smaller pieces by the DBMS (excluding certain special functions). Compound data, consisting of structured combinations of atomic data, can be decomposed by the DBMS.

In the relational model there is only one type of compound data: the relation. The values in the domains on which each relation is defined are required to be atomic with respect to the DBMS. A relational database is a collection of relations of assorted degrees. All of the query and manipulative operators are upon relations, and all of them generate relations as results. Why focus on just one type of compound data? The main reason is that any additional types of compound data add complexity without adding power.

"In the relational model there is only one type of compound data: the relation."

Sadly, "atomic = non-relation" is not what you're going to hear. (Unfortunately Codd was not the clearest writer and his expository remarks get confused with his bottom line.) Virtually all presentations of the relational model get no further than what was for Codd merely a stepping stone. They promote an unhelpful confused fuzzy notion canonicalized/canonized as "atomic" determining "normalized". Sometimes they wrongly use it to define realtion. Whereas Codd used everyday "nonatomic" to introduce defining relational "nonatomic" as relation-valued and defined "normalized" as free of relation-valued domains.

(Neither is "not a repeating group" helpful as "atomic", defining it as not something that is not even a relational notion. And sure enough in 1970 Codd says "terms attribute and repeating group in present database terminology are roughly analogous to simple domain and nonsimple domain, respectively".)

Eg: This misinterpretation was promoted for a long time from early on by Chris Date, honourable early relational explicator and proselytizer, primarily in his seminal still-current book An Introduction to Database Systems. Which now (2004 8th edition) thankfully presents the helpful relationally-oriented extended notion of distinguishing relation, row and "scalar" (non-relation non-row) domains:

This definition merely states that all [relation variables] are in 1NF

Eg: Maiers' classic The Theory of Relational Databases (1983):

The definition of atomic is hazy; a value that is atomic in one application could be non-atomic in another. For a general guideline, a value is non-atomic if the application deals with only a part of the value.

Eg: The current Wikipedia article on First NF (Normal Form) section Atomicity actually quotes from the introductory parts above. And then ignores the precise meaning. (Then it says something unintelligible about when the nonatomic turtles should stop.):

Codd states that the "values in the domains on which each relation is defined are required to be atomic with respect to the DBMS." Codd defines an atomic value as one that "cannot be decomposed into smaller pieces by the DBMS (excluding certain special functions)" meaning a field should not be divided into parts with more than one kind of data in it such that what one part means to the DBMS depends on another part of the same field.

Re "normalized" and "1NF"

When Codd used "normalize" in 1970, he meant eliminate relation-valued ("non-simple") domains from a relational database:

For this reason (and others to be cited below) the possibility of eliminating nonsimple domains appears worth investigating. There is, in fact, a very simple elimination procedure, which we shall call normalization.

Later the notion of "higher NFs" (involving JDs (join dependencies)) arose and "normalize" took on a different meaning. Since Codd's original normalization paper normalization theory always given results applicable to all relations not just those in Codd's 1NF. So one can "normalize" in the original sense of going from just relations to a "normalized"/"1NF" form without relation-valued columns and one can "normalize" in the normalization-theory sense of going from just relations aka "1NF" to higher NFs while ignoring whether domains are relations. And "normalization" is also (ill-)used for designing a relational version of a non-relational database (whether just relations and/or some sense of "1NF").

Relational spirit is to eschew multiple columns with the same meaning or domains with interesting parts in favour of another base table. But we must always come to an informal ergonomic decision about when to stop representing parts and just treat a column as "atomic" (non-relation-valued) vs "nonatomic" (relation-valued).

by anonymous   2019-07-21

Re your original question:

Your organization and reasoning are unsound. First give the all the FDs. Eg this determines the CKs. Eg you cannot reason soundly on just giving the (alleged) CKs (which imply certain FDs) and a couple of non-FDs. Eg "non-transitively dependent" cannot be determined without knowing all the FDs. Only then can you write sound bullets & answer your numbered questions.

But let's assume that {first_name,last_name} and {person_id} really are the only CKs and that there are no FDs other than those implied by the fact that each CK determines every attribute not in it.

Functionally, there is no relationship at all between this id number and the birthday.

I don't know what you mean by "Functionally, there is no relationship at all between". Maybe you are trying to say that {person_id} does not functionally determine {birthday}. But it does, because a CK determines all attributes not in it. Maybe you mean you don't see an application constraint between people ids and birthdays and/or a table constraint involving a table's person_id and a birthday values. But there is: A given person only has one birthday at a time, and in the table a person_id only ever has one birthday at a time. This is a consequence of the meaning of and rules around "people", "birthdays", person_id and birthday. The constraint on person_id and birthday is expressed by "{person_id} -> {birthday}" and you have to know whether it is the case as part of determining the intial list of all FDs (that precedes determining CKs).

S transitively determines T when there exists an X such that S -> X and X -> T and not(X -> S). S non-transitively determines T when it doesn't transitively determine it.

  1. Does this mean that there is a transitive dependency (birthday depends on [first_name, last_name], and each combination [first_name, last_name] maps to an id) and therefore not in 3NF?

I don't know what you are trying to say by "each combination maps to an id" let alone why it implies non-3NF. Maybe you are trying to say that taking {person_id} as S and {birthday} as T and {first_name, last_name} as X we have S -> X and X -> T so (wrongly) a non-prime attribute is transitively dependent on a CK so the relation is not in 3NF. But you did not satisfy not(X -> S).

For {person_id} as S and {birthday} as T the only possibility for X -> T has {first_name,last_name} as X but X -> S because X is a key so S -> T is not transitive.

Similarly for {first_name,last_name} as S and {birthday} as T the only possibility for X -> T has {person_id} as X but X -> S because X is a key so S -> T is not transitive.

  1. Does this mean that there is no dependency at all, and therefore not in 3NF?

Since the relation in in 2NF and every non-prime attribute is non-transitively dependent on every CK, the relation is in 3NF.

  1. Am I misinterpreting the difficult language and is this Table in 3NF?

You didn't claim it was or wasn't, did you?

(Please edit your question to use proper technical terms.)

Re your EDIT version

(You acknowledged in comments that your last bullet was supposed to have CK 2 and that it was unsound. And that my guesses at your unclear phrasings were more or less what you meant.)

  • All fields can only contain atomic values, so the Table is in First normal form.

Normalization only makes sense for relational "tables", ie relations. That means unique unordered attributes ("columns") and tuples ("rows"). With one value per attribute per tuple. All relations are in 1NF.:

A relational table is always in 1NF. Each column of a row has a single value of the column's type. A non-relational database is "normalized" to tables ie 1NF (first sense of "normalized") which gets rid of repeating groups. Then those tables/relations are "normalized" to higher normal forms (second sense of "normalized").

"Atomic" is not helpful: "Atomic" originally meant not a relation.:

In Codd's original 1970 paper he explained that "atomic" meant not a relation (ie not a table):

So far, we have discussed examples of relations which are defined on simple domains--domains whose elements are atomic (nondecomposable) values. Nonatomic values can be discussed within the relational framework. Thus, some domains may have relations as elements.

By the time of Codd's 1990 book The Relational Model for Database Management: Version 2:

From a database perspective, data can be classified into two types: atomic and compound.

In the relational model there is only one type of compound data: the relation.

And a relation is a single value so there's nothing wrong with relation-valued attributes. (Pace Codd's changing opinion on that.)

  • The candidate keys are 1) {person_id}, 2) {first_name, last_name}
  • The only non-prime attribute is {birthday}.

To normalize you must know for every subset of attributes what attributes are (non-trivially) functionally dependent on it. Although every superset of a determinant determines what it does, so that takes care of a lot of them. You skipped that step.

You cannot show that {first_name,last_name} is a CK without showing that {first_name} and {last_name} aren't CKs via what each determines. Assuming you do, you still won't have considered remaining possible determinants {first_name,birthday} and {last_name,birthday}.

You cannot show that those are the only CKs until you show that there are no other CKs. You must show for every subset of attributes whether it is a CK. Although no superset of a CK is a CK, so that takes care of a lot of them. There are algorithms.

  • There is an FD {person_id} -> {birthday}, so the attribute {birthday} is non-transitively dependent on CK 1
  • There is an FD {first_name, last_name} -> {birthday}, so the attribute {birthday} is non-transitively dependent on CK 2

Your new last two bullets are unjustified. Look at my message's definition and use of "(non) transitively dependent"; just knowing S -> T does not tell you enough. When there's a non-transitive FD S -> X -> T it must also be that S -> T; so knowing S -> T alone does not tell you about whether S transitively or non-transitively determines T. "->" does not mean "directly"; non-transitively is the only meaningful notion of "directly".

Maybe by "so" you mean "so as shown below for the first of these two cases"?

There is a dependency {person_id} -> {first_name, last_name} -> {birthday}, but since there is also a direct dependency {person_id} -> {birthday}, this dependency is not transitively.

See above: "direct" is a misconception. And as I said in my original answer it's since {first_name, last_name} -> {person_id} for CK1 and {person_id} ->{first_name, last_name} for CK 2.

I don't have a predefined set of FDs from a book, so I am not sure whether the FDs are correct. Can someone confirm this, or if they don't look right, show how I can find the FDs in this practical example?

You have to consider every possible value the table can have due to every possible application situation that can come up and the criterion (predicate) by which you are to put rows into the table vs leave them out. You can probably think of counterexamples to putative FDs, where two rows can share the same value for a putative determinant. Eg for {first_name,birthday} and {last_name,birthday} you can expect two different people to have the same name and birthday. (You can check the last two putative FDs.)

(Now your language is clearer. Roughly speaking your errors (still) come from not using definitions and skipping steps.)

Re your second EDIT version:

It now seems like you have probably done everything soundly. (Although I can't know for sure because you don't specifically make clear that there are no more 2-element attribute sets & there are no more attribute sets; why that pair is the set of CKs; and the 2NF/3NF "therefore"s.)

Phrasings like "If you know a person's last_name and birthday, you don't know any other field of this person" are problematic. Me: If I only know two fields, of course I don't know others; so there's never a FD? You: For a person. Me: But if I know the person then I know their first_name; so there is an FD? You: If you know first_name and birthday for one person but not who; you don't know any other field. Me: Sometimes I do know other fields; so the implication is false; so there's an FD? It turns out that "know" is a super-confusing word good to avoid. Write, "Given ... there exists ...". As you did in "(there cannot be multiple ...)".