What is Normalization in RDBMS?
Normalization is a process to reduce or eliminate data redundancies by correcting table structures inside a relational database management system. There are, as of 2022, 7 different forms of normalization. They are 1NF, 2NF, 3NF, BCNF, 4NF, 5NF, and 6NF.
To understand how this works, we will take this table example.
We will normalize it from what it is now, unnormalized, up to 3NF. So, it will become this.
The idea here is to get a basic understanding of the process of normalizing data. In each step, we will identify what anomalies are present and how we can reduce or eliminate them by normalizing this table.
What are the types of normalization in RDBMS?
We have already mentioned above the 7 types of normal forms. Repeating here, they are 1NF, 2NF, 3NF, BCNF (3.5NF), 4NF, 5NF, and 6NF.
3NF is the most commonly required in OLTP (Online Transactional Processing) systems. BCNF or Boyce-Codd normal form is considered an upgraded version of 3NF, hence also called 3.5NF. If anomalies are still present after 3NF, the possible next step is BCNF and should be considered. However, when we achieve 3NF, we have most of the time achieved BCNF as well. it will depend a lot in the complexity of the data.
Any other higher level of normalization is not used that much because they can cause too many joins to take place without much more help on redundancy elimination. So, 3NF and 3.5NF are, 99.999% of the time, enough for transactional databases.
For that reason, I will not even mention anything above 3NF. In my 10 years of database development, I never had to handle those situations because they are just not present.
For OLAP (Online Analytical Processing) systems, 2NF is mostly used. In this case, we will see more data redundancy, but due to the nature of how these data will be used, it is ok. On top of data, we can add the low cost to store data allowing more redundancy on these analytical tables to speed up data retrieving.
What is the purpose of normalization?
Anomalies are everywhere! When trying to INSERT, UPDATE, or DELETE records from a table, anomalies can be an issue. Anomalies are one of the main culprits in databases. They cause development issues, a lot of maintenance, and consequently poor performance on the database.
A good indication that the tables in a database have anomalies can be based on data redundancy. The more redundant the data is, the more anomalies the structure has. In fact, it is even hard to say if the anomalies cause the data redundancy or if data redundancy causes the anomalies.
I have a good answer, in case you ask me. The answer is a bad database design causes both, data anomalies and redundancy.
In this article, we will see the what, the when, and the how of these anomalies.
What are they?
When they mess up?
How to fix them?
This is essential for database developers and designers. This is the type of work that will make or break you as a database professional. A database, that is poorly designed, is very famous in an organization. The creator of it as well.
The Normalization process
To start normalizing a table, first, we must assume that all the values in a table can not be NULL. Let’s have that in mind! The truth is, columns in a database were not made to be NULL. However, there are scenarios, where it is impossible to have a value. A good example is a middle name, not everyone has a middle name, but most people do. So, what can we do if we need to save that information? We need to save it as NULL. This is an exceptional case.
But if we need a good indication of the lack of normalization, we can check if the table has NULL values. If it does, it is a good candidate for normalization.
Business rules are also very important to know, before starting the process. If the normalization is done correctly, any application developed on top of this database will resemble so well the business rules and processes, that will be much easier for every developer to work on top of it.
Less maintenance will also be requested, and when requested, it will be much easier to work or fix it. Remember that database structure changes are not welcome once applications are using it.
First Normal Form (1NF)
Let me tell you a secret, you will find this type of table in many organizations. It is sad, but it is a true story. Don’t blame the current database designer.
These tables are usually legacy tables, and as I said before, it is hard to modify those without breaking an application somewhere.
So, if you see something like this, go with the flow and if the opportunity to redesign comes up, give your best to make it right this time.
Standard says that for a table to be 1NF it must:
- Not have repeating groups (Repeating groups comes in two flavors, we will talk about both).
- Rows must be unique.
In this table, in the first row as highlighted, notice that the column PhoneNumber has 2 values separated by a comma. (123) 456-7890 and (123) 564-0987. That is one flavor of repeating groups. In this case, they are part of the same column.
Why is that a problem? Let’s say that we want to INSERT another phone number to Joe Black. How hard would it be to use the RDBMS system? Even though it is possible, this is not a smooth operation. It would take some programming logic to do it. RDBMS is not made for this. RDBMS is made to do transactions in the simplest way possible.
What if we want to UPDATE or DELETE one of those phone numbers? Again, it would take so much work to be able to do that without messing up the other phone number.
Those are the anomalies that this type of design is causing now. There are more anomalies, and we’ll see them as we go along.
But first, let’s try to fix that! We will do that by making each column to hold only 1 value. In other words, we will make the values atomic.
We fixed it, right? Not so fast!
We made the column atomic since now we created another column and each phone number is in a different column.
However, we still have repeating groups. The newly created column is holding the same type of information as the other column. That is the second flavor of repeating groups.
Note that we may find those repeating group variations separated or associated.
Well, thanks for nothing! We have fixed nothing.
In fact, we introduced new anomalies. Let’s check it!
First, I had to create a value called “None” for PhoneNumber2, since other members do not have a second phone number. I put None so that I don’t have NULL values.
Remember, we do not want NULL values during this process. But, none and NULL are pretty much the same here and it does not follow the same pattern as a phone number (xxx) xxx-xxxx. Datatype values should be consistent throughout the table.
What if they all had valid phone number values? We would still have problems. Let me show you!
Let’s also try to INSERT a phone number to the first row Joe Black. How am I going to do that? I have no other place to insert it. So, I can’t. In this case, all users can only have 2 phone numbers.
In fact, they must have exactly 2 phone numbers, as we discussed in the paragraph above, only 1 will not work either due to NULL values.
What if we try to update one of these phone numbers? We can, as long as we update to another phone number format (XXX) XXX-XXXX. However, if more people use the same phone number, we would need to update in more than 1 place. We will see that in 2NF, the same rule will apply there.
What if we try to do DETELE a phone number. In this case, it must be all or nothing as far as the row. To delete that phone number I would have to DELETE Joe Black as a whole. We probably do not want to do that!
Can we fix it please? You got it!
Cool, we have eliminated the repeating groups and all attributes are, now, atomic.
Now, let’s check if each row is unique. By unique I mean, we can not have any row that has all the same attribute values more than once.
The first 2 rows are different because the phone numbers are different. The last row is also unique because phone number and street address are different.
However, we still have issues. Now, we know that the first 2 rows are related to the same person. Will I remember that tomorrow? How will the database know that it is the same person? What about the last row? Are we talking about the same person here as in the first 2 rows?
We have unique rows, so in truth, this is in 1NF, but what if I need to INSERT, yet, another Joe Black, tomorrow? I will add even more complexities to this table.
Why if I need to DELETE one of them, how will the database know which one to DELETE? Same for UPDATE, how will the database or application know?
What is a candidate key?
A candidate key is an attribute or a set of attributes that can uniquely identify tuples in a table. When identifying candidate keys, one of them will usually become the Primary key of a particular table.
The way the table is laid out now, the candidate keys we have are composed of the following attributes (FirstName, LastName, PhoneNumber), another option could be (FirstName, PhoneNumber). Notice that they are responsible for the uniqueness of each row. The first one, though, is the best one and the reason is this.
I don’t know the history of this table and what can be in it, but imagine that one of those Phone numbers was a home phone number and another Joe lives there with the same phone number. It would now have to find another attribute. Maybe, the last name would do it, or maybe not!
What if Joe has a Son and he names his Son also Joe Black and eventually his son inherits his father’s phone number or the phone number is again a home phone? Some of you are probably asking, what is a home phone?
So you know, a long time ago when people were very primitives, They used this type of device. this device was the number for everyone who lived in that place. How do I know it, I used to live in that period.
Now, We are talking of different Joes, but the database does not know that. So, how do we fix that? I have an idea, let’s add the date of birth to the table and make that part of the candidate key.
Sure, it will work that way, however, I am adding another object where the application will have to manage and at the end of the day, if the Joe in yellow was born on the same day, we would have the same issue.
Then we can use the Street Address to be part of the candidate key, right? Sure, but this key is getting so complicated now. It will be hard to manage and to develop applications on top of it. Is there an easy way? Yes, there is!
What is a surrogate key?
A surrogate key is a candidate key created by database designers with the purpose to uniquely identify a row. It has no business meaning attached to it, but helps developers and applications to have control over the database tables.
In the table above we saw how difficult it was to get a solid candidate key that could, eventually, be converted to a Primary Key. That is one of the reasons that make surrogate keys to shine.
We will talk more about this type of key in a bit, but before let’s talk about…
What is a natural key?
A natural key is simply an attribute or a set of attributes with business meaning, that identify tuples. The definition seems to be so much alike candidate keys. Well, you got it. Candidate keys are pulled out of natural keys in other words they are the same.
However, we must be careful using them because they can be misleading to determine a Primary key. Some very “good” examples of natural keys are Social Security Numbers, VIN Numbers, and emails.
They seem to be very natural until they are not anymore. Are Social Security numbers really unique per person? Based on the number of people that lived, live, and will live here in the US, it is mathematically impossible. Unless IRS change the number of character in the future. In that case, it can mess up our chosen data type. In other words, is out of our control.
Want a great example? think of IP addresses. they are supposed to be unique per network. They still are, so they are a great natural key, however, because IPV4 had not enough numbers to identify all the networks in the world, someone had to create IPV6.
What changed? The number of characters aside from other things. If a database designer, in the past, decided to use the old IP addresses as a primary key, they now had to do a huge update in the application and database structure.
Again, lost control!
What I am really trying to say is that we should not use natural keys to be a primary key!
Let’s create a surrogate key and make this table 1NF once and for all!
Now, we have a table with unique rows of atomic attributes and no repeating groups. This table is in 1NF!
The created surrogate key is not the primary key just yet, but it is part of a candidate key. In this case, the new candidate key is (PersonID, PhoneNumber). No need for FirstName and LastName anymore. PersonID made our lives that easy because now we know that the first 2 rows have the same Joe and that the last row has a different Joe.
With those two columns as candidate keys, I am now able to uniquely identify the rows, right? Sure!
If that is the case, we can INSERT more rows and be very happy with our accomplishment, let’s do that! I am inserting a few more records.
We added Joe1 again, but this time with work address information. We also added Car1 for further research.
Going back to Joe1. We still have 1NF when we added him, so we are good there because the new addresses are making the rows unique, but, (PersonID, PhoneNumber) can not be our candidate key anymore since it can yield duplicate results.
Second Normal Form (2NF)
Standard says that for a table to be in 2NF it must:
- Be in 1NF
- Eliminate partial dependencies.
The table above, even though, is in 1NF, it has few partial dependencies.
What are functional dependencies?
Functional dependencies (FD) is a relationship between table attributes where the value of one or more attributes, determines the values of the attribute.
Let’s think this mathematically. What does it mean to have this f(x) = x + 2?
It means that as I change the value of x, the value of f(x) may be different. however, if I use the value of x to be the same value, the result will always be the same.
So, try this out. Make x = 2 the value you will get for f(x) = 4. Try that 100 times. You will get the same result 100 times. Don’t believe me, try it yourself.
The same rule applies here, but in this case, the primary key should be our x and the attributes should be our f(x).
This is the annotation we will use to make this work A -> B. This means that A determines the value of B.
So, let’s use that logic to evaluate each attribute in our table.
Considering that PersonID = ‘Joe1’
PersonID -> FirstName
Will we get the same FirstName always? The answer is yes. so we know for a fact that PersonID determines the FirstName. Let’s try this now.
PersonID -> FirstName, LastName
That is also true, so PersonId Determines both attributes. What about this?
PersonID -> FirstName, LastName, PhoneNumber
This is false, We can see now that PersonID yields 2 different phone numbers. So, we must find a way to yield the same results always.
What should we do? We should just move the PhoneNumber to be part of the candidate key.
PersonID, PhoneNumber -> FirstName, LastName
This is true, so now we are back in business. Let’s keep moving with this.
PersonID, PhoneNumber -> StreetAddress
This is false, so, I will move the StreetAddress to the left.
PersonID, PhoneNumber, StreetAddress -> FirstName, LastName
Ok, we got it? keep doing that with all attributes and you will get to this eventually.
PersonID, PhoneNumber, StreetAddress -> FisrtName, LastName, ZipCode, City, AddressType
I am not sure if you noticed, but you just learned how to find a candidate key.
Anyway, PersonID, PhoneNumber, StreetAddress are my candidate keys. They form my superkey.
A candidate key is a minimal superkey-that is, a superkey without any unnecessary attributes.
There is one more term to remember about them. Each of them, separately, can be called a prime member or a prime attribute.
The columns that are not part of the candidate key are called non-prime or non-key attributes. Hold those definitions in your thoughts. We will need them from now on to understand dependencies.
What is a partial dependency?
Partial dependency is when a non-prime attribute depends on a portion of the candidate key. In other words, a non-prime attribute does not depend in all prime attributes.
We determined this previously:
PersonID, PhoneNumber, StreetAddress -> FisrtName, LastName, ZipCode, City, AddressType
We, now, need to find out if each non-prime element depends in all the keys.
Let’s start with FirstName. Does it depend on the PersonID prime attribute? Yes, it does because that is a surrogate key used to identify that person. PersonID will give always the same FirstName and LastName.
Does FirstName depend on the PhoneNumber? If we look at the data now, it looks like it does, but it does not.
The reason is going back to Joe having a son with the same name and living in the same house with the same phone number. We can see that it may not hold true always.
This is called conceptually designing. Sometimes, we don’t have or are not able to analyze all the data, so we need to either ask for business rules or to think about possibilities.
We can clearly see that FirstName does not depend on the StreetAddress. So, FirstName and LastName really depends only on PersonID. It does not depend in the whole candidate key so it is partially dependent.
What about ZipCode? Is that dependent on PersonID? See that Joe1 and Carl1 has the same ZipCode. So the answer is NO!
The same happens for City and AddressType. They do not depend on PersonID or PhoneNumber, they only depend on StreetAddress.
One more thing to notice here, PhoneNumber is by itself right now. It is a prime attribute, but if we think conceptually, we can see that the best approach here would be to have a surrogate key for the phone as well.
In that case, the surrogate key would be the candidate key and PhoneNumber would be a non-prime attribute. That would make PhoneNumber dependant only on the Phone surrogate key.
The reason I have it that way is that when structuring data, no one will have the perfect dataset for us. We have to keep doing conceptual thinking and design all the time.
Another good way to see the issue with this Phone number is by seeing if this phone is repeating in the table. As we can see, it does. So, this phone depends on itself only.
Let’s now understand what issues these partial dependencies cause.
If we try to INSERT another address record for anyone that already exists, we will also need to insert the same phone information FirstName and so forth. It will cause redundancies again. See the last 2 rows! We have Joe1 4 times already because of the different phone numbers and addresses.
If we try to UPDATE Joe1 addressees, we will need to do it in more than 1 place. Also if we update Jenn1’s City to San Diego, we, now, lost record of Los Angeles. Los Angeles is no more in our database.
If we need to DELETE ZipCode 96321. We will also need to delete Joe2 and his entire row.
So, those are the most common anomalies when the table is not in 2NF. We need to fix that and make each candidate key its own entity. So, let’s start breaking this table.
First, we will create a Core table called Person
Let’s test it!
PersonID -> FirstName (Pass)
PersonID -> FirstName, LastName (Pass)
PersonId is a candidate key, super key, surrogate key and now my Primary Key.
FirstName is a non-prime attribute
LastName is a non-prime attribute
Notice that each Person is found only once in the table. There are no redundancies in this table.
Next, we will create a list table for Phone numbers.
There is more information that could be saved here like phone type. For simplicity purposes, let’s keep as-is for now.
Next we will create another core table for addresses.
Notice the creation of another surrogate key called AddressID. I created that for the same purpose as the one created for PersonID. If we use that surrogate key as the primary key, we will have this.
AddressID -> StreetAddress, ZipCode, City, AddressType
We can see that AddressID determines every single attribute in the table. All the other columns are now non-prime attributes.
Next table will be a lookup table named AddressType.
This table is very simple and very similar to the Phone table.
We have created those entity tables based on what they really are. There are no partial dependencies on them anymore. We need to connect them to make sure it serves the needed purpose.
To connect phone numbers with people we will create a cross-reference table called PersonPhone.
This table holds the connection between a Person and Phone in a many-to-many relationship. If we need to INSERT another phone to Joe1. We just need to INSERT a record to the phone table and then INSERT another record to the PersonPhone table.
If we need to DELETE a phone record from anyone, we just need to DELETE that record from the PersonPhone table.
If we need to UPDATE one phone number, we need to do it in the Phone table and in the PersonPhone table. I will show you later on how to update in one single place only.
The next table will also be a cross-reference table, this time connecting People with addresses and address types.
This one also behaves similarly to the personPhone table. We can see that the Primary keys of each table created above, came to this table.
The association of all these keys creates the candidate key for this table. In this case, the table has three candidates that can become the Primary key.
Here is another word to learn. Composite key, so when a primary key has more than 1 key, we call it a composite key.
The keywords to be remembered when decomposing a table from 1NF to 2NF are mainly three.
- Partial dependencies
- Composite keys
- Many-yo-many relationships.
I know we want to use the relationship connections to connect them but wait a couple more minutes. We will get that.
Before we do that, Let’s pay attention to this particular table that we created.
If we add another row, where the address is again in Miami. We will now have two addresses with Miami’s name.
If we add millions of addresses with Miami’s name, now we will have millions of records with that name Miami.
So, there could be a lot of redundancy for city names. I do not expect Miami to change its name anytime soon, but what if it has to change? We would need to do an UPDATE that would have to change millions of records.
Let’s try to DELETE the addr4. What if that was the only record with Portland as a City? Portland would be gone once and for all in our database until another one is inserted again.
So, those anomalies must also be fixed.
Third Normal Form (3NF)
Standard says that for a table to be in 3NF it must:
- Be in 2NF
- Eliminate transitive dependency
What is transitive dependency?
Transitive dependency is when a non-prime attribute is dependent on another non-prime attribute. In other words is when an attribute depends on another attribute that is not part of the primary key.
To test that we just have to first identify the non-prime attributes and see if any of them depends on each other.
Let’s start analyzing these columns in the Address table. We know that the Primary Key here is the surrogate key AddressID. So, as of now, I will ignore that column.
Going to StreetAddres, is the StreetAddress dependent on ZipCode? How many 100 Washington Avenue addressees are in the country?
In the U.S.A, I can tell you there are a lot of them. Do they depend on the ZipCode? The answer is no because all these addresses, with the same name, will have different Zipcodes.
Let’s check StreetAddress dependency on City. Does it depend on the city? The answer is again, no. The same address name may be present in different cities.
What about ZipCode, does that depend on the StreetAddress? No!
What about the City, does it depend on the ZipCode? The answer is: YES, IT DOES!
Every city has its own set of Zipcodes and they must be unique in the U.S.A.
So, we can infer this:
AddressID -> ZipCode
ZipCode -> City
AddressID -> ZipCode -> City
The only reason we really know that the City is X is that Zipcode, who determines the city, is in between the city and the AddressID. That is called transient dependency.
So, we need to break this table again and eliminate this dependency to make it 3NF.
Our core table Address, will be the same, but now without the City column.
We will create another lookup table called ZipCode that will hold city information.
We are now able to connect this two tables in a one-to-many relationship.
We are done normalizing that initial Person table. That initial Person table became 7 different tables. What are the implications of that?
The more normalized the data is, the more tables we will have, and the more joins will be needed to fetch the same information.
So 3NF for OLTP systems should be good enough. The next step would be to learn BCNF which is also considered a 3.5NF normal form. As of now, this normalization has also achieved BCNF as well, so we don’t have to talk about it. In fact, the majority of the time, getting to 3NF will also get us to BCNF.
There is nothing else to do here, but there may be situations where 4NF can be considered. Considered does not mean to be done. We need a very good reason to take it to 4NF or even above because they will cause even more joins.
The moral of the story is an under the normalized table or an over normalized table can both be very bad. So, stop at 3NF or BCNF.
There is one more thing to talk about. Even though this not part of the process of normalization itself, it should be considered during normalization. I am talking about surrogate keys. Yes, we mentioned it before, but now is a good time to complement it.
As you can see, I used surrogate keys for PersonID and AddressID. I used a VARCHAR type, mostly to make it more understandable. However, that is not a good practice. For the following reasons:
- An algorithm must be used to determine the PersonID since primary keys must be a unique per table.
- VARCHAR takes more bytes than other possible datatypes
- Primary keys are sorted when INSERTED in the database. Every time an INSERT is made the string will need to be sorted. Strings are harder to sort than numbers.
- If we make UPDATES on that key, we must change all the related tables as well.
These are the main reasons. There is another data type that other designers use called uniqueidentifier.
Worst then use a VARCHAR is to use this UniqueIdentifier for this purpose. The unique identifier generates this unique base value, so there is no need to create an algorithm to generate unique values.
However, it takes 16 bytes and the system needs to sort as well before inserting in the table. So, don’t use it!
What should be done then?
Integer numbers should be used for that purpose. Most databases auto-generate that for us, so there is no need to create an algorithm.
The sizes may vary from 1 byte (TINYINT) to 8 bytes (BIGINT). Choose the numeric datatype that makes sense.
Core tables usually take INT or BIGINT.
Lookup tables usually take a TINYINT.
List table usually takes INT.
Numbers take much fewer resources to INSERT. In an auto-generated database, that the system already knows where to insert.
So, once we have proper surrogate keys in place, this is what we get:
Please tell me how was the process!