What are the SQL Server datatypes?
The SQL Server datatypes dictate the type of data an object can hold. Examples are integer, strings, decimals, binary, and so forth.
We can divide them in
- Character Strings
- Binary Strings
- Date & Time
Numeric, as the name say, has to do with numbers.
Character Strings will be a text like.
Binary will hold binary data
Date & Time
The list above is not complete but are the ones that we may see in the real world. We will devote mostly to the main ones in this article.
What is a BIT?
A BIT is the “atom” of computer communication. When I say “atom”, I am trying to say the smallest and indivisible part of computer communication. The values can be interpreted as 0 and 1. In databases, we have a third one, NULL.
What is a BYTE?
A byte is the set of 8 BITs.
What is the difference between BIT and BYTE?
The difference is the size, like said before, a byte has 8 BITS all adjacent.
This byte of information matters when talking about applications and databases.
The main problem when designing a database is the idea of the Black Box. Many database developers believe that a database is this black box to only hold data and retrieve it when needed.
That is not true. It will store and retrieve data, but underneath there some amazing things happening. We must take advantage of it, starting by using the right datatype.
When we decide what datatype to use, we are also telling how the server should allocate storage space.
How to count in Binary?
It is hard to write about this, so I recommend you to watch this video below.
In summary, every bit that is “ON” will be mapped to those numbers in a byte.
--128 64 32 16 8 4 2 1 -- 0 0 0 0 0 0 0 0 = 0 -- 0 0 0 0 0 0 0 1 = 1 -- 0 0 0 0 0 0 1 0 = 2 -- 0 0 0 0 0 0 1 1 = 3 -- 0 0 0 0 0 1 0 0 = 4 -- 0 0 0 0 0 1 0 1 = 5
Notice that it doubles, so 2 bytes will continue the doubling pattern.
--32768 16384 8192 4096 2048 1024 512 256
So, if we want the number 4097 we would need to bytes of data that would be represented like this
32768 16384 8192 4096 2048 1024 512 256 128 64 32 16 8 4 2 1
0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 1 = 4097
For now, keep this in mind.
- Bit -> 0 or 1 -> On or Off
- Byte -> 8 bits
- 1KB -> 1024 Bytes
- 8KB -> 8,192 Bytes
8Kb page is the smallest block of I/O in SQL Server. What does that mean? It means that SQL Server reads pages instead of records. Each page can hold up to 8Kb of data.
If we need to read books, taking into consideration that font size and pages are the same for all books. However, the number of pages is different. Which book, would be read faster? The one with fewer pages, right?
Yes, databases behaves the same way.
So, now we know that SQL Server reads pages and not rows. So, the less page it has to read, the faster it will most likely be, less I/O will be required, fewer chances of locks will be in place and better performance is almost guaranteed.
We should have that in mind when creating datatypes. The more the allocation of space, the more pages SQL Server will need to read and the worst the performance.
The data type bit takes 1 byte of storage data! What in the world? What not just a BIT? The minimum amount that SQL SERVER can allocate is a byte. However, each BIT column will only use 1 BIT portion of the byte.
Since we know that 1 byte has 8 BITs, we also know that in a column with 1 single BIT datatype. There will be another 7 BITS to be used.
Let’s try to use an example to better visualize that.
Consider this table:
CREATE TABLE aBitTable ( BitColumn1 BIT )
How many bytes are being allocated per row in this table? The answer is 1 byte.
What bout this table:
CREATE TABLE aBitTable2 ( BitColumn1 BIT , BitColumn2 BIT , BitColumn3 BIT , BitColumn4 BIT , BitColumn5 BIT , BitColumn6 BIT , BitColumn7 BIT , BitColumn8 BIT )
This table is allocating only 1 byte per row as well. The reason is that all BITs of one single byte can be used here. If we were to add another BIT column, it would now take 2 bytes and that could have up to16 BIT columns.
Possible values for BIT are 0, 1 and NULL.
This data type takes 1 byte of storage data. This is a numeric integer value so the max number it can hold is 255. Let’s check the 8 BITs to understand why.
|128 64 32 16 8 4 2 1|
1 1 1 1 1 1 1 1 = 255
It can hold 2556 values from 0 to 255.
This data type takes 2 bytes of storage data. This is also a numeric integer and the number it can hold ranges from -32,768 to 32,767. How did we get these range? let’s checks the BITS of 2 bytes.
32,768 16,384 8,192 4,096 2,048 1,024 512 256 128 64 32 16 8 4 2 1
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 = 65,535
The orange color is for the first byte and the color red is for the second byte. So, if they are all on the total number available is 65,535. So, why we can only have a max of 32,767?
Notice that for this data type we also have negative values, so the amount must be divided by 2. So (65,565 / 2) = 32,767.5
Hence, 32,767. Notice that we have a half somewhere in there. So, if we remove for now just 1 number from 65,535 we will have the exact number 32,767. Where will the 1 removed go? It will go to the negative portion, so -32,768.
This datatype takes 4 bytes. The values range from -2147,483,548 to 2,147,483,647. The principal is the same calculated on the previous data type. You may try it yourself.
This datatype takes 8 bytes. The values ranges from -263 to 263 – 1.
DECIMAL OR NUMERIC
These datatypes are not integer numbers. The storage varies from 5 to 13 bytes. It will all depend on the precision.
This is also not an integer and it takes 8 bytes.
Similar to MONEY, but it stores only 4 bytes.
CHAR is a character string data type. It can take from 1 byte to 8000 bytes. it will depend on how many characters will be specified when declaring the datatype. For example, CHAR(100) will take 100 bytes.
The maximum number of characters allowed in this datatype is 8000. It explains why the maximum it can take is 8000 bytes (CHAR(8000)). Notice that 8000 bytes are very close to 8Kb.
The 8000 limitation has to do with 8Kb pages.
This datatype is used for Unicode characters. It behaves exactly like CHAR with the exception that special characters can be used here. For that reason, each character will take 2 bytes. So, it can take from 2 bytes to 8000 bytes. So, NCHAR(100) will take 200 bytes.
Since it takes 2 bytes per character, the maximum amount of characters is 4000 (NCHAR(4000)).
CHAR was created with the English language in mind, but not only that, also because when created, that was all computer could really handle, when talking about mapping character to binary. So, for applications that will only use English, CHAR should be enough. Nowadays, English only is hard to see, so NCHAR is the most appropriate.
This datatype is another character string that takes 2 bytes plus 1 byte per character. It can range from 2 bytes to 8002 bytes. For example VACHAR(100) will take 102 bytes (100 bytes for all characters plus 2 bytes).
This data type is also similar to VARCHAR, but it takes 2 bytes plus 2 bytes per character. The reason is the same as mentioned for NCHAR. It takes special characters. For example, NCHAR(100) will take 202 bytes (200 bytes for all characters plus 2 bytes).
What is the difference among CHAR and VARCHAR?
When we declare a variable CHAR(10), we are telling the server that all the characters will be utilized and it can not be more than 10 characters. So, if the character string used to be inserted is less than ten characters, the server will insert blank spaces up until it gets to ten characters.
When we declare a variable VARCHAR(10), we are telling the server that the character string may have up to 10 characters. If we insert a character string with 8 characters, no blank spaces will be inserted.
The same applies to NCHAR & NVARCHAR.
Which one to use?
If we know for a fact that the number of characters of a string will always be constant, we should use CHAR or NCHAR. For example, Social Security Numbers, as of now, should always be of 9 characters. So, just use CHAR(9).
On the other hand, names can be of all different sizes. For example, FirstName could be from 1 to 40 to 50 characters. Since the number of characters will vary, we should use NVARCHAR instead. So, just use NVARCHAR(50).
Bottom-line is to use both datatypes as needed. If we think about it, they can serve as a constraint since it will limit the number of characters. Some developers say that we should only use VARCHAR or NVARCHAR. I disagree with that because in some situations if we know our data, we should use CHAR or NCHAR.
That matters because CHAR is usually faster to query, VARCHAR can potentially take less storage, consequently less 8Kb pages.
DATETIME takes 8 bytes and simply should not be used anymore nowadays. But you will still see it out there a bunch. It stores dates from January 1, 1753 to December 31, 9999. The time can also be stored, if time is not provided, it will default to 00:00:00.
This datatype takes 4 bytes and should not be used anymore as well. It stores dates from January 1, 1900 to June 6, 2079. Time can also be stored.
This datatype should be the one used to handle dates associated with times. It takes from 6 to 8 bytes depending on time precision. It stores dates from January 1, 0001 to December 31, 9999.
This datatype takes 3 bytes and ranges from January 1, 0001 to December 31, 9999. If no time is needed, this datatype should be the one used.
The datatypes we saw here are the ones most seen on a day to day basis. There are others that we will skip for now because they are not very relevant for beginners.
I will enumerate them here, so you can search for them at your own discretion. But again, they are not relevant as of now.
How big of an impact datatypes can have?
We will use a few examples to give an idea of how these datatypes will impact page reads. There may be somewhat intermediate queries for testing purposes. The intention here is not to introduce any advanced concept, but rather understand datatypes and its consequence when using them. So, ignore anything that may seem advanced for now.
This is what we are going to do, we will create set of 2 tables with different datatypes while inserting the exact same data.
CREATE TABLE Test1 ( TestId SMALLINT IDENTITY (1, 1), DummyType TINYINT ) GO INSERT INTO Test1 VALUES (1) GO 32767
This command is creating a table with 2 columns and inserting 32,767 records. notice the datatype TestId. It is a SMALLINT and only takes 2 bytes. We are only inserting 32,767 records because that is the maximum number it can hold.
Let’s do the same, but now with a BIGINT.
CREATE TABLE Test2 ( TestId BIGINT IDENTITY (1, 1) , DummyType TINYINT) GO INSERT INTO Test2 VALUES (1) GO 32767
The data is the exact same in both tables, however, the datatype is different. How big of a difference will it be in pages?
SELECT t.NAME AS TableName, p.rows AS RowCounts, SUM(a.total_pages) AS TotalPages, SUM(a.used_pages) AS UsedPages FROM sys.tables t JOIN sys.indexes i ON t.OBJECT_ID = i.object_id JOIN sys.partitions p ON i.object_id = p.OBJECT_ID AND i.index_id = p.index_id JOIN sys.allocation_units a ON p.partition_id = a.container_id WHERE t.is_ms_shipped = 0 AND i.OBJECT_ID > 255 GROUP BY t.Name, p.Rows ORDER BY t.Name
So, even though we have the exact same data, the number of pages is different. Table Test 2 is taking more pages because it has a BIGINT datatype.
Let’s try something similar with CHAR and VARCHAR.
CREATE TABLE Test3 ( TestName CHAR(10) ) GO INSERT INTO Test3 VALUES ('1234567890') GO 10000 CREATE TABLE Test4 ( TestName VARCHAR(10) ) GO INSERT INTO Test4 VALUES ('1234567890') GO 10000
Again, the exact same data was used but the number of used pages was different. In this case, CHAR seems to be better. The Character String has a fixed amount of 10 and all characters are being filled.
Let’s try it again with different datatypes for CHAR and VARCHAR.
CREATE TABLE Test5 ( TestName CHAR(30) ) GO INSERT INTO Test5 VALUES ('1234567890') GO 10000 CREATE TABLE Test6 ( TestName VARCHAR(30) ) GO INSERT INTO Test6 VALUES ('1234567890') GO 10000
This time we changed CHAR and VARCHAR to hold up to 30 characters. We are only inserting 10 characters. But this time, VARCHAR is taking fewer pages. Why?
As we mentioned before, CHAR will use all those characters even if we do not insert them. It will insert blank characters instead. VARCHAR, on the other hand, will only use the number of characters inserted.
You may also notice that UsedPages on Test4 is the same as Test6. So, to reinforce, if we know that the number of characters will always be constant, we should use CHAR. Otherwise, use VARCHAR.
These examples were very simple. We used small tables with few records. In a table with many more columns and millions of records, this can make a huge difference.
So, before we create our tables, we should know the bits and bytes of the datatypes.