Transmokopter database solutions

Tag: t-sql

What is NULL?

If you ever studied normalisation of databases, you have probably come to the same conclusion as I have regarding NULL: It is best if NULL values in the database can be avoided but it is not always easy to achieve a NULL-free database. Let’s look at an example:

An object has certain properties. A vehicle has a number of gears, a certain length and height, a Wheel size etc. Typically there’s only one of these properties – a vehicle has a fixed number of gears, its length is fixed etc. So if you want to create a data model for a vehicle, you probably would create a Vehicles table, with Length, Height, Gears ans so forth as columns. So far, so good. But what if some properties are not applicable for certain Vehicles? Like a Tesla car, which doesn’t use the type of transmission we are used to. What if number of gears is a configurable option, not set until a buyer wants to become the owner of a specific vehicle? If you try to model your database to avoid NULL values in the database, you would have to rethink your data model. The Vehicles table would maybe have to become less wide and certain properties be taken out of the table and put into a VehiclePropertyValues table instead. If a certain property value is missing, there will simply be no row for that property value in the VehiclePropertyValues table. This is all fine, but it makes reporting slightly less intuitive – the SQL Queries against the database would have to contain a JOIN to the VehiclePropertyValues table. Values in the VehiclePropertyValues table would have to be pivoted to fit a report with fixed columns etc.

This is a rather simple example, but already with a simple data model, you are likely to end up with NULL-able columns.

To me, as a database designer and developer, NULL values are not forbidden. They are sometimes a problem, but my reality is that some tables will contain NULL values. In this blog post, I will discuss which possible implications NULL values can have and clear out some misunderstandings about NULL.

First of all – What is NULL? What does it mean?
To me as a user, NULL can mean one of two things. It can be used to indicate that a value is MISSING or NOT APPLICABLE. It can also be used to indicate that a value is UNKNOWN. I can usually understand which type of NULL is where. Sometimes I have to make an assumption, but usually I know if NULL means “MISSING” or “UNKNOWN”. That’s because I know the business requirements. I know that a NULL value in the Gears column means that no Gear has been chosen yet – the value is MISSING (so far). But me knowing if NULL means a value is MISSING or if it means UNKNOWN doesn’t help much when it comes to SQL Server interpreting a Query against a database. SQL Server (or SQL in general, this is not an implementation issue, it’s a SQL language issue) will always treat NULL as UNKNOWN. This is how the SQL language is designed and it’s nothing we can do about it. Whenever possible, I will try to design the database in such a way that I can treat MISSING different from UNKNOWN. In the Gears column, I might put a -1 to indicate that the value is MISSING or Not Applicable. But sometimes (or very often even), I’m not in charge of designing the database and/or the system surrounding it. I’m supposed to work with whatever data is in the database.
Many Words. “Too long, didn’t read” might be your reaction, but just remember this one thing: To a person, NULL would mean either MISSING or UNKNOWN. To SQL, it always means UNKNOWN.

Why does it matter what NULL means?
To answer that question, I would like to keep using the Gears column in the Vehicles table and make an example out of that. I’m faced with answering two questions from the business: How many bikes in our registry has five gears? How many bikes does NOT have five gears? To answer that, most developers would write two queries:

--How many vehicles has five gears?
  COUNT(*) AS WithFiveGears
  Gears = 5;

--How many vehicles does NOT have five gears?
  COUNT(*) AS NotHavingFiveGears_WRONG
  Gears <> 5;

You see that I labelled the count from the second query NotHavingFiveGears_WRONG. Why WRONG? Because it potentially returns the wrong results. And I say potentially. Like so often – “it depends”. This time, it depends on how we want to treat NULLs. Does NULL mean that the value is in fact UNKNOWN (there will be gears in this vehicle but the number of gears is set at a later stage) or does it mean that the value is MISSING or NOT APPLICABLE (A Tesla won’t have gears at all)?
If the NULL value is to be treated as UNKNOWN, we might have to rewrite the first Query as well. It should perhaps take into consideration that a number of the NULL values WILL get the value “5” in the future, and that should be a part of our report. But if the business says that UNKNOWN should be treated as “not 5”, our first Query will return the correct results, and the second won’t. And that’s because NULL is UNKNOWN to SQL. The WHERE clause in the first query gives us the rows where Gear = 5. That’s filtering out anything which is not “5” and therefore NULL values are not returned. Fine. And the second query has a WHERE clause which gives us the rows where Gears is NOT “5”. And that should give us all rows which are not explicitly “5”, right? Well, no. It doesn’t. It gives us the rows where the comparison Gears<>5 equeates to TRUE. All WHERE clauses equate to TRUE or FALSE, right? Well, no. Not always. Anything compared to NULL equeates to UNKNOWN. Because SQL doesn’t know if the column contains a certain value or not. The value is UNKNOWN and therefore it could be TRUE or FALSE. The WHERE clause will only return rows where the comparison is TRUE. And UNKNOWN is not TRUE. Therefore, the rows with NULL values will not be returned for the inequality predicate either.
Do you feel, when reading this, that you have made this error? I know I have. I’m not even sure I get it right every time nowadays either. But I know that I get it right more often then some years ago. I know that because my errors have had implications which were not desirable.

So what do we do about these d**n NULLs?
Well. To begin with, we try as hard as we can to only have NULL values which mean UNKNOWN in the database. That’s not Always easy, and it could mean we have to remodel the database. If we can’t remodel the database, we could use a constant. Like -1 for Gears. But how would that affect an AVG aggregate query? Or a median calculation? Or even “how many transmission boxes do we have to produce based on the number of Vehicles sold last year”? You might end up with one problem solved and a number of new, interesting problems.
In the end, you would have to learn to write SQL code which takes NULL into consideration. And each time you consider NULL, you would have to understand if NULL means UNKNOWN och MISSING/NOT APPLICABLE.
I will not try to answer when NULL means a certain thing. Instead, I will just end with a thing which you probably already know and have seen, but which you might not always remember to use. The IS keyword.
The IS keyword is used to compare something with NULL. It works both for equality (IS NULL) and inequality (IS NOT NULL). And it will return TRUE or FALSE, not UNKNOWN. Because SQL will now if a value is a NULL value or not.
So to finalise, I will rewrite the second reporting query to take the NULL value into consideration and in this case treat NULL as “not 5”.

  COUNT(*) as NotFiveGears_CORRECT
  Gears <> 5 OR Gears IS NULL

An alternative solution would be to use either ISNULL or COALESCE functions, to replace NULL values with something else, and then compare that with “not 5”. This will however be problematic in terms of performance. ISNULL have to be applied to each row, returning either the value for the Gears column of that row OR whatever replacement value we have. The result of that is then compared with “<> 5″. If we have an index on Gears column, we will end up with scanning the whole index, applying the function to each row in the index, thus making the index more or less useless. But semantically, using ISNULL or COALESCE would solve our problem, and therefore I show how it can be done using, in this case, COALESCE. COALESCE and ISNULL are very similar, if you want to compare them, look them up in Books Online. One thing not mentioned there (I Think at least) is how the two behave differently when used together with SELECT INTO.

  COUNT(*) as NotFiveGears_CORRECT
  COALESCE(Gears,-1) <> 5 

DELETE and Non Clustered Indexes

I’ll start off with a disclaimer: I’m going to tell you about something that happened in a specific system Environment. There’s no such thing as  a general advice you can build on this specific scenario. I’m just posting it because I was myself surprised by what order of magnitude I was able to speed up a specific query by slightly removing some of the work in the execution plan.

The other day I helped troubleshooting a database system. In a table with some 400 million records, a subset (50-60 million records) were to be deleted. The application kept timing out on this delete operation so I adviced the developer to split the delete operation into smaller chunks. I even helped writing a T-SQL script to perform the delete in one million row chunks. The script was pretty basic – a WHILE-loop which checked if any rows fulfilling the WHERE-condition of the delete was left in the table, and inside the loop a DELETE TOP(one million) followed by an explicit checkpoint.

I was sure that this script would do the trick, but I found that just deleting one million rows was a 20 minute job. 20 minutes for one million rows. And the table has 50-60 million rows. That’s something like a 15 hour job. No wonder the application kept timing out… Looking at the query plan for a delete of one million rows, I saw that a SORT-operator and a Non Clustered delete took 70% of the execution and I had a warning about spill to tempdb in the SORT operator. Knowing something about the disk latency for tempdb, I could also tell that this part of the execution plan was in fact taking more than 70% of the time.

So I started advicing on partitioning the data, in order to switch out data in specific partition to a staging area and truncate. Though this would be the “ultimate” solution in terms of performance, it should also be said that this is a rather advanced topic and it would require more future maintenance than not partitioning the data.

I had the SORT- and Non Clustered Delete operators in the back of my head and decided to try disabling the Non Clustered Index before performing the the delete. I tested how long a Non Clustered Index rebuild would take and to my surprise I saw that that was only an eight minute job. Eight minute to rebuild a NCI on 400 million rows, compared to 15 hours for deleting 50-60 million rows. So I built another T-SQL script, which disabled NCIs on the table, performed the delete operation and then rebuilt the NCIs. I tested only with deleting one million rows. The result: zero seconds for disabling NCIs. Nine seconds for the delete (!!!). Eight minutes for rebuilding the index. If I could delete one million rows in nine seconds, I was looking at something like 7 minutes for deleting all 50-60 million rows. That’s 0,8% of the time it would have taken to just perform the delete. If I add on the time it takes to rebuild the index, I’m up to a 15 minute job, compared to a 15 hour job before disabling the indexes.

BUT, someone would argue – if I disable and then rebuild the indexes, the table would be slow on Query performance the first seven minutes and in worst case it would be offline the following eight, during the index rebuild (if it’s not done online). But imaging the Query performance on a table in which a delete operation goes on for 15 hours.

The disk latency on tempdb and actually also on the transaction log is pretty high in this machine and that’s something which slows down performance. But even with an ok or even good disk configuration, I would say that disabling NCIs before the DELETE would save us a lot of time.

I’m not going to argue that you should ALWAYS do this or that, but if you find yourself struggling with updates, inserts or deletes which goes on forever, check your execution plan to see how much of the work really has to do with the DML operation and how much has to do with updating/inserting/deleting in non clustered indexes. Before going to your manager and ask for better hardware, look what you can do to optimize large workloads with a mix of DDL and DML operations. In my case, we had spill to tempdb already with one million rows. If I would have tried fighting that with more memory on the server, I would have had to ask for 150GB more memory in a box with 3GB available memory. Hardware is sometimes a real bottleneck, but I would argue that you should fix your code before adjusting the hardware.

Partitioning the table and/or implementing compression on the table would probably help even more. But these are both Enterprise features, which comes with a rather high cost. Not every organisation can afford that. If you don’t have the knowledge to troubleshoot really problematic queries yourself, it’s probably worth investing in a few days of expert help, before wasting money on new hardware or more expensive licenses.

© 2019 tsql.nu

Theme by Anders NorenUp ↑