If you ever studied normalisation of databases, you have probably come to the same conclusion as I have regarding NULL: It is best if NULL values in the database can be avoided but it is not always easy to achieve a NULL-free database. Let’s look at an example:
An object has certain properties. A vehicle has a number of gears, a certain length and height, a Wheel size etc. Typically there’s only one of these properties – a vehicle has a fixed number of gears, its length is fixed etc. So if you want to create a data model for a vehicle, you probably would create a Vehicles table, with Length, Height, Gears ans so forth as columns. So far, so good. But what if some properties are not applicable for certain Vehicles? Like a Tesla car, which doesn’t use the type of transmission we are used to. What if number of gears is a configurable option, not set until a buyer wants to become the owner of a specific vehicle? If you try to model your database to avoid NULL values in the database, you would have to rethink your data model. The Vehicles table would maybe have to become less wide and certain properties be taken out of the table and put into a VehiclePropertyValues table instead. If a certain property value is missing, there will simply be no row for that property value in the VehiclePropertyValues table. This is all fine, but it makes reporting slightly less intuitive – the SQL Queries against the database would have to contain a JOIN to the VehiclePropertyValues table. Values in the VehiclePropertyValues table would have to be pivoted to fit a report with fixed columns etc.
This is a rather simple example, but already with a simple data model, you are likely to end up with NULL-able columns.
To me, as a database designer and developer, NULL values are not forbidden. They are sometimes a problem, but my reality is that some tables will contain NULL values. In this blog post, I will discuss which possible implications NULL values can have and clear out some misunderstandings about NULL.
First of all – What is NULL? What does it mean?
To me as a user, NULL can mean one of two things. It can be used to indicate that a value is MISSING or NOT APPLICABLE. It can also be used to indicate that a value is UNKNOWN. I can usually understand which type of NULL is where. Sometimes I have to make an assumption, but usually I know if NULL means “MISSING” or “UNKNOWN”. That’s because I know the business requirements. I know that a NULL value in the Gears column means that no Gear has been chosen yet – the value is MISSING (so far). But me knowing if NULL means a value is MISSING or if it means UNKNOWN doesn’t help much when it comes to SQL Server interpreting a Query against a database. SQL Server (or SQL in general, this is not an implementation issue, it’s a SQL language issue) will always treat NULL as UNKNOWN. This is how the SQL language is designed and it’s nothing we can do about it. Whenever possible, I will try to design the database in such a way that I can treat MISSING different from UNKNOWN. In the Gears column, I might put a -1 to indicate that the value is MISSING or Not Applicable. But sometimes (or very often even), I’m not in charge of designing the database and/or the system surrounding it. I’m supposed to work with whatever data is in the database.
Many Words. “Too long, didn’t read” might be your reaction, but just remember this one thing: To a person, NULL would mean either MISSING or UNKNOWN. To SQL, it always means UNKNOWN.
Why does it matter what NULL means?
To answer that question, I would like to keep using the Gears column in the Vehicles table and make an example out of that. I’m faced with answering two questions from the business: How many bikes in our registry has five gears? How many bikes does NOT have five gears? To answer that, most developers would write two queries:
--How many vehicles has five gears?
COUNT(*) AS WithFiveGears
Gears = 5;
--How many vehicles does NOT have five gears?
COUNT(*) AS NotHavingFiveGears_WRONG
Gears <> 5;
You see that I labelled the count from the second query NotHavingFiveGears_WRONG. Why WRONG? Because it potentially returns the wrong results. And I say potentially. Like so often – “it depends”. This time, it depends on how we want to treat NULLs. Does NULL mean that the value is in fact UNKNOWN (there will be gears in this vehicle but the number of gears is set at a later stage) or does it mean that the value is MISSING or NOT APPLICABLE (A Tesla won’t have gears at all)?
If the NULL value is to be treated as UNKNOWN, we might have to rewrite the first Query as well. It should perhaps take into consideration that a number of the NULL values WILL get the value “5” in the future, and that should be a part of our report. But if the business says that UNKNOWN should be treated as “not 5”, our first Query will return the correct results, and the second won’t. And that’s because NULL is UNKNOWN to SQL. The WHERE clause in the first query gives us the rows where Gear = 5. That’s filtering out anything which is not “5” and therefore NULL values are not returned. Fine. And the second query has a WHERE clause which gives us the rows where Gears is NOT “5”. And that should give us all rows which are not explicitly “5”, right? Well, no. It doesn’t. It gives us the rows where the comparison Gears<>5 equeates to TRUE. All WHERE clauses equate to TRUE or FALSE, right? Well, no. Not always. Anything compared to NULL equeates to UNKNOWN. Because SQL doesn’t know if the column contains a certain value or not. The value is UNKNOWN and therefore it could be TRUE or FALSE. The WHERE clause will only return rows where the comparison is TRUE. And UNKNOWN is not TRUE. Therefore, the rows with NULL values will not be returned for the inequality predicate either.
Do you feel, when reading this, that you have made this error? I know I have. I’m not even sure I get it right every time nowadays either. But I know that I get it right more often then some years ago. I know that because my errors have had implications which were not desirable.
So what do we do about these d**n NULLs?
Well. To begin with, we try as hard as we can to only have NULL values which mean UNKNOWN in the database. That’s not Always easy, and it could mean we have to remodel the database. If we can’t remodel the database, we could use a constant. Like -1 for Gears. But how would that affect an AVG aggregate query? Or a median calculation? Or even “how many transmission boxes do we have to produce based on the number of Vehicles sold last year”? You might end up with one problem solved and a number of new, interesting problems.
In the end, you would have to learn to write SQL code which takes NULL into consideration. And each time you consider NULL, you would have to understand if NULL means UNKNOWN och MISSING/NOT APPLICABLE.
I will not try to answer when NULL means a certain thing. Instead, I will just end with a thing which you probably already know and have seen, but which you might not always remember to use. The IS keyword.
The IS keyword is used to compare something with NULL. It works both for equality (IS NULL) and inequality (IS NOT NULL). And it will return TRUE or FALSE, not UNKNOWN. Because SQL will now if a value is a NULL value or not.
So to finalise, I will rewrite the second reporting query to take the NULL value into consideration and in this case treat NULL as “not 5”.
COUNT(*) as NotFiveGears_CORRECT
Gears <> 5 OR Gears IS NULL
An alternative solution would be to use either ISNULL or COALESCE functions, to replace NULL values with something else, and then compare that with “not 5”. This will however be problematic in terms of performance. ISNULL have to be applied to each row, returning either the value for the Gears column of that row OR whatever replacement value we have. The result of that is then compared with “<> 5″. If we have an index on Gears column, we will end up with scanning the whole index, applying the function to each row in the index, thus making the index more or less useless. But semantically, using ISNULL or COALESCE would solve our problem, and therefore I show how it can be done using, in this case, COALESCE. COALESCE and ISNULL are very similar, if you want to compare them, look them up in Books Online. One thing not mentioned there (I Think at least) is how the two behave differently when used together with SELECT INTO.
COUNT(*) as NotFiveGears_CORRECT
COALESCE(Gears,-1) <> 5