tsql.nu

Transmokopter database solutions

Category: T-SQL code

Duplicate key in sysclsobjs using DBCC CLONEDATABASE

This blog post is about an error message I got the other day when using DBCC CLONEDATABASE in a T-sql-script. But first some background to DBCC CLONEDATABASE.

I was pretty excited about the DBCC CLONEDATABASE command, which was introduced in SQL Server 2014 SP2 and SQL Server 2016 SP1. It creates a schema-only (that means all the database objects, but no data) copy of a database, keeping all statistics data, so that you can troubleshoot Query plans for certain queries without having to copy all the data. Before DBCC CLONEDATABASE (and to be honest probably also afterwords, DBCC CLONEDATABASE doesn’t replace all the needs) one had to make a full copy of a database to get the statistics data along. That’s usually copied to a test box. If the test box is identical to your production box, you’re almost fine. But on your test box, you don’t have the cached execution plans from the production box. Therefore, you might end up with very different Query plans in your test box. With DBCC CLONEDATABASE, you get a readonly copy of a database, on your production box and you can use that to tweak your queries and see what new estimated execution plans they get.

But there are limitations to what you can do in terms of troubleshooting. You can’t really use it to test real queries, since the clone has no data. Not only will you miss the actual data when running queries, you won’t know anything about CPU or IO times for a Query execution. And you can’t add or change indexes to see how a Query will behave – you can of course make the clone a readwrite database and alter/add indexes. But any new or altered index will screw up the statistics objects – since the database contains no data, the statistics for the objects will be empty.

To me, DBCC CLONEDATABASE is not that useful for troubleshooting. My use for DBCC CLONEDATABASE is a replacement for scripting a database. I’m creating a database which should represent a subset of another database. When I first developed the scripts for it, I used Adventureworks2014, and sampled 250 rows from Sales.Customer, and wanted to get a complete Adventureworks2014 database for these 250 customers, including their orders, the orderlines for these orders, the salespersons for these customers, the rows from Person.Person, Sales.Store, Production.Products etc. I got a script to work OK on my own laptop, but when trying to run it against a server and Another database, I got an error message:

Msg 2601, Level 14, State 1, Line 4
Cannot insert duplicate key row in object ‘sys.sysclsobjs’ with unique index ‘clst’. The duplicate key value is (50, 5).

I created a Connect item for this error, but couldn’t really let it go, so I did some researching. sys.sysclsobjs is not available unless you open an Admin connection to your database. So I did that, and this is what I found:

I have a row in sys.sysclsobjs with class=50 and id=5. I found out that this row has information about a schema which exists in my source database. I also found that all rows with class=50 has information about my schemas (id=1 is dbo etc). The schema_id for each schema is represented by the id-column in sys.sysclsobjs. But this didn’t help me much – there was nothing strange with the schema. So I let it go for a while, had some coffee, helped some developers troubleshoot queries and went home from work.
But I still couldn’t let it go, so I fired up my home computer and tried to figure out what is different on my own laptop’s SQL Server instance compared to the server instance. Since the source database was the same on both my laptop and the server instance, the SQL Server build number is the same, I started looking at what could possibly be different between the instances.
Without going through all the attempts leading to a dead end (they were many…) I finally looked at the model database on my two instances. In the instance where I got the error, I had a schema created. A schema created by monitoring software. This schema had schema_id=5.
I removed all the user created schemas from model. And voila, DBCC CLONEDATABASE works.
What does this mean? To me, it looks like (this has not been verified, I draw my conclusions based on symtoms, not on any knowledge about the bits and pieces inside DBCC) DBCC CLONEDATABASE creates a new database, using a normal CREATE DATABASE command and then creates database objects and copies statistics. All schema_id’s and object_id’s are copied to the clone and this is what causes the command to throw the duplicate key error. When schema_id=5 is copied from Adventureworks2014 to Adventureworks2014_clone, there’s already a schema with schema_id=5 – the one copied from the Model system database.
I was pleased to find the cause of the error and on our test boxes, I’m ok with dropping schema_id=5 in Model and go on with life. But I won’t be equally happy when we start using my script for something more useful than subsetting Adventureworks2014. Subsetting a production database for troubleshooting purposes won’t be possible. Unless I let my script drop any non-system-schemas in Model in our production instances.
Here’s a script to reproduce the error:

--First create a user database
CREATE DATABASE CloneTest;
GO
USE CloneTest;
GO
CREATE SCHEMA CloneTestSchema;
GO
SELECT SCHEMA_ID('CloneTestSchema');
--schema_id=5
GO
--Now create a schema in Model
USE Model;
GO
SELECT * FROM sys.schemas ORDER BY schema_id;
GO
--Gives these first rows
--	name				schema_id	principal_id
--	dbo				1			1
--	guest				2			2
--	INFORMATION_SCHEMA		3			3
--	sys				4			4
--	db_owner			16384			16384
GO
CREATE SCHEMA model_schema;
GO
SELECT SCHEMA_ID('model_schema');
--schema_id=5
GO
USE master;
GO
--Now try to clone the CloneTest database
DBCC CLONEDATABASE('CloneTest','CloneTest_Clone');
--Database cloning for 'CloneTest' has started with target as 'CloneTest_Clone'.
--Msg 2601, Level 14, State 1, Line 32
--Cannot insert duplicate key row in object 'sys.sysclsobjs' with unique index 'clst'. The duplicate key value is (50, 5).
GO
--Now remove the schema with schema_id=5 from Model
USE model;
GO
DROP SCHEMA model_schema;
GO
USE master;
GO
--Clone again
DBCC CLONEDATABASE('CloneTest','CloneTest_Clone');
--Database cloning for 'CloneTest' has started with target as 'CloneTest_Clone'.
--Database cloning for 'CloneTest' has finished. Cloned database is 'CloneTest_Clone'.
--Database 'CloneTest_Clone' is a cloned database. A cloned database should be used for diagnostic purposes only and is not supported for use in a production environment.
--DBCC execution completed. If DBCC printed error messages, contact your system administrator.
GO

I thought for a while that it would be OK if the schema in model and the source database had the same schema_id/Schema_name-combination. But that also fails. So the only way to use DBCC CLONEDATABASE is if there are no user created schemas in model.

OPTION(USE HINT) – New SQL Server 2016 SP1 feature

Many SQL Server developers and admins found, after upgrading to SQL Server 2014, that some queries started taking much longer time than before. The reason is the new cardinality estimation formula which was introduced in SQL Server 2014. Cardinality Estimation is done all the time by the SQL Server optimizer. To produce a Query plan, the optimizer makes some assumptions about how many rows exist for each condition in the table. In most cases, the new cardinality estimation formula in SQL Server 2014 and onwards gives slightly better estimates and the optimizer therefore produces slightly better plans. In some cases however, mostly when there are predicates on more than one column in a WHERE clause or JOIN clause, the 2014 cardinality estimation is a lot worse than in previous versions of SQL Server.

I’m doing an upgrade of SQL Server, from 2012 to 2016. We had a stored procedure which went from three minutes to 15 minutes some days ago. Analyzing Query plans for the procedure showed an operator estimating 10.000 rows but with an actual rowcount of nearly three million rows. In SQL Server 2012, the estimation for the same combination of predicates was two million rows. The difference between two and three million rows is neglectable. The difference between 10.000 and three million rows is far from neglectable and in our case it caused the optimizer to choose to do key lookups instead of scanning small and medium sized tables.

There was a new trace flag introduced in SQL Server 2014, 9481, which can be used to instruct SQL Server to use the old cardinality estimation formula. But a trace flag is global and we probably wouldn’t want to change the behaviour for a whole instance when we have a problem with one specific Query.
In SQL Server 2016, database scoped configuration was introduced. Database scoped configuration moves som of the configuration options previously only available on instance level down to the database level. One database scoped configuration option is LEGACY_CARDINALITY_ESTIMATION, which will give the same behaviour as trace flag 9481, but only for a single database. I’m pleased with database scoped configuration, but I still don’t want to force a certain behaviour upon ALL queries in a database when I only have a problem with ONE single Query.
Both in SQL Server 2014 and SQL Server 2016, one can use the query hint QUERYTRACEON, to use the behaviour of certain trace flags for a single query. That’s almost good enough. Almost.

use AdventureWorks2014;
--Create a user with low privileges
CREATE USER u WITHOUT LOGIN;
GO
GRANT SELECT TO u;
--Impersonate user u
GO
EXECUTE AS user='u';
GO
--Query using QUERYTRACEON to enforce legacy cardinality estimation
SELECT * FROM Sales.SalesOrderHeader 
OPTION(QUERYTRACEON 9481)
GO
--Stop impersonating user u
REVERT;
GO
--Cleanup
DROP USER u;
GO

And the result is this

Msg 2571, Level 14, State 3, Line 11
User 'u' does not have permission to run DBCC TRACEON.

QUERYTRACEON requires sysadmin privileges, and I certainly don’t want all my users or application services to run with sysadmin privileges.

Which brings me to the real topic of this post. There has been a LOT of fuzz about SP1 to Sql Server 2016 and most of it has been about all the cool Enterprise features that makes it into Standard edition. Less talked about is the new query hint USE HINT. It provides a method to add behaviour to a single query, and behaviour which was previously only available as trace flags or database scoped configuration options. One option available through USE HINT is FORCE_LEGACY_CARDINALITY_ESTIMATION.

It does the same as querytraceon 9481 and the syntax is quite simple:

SELECT col1, col2, col3...
FROM SomeTable
WHERE col1='somevalue' AND col2='someothervalue' AND col3='somethirdvalue'
OPTION(USE HINT('FORCE_LEGACY_CARDINALITY_ESTIMATION'));

And if we run our test using the low privelege user “u” again, but with USE HINT instead of QUERYTRACEON, it just works.

use AdventureWorks2014;
--Create a user with low privileges
CREATE USER u WITHOUT LOGIN;
GO
GRANT SELECT TO u;
--Impersonate user u
GO
EXECUTE AS user='u';
GO
--Query using USE HINT to enforce legacy cardinality estimation
SELECT * FROM Sales.SalesOrderHeader 
OPTION(USE HINT('FORCE_LEGACY_CARDINALITY_ESTIMATION'))
GO
--Stop impersonating user u
REVERT;
GO
--Cleanup
DROP USER u;
GO

To me, this is fantastic, because it lets us fix the very, very few queries where the cardinality estimator goes bananas after an upgrade, without having to change the behaviour of the other several thousand queries.

And of course, there’s a sibling hint to FORCE_LEGACY_CARDINALITY_ESTIMATION, and that is FORCE_DEFAULT_CARDINALITY_ESTIMATION, which overrides both trace flag 9481 and the database scoped setting LEGACY_CARDINALITY_ESTIMATION=ON.

Another really useful hint available through USE HINT is ENABLE_QUERY_OPTIMIER_HOTFIXES. When a hotfix is released to fix some odd behaviour in the optimizer, Microsoft has decided that is’s not going to be enabled by default. That is because a hotfix may fix an odd behaviour for certain queries, but it might introduce other problems with other queries. Instead of globally enabling the trace flag 4199 or use database scoped configuration option QUERY_OPTIMIZER_HOTFIXES=ON, we can now enable Query optimizer on a per query base. Really cool if you ask me!

All the rest of the hints available using USE HINT (there are nine of them) are available in Books OnLine, here: https://msdn.microsoft.com/en-us/library/ms181714.aspx

What is NULL?

If you ever studied normalisation of databases, you have probably come to the same conclusion as I have regarding NULL: It is best if NULL values in the database can be avoided but it is not always easy to achieve a NULL-free database. Let’s look at an example:

An object has certain properties. A vehicle has a number of gears, a certain length and height, a Wheel size etc. Typically there’s only one of these properties – a vehicle has a fixed number of gears, its length is fixed etc. So if you want to create a data model for a vehicle, you probably would create a Vehicles table, with Length, Height, Gears ans so forth as columns. So far, so good. But what if some properties are not applicable for certain Vehicles? Like a Tesla car, which doesn’t use the type of transmission we are used to. What if number of gears is a configurable option, not set until a buyer wants to become the owner of a specific vehicle? If you try to model your database to avoid NULL values in the database, you would have to rethink your data model. The Vehicles table would maybe have to become less wide and certain properties be taken out of the table and put into a VehiclePropertyValues table instead. If a certain property value is missing, there will simply be no row for that property value in the VehiclePropertyValues table. This is all fine, but it makes reporting slightly less intuitive – the SQL Queries against the database would have to contain a JOIN to the VehiclePropertyValues table. Values in the VehiclePropertyValues table would have to be pivoted to fit a report with fixed columns etc.

This is a rather simple example, but already with a simple data model, you are likely to end up with NULL-able columns.

To me, as a database designer and developer, NULL values are not forbidden. They are sometimes a problem, but my reality is that some tables will contain NULL values. In this blog post, I will discuss which possible implications NULL values can have and clear out some misunderstandings about NULL.

First of all – What is NULL? What does it mean?
To me as a user, NULL can mean one of two things. It can be used to indicate that a value is MISSING or NOT APPLICABLE. It can also be used to indicate that a value is UNKNOWN. I can usually understand which type of NULL is where. Sometimes I have to make an assumption, but usually I know if NULL means “MISSING” or “UNKNOWN”. That’s because I know the business requirements. I know that a NULL value in the Gears column means that no Gear has been chosen yet – the value is MISSING (so far). But me knowing if NULL means a value is MISSING or if it means UNKNOWN doesn’t help much when it comes to SQL Server interpreting a Query against a database. SQL Server (or SQL in general, this is not an implementation issue, it’s a SQL language issue) will always treat NULL as UNKNOWN. This is how the SQL language is designed and it’s nothing we can do about it. Whenever possible, I will try to design the database in such a way that I can treat MISSING different from UNKNOWN. In the Gears column, I might put a -1 to indicate that the value is MISSING or Not Applicable. But sometimes (or very often even), I’m not in charge of designing the database and/or the system surrounding it. I’m supposed to work with whatever data is in the database.
Many Words. “Too long, didn’t read” might be your reaction, but just remember this one thing: To a person, NULL would mean either MISSING or UNKNOWN. To SQL, it always means UNKNOWN.

Why does it matter what NULL means?
To answer that question, I would like to keep using the Gears column in the Vehicles table and make an example out of that. I’m faced with answering two questions from the business: How many bikes in our registry has five gears? How many bikes does NOT have five gears? To answer that, most developers would write two queries:

--How many vehicles has five gears?
SELECT
  COUNT(*) AS WithFiveGears
FROM
  dbo.Vehicles
WHERE 
  Gears = 5;

--How many vehicles does NOT have five gears?
SELECT
  COUNT(*) AS NotHavingFiveGears_WRONG
FROM
  dbo.Vehicles
WHERE
  Gears <> 5;

You see that I labelled the count from the second query NotHavingFiveGears_WRONG. Why WRONG? Because it potentially returns the wrong results. And I say potentially. Like so often – “it depends”. This time, it depends on how we want to treat NULLs. Does NULL mean that the value is in fact UNKNOWN (there will be gears in this vehicle but the number of gears is set at a later stage) or does it mean that the value is MISSING or NOT APPLICABLE (A Tesla won’t have gears at all)?
If the NULL value is to be treated as UNKNOWN, we might have to rewrite the first Query as well. It should perhaps take into consideration that a number of the NULL values WILL get the value “5” in the future, and that should be a part of our report. But if the business says that UNKNOWN should be treated as “not 5”, our first Query will return the correct results, and the second won’t. And that’s because NULL is UNKNOWN to SQL. The WHERE clause in the first query gives us the rows where Gear = 5. That’s filtering out anything which is not “5” and therefore NULL values are not returned. Fine. And the second query has a WHERE clause which gives us the rows where Gears is NOT “5”. And that should give us all rows which are not explicitly “5”, right? Well, no. It doesn’t. It gives us the rows where the comparison Gears<>5 equeates to TRUE. All WHERE clauses equate to TRUE or FALSE, right? Well, no. Not always. Anything compared to NULL equeates to UNKNOWN. Because SQL doesn’t know if the column contains a certain value or not. The value is UNKNOWN and therefore it could be TRUE or FALSE. The WHERE clause will only return rows where the comparison is TRUE. And UNKNOWN is not TRUE. Therefore, the rows with NULL values will not be returned for the inequality predicate either.
Do you feel, when reading this, that you have made this error? I know I have. I’m not even sure I get it right every time nowadays either. But I know that I get it right more often then some years ago. I know that because my errors have had implications which were not desirable.

So what do we do about these d**n NULLs?
Well. To begin with, we try as hard as we can to only have NULL values which mean UNKNOWN in the database. That’s not Always easy, and it could mean we have to remodel the database. If we can’t remodel the database, we could use a constant. Like -1 for Gears. But how would that affect an AVG aggregate query? Or a median calculation? Or even “how many transmission boxes do we have to produce based on the number of Vehicles sold last year”? You might end up with one problem solved and a number of new, interesting problems.
In the end, you would have to learn to write SQL code which takes NULL into consideration. And each time you consider NULL, you would have to understand if NULL means UNKNOWN och MISSING/NOT APPLICABLE.
I will not try to answer when NULL means a certain thing. Instead, I will just end with a thing which you probably already know and have seen, but which you might not always remember to use. The IS keyword.
The IS keyword is used to compare something with NULL. It works both for equality (IS NULL) and inequality (IS NOT NULL). And it will return TRUE or FALSE, not UNKNOWN. Because SQL will now if a value is a NULL value or not.
So to finalise, I will rewrite the second reporting query to take the NULL value into consideration and in this case treat NULL as “not 5”.

SELECT
  COUNT(*) as NotFiveGears_CORRECT
FROM
  dbo.Vehicles
WHERE
  Gears <> 5 OR Gears IS NULL

An alternative solution would be to use either ISNULL or COALESCE functions, to replace NULL values with something else, and then compare that with “not 5”. This will however be problematic in terms of performance. ISNULL have to be applied to each row, returning either the value for the Gears column of that row OR whatever replacement value we have. The result of that is then compared with “<> 5″. If we have an index on Gears column, we will end up with scanning the whole index, applying the function to each row in the index, thus making the index more or less useless. But semantically, using ISNULL or COALESCE would solve our problem, and therefore I show how it can be done using, in this case, COALESCE. COALESCE and ISNULL are very similar, if you want to compare them, look them up in Books Online. One thing not mentioned there (I Think at least) is how the two behave differently when used together with SELECT INTO.

SELECT
  COUNT(*) as NotFiveGears_CORRECT
FROM
  dbo.Vehicles
WHERE
  COALESCE(Gears,-1) <> 5 

Archiving with the OUTPUT clause

Most database developers have been faced with the task to archive old data. It could look something like this:

CREATE TABLE dbo.Cars(
  CarID int identity(1,1) PRIMARY KEY, 
  BrandName varchar(100),
  ModelName varchar(100),
  releaseYear smallint
);

CREATE TABLE dbo.Cars_Archive(
  CarID int,
  BrandName varchar(100),
  ModelName varchar(100),
  releaseYear smallint,
  ArchivedDateTime datetime DEFAULT CURRENT_TIMESTAMP,
  CONSTRAINT PK_Cars_Archive PRIMARY KEY(CarID, ArchivedDateTime)
)

And updating a row would often require a stored procedure and some explicit transactions

--Note that this is very simplified. 
CREATE PROC dbo.UpdateCars(@CarID int, @Brandname varchar(100), @ModelName varchar(100), @releaseYear smallint)
AS
BEGIN TRY
  BEGIN TRAN
  INSERT dbo.Cars_Archive(CarID, BrandName, ModelName, releaseYear)
    SELECT CarID, BrandName, ModelName, releaseYear
    FROM dbo.Cars
    WHERE CarID = @CarID;

  UPDATE dbo.Cars 
    SET Brandname = @brandName,
        ModelName = @ModelName,
        releaseYear = @releaseYear
        WHERE CarID = @CarID;
  COMMIT
END TRY
BEGIN CATCH
  --You would of course do more than just Rolling back the transaction here, but this works as an example
  IF @@TRANCOUNT>0
    ROLLBACK;
END CATCH

This works. But complexity always grows quite a bit when you have to use explicit transactions. And if you (like me) are lazy, you want to write as little code as possible.
And this is where the OUTPUT-clause comes into play.
The OUTPUT-clause can be added to UPDATE, INSERT, DELETE and MERGE and will provide you with a way to view how rows looks before and after a change.
I will provide an example of how the above UPDATE-procedure could be rewritten to a single SQL-statement, thus getting rid of the explicit transaction handling (a single statement is atomic and will either commit or fail, all by itself).


INSERT dbo.Cars_Archive(CarID, BrandName, ModelName, releaseYear)
SELECT CarID, BrandName, ModelName, releaseYear FROM
(
UPDATE dbo.ActiveCars SET BrandName='SAAB' 
OUTPUT deleted.CarID, deleted.BrandName, deleted.ModelName, deleted.releaseYear 
WHERE CarID = 2
) as t;

The above code will both update the Cars table and archive the old row from the Cars table into the Cars_Archive table.

The syntax of the OUTPUT-clause isn’t really that complicated – it reminds a lot about what you can do inside a trigger. You have an inserted- and a deleted-table, just like in a trigger.

Read more about the OUTPUT-clause here: https://msdn.microsoft.com/en-us/library/ms177564(v=sql.105).aspx

Generate time slots

Here’s an Inline Table Valued Function (TVF) for generating time-slots from a start-date to an end-date, given a certain time for each slot, given in minutes.

This would be useful for many applications, like scheduling systems, salesĀ statistics broken down into certain slices of time etc. The function does have some limitations, eg there can’t be more than 100.000 minutes between start and endtime. This is easily fixed by just adding Another CROSS JOIN to CTE2, or by changing the DATEADD-functions to use hour instead of minute if that fits your purpose.

The function uses a numbers table together with DATEADD to genereate time slots. I’ve seen tons of Numbers table solutions, using CROSS JOIN and the table sys.columns. This solution is faster, since it only holds the numbers table in memory and won’t have to do any logical reads or Scans against sys.columns. My solution uses zero logical reads, compared to 12 logical reads using 3600 time-slots using sys.columns CROSS JOIN sys.columns. Not a dramatic difference, but if you CROSS APPLY this function with some large dataset you might end up with a more dramatic performance difference.

 

Anyway, here’s the code for the function.

 

CREATE FUNCTION GetTimeSlots(@starttime datetime, @endtime datetime, @interval int)
RETURNS TABLE 
AS
RETURN 
--Generate numbers 0 to N-1 for number of slots N
WITH cte AS(
	SELECT 1 AS n
	UNION ALL 
	SELECT 1
	UNION ALL 
	SELECT 1
	UNION ALL 
	SELECT 1
	UNION ALL 
	SELECT 1
	UNION ALL 
	SELECT 1
	UNION ALL 
	SELECT 1
	UNION ALL 
	SELECT 1
	UNION ALL
	SELECT 1
	UNION ALL
	SELECT 1
), cte2 AS (
	SELECT TOP(DATEDIFF(minute,@starttime, @endtime) / @interval) row_number() OVER(ORDER BY c1.n)-1 AS rownum
	FROM cte c1
	CROSS JOIN cte c2
	CROSS JOIN cte c3
	CROSS JOIN cte c4
	CROSS JOIN cte c5
)
SELECT 
	DATEADD(minute,rownum*@interval,@starttime) AS slotfrom,
	DATEADD(minute,(rownum+1)*@interval,@starttime) AS slotto
FROM 
	cte2 

© 2019 tsql.nu

Theme by Anders NorenUp ↑