Normalization vs Performance – SQL Server

Are you dealing with high volume online data and having trouble in performing online analytics over it?

Normalization could be one possible reason behind poor performance of your queries. Every solution has some trade-off so normalization also has. If your data is extensively normalized then you may be performing a lot of joins in order to query relevant data. Believe me, if your SQL Server database is having millions and billions of rows then joins could cause you heavy penalty.

Normalization is the process of reorganizing data in a database so that it meets two basic requirements:
(1) There is no redundancy of data (all data is stored in only one place), and
(2) data dependencies are logical (all related data items are stored together).

If your data is normalized then you can’t live without join. Although, joins are fantastic feature but they have their own limitations.

If you are developing a critical solution that requires real time analytics and data processing and you have used a lot of joins then may be you will end up dealing with poor performing query which in-turn delay the output of your query.

I am not saying you to believe my words, please go and try it your own. Try to replicate the following simple scenario and experience it.

Let’s create two tables. One will hold the credit card spend category information and another will hold credit card usage information.

Note: This is just an example to evaluate the performance implications with Normalization. It has nothing to do with the actual working of Credit Cards.

CREATE TABLE [dbo].[CreditCardSpendCategory]
(
	[CategoryID]				INT NOT NULl IDENTITY(1, 1) PRIMARY KEY
	, [CategoryDescription]		VARCHAR(50)
)

CREATE TABLE [dbo].[CreditCardUsage]
(
	[TransactionID]				NUMERIC NOT NULl IDENTITY(1, 1) PRIMARY KEY
	, [CreditCardNumber]		VARCHAR(20)
	, [TransactionDate]			DATE
	, [TransactionDateTime]		DATETIME
	, [TransactionAmount]		NUMERIC(18, 2)
	, [MerchantName]			VARCHAR(50)
	, [CategoryID]				INT NOT NULL FOREIGN KEY REFERENCES [dbo].[CreditCardSpendCategory] ([CategoryID])
)

Let’s load these tables with some data for our analysis. Query will execute for a while, please have patience!

INSERT INTO [dbo].[CreditCardSpendCategory] ([CategoryDescription]) VALUES ('Fuel'), ('Dining'), ('Apparel'), ('Travel')

GO

SET NOCOUNT ON
INSERT INTO [dbo].[CreditCardUsage] ([CreditCardNumber], [TransactionDate], [TransactionDateTime], [TransactionAmount], [MerchantName], [CategoryID])
VALUES ('1', GETDATE(), GETDATE(), 10000, 'Merchant 1', 1)
	, ('1', GETDATE(), GETDATE(), 10000, 'Merchant 1', 2)
	, ('1', GETDATE(), GETDATE(), 10000, 'Merchant 1', 3)
	, ('1', GETDATE(), GETDATE(), 10000, 'Merchant 1', 4)
	, ('2', GETDATE(), GETDATE(), 10000, 'Merchant 1', 1)
	, ('2', GETDATE(), GETDATE(), 10000, 'Merchant 1', 2)
	, ('2', GETDATE(), GETDATE(), 10000, 'Merchant 1', 3)
	, ('2', GETDATE(), GETDATE(), 10000, 'Merchant 1', 4)
	, ('3', GETDATE(), GETDATE(), 10000, 'Merchant 1', 1)
	, ('3', GETDATE(), GETDATE(), 10000, 'Merchant 1', 2)
	, ('3', GETDATE(), GETDATE(), 10000, 'Merchant 1', 3)
	, ('3', GETDATE(), GETDATE(), 10000, 'Merchant 1', 4)
	, ('4', GETDATE(), GETDATE(), 10000, 'Merchant 1', 1)
	, ('4', GETDATE(), GETDATE(), 10000, 'Merchant 1', 2)
	, ('4', GETDATE(), GETDATE(), 10000, 'Merchant 1', 3)
	, ('4', GETDATE(), GETDATE(), 10000, 'Merchant 1', 4)
GO 1000000

I have used 4 credit card spend categories and 4 credit cards for our analysis.

Now, we want to perform some analytics over this data. We want to see the total spend for all the credit card spend categories. Let’s write the select query for it.

SELECT B.[CategoryDescription], SUM(A.[TransactionAmount]) AS [TotalSpend]
FROM [dbo].[CreditCardUsage] A (NOLOCK)
	INNER JOIN [dbo].[CreditCardSpendCategory] B  (NOLOCK)
		ON B.[CategoryID] = A.[CategoryID]
GROUP BY B.[CategoryDescription]

The query will take more than 6 seconds to execute. Let’s write another version of the same query for the same output.

; WITH cte_totalspend AS
(
	SELECT A.[CategoryID], SUM(A.[TransactionAmount]) AS [TotalSpend]
	FROM [dbo].[CreditCardUsage] A (NOLOCK)
	GROUP BY A.[CategoryID]
)

SELECT B.[CategoryDescription], A.[TotalSpend]
FROM cte_totalspend A
	INNER JOIN [dbo].[CreditCardSpendCategory] B  (NOLOCK)
		ON B.[CategoryID] = A.[CategoryID]

In this query we did join after aggregation. This query will take less than 3 seconds. We have more than 100 percent performance improvement.

We seen the example with normalization. Now, we will see the same example without normalization. We will keep the CategoryDescription column in CreditCardUsage table in denormalized form.

CREATE TABLE [dbo].[CreditCardUsage]
(
	[TransactionID]				NUMERIC NOT NULl IDENTITY(1, 1) PRIMARY KEY
	, [CreditCardNumber]		VARCHAR(20)
	, [TransactionDate]			DATE
	, [TransactionDateTime]		DATETIME
	, [TransactionAmount]		NUMERIC(18, 2)
	, [MerchantName]			VARCHAR(50)
	, [CategoryDescription]		VARCHAR(50)
)

GO

SET NOCOUNT ON
INSERT INTO [dbo].[CreditCardUsage] ([CreditCardNumber], [TransactionDate], [TransactionDateTime], [TransactionAmount], [MerchantName], [CategoryDescription])
VALUES ('1', GETDATE(), GETDATE(), 10000, 'Merchant 1', 'Fuel')
	, ('1', GETDATE(), GETDATE(), 10000, 'Merchant 1', 'Dining')
	, ('1', GETDATE(), GETDATE(), 10000, 'Merchant 1', 'Apparel')
	, ('1', GETDATE(), GETDATE(), 10000, 'Merchant 1', 'Travel')
	, ('2', GETDATE(), GETDATE(), 10000, 'Merchant 1', 'Fuel')
	, ('2', GETDATE(), GETDATE(), 10000, 'Merchant 1', 'Dining')
	, ('2', GETDATE(), GETDATE(), 10000, 'Merchant 1', 'Apparel')
	, ('2', GETDATE(), GETDATE(), 10000, 'Merchant 1', 'Travel')
	, ('3', GETDATE(), GETDATE(), 10000, 'Merchant 1', 'Fuel')
	, ('3', GETDATE(), GETDATE(), 10000, 'Merchant 1', 'Dining')
	, ('3', GETDATE(), GETDATE(), 10000, 'Merchant 1', 'Apparel')
	, ('3', GETDATE(), GETDATE(), 10000, 'Merchant 1', 'Travel')
	, ('4', GETDATE(), GETDATE(), 10000, 'Merchant 1', 'Fuel')
	, ('4', GETDATE(), GETDATE(), 10000, 'Merchant 1', 'Dining')
	, ('4', GETDATE(), GETDATE(), 10000, 'Merchant 1', 'Apparel')
	, ('4', GETDATE(), GETDATE(), 10000, 'Merchant 1', 'Travel')
GO 1000000

Let’s write the select query for it.

SELECT A.[CategoryDescription], SUM(A.[TransactionAmount]) AS [TotalSpend]
FROM [dbo].[CreditCardUsage] A (NOLOCK)
GROUP BY A.[CategoryDescription]

This query will take less than 3 seconds.

Conclusion

If you are dealing with the high volume online data and require real time processing then think twice before implementing the normalization to the greater extent. If normalization is mandatory to be implemented then tweak you query or logic to avoid the excessive cost being utilized by joins.

There is no PANACEA solution so it’s always advisable to evaluate every solution and its trade-offs holistically before actually applying it. There could be two methods to deal with the problems – proactive and reactive. It’s always good to be proactive than reactive.

Disclaimer

This is a personal weblog. The opinions expressed here represent my own and not those of my employer.

All data and information provided on this site by Author is for informational purposes only. Author makes no representations as to accuracy, completeness, correctness, suitability, or validity of any information on this site and will not be liable for any errors, omissions, or delays in this information or any losses, injuries, or damages arising from its display or use.

Please don’t use any of the examples discussed in this article in Production without evaluating it based on your need.

Advertisements

2 thoughts on “Normalization vs Performance – SQL Server

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s