Integration Services (SSIS) internals and performance tuning

Back in 2014, I created a 214-page deck for a pre-con at SQLSaturday Johannesburg and Cape Town. Since then I’ve presented it at a good number of large and small conferences.
I don’t often work with SSIS anymore, so I thought I’d publish the entire deck here.
The material is quite text-heavy, so it should be readable. I compiled the deck from several sources (see the last four slides). Also, please note it doesn’t cover SSIS 2016 and the buffer optimizations that came in that version.

Integration Services (SSIS) internals and performance tuning

SQL Server Performance Troubleshooting Free Scripts and Tools List

Back in the days, I used to collect a lot of different scripts, tools, and other goodies for troubleshooting SQL Server performance issues. These days, however, I use what is available for free and keep a list of those in my head. 
I’ve meant to write that list down for a while, and today Chrissy asked: 


So here it is… 
Disclaimer: While I do work as a Premier Field Engineer for Microsoft, this is my list – this is not an official list from my employer. 

Free scripts and tools from Microsoft

These are scripts and tools provided for free by Microsoft that I use. Some of them come “as-is” with no official support.

  • BPCheck – This is by far my favorite script and one I use all the time. I like to joke that there are one good thing and one bad thing about this script. The good thing is that it gives me a lot of information. The bad thing is that it gives me a lot of information. It can overwhelm the first time you run it, but once you have used it a few times, it is handy.
  • Pssdiag/Sqldiag Manager – This tool creates a customized pssdiag utility that uses sqldiag to collect diagnostics data, such as profiler trace, extended events, perfmon and DMV information. I use this all the time for data collection used to troubleshoot performance problems. There is also version for Linux available (see this blog post, for how to use the Linux version).
  • SQL Nexus – Loads and analyzes performance data collected by Pssdiag. It loads the data into a database and does reporting.
  • PAL – The Performance Analysis of Logs (PAL) tool reads a perfmon counter log and analyzes it using different thresholds. It generates an HTML based report (or XML, if you prefer), which shows which counters are interesting. It has templates for SQL Server workloads but works with a lot of other workloads too.
  • RML – The Replay Markup Language (RML) utilities can analyze profiler traces or replay the trace file against another instance of SQL Server. For how to use it, read this blog post.
  • WinDbg – Windows Debugger. Useful for debugging a memory dump.
  • Tigertoolbox – This is the repository for the SQL Server Tiger Team and contains a lot of good stuff. Maybe you need some SQL Performance Baseline reports, fix your VLFs, or view the waits and latches. It is also the home of BPCheck.
  • diskspd – Storage load generator / performance test tool. It replaces SQLIO. I use this if I want to benchmark the IO subsystem of a new machine / VM. There are two good blog posts about how to use diskspd here and here.
  • SQL Server Management Studio – SSMS is now a standalone application (it used to ship with the SQL Server installation) and got quite a few improvements in the newer versions.
  • sysinternals – Tools for advanced troubleshooting on Windows, which I use from time to time.
  • SQL Server Data Tools – Not a tool I use for performance troubleshooting, but it is in my toolbox for when I have to look at an SSIS package.
  • Visual Studio Code – My go-to editor for… everything.

Free scripts, tools, and resources from the community

The awesome SQL Server community has made a lot of things available.

  • SQL Server Diagnostic Information Queries – Before I discovered Glenn Berry‘s DMV queries, I used to collect scripts myself. Not anymore. Glenn updates his queries every month. While I prefer to use BPCheck to get information from the DMVs, I find that Glenn’s scripts are better structured and easier to use.
  • sp_whoisactive – Think sp_who2… but much better. This script by Adam Machanic shows which statements are being executed and can include the estimated execution plan, acquired locks, tempdb resource allocation, and more.
  • sp_helpindex rewrite – sp_helpindex that ships with SQL Server do not show newer functionality (such as included columns and filters), so Kimberly Tripp rewrote it to include this information. Also, be sure to read her blog post on indexing strategies.
  • Ola Hallengren’s SQL Server Maintenance Solution – Ola’s scripts are great, easy to use, and make life easier for a lot of DBAs around the world. The index and statistics maintenance part of the script is an easy way to reduce index fragmentation and make sure your statistics are up to date.
  • Plan Explorer – While SSMS can show the execution plan just fine, Plan Explorer makes it a little easier to read.
  • Statistics Parser – Parses the output of SET STATISTICS IO and SET STATISTICS TIME and makes it a lot easier to read and analyze.
  • Poor SQL – Ever got a SQL query that is just poor formatted or just one line? This site can format it to something readable.
  • dbatools – I don’t use these PowerShell modules for performance troubleshooting, but they are useful for checking best practices and other DBA tasks.
  • SQL Server Version List – I use this up-to-date list of SQL Server versions to check if any service packs or commutative updates are missing.
  • SQL Server Wait Types Library and Latches Classes Library – While I remember the most common wait types and latches classes I see all the time, I sometimes run into one I haven’t seen before or can’t remember what is. Paul Randal‘s libraries are handy resources to look up those.

That’s it! If you have a free tool or script you use (or wrote yourself), leave a comment – would love to check it out!

SQL Server Performance Troubleshooting Free Scripts and Tools List

Regional and language settings for Oracle client used with SQL Server linked server

Today, I ran into a problem while migrating a Data Warehouse solution to another server. 
The solution have linked servers for extracting data from an Oracle source system. 
The query would go something like this: 

INSERT INTO dbo.MyTable 
SELECT *
FROM OPENQUERY(UNDPFSP1ICC, 'SELECT
 TO_CHAR(COL1_DT, ''YYYY'') AS Col1Year
 , COL2
 , COL3
 , SUM(AMOUNT) AS Amount
 , SYSDATE
FROM MYTABLE
GROUP BY
 TO_CHAR(COL1_DT, ''YYYY'')
 , COL2
 , COL3
')

I had migrated everything, but while running the extract job, which ran queries like the above, we would get errors like these: 
Conversion failed when converting the nvarchar value '123456,78' to data type int. 
Error converting data type nvarchar to numeric. 
The problem was that instead of using a dot (.) as the decimal symbol, a comma (,) was being returned. This would make the data convert to nvarchar, and an implicit conversion to a NUMERIC or INT would fail. 
Everything was running fine on the old server. The collation of the server and databases were identical (Latin1_General_CI_AS), but the regional and language settings were different. On the old server they were set to English (United States) and on the new server to Danish. Changing the settings on the new server to English (United States) did not help. 
After troubleshooting a bit on #sqlhelp with Amit Banerjee (twitter | blog) and William Durkin (twitter | blog), as well as via Skype with Jørgen Guldmann (twitter | blog), Jørgen ended up pointing me in the right direction. 
The problem as that the NLS_LANG parameter was set to Danish (where we use a comma (,) as the decimal symbol). It is stored in the registry under the following sub-key: 
HKEY_LOCAL_MACHINESOFTWAREORACLEHOMEIDNLS_LANG 
Where ID is the unique number identifying the Oracle home. 
The trick was to change this sub-key from: 
DANISH_DENMARK.WE8MSWIN1252 
To: 
AMERICAN_AMERICA.WE8MSWIN1252 
And then everything worked fine, and the decimal symbol was now a dot (.) instead of a comma (,).

Regional and language settings for Oracle client used with SQL Server linked server

MSBuild error (and fix) with Visual Studio 2013 SQL Server tooling and SQL Server 2014

To show the possibilities with continuous integration with SQL Server development, I decided to spin up a virtual machine with Visual Studio 2013, Team Foundation Server 2013, and SQL Server 2014. While setting it up, I ran into an error that I thought I’d share here. 
If you want to know more about how to set up continuous integration for SQL Server development, Jamie Thomson (blog | twitter) wrote a very good blog post about continuous deployment of SSDT database projects to Windows Azure using Team Foundation Service. This approach can also be used for on premise build servers using Team Foundation Server, which is what I was looking for. 
With Visual Studio 2013 you can no longer download SQL Server Data Tool separately as SQL Server tooling is included. You will need to use either one paid versions of Visual Studio or one of the free express editions
If you wish to deploy to SQL Server 2014, you also need to update the SQL Server tooling from within Visual Studio
While I was trying to set up a build server, I ran into this error: 
C:Builds3MyProjectAdventureWorks Test Server BuildSourcesMainAdventureWorksAdventureWorksAdventureWorks.sqlproj (63): The imported project "C:Program Files (x86)MSBuildMicrosoftVisualStudiov11.0SSDT Microsoft.Data.Tools.Schema.SqlTasks.targets" was not found. Confirm that the path in the declaration is correct, and that the file exists on disk. 
In my .sqlproj file, the Import declaration showed: 
“ 
The C:Program Files (x86)MSBuildMicrosoftVisualStudiov11.0SSDT path did not exist, but the C:Program Files (x86)MSBuildMicrosoftVisualStudiov12.0SSDT did. 
It looked like an old MSBuild version was being used, so the solution I found was to override the variable in the build definition to use the newest version. This is what I did:
1. Find build definition under Builds in the Team Explorer
2. Click Edit build definition
3. Click Process
4. Click Advanced
5. Add /p:VisualStudioVersion=12.0 under MSBuild arguments
My build arguments ended up looking like this, when adding the build, publish and SqlPublishProfilePath arguments
/t:Build /t:Publish /p:SqlPublishProfilePath=AdventureWorks.Test.publish.xml /p:VisualStudioVersion=12.0 
After that, the build worked just fine.

MSBuild error (and fix) with Visual Studio 2013 SQL Server tooling and SQL Server 2014

Create Data Warehouse extract tables from SQL Server source system

I have often found that I needed a quick script for giving me CREATE TABLE statements for each table in a source system database, so I could create these tables in my extract schema in my Data Warehouse. This can be done using SQL Server Management Studio, but I found that I would then have to edit each statement to remove keys, indexes, identity columns, and everything else that I rarely want in my extract tables.
Having not being able to find one, I wrote my own, and though I’d share it here. The following script writes a CREATE TABLE statement for each table in a SQL Server source database, so I can use them for creating extract tables in a data warehouse.

DECLARE @ExtractSchema NVARCHAR(100) = 'Extract'
SELECT 'CREATE TABLE [' + @ExtractSchema + '].['
      + SCHEMA_NAME (t.schema_id)
      + '_' + t.name + '] (' + LEFT(cols .column_list, LEN(cols .column_list) - 1) + ')'
FROM sys.tables t
CROSS APPLY (SELECT '[' + c .name + '] ' + UPPER(ty .name)
                    + CASE WHEN ty.name IN ('decimal', 'numeric') THEN
                              '(' + CONVERT(NVARCHAR(10), c.precision)
                              + ', '
                              + CONVERT (NVARCHAR(10), c.scale)
                              + ')'
                      WHEN ty.name = 'float' THEN
                         '(' + CONVERT(NVARCHAR(10), c.precision ) + ')'
                      WHEN ty.name IN ('datetime2', 'datetimeoffset', 'time') THEN
                         '(' + CONVERT(NVARCHAR(10), c.scale) + ')'
                      WHEN ty.name IN ('char', 'nchar', 'varchar', 'nvarchar') THEN
                         '(' + CASE WHEN c.max_length = - 1
                                   THEN 'MAX'
                                   ELSE CAST (c.max_length AS NVARCHAR(10))
                              END + ')'
                      ELSE ''
                      END
                    + CASE WHEN c.is_nullable = 0 THEN ' NOT'
                      ELSE ''
                      END
                    + ' NULL, '
               FROM sys.columns c
               INNER JOIN sys.types ty
               ON c.user_type_id = ty.user_type_id
               WHERE t.object_id = c.object_id
               ORDER BY c.column_id
               FOR XML PATH('')
      ) cols (column_list)
Create Data Warehouse extract tables from SQL Server source system

When does SQL Server partition split move data?

One of the many benefits when using table partitioning in SQL Server is the metadata only operations, such as switch, split, and merge. However, if we are not careful, splitting partition can potential take a long time if data end up being moved between two partitions.
In this blog post we will take a look at when SQL Server partition split will be a metadata only operation, and when data will be moved between partitions.
According to Books Online, we should avoid splitting or merging populated partitions:

Always keep empty partitions at both ends of the partition range to guarantee that the partition split (before loading new data) and partition merge (after unloading old data) do not incur any data movement. Avoid splitting or merging populated partitions. This can be extremely inefficient, as this may cause as much as four times more log generation, and may also cause severe locking.

But let’s take a look at what is actually happening when we split a partition. By using fn_dblog to read from the transaction log, we can see whether we have data movement or not when we split a partition.
To demonstrate this, I have created a partitioned table with two partitions; one empty and one with 20000 rows. The 20000 rows are split between 10000 rows with the date 20120301 and 10000 rows with the date 20120501. I am using right range partitioning. Using the DMVs (see my previous blog post for a query that gives you information about the partitions), let’s take a look at the two partitions:
viewpartitions1

Test #1: Split a partition with data on both sides of the new boundary point

In this test we split with a new boundary point 20120401, which means that we will have data on both sides of the new boundary point.

DECLARE @xact_id BIGINT
BEGIN TRAN
-- This is the actual split
ALTER PARTITION FUNCTION CreditPF()
SPLIT RANGE (20120401)
-- End of split
-- Get the xact ID from our current explicit transaction
SELECT @xact_id = transaction_id
FROM sys.dm_tran_current_transaction
COMMIT TRAN
-- Get the entries in the transaction log related to the
-- above transaction doing a split
SELECT [Current LSN], [Operation], [AllocUnitName], [Context]
     , [Transaction ID] , [Transaction Name], [Xact ID]
FROM ::fn_dblog(NULL,NULL)
WHERE [Transaction ID] = (
     SELECT TOP 1 [Transaction ID]
     FROM ::fn_dblog(NULL,NULL)
     WHERE [Xact ID] = @xact_id)
GO

This results in the following output where we can see that there are 20,258 entries in the transaction log.
partitionsplit-test1
If we take a further look, we can see that most of the entries are LOB_INSERT_ROWS and LOB_DELETE_ROWS, which tells us that 10,000 rows are inserted into the new partition and deleted the old partition.
partitionsplit-test1-page2
After this split, the partitions now look like this:
partitionsplit-test1-partitions

Test #2: Split a partition with data on the left side of the new boundary point

In this test we split with a new boundary point 20120601, which means that we will have data on the left side of the new boundary point, and the right side will be empty.

DECLARE @xact_id BIGINT
BEGIN TRAN
ALTER PARTITION FUNCTION CreditPF()
SPLIT RANGE (20120601)
SELECT @xact_id = transaction_id
FROM sys.dm_tran_current_transaction
COMMIT TRAN
SELECT [Current LSN], [Operation], [AllocUnitName], [Context]
     , [Transaction ID] , [Transaction Name], [Xact ID]
FROM ::fn_dblog(NULL,NULL)
WHERE [Transaction ID] = (
     SELECT TOP 1 [Transaction ID]
     FROM ::fn_dblog(NULL,NULL)
     WHERE [Xact ID] = @xact_id)
GO

This results in the following output where we can see that there are only 19 entries in the transaction log.
partitionsplit-test2
Despite we have data in the partition that we split, it is a metadata only operation. Quite a different case than our previous test. In this case, we are using right range partitioning, and if the data is only on the left side of the new boundary point, there is no data movement. This makes sense as the new partition being created is empty.
partitionsplit-test2-partitions

Test #3: Split an empty partition

In this test we split with a new boundary point 20120801, which means that we split the last partition which is empty.

DECLARE @xact_id BIGINT
BEGIN TRAN
ALTER PARTITION FUNCTION CreditPF()
SPLIT RANGE (20120801)
SELECT @xact_id = transaction_id
FROM sys.dm_tran_current_transaction
COMMIT TRAN
SELECT [Current LSN], [Operation], [AllocUnitName], [Context]
     , [Transaction ID] , [Transaction Name], [Xact ID]
FROM ::fn_dblog(NULL,NULL)
WHERE [Transaction ID] = (
     SELECT TOP 1 [Transaction ID]
     FROM ::fn_dblog(NULL,NULL)
     WHERE [Xact ID] = @xact_id)
GO

This results in the following output where we can see that there are only 19 entries in the transaction log.
test3
This result is expected as it is an empty partition we are splitting. Let’s take a look at our partitions.
partitionsplit-test3-partitions

Test #4: Split a partition with data on the left side of the new boundary point

In this test we split with a new boundary point 20120201, which means that we will have data on the right side of the new boundary point, and the left side will be empty.

DECLARE @xact_id BIGINT
BEGIN TRAN
ALTER PARTITION FUNCTION CreditPF()
SPLIT RANGE (20120201)
SELECT @xact_id = transaction_id
FROM sys.dm_tran_current_transaction
COMMIT TRAN
SELECT [Current LSN], [Operation], [AllocUnitName], [Context]
     , [Transaction ID] , [Transaction Name], [Xact ID]
FROM ::fn_dblog(NULL,NULL)
WHERE [Transaction ID] = (
     SELECT TOP 1 [Transaction ID]
     FROM ::fn_dblog(NULL,NULL)
     WHERE [Xact ID] = @xact_id)
GO

This results in the following output where we can see that there are 20,307 entries in the transaction log.
partitionsplit-test4
If we again take a further look, we can see that most of the entries are LOB_INSERT_ROWS and LOB_DELETE_ROWS, which suggests that it inserts all rows into the new partition and deleted the old partition.
partitionsplit-test4-page2
If we look at our partitions, we can see that the old partition 2 is now empty (although 1 page is still allocated for it) and the new partition 3 contains all the rows. So when the data is on the right side of the new boundary point, and the left side is empty, it moves all rows.
partitionsplit-test4-partitions

Left range partitioning

If we had used left range partitioning instead of right range, the results would have been opposite the ones shown above. Here, there would be data movement if there would be data only on the left side of the new boundary point, and it would be a metadata operation only if there would be data only on the right side of the new boundary point.

Conclusion

While the recommendation in books online is to only split empty partitions, it is possible to split a partition with data and still get a metadata operation only. However, this is undocumented, and could change. I have seen designs where the last (or first) partition was not kept empty, and as of SQL Server 2012 that seems safe. As long as we know that there only will be data on the left side (of the last partition, in right range partitioning) or the right side (of the first partition, in left range partitioning) of the new boundary point for the split.

When does SQL Server partition split move data?

Partitioning on Microsoft SQL Server

Here is a few links to articles, white papers, and best practices on partitioning on Microsoft SQL Server. I have these at the end of my slide deck on layered partitioning, which uses a combination of partitioned tables and a partitioned view, and thought that I would put them up here as well.
Partitioning on Microsoft SQL Server

View partitions in SQL Server

This query is quite useful when looking at which partitions are in the database. It is originally taken from the book Microsoft SQL Server 2008 Internals (p. 438), and then I have tweaked it a bit and added some more information.
I used this query in my demo today at SQLSaturday #152 in Bulgaria, where I talked about layered partitioning.

DECLARE @TableName NVARCHAR(200) = N'Facts.Credit'
SELECT SCHEMA_NAME(o.schema_id) + '.' + OBJECT_NAME(i.object_id) AS [object]
     , p.partition_number AS [p#]
     , fg.name AS [filegroup]
     , p.rows
     , au.total_pages AS pages
     , CASE boundary_value_on_right
       WHEN 1 THEN 'less than'
       ELSE 'less than or equal to' END as comparison
     , rv.value
     , CONVERT (VARCHAR(6), CONVERT (INT, SUBSTRING (au.first_page, 6, 1) +
       SUBSTRING (au.first_page, 5, 1))) + ':' + CONVERT (VARCHAR(20),
       CONVERT (INT, SUBSTRING (au.first_page, 4, 1) +
       SUBSTRING (au.first_page, 3, 1) + SUBSTRING (au.first_page, 2, 1) +
       SUBSTRING (au.first_page, 1, 1))) AS first_page
FROM sys.partitions p
INNER JOIN sys.indexes i
     ON p.object_id = i.object_id
AND p.index_id = i.index_id
INNER JOIN sys.objects o
     ON p.object_id = o.object_id
INNER JOIN sys.system_internals_allocation_units au
     ON p.partition_id = au.container_id
INNER JOIN sys.partition_schemes ps
     ON ps.data_space_id = i.data_space_id
INNER JOIN sys.partition_functions f
     ON f.function_id = ps.function_id
INNER JOIN sys.destination_data_spaces dds
     ON dds.partition_scheme_id = ps.data_space_id
     AND dds.destination_id = p.partition_number
INNER JOIN sys.filegroups fg
     ON dds.data_space_id = fg.data_space_id
LEFT OUTER JOIN sys.partition_range_values rv
     ON f.function_id = rv.function_id
     AND p.partition_number = rv.boundary_id
WHERE i.index_id < 2
     AND o.object_id = OBJECT_ID(@TableName);

The result is something like the following.
View partitions in SQL Server
The query shows:

  • object – The table name.
  • p# – The partition number.
  • filegroup – The filegroup the partition is located on. Note that in this result, I am using the same filegroup for all partitions, but usually you would probably not do this.
  • rows – Number of rows in the partition.
  • comparison – shows “less than” if you are using right range partitioning or “less than or equal to” if you are using left range partitioning.
  • value – The boundary point between the two partitions. This value is either in the right or left partition of the boundary point, depending on whether or not you are using right range or left range partitioning.
  • first_page – The first file:page allocated for the partition.
View partitions in SQL Server