Windows IT Pro is the leading independent community for IT professionals deploying Microsoft Windows server and client applications and technologies.
  
  
  Advanced Search 



Subqueries and Joins for Running Aggregates

Use these solutions for small partitions
RSS
Subscribe to Windows IT Pro | See More SQL Server and Database Articles Here | Reprints | Or get the Monthly Online Pass—only $5.95 a month!

Download the Code Here

Executive Summary:
Both the subquery and join solutions perform reasonably well for calculating running aggregates when you have very small partitions. Because the subquery-based approach requires a separate scan of the data for each aggregate, the join approach is best for calculating multiple aggregates.

Running aggregates are calculations that are commonly used for data analysis. This article is the first of several articles covering various techniques to calculate running aggregates with a focus on performance. I’ll discuss how to analyze the complexity of a solution, as well as explain how to predict performance changes based on changing variables in the data or the solution. This article focuses on set-based solutions using subqueries and joins.

Running Aggregates
A running aggregate is an aggregate of a measure that keeps accumulating over an ordered sequence, possibly within partitions. Consider a table called Sales that contains employees’ daily sales quantities and values. Run the code in Listing 1 to create the Sales table in the tempdb database for demo purposes. (I provide the code to populate this table with sample data later in the article.)

An example of a running aggregate is a running sum of the quantity (or value) for each employee and day. That is, for each employee and day, calculate the sum of quantity from the beginning of the employee’s activity until the current day. The elements involved in the calculation of a running aggregate include the measure you are aggregating (qty in our example), the ordering attribute (dt in our case), and if relevant, the partitioning attribute (empid). This is the most basic and classic type of a running aggregate. Variations exist that have different boundary points to determine the window of rows that the aggregate operates on. However, I’ll stick to the classic form since my focus is performance and in this sense the variations should be similar for the most part.

Regarding indexing guidelines, typically you need to come up with a solution before you can evaluate indexing strategies for it. With running aggregates, though, most solutions would benefit from the same indexing strategy. The ideal index is a covering one created on the partitioning columns followed by the ordering columns as the index keys, and the measures as included nonkey columns. One way to achieve such an indexing strategy is to create a clustered index on the partitioning columns followed by the ordering columns as the clustered index keys. Since the leaf rows of the clustered index will contain all other columns from the table, you get coverage of the measures as well. Our Sales table has such an index, which was implicitly created because of the primary key defined on (empid, dt). If you need your clustered index to be defined differently for your own reasons, another way to implement this strategy is to create a nonclustered index on the partitioning columns and sort columns as keys, and on the attributes holding the measures as included nonkey columns. The index definition in our case would look like this (don’t run this command, because we already have a clustered index that supports the desired strategy):

CREATE UNIQUE INDEX idx_runagg ON dbo.Sales(empid, dt)
   INCLUDE(qty, val);


Next, I provide instructions to populate the Sales table with sample data before presenting solutions. First, run the code in Listing 2 to create a helper function called GetNums. GetNums accepts a number as input and returns a table result with a sequence of integers in the range 1 and the input number. Run the code in Listing 3 to populate the table with sample data. Notice in Listing 3 that you can set the number of partitions and the partition size. Currently, Listing 3 has 10,000 partitions with a partition size of 10. Later when I want to demonstrate the effect of changes in number of partitions or partition size, I’ll rerun the code in Listing 3 with other values as needed.

Set-Based Solution Using Subqueries
You might wonder why I’m covering both subqueries and joins in this article. The reason for this is that some people simply prefer to use subqueries, whereas others prefer to use joins. In addition, although some performance aspects are similar in both approaches, certain aspects of performance are different between the two approaches.

Let’s start with the subquery approach that Listing 4 shows. The outer query is against the Sales table, aliased as S1. From the outer instance you return the employee ID, quantity, and date. A subquery is in charge of calculating the running aggregate. The subquery is against a second instance of the Sales table, aliased as S2. The subquery filters the rows from S2 where the employee ID is the same as the one in S1, and the date is smaller than or equal to the one in S1. In the SELECT list, the subquery applies the SUM function to the qty attribute to aggregate all quantities from the qualifying rows. To evaluate the performance aspects of the solution, examine the execution plan for the solution query that Figure 1 shows.

As you can see, the plan shows that the query was processed with a Nested Loops join. The outer input of the join is the result of a clustered index scan representing the outer instance of the Sales table that was aliased as S1. For each of those rows, the loop applies activity against the inner input of the join, which is the instance of the Sales table aliased as S2. This activity involves a clustered index seek followed by a partial scan to fetch all rows with the same empid as in the outer row, and a dt that is smaller than or equal to the one in the outer row. Those rows returned from the clustered index seek and partial scan are then aggregated with a Stream Aggregate operator.

   Previous  [1]  2  3  4  5  6  7  8  9  10  Next 


Top Viewed ArticlesView all articles
Microsoft, News Corp. Discuss Locking Out Google

Microsoft and Rupert Murdoch's News Corp. recently discussed an alliance that would counter Google's fledgling online news service. ...

2009 Windows IT Pro Editors' Best and Community Choice Awards

Picking a favorite product from an impressive crowd of competitive offerings is never an easy task, and such was the case with our Editors' Best and Community Choice awards this year. ...

Command Prompt Tricks

One reader shares his tip for setting up the command prompt to reflect a remote path. ...


Related Events SQL Server Unleashed EMEA

The Easiest Way to Save Time and Money on E-mail and SharePoint Management

SQL Server for non-DB Specialists, Part II

Check out our list of Free Email Newsletters!

SQL Server and Database eBooks Safeguarding Your Windows Servers

SQL Server Administration for Oracle DBAs

Taking Control: Monitoring the Windows Platform Proactively

Related SQL Server and Database Resources Introducing Left-Brain.com, the online IT bookstore
Looking for books, CDs, toolkits, eBooks? Prime your mind at Left-Brain.com

Discover Windows IT Pro eLearning Series!
Clear & detailed technical information and helpful how-to's, all in our trademark no-nonsense format


Windows IT Pro Home Register FAQ for Windows WinInfo News
Europe Edition About Us Contact Us/Customer Service Media Kit Affiliates / Licensing  
SQL Server Magazine Office & SharePoint Pro DevProConnections IT Job Hound
Left-Brain.com Technology Resource Directory asp.netPRO ITTV Windows SuperSite 
 
 Windows IT Pro is a Division of Penton Media Inc.
 © 2009 Penton Media, Inc. Terms of Use | Privacy Statement