SQL SERVER -SCD Solution

2023-10-27

看下SQL SERVER中,对SCD的解决方案(不看SQL SERVER 2008 CDC方法),摘自《expert sql server 2005 integration services》

在这里介绍的SCD TYPE0,1,2与KIM BALL的(TYPE1,TYPE2,TYPE3)有些出入,需要仔细辨别(主要区别是,kim ball的typ3,是新增加一个attribute的处理,而这里SSIS没有提到,而这里多了TYPE0,和Inferred attribute,TYPE0是不变attribute,inferred attribute是在FACT 表被ETL的时候才添加的属性,这种应该以transaction的形式过滤掉,但是有时候且是大多数时候,海量数据,不允许做transaction,特别是serializable transaction)


SSIS Slowly Changing Dimension Wizard

Now it’s time to look at the built-in support for dimension ETL, called the Slowly Changing Dimension (SCD) Wizard. The SCD Wizard is a data-flow transformation and initially works like all the other transformations-simply drag and drop the transformation into the data flow and connect it to the upstream source or transformation.Figure 4-16 shows the data flow that was used earlier in the chapter with the SCD transformation now connected to the output of theUnion All.




Figure 4-16: Data flow with the SCD transformation connected to the output of the Union All

Double-clicking the transformation will invoke the wizard. Like other user interface wizards, several windows will prompt you for configurations, in order to build the dimension ETL process. One of the nice advantages of the SCD Wizard is that it allows for very rapid ETL development.

The SCD Wizard supports Type 1 changing attributes, Type 2 historical attributes, inferred members, and Type 0 fixed attributes, all out-of-the-box.

When the source is connected to the SCD component, and the wizard invoked, the first screen will prompt you to identify the target dimension table, then the mapping of source columns from the data flow pipeline to the dimension columns in the dimension table, and, finally, the business keys in the source and dimension table.Figure 4-17 shows the mapping between the source rows generated for the product dimension and the dimension columns themselves.



Figure 4-17: Mapping between source rows generated for the product dimension and the dimension columns

Note that the data types must match in order for the source columns to be compared with the dimension table columns, which may require use of the Data Conversion transformation in the upstream data cleansing logic. For the wizard to handle the matching automatically, the column names must be the same. Matching names is not a requirement, because the matching can be done manually. Furthermore, if you have any columns in the dimension table that are not attributes, but rather management or metadata columns (such as StartDate and EndDate columns that identify when the row is active), these will not be matched. Later options in the wizard will give you the opportunity to specify usage of these.

The final step in the first screen is to identify the business key or keys. InFigure 4-17, theProductAlternateKey is the business key manually matched to the source columnProduct ID (which has been aliased asProductAlternateKey). The business keys help identify dimension records that need to be added as new members. The business keys also provide part of the equation on identifying matching records that need to be evaluated for changes.

The next screen of the wizard is about associating the dimension columns that have just been matched with the dimension change type. The wizard does not use the common dimension changing type numbers (Type 0, Type 1, or Type 2); rather, it uses the descriptive terms to identify the type of change (fixed, changing, or historical).

Figure 4-18 shows the matching of the dimension changes. The column on the left contains any non-business key matches identified in the prior screen, and the column on the right is a drop-down of the three different change types.




Figure 4-18: Matching of the dimension changes

In this example, the following groupings were made:

  • Fixed attributes-   Size (changes in size will generate a new product, therefore, this is fixed)

  • Historical attributes-   Class, ModelName, ListPrice

  • Changing attributes-   Color, DaysToManufacture, EnglishProductName, FinishedGoodsFlag, FrenchProductName, ProductLine, ProductSubcategoryKey, ReorderPoint, SafetyStockLevel, SizeRange, SizeUnitMeasureCode, SpanishProductName, StandardCost, Style, Weight, WeightUnitMeasureCode

  Important 

The matching shown in Figure 4-18 can be a tiresome process if you have dozens of dimension columns and you are using your mouse to select the column in the drop-down. A quicker way to fill in these values is to use the Tab key and the up and down arrow keys. After finishing a row, the Tab key will send the cursor to the next row and automatically select the next column in the list.

Now that you have defined the column matches and identified the change types, the next few screens will help you to manage the advanced requirements for the dimension ETL process.

In the next screen, shown in Figure 4-19, some specific requirements are asked about Fixed attribute and Changing attribute members.




Figure 4-19: Fixed and Changing Attribute Options screen

If you do not identify any Fixed attribute or Changing attribute columns, then the respective detail questions shown in the screen shot will be grayed out.

The option for fixed attribute asks, “If there is a change in a fixed attribute, what should happen?”

  • By leaving the box unchecked, the change will be ignored and the value in the dimension table will stay the same as it was originally.

  • By selecting the box, if a change is detected in a Fixed attribute column, then the transformation will intentionally generate an error, so the data can be reviewed. This may be useful if, when a change is identified, it indicates a bigger problem.

The option for changing (Type 1) attributes identifies which records to update when a change happens. In other words, if the dimension record has any Type 2 historical attributes, there may be multiple records for the same business key.

  • When the check box is selected, all the related records (the current dimension member and the outdated members) are updated.

  • When the option is cleared, only the current record gets updated with the new value of the column. For example, if there’s a change inColor fromblue toaqua, and the particular business key had five historical changes, then only the last record would be updated with aqua. When the Changing attributes check box is cleared, and color is marked as a Type 1, only the Current record gets updated fromblue toaqua. The rest of the historical records remain asblue.

If any historical attributes have been selected, then the next screen will prompt to identify how the current row is identified.Figure 4-20 shows the Historical Attribute Options screen.




Figure 4-20: Historical Attribute Options screen

The two choices to identify a current row are:

  • Use a flag column (usually Boolean) to identify whether a row is current or not. Besides just selecting the column that should be used, the SCD Wizard also supports defining what values identify a current record and expired record. Note also that the columns in the drop-down list are any dimension columns that have not been matched from the source. That is because these are considered metadata columns that are used for management purposes like this.

  • Use a StartDate and EndDate combination to manage when a dimension record is active. The StartDate and EndDate column selections need to be dimension table columns defined with adatetime data type. Furthermore, one other option exists if taking the approach of start time and end time columns-that is, choosing which package variable should be used as the value to update the record with. In other words, when a new dimension member needs to be added, because there is a change in a Type 2 historical attribute, the prior record needs to be first updated with a newEndDate and the new record needs to be initiated with a newStartDate. Any system or user variable can be used. One good choice is to use theSystem::StartTime variable, which is automatically updated when the package is executed. And, in this example, theStartDate andEndDate columns are used in conjunction with theSystem::StartTime variable.

  Important 

Choosing to use StartDate and EndDatecolumns will give you the advantage of knowing exactly when a particular dimension member was active, either for reporting purposes or even to handle the scenario when a fact record arrives late and a historical dimension record needs to be selected instead of the current member.

The SSIS inferred member support is found in the next screen. As a reminder, inferred members are added during the fact ETL and updated during the dimension ETL. The built-in support for inferred members revolves around the dimension update.Figure 4-21 shows the options available for inferred members.




Figure 4-21: Options available for inferred members

First of all, inferred member support is either enabled or disabled. If enabled, the SCD Wizard needs to know how to identify whether a record is an inferred member. The two choices are to leave all the dimension columns in the dimension table as NULL, or to use an Inferred Member Boolean column in the table that identifies which rows are inferred members.

Since the product dimension table in the AdventureWorksDW database does not have an inferred member column, the first choice is selected.

  Important 

Using the first choice for inferred members (all attributes containNULLvalues) is not often practical, because, first of all, it assumes that the columns in the dimension table allowNULLvalues, and secondly, it makes for difficult querying. Using an unknown value, for example, is often a better way to see data for a reporting system. But in addition, if Analysis Services is used for analytics,NULLvalues are also not a good choice. The best practice is to define an inferred member column and handle the identification by setting the Boolean value toTrueorFalse.

The final screen, not shown, presents a summary of the outputs that will be created. By selecting Finish on this screen, the SCD Wizard will take all the configuration options and create several downstream transformations and destinations.Figure 4-22 shows the end result of the SCD Wizard-the SCD transformation remains, but it contains several outputs to handle the different attribute types.




Figure 4-22: End result of the SCD Wizard

Since SSIS dynamically builds the data flow, the resulting layout in this example may not look exactly like your testing. However, the functionality is the same. For this example, on the left are changing attributes. Down the center are new members and historical attributes. On the right are inferred members. Starting with the SCD transformation, when a dimension row is processed, the SCD will determine which (if any) attribute changes occur, and whether there are new members to be added or inferred members to update. Consider the SCD transformation to be like a Conditional Split; it evaluates every row, one at a time, and routes the records to different outputs.

The simplest output to understand is the Changing Attributes Updates Output, which is linked to the OLE DB Command 2 transformation, connected to theAdventureWorksDW database. The following code shows theSQLCommand property, which defines the UPDATE statement:


UPDATE [dbo].[DimProduct]
SET [Color] = ?
,[DaysToManufacture] = ?
,[EnglishProductName] = ?
,[FinishedGoodsFlag] = ?
,[FrenchProductName] = ?
,[ProductLine] = ?
,[ProductSubcategoryKey] = ?
,[ReorderPoint] = ?
,[SafetyStockLevel] = ?
,[SizeRange] = ?
,[SizeUnitMeasureCode] = ?
,[SpanishProductName] = ?
,[StandardCost] = ?
,[Style] = ?
,[Weight] = ?
,[WeightUnitMeasureCode] = ?
WHERE [ProductAlternateKey] = ?

What you should note in the UPDATE statement is that only the columns that were defined as changing attributes (Type 1) are included in theUPDATE statement, simply because this output is only for the Type 1 changing attributes. Also notice that the SQL statement is an OLE DB parameterized statement with question marks, which is the way that the OLE DB provider handles the parameterization.Figure 4-23 shows the Column Mappings tab, which maps (in order) the pipeline input columns to the parameterized query.


Figure 4-23: Column Mappings tab

The order of the question marks defines the order of the mappings.

The second output is the Inferred Member Updates Output. This output is very similar to the Changing Attributes Update Output because it also performs anUPDATE statement. Just like the first output, the inferred member output uses an OLE DB Command transformation to handle the updates (in this case, the OLE DB Command 1 transformation). TheUPDATE statement defined in theSQLCommand property is as follows:


UPDATE [dbo].[DimProduct]
SET [Class] = ?
,[Color] = ?
,[DaysToManufacture] = ?
,[EnglishProductName] = ?
,[FinishedGoodsFlag] = ?
,[FrenchProductName] = ?
,[ListPrice] = ?
,[ModelName] = ?
,[ProductLine] = ?
,[ProductSubcategoryKey] = ?
,[ReorderPoint] = ?
,[SafetyStockLevel] = ?
,[Size] = ?
,[SizeRange] = ?
,[SizeUnitMeasureCode] = ?
,[SpanishProductName] = ?
,[StandardCost] = ?
,[Style] = ?
,[Weight] = ?
,[WeightUnitMeasureCode] = ?
WHERE [ProductAlternateKey] = ?

The difference, as you would expect, is that there are more columns in theUPDATE statement. Not only are the Type 1 changing attributes updated, but also the Type 2 historical attributes. Included in theUPDATE statement are the Class,ModelName, andListPrice columns, which were defined as a Type 2 historical attributes. These are updated because of the nature of an inferred member, which requires updates to all the columns without generating a new record. Furthermore, if you had defined anInferred Member Flag column, this is where the inferred member column would also be updated. Just like the first OLE DB Command transformation, the order of question marks defines the order of the mapping.

The next two outputs to consider are also related. Both the New Output (new members to be added) and the Historical Attribute Inserts Output add rows to the dimension table. What you can see inFigure 4-22 earlier is that a Union All transformation is used to bring these two outputs together for a single destination insert.

If the business key from the source does not exist in the dimension table, it’s identified as a new member that needs to be inserted. The New Output that handles new members goes directly to the Union All transformation. With the Historical Attribute Insert Output, before the records are brought together in the Union All, a couple of metadata management tasks need to happen. Whether the new historical attribute record is marked as current through a combination of dates or a separate column, the old record needs to be updated before the insert can happen. Either the End Date column is updated, or a current flag column is updated, which is handled in a two-step process:

  1. The Derived Column transformation that is attached to the Historical Attribute Insert Output adds either anEndDate column to the data flow (as in this example), or it adds the expired flag value.Figure 4-24 shows the Derived Column editor that defines anEndDate column and then also uses theSystem::StartTime variable as specified in the SCD Wizard.


    Figure 4-24: Derived Column editor

  2. Another OLE DB Command transformation is used to update theEnd Date for the expired record, based on the business key and the current indicator (in this case, the record to be expired will currently have anEnd Date of NULL). The SQLCommand property of this OLE DB Command transformation is as follows:

    
    UPDATE [dbo].[DimProduct]
    SET [EndDate] = ?
    WHERE [ProductAlternateKey] = ? AND [EndDate] IS NULL
    

One nice feature of the OLE DB Command transformation, as this example shows, is that the records from the pipeline can perform theUPDATE statement, and then still be available downstream for other purposes. After the expired record has been updated, then the record is ready to be “union-ed” with the new member output.

Before the final insert for the new members and Type 2 historical attributes, one final step is necessary- you need to add theStart Date (or current record flag). This is handled by a Derived Column transformation, which adds a column to the pipeline calledStartDate (in this example).Figure 4-25 shows the details of the Derived Column transformation that falls just below the Union All transformation and before the OLE DB destination (from the data flow shown earlier inFigure 4-22).


Figure 4-25: Details of the Derived Column transformation

Just like the EndDate, the StartDate uses the System::StartTime variable, which is used for the new record. New records that are new members or Historical-attribute inserts require theEndDate to beNULL. (If you have specified a current indicator, you put the indicator’s current value in this Derived Column transformation.)

When executed, the SCD transformation routes the rows from the source to the different outputs. Notice inFigure 4-26 that the 504 input rows are not all sent out the outputs. This is because some of the records do not go through any changes, so they are effectively ignored.



Figure 4-26: Input rows not sent out the outputs

Advanced Properties and Additional Outputs of the SCD

There are two advanced properties of the SCD that can be used to customize how the SCD transformation handles the processing:

  • The CurrentRowWhere property identifies how the current dimension row for a source record is identified. If you have configured the SCD to use aCurrent Flag column, then, by default, this property would filter on theCurrent Flag where the value you defined in the wizard is current. Alternately, if you specifiedStart Date andEnd Date, then, by default, the property would assume that theEnd Date columnIS NULL. This property can be changed if you need to re-define how the SCD transformation searches for the current record. For example, if your organizational standards do not allowNULL values, then you would have to modify this property to check theEnd Date for the default value set for the column (oftentimes, a date far out in the future is used to accommodate this, such as 1/1/2050).

  • The SQLCommand property contains the SQL syntax used for the lookup against the dimension table to either determine if a new member needs to be created, or if there have been any changes in a Type 0 fixed attribute, Type 1 changing attribute, or Type 2 historical attribute.

Both of the properties referenced here can be found by reviewing the Properties window when selected on the SCD transformation, as shown inFigure 4-27, or by looking at the advanced editor of the SCD transformation.


Figure 4-27: SCD Transformation Properties window

Furthermore, the SCD transformation contains two additional outputs that are not used by default, but are useful for auditing and data validation.

The first output enables you to capture the rows that have not gone through any change. Very likely, if you are pulling the entire dimension source (as opposed to just targeting new and changed records), you will have many dimension records from the source that are unchanged or are completely in synch with the dimension table. Although a change has not happened, you may have a need to count the number of rows that are unchanged, or capture the unchanged rows in a table or file for review. The Unchanged Output is accessible by selecting the green path output from the SCD transformation, and connecting it to another transformation or destination. When this is done, you will be prompted to choose the right output, and are prompted with the remaining outputs that are not used. In Figure 4-28, a Row count transformation is used to capture the number of rows that are unchanged into a variable that is later captured for auditing purposes.


Figure 4-28: Row count transformation used to capture the number of unchanged rows

The second additional output is the Fixed Attribute Output, which sends out any rows where a fixed attribute column has changed when it should not have. Rather than a Row count, a better use of this output is to capture the records to a staging table for review, since a change was not supposed to happen.

  Tip 

Only when the Ignore fixed attribute changes option is selected will this output be used. Otherwise, if a fixed attribute change occurred, the SCD would intentionally fail.

Slowly Changing Dimension Wizard Advantages and Disadvantages

The SCD Wizard is a very powerful tool, and will be appreciated by ETL developers who commonly deal with managing complicated ETL processes for dimensions. Several benefits will be achieved by using the built-in SCD support. However, there are also a few limitations surrounding the SCD that should be mentioned.

The advantages focus on management, development, and standardization, including the following:

  • Simplicity- The SCD Wizard can handle most dimension scenarios. It makes the often complicated dimension processing straightforward, and helps standardize ETL development for dimensions.

  • Rapid development- The SCD can save time in the development lifecycle by reducing the design and development time, and also easing the management and testing. This leaves more availability for other areas of an ETL process.

  • Wizard allows changes- If the inputs entered into the SCD Wizard require changing, the wizard can be re-invoked, and these changes will propagate down to the downstream-generated components automatically. A caveat to this is presented in the limitations section.

  • Customized output transformations- Since the SCD Wizard generates transformations rather than a black-box approach, the output can be customized. For example, the OLE DB Command transformation used for Type 1 changing attributes can be removed and replaced with a staging table in order to achieve set-based updates, which often perform faster than row-by-row updates.

  • Beyond dimension processing-   The SCD transformation can be used beyond just dimension processing, such as table synchronization. Even though the name suggests that the SCD Wizard focuses exclusively on dimension processing, one alternate use is to just leverage the Type 1 changing attribute support (and the included new member support).

The limitations of the SCD support focus mostly on scalability for large-dimension scenarios:

  • Dimension table lookup scalability- The dimension table is not cached in memory. Therefore, for every row coming in from the source, a separate lookup statement is sent to the dimension table in the underlying relational engine.

  • All updates are row-based- Relational updates are required for the Type 1 changing attribute output, the Inferred Member output, and the Type 2 historical attribute output (to expire the previous record). Because the OLE DB Command transformation is employed, every row coming through these transformations sends a separateUPDATE statement to the dimension table, in a cursor-like fashion. When dealing with several thousand updates, this can be limiting.

  • Customized outputs are overwritten by changes- Although the wizard can be re-run (with the prior run’s values remaining), if you have customized the output and then run through the wizard again, when the wizard finishes, it will overwrite any changes you made (the transformations will be orphaned by a new set of transformations). Be careful with that if you’re making customizations. The wizard will overwrite them.

  • Locking issues and inserts- All at the same time, data may be queried from the dimension table for comparison, it may be updated in the dimension table to handle a Type 1 change, and it may also be inserted for new members and Type 2 historical records. All this activity on the dimension table at one time can slow down the dimension ETL process. Furthermore, the inserts cannot take advantage of the Fast Load option because of the locking contentions, thus resulting in row-by-row inserts.

Optimizing the Built-in Slowly Changing Dimension Support

Later in this chapter, we examine writing an SSIS package for dimension ETL without using the built-in SCD Wizard support. However, if you are dealing with a large dimension table, a couple of techniques can be used to achieve better performance.

Index Optimizations

Since the dimension lookups and updates both are row-by-row operations, be sure to check the indexes on your dimension table to speed up the identification of the current record. If you are seeing very poor performance with the SCD (anything less than approximately 2,000 rows per minute), then chances are the SCD lookups are requiring relational table scans or bookmark lookups after identifying the record key. For best ETL optimization, create your dimension table’s clustered index on the business key, rather than the dimension surrogate key. Including the current indicator flag or end date as the second column in the index will improve the performance even more.

  Important 

Index optimization must be balanced between the ETL and the query usage. Optimization for query patterns should take priority over optimizations for ETL. However, some situations may require ETL-focused optimization as a priority to achieve the service-level agreements (SLAs) identified for processing times. Also, be cautious that too many indexes can slow down operations.

Update Optimizations

Dimension table updates to handle Type 1 changing attributes are a common occurrence in any dimension table ETL process. Although the inferred member output also requires updates, the number of rows will typically be a fraction of the number of Type 1 changing attributes, since inferred members are considered an exception to the rule. Since the updates are row by row (this is the way the OLE DB Command transformation works), then dealing with thousands of updates will create a processing bottleneck. One way to improve performance is to replace the OLE DB Command update with a set-based update approach. InFigure 4-29, the OLE DB Command that handles the Type 1 changing output has been replaced with an OLE DB Destination to a staging table.


Figure 4-29: OLE DB Destination to a staging table

Using a staging table for this scenario improves performance, because it allows a single set-basedUPDATE statement to be run. TheUPDATE statement is handled with an Execute SQL Task in the control flow.

The set-based statement is comprised of an inner join between the staging table and the dimension table across the business key, where any matching rows (already identified as requiring a Type 1 changing update) will require the attributes to be updated. The following highlights the TSQL code that performs this operation:


UPDATE [dbo].[DimProduct]
SET
 [Color] = STG.[Color]
,[DaysToManufacture] = STG.[DaysToManufacture]
,[EnglishProductName] = STG.[EnglishProductName]
,[FinishedGoodsFlag] = STG.[FinishedGoodsFlag]
,[FrenchProductName] = STG.[FrenchProductName]
,[ProductLine] = STG.[ProductLine]
,[ProductSubcategoryKey] = STG.[ProductSubcategoryKey]
,[ReorderPoint] = STG.[ReorderPoint]
,[SafetyStockLevel] = STG.[SafetyStockLevel]
,[SizeRange] = STG.[SizeRange]
,[SizeUnitMeasureCode] = STG.[SizeUnitMeasureCode]
,[SpanishProductName] = STG.[SpanishProductName]
,[StandardCost] = STG.[StandardCost]
,[Style] = STG.[Style]
,[Weight] = STG.[Weight]
,[WeightUnitMeasureCode] = STG.[WeightUnitMeasureCode]
FROM [dbo].[DimProduct]
INNER JOIN [SSISOps].[dbo].[stgDimProductUpdates] STG
ON [DimProduct].[ProductAlternateKey]
 = STG.[ProductAlternateKey]

Be sure to truncate your staging table for every ETL run (by adding an Execute SQL Task at the start of the control flow); otherwise, you will be updating data from old rows from a prior execution.

Handling Advanced Dimension Processing with the Slowly Changing Dimension Support

This section examines how to handle the ETL for advanced dimension forms, combining the functionality of the SCD with other out-of-the-box transformations, focusing on snowflake dimensions, parent-child dimensions, and date dimensions.

Snowflake Dimension Tables

A snowflake table, as briefly described in the beginning of this chapter with the product dimension (seeFigure 4-2 earlier), requires some unique ETL handling aspects. In a snowflake dimension, the higher-level tables (subcategory and category in this example) also have surrogate keys. With the product snowflake dimension, theProductCategoryKey cascades down as a foreign key in theDimProductSubCategory table, and theProductSubCategoryKey cascades down to a foreign key relationship in theDimProduct table. TheProductKey itself relates directly to the fact table, whereas the surrogate keys in the category and subcategory tables do not relate directly to the reseller fact table.

The design is called a snowflake because when viewed in relationship to the fact table, the table layout looks like a snowflake as opposed to a star. (A star schema has dimension tables one level out from the fact table. A single table dimension is often called a star dimension.) Generally, most dimensions are designed as a star dimension. However, there are two very valuable reasons to break out a dimension table into a snowflake design:

  • When a dimension table has several attributes that relate directly to a higher level within a dimension hierarchy, managing those dimension changes can be a lot easier with a snowflake design. For example, say the product subcategory table contains the English, Spanish, and French names of the subcategory. If these columns were included in the base product table, and the subcategory changed for the dimension, it would be much more difficult to ensure that the Spanish and French names were in synch with the subcategory. Otherwise, if they were not, reporting of these attributes would be misleading with the incorrect association.

  • A second and more compelling reason to use a snowflake is when you have multiple fact tables related to the same dimension table at different levels. For example, if the Reseller Sales fact table tracked sales at the Product Level, but the Sales Quota facts were assigned to a Product Category, if the Product dimension only had one table, the Category could not be effectively joined. Using a snowflake design, the fact tables can be related to one another because they share a common table at different levels of the product dimension hierarchy. Sharing dimension tables between fact tables is a driving emphasis of dimensional modeling calledconforming dimensions.

  Important 

Analysis Services supports having a standard dimension relate to different measure groups, at different levels. (Ameasure groupin Analysis Services is equivalent to a fact table in the database.) It understands how to do the right aggregations. The Analysis Services engine naturally understands how to associate this data and perform aggregations.

A straightforward method to processing snowflake dimension in SSIS is to use multiple SCDs embedded in different data flows, linked by precedence constraints in the control flow starting at the top level of the snowflake tables and working down to the lowest level. Figure 4-30 shows the control flow of the product dimension package. Note the very first task is an Execute SQL Task that truncates the staging table used for the set-based update, followed by the three Data Flow Tasks.


Figure 4-30: Control flow of the product dimension package

The first data flow shown is the product category dimension table, followed by the subcategory dimension table, and concluding with the product data flow. The final Execute SQL Task handles the set-based updates for Type 1 changing attributes as discussed earlier.

The product category data flow is the most straightforward because it only requires one Lookup transformation to get the category translation attributes, and, furthermore, it only contains Type 1 changing attributes.Figure 4-31 shows the product category data flow.


Figure 4-31: Product category data flow

Next, you process the product subcategory. When processing data within intermediate snowflake tables in the hierarchy, a lookup is also required to reference the parent table surrogate key. For example, as you process the product subcategory table, include a Lookup transformation to pull the surrogate key of the category, as shown inFigure 4-32.


Figure 4-32: Subcategory lookup to pull the surrogate key of the category

The ETL processing in the lowest-level product table has already been discussed. Not mentioned, but also included in the data preparation steps for the product dimension, is the need to pull the surrogate key of the subcategory table.

Parent-Child Dimension ETL

The next dimension design that we will consider is the parent-child dimension. A parent-child is a self-referencing dimension table and also has special ETL requirements. Simply put, a parent-child dimension has a surrogate key and a parent surrogate key, which gives the dimension a self reference. In addition, parent-child dimensions usually have a business key and a parent business key. The self-referencing business key nicely illustrates the parent relationship in the table.

An organizational structure is a great example. An organization typically has common attributes for all the individuals in the organization, such as location, office, salary, and so on. All of these attributes relate to higher levels in the hierarchy. A parent-child enables you to build a hierarchy where members at different levels have common attributes. Secondly, a parent-child dimension also allows the hierarchy to be unbalanced, where not every drill path in the hierarchy goes down to the same level.

The Employee dimension table as shown in the beginning of the chapter (seeFigure 4-2) is an example of an organizational structure.Figure 4-33 shows a subset of data and columns within the Employee dimension table.


Figure 4-33: Subset of data and columns within the Employee dimension table

As you can see, some dimension members relate to other dimension members. For example, Dan Bacon reports to Jean Trenary (Dan’s parent employee key is 44, which is the employee key of Jean). At the top of the table, the Chief Executive Officer, Ken Sanchez has no parent key and, therefore, no manager. Every member in the table is a member at a different level in the hierarchy. Taking this subset of data and building the hierarchy for the parent-child relationship turns into the hierarchy shown inFigure 4-34.


Figure 4-34: Building the hierarchy for the parent-child relationship

Note a few points about this parent-child dimension:

  • This is an unbalanced hierarchy. The levels within this dimension don’t all extend to the lowest level (Laura Norman, for example, has no direct reports). In a USA-based geography dimension, for example, everything typically goes to the same state or city level. Within an unbalanced hierarchy, there are levels that don’t cascade all the way down.

  • A parent-child dimension can also be a ragged hierarchy. A ragged hierarchy has holes in the hierarchy; you could be at the top level, skip the next level, and go directly to the level below. You must be aware of these variations when processing ETL.

  • Parent-child dimensions have shared attributes. Most of the records, except at the top level, share common attributes, such as Employee Address.

When processing parent-child dimensions in SSIS, a couple of methods can be applied for acquiring the parent record key.

The first approach is to use a Lookup transformation to acquire the parent record for the parent-child relationship.Figure 4-35 shows the data flow used to process the Employee dimension table.


Figure 4-35: Data flow used to process the Employee dimension table

This data flow has similarities to the product dimension data flow, as the first several transformations are used to prepare the data for the SCD transformation. Note that the third data flow object is a Lookup transformation used to acquire the parent surrogate key. In this case, the Lookup transformation joins the parent employee business key of the source to the matching employee business key from the employee table.Figure 4-36 shows the Columns tab of the Lookup editor.


Figure 4-36: Columns tab of the Lookup editor

Besides the join being across the parent business key to the business key in the reference table, the surrogate key that is returned is aliased to match theParentEmployeeKey.

A second approach involves using an Execute SQL Task in the control flow and performing a set-based update to associate the employee with the employee’s manager record. This SQL Task needs to be run after the data flow that adds new dimension records. The following SQL code is used to perform the operation:


UPDATE dbo.DimEmployee
   SET ParentEmployeeKey = Parent.EmployeeKey
FROM dbo.DimEmployee
INNER JOIN dbo.DimEmployee Parent
      ON DimEmployee.ParentEmployeeNationalIDAlternateKey =
            Parent.EmployeeNationalIDAlternateKey
      AND Parent.EndDate IS NULL
WHERE DimEmployee.ParentEmployeeKey IS NULL

The query could be handled by a self join or a correlated subquery. In this case, a join is used between theEmployee table and theEmployee table (aliased asParent) matching across the child employee with the parent manager record. TheParentEmployeeKey is updated with theEmployeeKey of theEmployee table joined as the parent table. Also important to note is that the only records affected are those with theParentEmployeeKey set toNULL, which targets and, therefore, optimizes the updates.

A final approach is to use a combination of a Lookup transformation and an Execute SQL Task. The Lookup transformation approach may fall short of handling all situations if an employee is added at the same time as the assigned manager. In this case, the parent employee key would not yet be available. This combined approach would give better scalability in higher-volume situations by reducing the rows affected by the update.

Date Dimension ETL

The date dimension is probably the most common and conformed dimension in any data warehouse or data mart structure. In other words, most fact tables have relationships to a Date table. In the sampleAdventureWorksDW, the date dimension table is called DimTime. Honestly, it’s poorly named. The wordTime has connotations of time of day-hour or minute, not the date grain, which is actually what is stored in theDimTime table.DimDate would have been a better name.

A date dimension can have many different attributes, besides the year, month, and date. It might include the following:

  • Day of week

  • Week of year

  • Holiday

  • Fiscal hierarchy

The attributes combine to create different hierarchies. An example of a natural calendar hierarchy would be Year-Quarter-Month-Date or Year-Week-Date. In the natural calendar, weeks do not line up with months (a week can span two months). Therefore, there are two natural calendar hierarchies. Because of this challenge, organizations have come up with different, and often unique, fiscal hierarchies. Your organization may have its own fiscal hierarchy.

A common approach to a custom fiscal hierarchy is to break a quarter into three periods: four weeks in period 1, four weeks in period 2, and five weeks in period 3 (commonly called a 4-4-5 fiscal date hierarchy). A period essentially replaced a month to accommodate the week-month challenge. One organization we worked with had at least 10 different fiscal hierarchies-every branch of the organization wanted to see the sales data from a different perspective.

When it comes to handling the date dimension ETL, the dimension is relatively easy to process because once a new date record is added, it is usually never updated. The Date dimension is commonly not snowflaked into multiple tables.

  Important 

The date dimension is not typically snowflaked even if you have higher-level fact tables and a multi-grained scenario. For example, account balances might be tracked at a week level, whereas sales come in at a day level, and inventory at a month level. The date dimension almost always uses the day for granularity. You would use the first period of that grain as the key. For example, if inventory is at the month level, use the first day of the month as the key for the month.

There are a few different ways to process date-dimension ETL. The following are some common mechanisms:

  1. Handle the date dimension population through a stored procedure. Whether you’re calculating one date at a time every day or out into the future once a year, a stored procedure can load all of the records and handle all the date logic.

  2. Use a data flow with Derived Column transformations and date part expressions. The data flow generates a set of dates where additional attributes are added using the date functions in SSIS updated.

  3. Use a For Loop Container to iterate through a range of dates with SQL Inserts. The For Each Loop container could be configured to increase the date variable by one every loop and then leverage an Execute SQL Task to manage the inserts, one row at a time.

Any of these approaches are viable. Since the date dimension usually contains a small number of records, this package will process quickly regardless of the approach chosen.

As an example, if the second approach is taken, the process would involve identifying the first new date and the last date to add, followed by a data flow to handle the inserts.Figure 4-37 shows the control flow of a package with an Execute SQL Task followed by a data flow.


Figure 4-37: Control flow of a package with an Execute SQL Task followed by a data flow

Defined in the package are two user variables, FromDate and ToDate, which are populated by the Execute SQL Task using a single row resultset. TheFromDate is populated by querying theMAX date in the time dimension plus one day, and the ToDate is populated from theMAX date in the sales header table source. As you would guess, the data flow only needs to run if theToDate is greater than theFromDate. This is handled through the Precedence Constraint after the Execute SQL Task.Figure 4-38 shows the Precedence Constraint Editor.


Figure 4-38: Precedence Constraint Editor

If the Execute SQL Task is successful and the@ToDate is greater than@FromDate, then the Precedence Constraint will allow the Data Flow Task to execute. At this point, if the criteria are met, then there are rows that need to be inserted into the time dimension table. The data flow, shown inFigure 4-39, contains five components:

  • A Script Component that generates a row for every day starting with theStart Date and ending with theEnd Date

  • A Derived Column transformation that adds the required date attributes

  • Two Lookup transformations to pull in some date translations from a translation table

  • A destination adapter that inserts the rows into the Time dimension table.


Figure 4-39: Five components of the data flow

Since the Script Component can be used as a source (as described inChapter 2), it is a great candidate to generate rows from theFromDate to theToDate values. The following code takes the SSIS variables and, using aDo While loop, iterates over the dates from theFromDate to theToDate, adding a date for every loop:


Imports System
Imports System.Data
Imports System.Math
Imports Microsoft.SqlServer.Dts.Pipeline.Wrapper
Imports Microsoft.SqlServer.Dts.Runtime.Wrapper

Public Class ScriptMain
  Inherits UserComponent

  Public Overrides Sub CreateNewOutputRows()

        Dim vFromDate As Date = Me.Variables.FromDate
        Do While vFromDate <= Me.Variables.ToDate
            With NewDatesOutputBuffer
                .AddRow()
                .FullDateAlternateKey = vFromDate
            End With
            vFromDate = DateAdd(DateInterval.Day, 1, vFromDate)
        Loop

    End Sub

End Class

The second component to highlight is the Derived Column transformation.Figure 4-40 shows the Derived Column editor, which contains several additional columns using theDATEPART SSIS expression function to pull out common attributes.


Figure 4-40: Derived Column editor

Before the OLE DB Destination, which is just an insert into the DimTime table, two Lookups handle the translation, because the time dimension table supports multiple languages for the month- and day-related attributes.

Overall, the package is straightforward, performs well, and leverages SSIS capabilities.




本文内容由网友自发贡献,版权归原作者所有,本站不承担相应法律责任。如您发现有涉嫌抄袭侵权的内容,请联系:hwhale#tublm.com(使用前将#替换为@)

SQL SERVER -SCD Solution 的相关文章

  • 根据产品变体术语将收件人添加到 Woocommerce 电子邮件通知

    我创建了一个 Woocommerce 插件并要求它做两件事 根据购物车中的产品变体 向特定电子邮件地址发送通知消息 电子邮件必须仅包含相关产品 不得包含其他属性的产品 例如 产品 A 具有名为 Chef 的属性 其中 Chef one 和
  • 模型驱动开发:什么是转换?

    据我了解 模型驱动开发 MDD 允许自动化 通过应用转换从相应的模型自动生成程序 模型 我对转换的了解是 它们是存储开发人员特定于平台的专业知识的某种方式 但究竟什么是转型 A 程序 转型 http www semanticdesigns
  • 打开脚本任务时 SSIS 丢失文件引用

    我们使用自定义审核程序集 C 在脚本任务中记录 SSIS 中的多种操作 我们将在 GAC 中构建自定义程序集 用于运行时 并发布到 IDE VS2008 的公共程序集区域以供设计时文件引用 后构建完成后 自定义程序集可在运行时使用 并可在文
  • 将新属性动态添加到 Node 中现有的 JSON 数组中

    我需要添加当前 JSON 中不存在的属性 json 对象如下所示 var jsonObj result OK data 我想在 数据 中添加温度 我可以像下面那样做 jsonObj data push temperature 然后 我想在
  • SSIS 将字符转换为布尔值/位

    我有一个SSIS包来加载数据 您可能还记得 当我尝试将数据文件中的标志作为位标志加载到 SQL Server 中时 这些标志作为 Y N char 1 存在 我将数据文件中的列指定为String DT STR 我有一个数据转换任务 根据以下
  • 覆盖SSIS环境变量

    我已经设置了一个包配置 其中配置类型为 间接 XML 配置文件 环境变量指向的位置C SSIS MasterConfig MasterConfig dtsConfig file 这非常有效 并且可以轻松地从测试迁移到 UAT 再到生产 问题
  • 使用 dtexec 运行 SSIS 包

    我正在使用 dtexec 运行 SSIS 包 该软件包在我的系统上的 BIDS 中运行良好 当我创建 SQL Server 代理作业以按计划运行包时 包运行步骤被安排为 T SQL 任务 而不是 SSIS 包 该作业没有报告错误 但它甚至没
  • python 中的“ AttributeError: '_io.TextIOWrapper' 对象没有属性 'replace' ”是什么?

    print Welcome to the code breaker game In this game you will have to change symbols into letters in order to decipher se
  • Datalist 属性在 google chrome 中不起作用

    Datalist 属性在 Google Chrome 中不起作用 但在 Firefox 中工作正常 请看这里http prntscr com arny81 http prntscr com arny81 提前感谢您的帮助 HTML td t
  • NUnit 测试运行顺序

    默认情况下 nunit 测试按字母顺序运行 有谁知道有什么方法可以设置执行顺序吗 是否存在这样的属性 我只是想指出 虽然大多数受访者认为这些是单元测试 但问题并没有具体说明它们是 nUnit 是一个很棒的工具 可用于各种测试情况 我可以看到
  • 使用 Excel 创建包含特殊字符的 CSV 文件,然后使用 SSIS 将其导入数据库

    获取此 XLS 文件 然后 我将此 XLS 文件另存为 CSV 然后使用文本编辑器将其打开 这是我所看到的 Col1 Col2 Col3 Col4 Col5 Col6 Col7 1 ABC AB C D E F 03 3 2 我看到 C 列
  • 我可以向函数添加属性以防止重入吗?

    目前 我有一些如下所示的功能 private bool inFunction1 false public void function1 if inFunction1 return inFunction1 true do stuff whic
  • 如何判断一个类是否被某个特定属性修饰

    我试图确定接口是否用特定属性装饰 例如我有以下界面
  • Python:numpy/pandas 根据条件更改值

    我想知道是否有更快 更 Pythonic 的方法来执行以下操作 例如使用一些内置方法 给定一个 pandas DataFrame 或 numpy 浮点数组 如果该值等于或小于 0 5 我需要计算倒数并乘以 1 并用新计算的值替换旧值 转变
  • Amazon RDS for SQL Server 是否支持 SSIS?

    从谷歌搜索中读到一些相互矛盾的答案 不确定答案是是 否还是可能 我觉得读的时候已经很清楚了this http docs aws amazon com AmazonRDS latest UserGuide CHAP SQLServer htm
  • Python 函数可以从作用域之外赋予新属性吗?

    我不知道你可以这样做 def tom print tom s locals locals def dick z print z name z name z guest Harry print z guest z guest print di
  • 寻找两个框架之间的变换

    我有来自视频源的两个连续帧 并且我使用 FAST 算法检测这两个帧的关键点 我使用平方差之和法 SSD 来匹配关键点 所以基本上我已经匹配了两个框架之间的关键点 现在我想根据匹配的关键点集计算两个帧之间的仿射变换 缩放 旋转 平移 我知道如
  • 使用C#属性来跟踪函数调用、变量和返回值?

    在Python中 我可以使用装饰器来跟踪函数调用 它的变量和返回值 这是非常容易使用 我只是想知道 C 可以做同样的事情吗 我发现网上有CallTracing Attribute的示例代码 然而 它并没有显示出我预期的结果 C 属性与 py
  • C#:如何在调用时触发事件的方法上创建属性?

    在 C 或 NET 中 是否有一种方法可以在方法上创建一个属性 以便在调用该方法时触发事件 理想情况下 我能够在调用该方法之前和之后运行自定义操作 我的意思是这样的 TriggersMyCustomAction public void Do
  • 如何使用 SSIS 包发送电子邮件正文中表中的记录?

    我有一张名为 产品 的表 产品表有较多记录 有时表没有记录 所以我想检查产品表 if it s have the records i send all table information as mail if it s not have r

随机推荐

  • 关于null的typeof和instanceof

    问题 alert typeof null object alert null instanceof Object false 答案 这是由Javascript规范规定的 Null和Object都是javascript中的数据类型 Null数
  • DC靶机系列:DC-3

    一 信息收集 查询本机ip及目标靶机ip 本机ip 192 168 56 104 利用nmap查询同网段存活的ip 或者使用arp scan l 靶机ip为 192 168 56 112 下一步收集靶机开放的端口信息 收集靶机开放端口 输入
  • Springboot解决跨域问题的配置

    由于自己是主后端开发 前端自己很少去配置 所以自己留一个配置SpringBoot配置跨域问题的代码在这里 注意一点 如果是在生产环境 应该根据实际需求设置allowedOrigins来限制允许访问的域名 而不是使用通配符 import or
  • nvidia 显卡硬件文档手册

    https github com NVIDIA open gpu doc
  • vue项目控制按钮是否显示

    import Vue from vue permission 用于控制是否显示按键 控制权限的指令 Vue directive has bind function el binding if Vue
  • 批量将markdown内本地图片转换为网络图片

    批量将markdown内本地图片转换为网络图片 在线地址 http 106 52 170 128 8003 需求 大部分支持markdown格式的网站 都不支持将markdown和其内置的图片同时上传到服务器 因此增大了小朋友们写文档的负担
  • 软件开发中几个常用功能的实现

    软件开发中几个常用功能的实现 出处 vchelp net责任编辑 leelee 04 8 12 10 01 作者 戚高 在进行软件开发过程中间 有很多小功能的实现 虽然这些东西你可以不用 但是如果应用仂将会是你的程序更具有专业性 一 设置程
  • Unity 3D 动画系统(Mecanim)

    Unity 3D 动画系统 Mecanim Mecanim 动画系统是 Unity 公司推出的全新动画系统 具有重定向 可融合等诸多新特性 可以帮助程序设计人员通过和美工人员的配合快速设计出角色动画 其主界面如下图所示 Unity 公司计划
  • 小写的bool和大写BOOL

    bool是标准C 中的布尔量 占一个字节大小内存 只有false或者true 具有跨平台特性 BOOL是MFC定义的宏 typedef int BOOL define FALSE 0 define TRUE 1 其实是个int类型 占四个字
  • 学习笔记1.STM32HAL库之点灯

    学习笔记1 STM32HAL库之点灯 前段时间学习了51单片机的相关知识 接下来进行32的学习 这里我使用的是野火的stm32f103v6核心板 进入正题 1 首先打开cubemx 进行相关配置 选择SYS 在debug中选择烧录方式 Se
  • codeforces 950 #469 div2 D A Leapfrog in the Array

    Problem codeforces com contest 950 problem D Reference Codeforces Round 469 Div 2 D A Leapfrog in the Array 思维 Meaning 开
  • 单链表的插入和删除

    前言 在上一篇文章 单链表的定义 中我们已经了解了单链表的含义和简单的实现 那么在这篇文章中 我们将要来讲解单链表的插入和删除操作 按位序插入 带头结点 我们在上篇文章中已经讲解过 如果想要在表L中的第i个位置上插入指定元素e 我们需要找到
  • 认识爬虫:提取网站 cookie 信息,并使用 cookie 信息实现登录

    为什么要使用 cookie 信息来进行爬虫呢 做后端的朋友们都知道 一般情况下 在服务器上发布接口都是要设置身份信息验证 验证的方式就是通过 cookie 信息中包含的身份认证来进行验证 在身份验证通过之后 才能获取到响应接口的信息 所以
  • 实现锚点-scroll平滑滚动

    a链接锚点定位太生硬 试试自己让滚动条平滑滚动把 scroll2 target gt console log alb console log 滚动拉 target target target aaa className const scro
  • 使用动态规划解决分钱方案-2023年全国青少年信息素养大赛Python复赛真题精选

    导读 超平老师计划推出 全国青少年信息素养大赛Python编程真题解析 50讲 这是超平老师解读Python编程挑战赛真题系列的第14讲 全国青少年信息素养大赛 原全国青少年电子信息智能创新大赛 是 世界机器人大会青少年机器人设计与信息素养
  • ajax append进来的图片闪一下就不见了,用FileReader做图片上传时遇到的一个异步问题...

    send box send img file send img file 2 detach for var i 0 i var file e target files item i 允许文件MIME类型 也可以在input标签中指定acce
  • Python的seaborn库内置数据集的使用

    iris sns load dataset iris 当调用seaborn内置数据集时 如果会出现以下报错的情况 表示该内置数据没有下载到本地文档 进入以下网站 https github com mwaskom seaborn data 下
  • Modulated Graph Convolutional Network for 3D Human Pose Estimation

    论文主要改进普通图卷积的两个缺点 共享每个图卷积之间的特征变换 阻止了他们学习不同节点之间的不同关系 而且图是根据人体骨骼定义的 人类活动往往会表现出超出身体关节自然连接的运动 例如跑步时胳膊和腿之间的联系 论文提出了权重调节和亲和力调节
  • 用Python绘制漫天繁星

    用Python绘制漫天繁星 一闪一闪亮晶晶 漫天都是小星星 想要用python绘制漫天星星吗 本篇文章小编来教大家如何使用python绘制星星 话不多说 进入正题 一 理清思路 星星出现在画布的随机位置 即位置随机 星星有的大有的小 即大小
  • SQL SERVER -SCD Solution

    看下SQL SERVER中 对SCD的解决方案 不看SQL SERVER 2008 CDC方法 摘自 expert sql server 2005 integration services 在这里介绍的SCD TYPE0 1 2与KIM B