This is the sixth day of my participation in the November Gwen Challenge. Check out the event details: The last Gwen Challenge 2021

Translation reference

This article mainly refers to Adding Rows in Incremental happy-level 3 of the Stairway to, translated from the Stairway to Integration Services Integration Services aims to have a comprehensive and clear understanding of SSIS. All contents are carried out on the basis of the original text. Due to various reasons such as version differences and personal doubts, the principle of complete translation is not adopted. Hopefully eventually it will be more conducive to learning and understanding SSIS,

Thanks for your support!

introduce

The previous article covered the basic configuration of loading data for the SSIS data flow task. Incremental loading will be covered below to enable re-execution of the SSIS package created in the previous article (that is, Data Rows can be repeatedly executed).

What is incremental loading?

An incremental load only loads data that is different from the previous load.

“Different” includes:

  • New rows New rows
  • Updated rows
  • Deleted rows

In essence, incremental loading is repeatable, meaning that the loader can be executed over and over again without causing damage. Not only that, re-executable means that the loader is designed to be executed multiple times without causing unnecessary or repetitive work on the server.

Next, take a look at the goal of incremental loading:

  1. We want to insert data that is new in the source table but has not yet been loaded into the target table
  2. We want to update any data that has changed since the last time it was loaded into the target table
  3. We want to delete the data in the target table that has been removed from the source table

Incremental loading of new row data

Adding Rows in Incremental Loads Adds Rows to load

The SSIS project created in the previous section, FirstSSIS DEBUG, runs and loads the selected data into the dbo.FullName table. There are 19,972 entries in total. However, if the SSIS package is executed again, it will load the 19,972 entries into the FullName table again.

The requirement is not to load duplicate data into the target table every time, so how do we achieve the goal of incremental loading?

Check the new data in the source table

Change the target table FullName of the FirstSSIS package

Here we will modify the previous FirstSSIS package. The FullName table with only FirstName, MiddleName, and LastName cannot uniquely determine a row of data, so it cannot uniquely determine the difference between the source table and the target table. A unique value (primary key) field must be introduced.

The FullName table has the same structure as the source table.

CREATE TABLE [dbo].[FullName](
	[BusinessEntityID] [int] PRIMARY KEY,
	[FirstName] [nvarchar](50) NOT NULL,
	[MiddleName] [nvarchar](50),
	[LastName] [nvarchar](50) NOT NULL
)
Copy the code

Simulate source table new data

Next, we simulate adding new rows (there are two ways, one is to insert data into the source table to implement the new row; The other option is to delete some data from the target table, so that the extra data from the source table is the new data.

Select * from table where MiddleName is NULL and delete 8499 rows

Use AdventureWorks2012
go
Delete dbo.FullName
Where MiddleName Is NULL
Copy the code

Find data in one table but not in another

The first step in incrementally loading a new row is to monitor the new row. There are several ways to do this:

Method 1: Find rows that are in the source table but not in the target table. T-sql:

Select BusinessEntityID,
FirstName,
MiddleName,
LastName
From Person.Person
where BusinessEntityID not in (
	select BusinessEntityID from dbo.FullName 
);
Copy the code

A lookup in one table but not the other requires that the two tables uniquely identify an item of data, otherwise it cannot be compared.

Method 2: Use the LEFT JOIN query to find data that is not in the target table

select 
src.BusinessEntityID,
src.FirstName,
src.MiddleName,
src.LastName
FROM Person.Person src
LEFT JOIN dbo.FullName dest on src.BusinessEntityID=dest.BusinessEntityID
WHERE dest.BusinessEntityID IS NULL;
Copy the code

The second method is more efficient than the first and involves the problem of table scanning.

Incremental loading of new data is implemented in SSIS package

We need to add components between the OLE DB source of the data flow task and the OLE DB target (FullName) to enable the loading of incremental data.

Example Delete a data flow path

Right-click the source and target data flow paths and click Delete.

The error list below shows unmapped input column errors

Add the “Lookup” transformation component

From the Toolbox, drag the Lookup Lookup transformation component between the source and target adapters.

Drag the source data stream path to the Find transform

A Lookup Transformation means that it matches the rows flowing into the Transformation in another table, view, or query. That is, go to a table, view, or query, see if a match is found in those columns, and bring back those columns if so.

Here are a few caveats:

  1. If no matching column is found between the data stream and the Lookup table, view, or query, the default Lookup transformation configuration fails the transformation.
  2. If there are multiple matches in the lookup table, the lookup transform simply returns the first match.

These notes are “vicious” because if no match is found, the operation will fail. While joining rows in a data stream and looking up rows in a table, view, or query, the transformation returns only the first matching row.

Configure Lookup

Double-click Lookup Transformation to open the editor. The regular page is displayed by default.

  • The caching pattern

The first item on a regular page is the Cache Mode property

The cache mode controls when and how the actual lookup operation is performed.

In “No Cache” mode, a lookup is performed for each row that flows through the transformation. Whenever a Lookup is passed in for each row of data, the transformation performs a query on the Lookup table, view, or query; And add any returned values to the rows that flow through the transformation.

In Full Cache mode, the lookup operation attempts to load all rows from the lookup table, view, or query into the LOOKUP Cache of RAM before the data task is executed. By “try,” I mean that the lookup will fail if the lookup table, view, or query returns a large data set, or if the server is RAM constrained (running slowly or not having enough RAM). The lookup cache holds values from the configured table, view, or query, and matches found in the cache are added to the rows that flow through the transformation.

What if the search transformation fails to load in full cache mode due to RAM limitations? One option is to use the “no cache” mode. The second option is Partial Cache mode.

(There are other options beyond the scope discussed in this article)

In Partial Cache mode, transformations first check the Lookup Cache for matches as each row flows through. If there is no match in the cache, a lookup operation occurs. Matching data is added to the data row and lookup cache. If another row looking for the same matching column flows through the transformation, the match is retrieved from the lookup cache.

Select the default: full cache. Because the current Lookup looks up relatively few data sets (19972). Other models can be considered for large data volumes.

  • Processing when there is no match

Next, select “Specify how to handle rows with no matching entries” to “Redirect rows to no match The output “).

  • The connection

Click the “connections” page and set the OLEDB connection manager property to [server-name].AdventureWorks2012 to configure a connection manager interface as OLEDB source.

Next, again, select a table or enter an SQL query from the CONNECTION manager’s SQL Server instance and database configuration. Here, enter the following T-SQL query:

Select BusinessEntityID,
	FirstName,
	MiddleName,
	LastName
From dbo.FullName
Copy the code

  • column

Go to the Columns page, and above the columns there are two table-looking grids. On the left is a list labeled “Available Input Columns” containing Columns that enter the Input buffer of the Lookup Transformation (they are Columns connected to the output of the OLE DB source). The other table is Available Lookup Columns, which are Columns that exist in a table, view, or query (in this case query) configured on the Connect page

  • Set search match criteria

Click the BusinessEntityID column in Available Input Columns, and drag it onto the BusinessEntityID column in Available Lookup Columns.

The lookup here is the same as the join query. When I put BusinessEntityID on BusinessEntityID, The wires that appear between the BusinessEntityID column of the available input column and the BusinessEntityID column of the available lookup column are similar to the ON clause of a join. It defines the matching conditions for the driver Lookup function.

The Available lookup column contains checkbox checkboxes and all checkboxes. If a lookup transformation is similar to a join, a checkbox is a mechanism for adding columns from a join table to a SELECT clause. No columns are selected here.

We have configured a Lookup Transformation to open the target table and match the records that exist in the data flow pipeline to those in the target table. The records that exist in the data flow pipeline come from the OLE DB source adapter — loaded into the data flow from the Person.Person table. The target table is dbo.fullname, which we accessed using a T-SQL query on the Connect page of The Find Transformation.

The Find Transform is configured to find matches by comparing the BusinessEntityID column values in the Target table with the BusinessEntityID column values in the source table (via OLE DB source adapter).

Find transformation is configured to send lines that do not match the mismatch output of find transformation. If a match is found between the BusinessEntityID column value in the target table and the BusinessEntityID column value in the source table, the lookup transformation sends those rows to match Output.

  • Connect to the target

Next, build incremental loading.

Click the OK button to close the Find Transformation editor.

Then, click the Find Transform and drag to find the green data flow path below to the OLE DB target (named FullName). When a dialog box is displayed, select Lookup No Match Output.

  • Review (review)

To review the work done here, there are a number of important points:

Why Lookup No Match Output is displayed? In SSIS, Lookup Transformation provides this kind of built-in output, capturing records in the source (in the current example, the Person.person table) but not in the Lookup table (in the current example, the dbo.fullname target table), which is called “find no match output.”

Original text: the Lookup Transformation provides this built-in output to catch records in the Lookup table (the dbo.Contact destination table, In this case) that do not exist in the source (the person. Contact table) – it’s the Lookup No Match Output.

It must be in the “source” because the data used to search is from the “source”.

Why is there no match? The value of the BusinessEntityID column does not exist in the target table. If it is in the source table but not in the target table, it is a new row — a row added to the source table since the last load. This is the new row you want to load.

  • View the data flow path

Right-click the source and find the data flow path between the transformation, and click Edit.

On the metadata page in the Data Flow path editor, you can see that its configuration and metadata for finding the data flow path between the transformation and the target are the same.

The unmatched output of a lookup transformation is an exact copy of the lookup transformation input. This scenario is: if no match is found, all data columns with no match output are sent through the lookup transformation for use by downstream.

Close the data flow path editor. The data flow task now looks like this.

Run and execute multiple times

By pressing the F5 key or the “Start Debugging” button on the menu, you will see:

Only 8499 new rows were loaded (this is the number of rows removed when the new row was simulated above).

Now you can re-execute the load. There is a restart button next to the Debug button. The SSIS package stops and executes again, and this lookup transformation finds all rows in the target table. No “lines with no matching output” sent to the lookup transformation.

conclusion

Several goals have been accomplished here.

First, you create a loader that simply adds new rows from the source table to the target table. Second, the loader is built to be re-executable and does not stack duplicate copies of rows into the target table.

The SSIS package you just built is “functional” (which stands for repeatable). Many SSIS packages in production contain data flow tasks that perform similar loads. Scenarios where only new rows are loaded in an incremental load: for example, a table containing historical daily currency exchange rates (data does not change over time); There are also tables that keep the daily temperature (high temperature) (data is also never updated) and so on.

But in the case of frequent updates to source data, the current incremental loading is obviously less flexible!