C# Custom CDC

One of the biggest issues with Data warehouse technologies is the Change data capture functionality. Some companies have a structure that is too complicated for the standard SCD functions for any ETL tool. The one element that stands out with SSIS is the ability to use the programming languages C#/VB.NET in the pipeline. This is a major advantage where developers can use the input and either keep the elements synchronous or they can create a new output buffer and make it asynchronous. What we have tried to accomplish in this demonstration is the ability to take in a source file and create variables to specify the conditions like surrogate keys, business keys, and fields that change. We also have conditions and variables set that will go and give the record the criteria of a new record or an updated record. The idea is different because it promotes a bit more work and there are preprocessing tables used to accomplish this method. But with that being said it is fast, very powerful, and can allow you to accomplish a lot more.

The following illustrations show the ETL job that actually looks at 2 tables to do the inserts and updates. We are using 2 dataflows and then a execute sql task to accomplish this.

The first table is the primary table the holds the information for ROL_PLYR. There are only new records that are added table is never updated. So according to the rules with the script the new records coming in will be added to this table. The Second table EXTRNL_ORG is important because it’s where all the detail information is located. This table is also where the slowly changing dimension activity happens. This example gives us the ability to see type 2 slowly changing dimensions. Type 2 slowly changing dimensions allows for the new records to insert into the ROL_PLYR and EXTRNL_ORG giving them all new surrogate keys. If the record comes and it already exists then NO record is added to the ROL_PLYR table. The old record in the EXTRNL_ORG table is closed out with a date minus 1 day from when you post it and the new EXTRNL_ORG record is added. This may seem complicated but let’s dive into the logic of this with more illustrations.

The first dataflow allows us to create our own data capture. The following gives us an illustration of this main dataflow:

The SRC gives us the SQL statement that allows for us to set tone for the change data capture. This portion of the job is where the source and destination are joined in a SQL statement. We want to get our surrogate keys, business, keys, SCD fields to determine changes. This can be a bit tedious at this level because the staring point is in the SQL where we want to identify these terms. Most of the current change data capture and slowly changing dimension components allow for the source and destination streams to be separate and with that component we set all our rules. This scenario is based off of using staging and similar source and destination databases. So the source elements could be revised based on the systems you’re using. The following is an illustration of the SRC SQL that uses a left outer join to get new and existing data.

We now need to look at the 2 areas that give us our logic. We use a derived column and a script component to set up all our variables.

All of these derived columns were checks that were needed for the script component. Based off the rules we check for new records and compare columns of existing records with new ones coming in. Once we have all this set up we know come to the nuts and bolts of the job. The script component is the key element to separating objects and giving us true change data capture.

The idea with the change data capture is we want to identify the new and changed rows. The variables vNew and vChg are set so as we go row by row we can perform some action. One of the most important factors with this design is that we sorted the data by dbo.STG_H_ORG.H_CONT_ID which is the business key or alternate key and dbo.STG_H_ORG.H_CREATE_DT. Which is the create date. This allows them to be in order so that we can process the logic in our script. Without this we will not be able to properly set the change data capture.

We have set a flag on each record as we read them row by row. If they are a new record we set the custom output field Type to ‘H’ which denotes it’s a header record or the first one and will be a new insert. Ive added a condition at the bottom of the function that keeps track of the business key. The following statement allows us to keep a global copy of that in the function and we compare it to the current value of the row.

UniqueSys = Row.HCONTID;

If you look statement if (UniqueSys.ToString() != Row.HCONTID) we can see that once that changes that where we can figure out the header records and also where to change the surrogate key number. This design is looking at a bunch of things to recap. We first want to know what the new and changed reocrds are. We next want to get the last surrogate key value stored in the database so we can continue with the numbers for new records. We need to also set the Business key to a variable to compare with each iteration so that we keep our grouping of changes. That is so important in this desigh because instead of changing thing row by row and sending the changes right away to the database we are actually preprocessing the data then sending it over. This a far different approach to CDC and SCD and is harder to implement but is far faster as we take advantage of the pipleline processing in memory.

Well getting into the details some more if we want to capture changes we would have our condition based of the variables. The following illustration shows this detail:

As you can see we have Type set to ‘C’ and this is telling us that the row is not new but exists and there is a change.

We have shown that there are new records and the type is set to ‘H’ but what about if we have updates to the Header record even though it’s new. Based on the design we keep are business key global checking that it’s the same so after the first record comes in as the remain records with that same business key are considered updates to the header record. So we denote the type as ‘U’.

Once you have the data set up in the script the way you want it then we can set our output streams to be filtered based off the type.

The two conditional splits are for the 2 different tables and so we want to set the tone for what goes where with these filters. For ROL_PLYR we only want the header record so we set the filter Type == “H”. For the EXTRNL_ORG we set the Type == “H” || Type == “C” || Type == “U” for all three filter because we want to take a further dive into the data in the next process of the job. The data is actually going to 2 stage tables called TMP_ROL_PLYR and TMP_EXTRNL_ORG. These are giving us the proper data we want from the filters.

The next step in the process is to End Date all the updates that we preprocessed so that we can leave the last of the updates open. We need to consider that none of these records from the EXTRNL_ORG data are in the target table yet so we have the header and a bunch of updates. They are coming in chronological order so as we add them we need to look at what will be closed out from the previous record.

We first want to read from the TMP_EXTRNL_ORG table and get all the Header and Update records denoted by the Type=”U or “H”.

We then want to run a script but the key here is the SQL is set up first to reverse the order of the data in descending order by effective from date. The reason this is done this way is to give us the last record from date and we set the global variable to hold this value for comparison row by row.

Again we are checking the business key and making sure that it’s a global value and then we are also setting the Endate to be the fromdate of the previous record so we can use this to close out the record. The date is generally one date early from that date.

The final data as it’s filtered through the script component is sent to the TGT_EXTRNL_ORG table.

Lastly we want to then update the target table where the Type was a ‘C” denoted as a change. We want to overwrite the existing target data with the changed data in the TMP_EXTRNL_ORG table.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )


Connecting to %s