• Post Categories

  • Browse Blogs

  • Blog Stats

    • 633,809 hits
  • Syndications

    SQLServerPedia Contributor

Change Data Capture as a tool for Business Intelligence, Disaster Recovery, Auditing, and more

This is a summary of the Change Data Capture presentation covered in my sessions at SQL Saturday #38 – Jacksonville and SQLSaturday #40 – South Florida.

Presentation Summary
Change Data Capture (CDC) is one of the new features available for SQL Server 2008 Enterprise and Developer Editions, which allows you to easily track and keep history of all DML (Data Manipulation Language) changes on user tables such as INSERTs, DELETEs, and UPDATEs with little or no overhead. Change Data Capture has been used primarily to track changes for incremental loads using SSIS for the ETL process in Data Warehousing and Business Intelligence because it allows you to identify new data and data that changed or was deleted since last incremental load.

Before Change Data Capture in SQL Server 2005 and prior versions, identifying new or changed data required some level of custom programming, triggers or third party tools. These solutions work well but introduce additional overhead to your production systems. This is the main reason why Change Data Capture was introduced in SQL Server 2008; to eliminate overhead while tracking new and changed data.

But Change Data Capture use is not limited as a tool for ETL Process. It can be used for many other purposes where overhead, simplicity and cost might impose a restriction.  Some of these scenarios include:

1) You are a developer adding or modifying existing code and you need to see the difference in the data output on a particular table.
2) You are a DBA and need to audit several tables in a database and need to report which data was changed, what were the new and previous values, who did the change and when was the change made.
3) You were tasked to delete several records on a highly transactional production database and unintentionally deleted records that you were not supposed to with no possibility of rolling back the changes (no explicit transaction) and no possibility of restoring a backup.
4) You are a DBA and need to know which tables and columns are being written to the most in your database.
5) You are a DBA and need to know how many new records are being entered into your database in a period of time and the percentage of change of existing records.

In most cases you would want to know what changed and which were the values before the change.

 One of the most important benefits of this feature is that it allows you to see the data before & after an UPDATE or DELETE statement, which in turn allows you to query and recover data overwritten or deleted quickly without the need of a database restore. It also captures each new row inserted.

CDC accomplishes this task by reading the committed operations from the log file and inserting the changed records in a tracking table that mirrors the source (tracked) table. The records inserted in this tracking table contain both the value before the change and the value after the change along with the metadata associated with the change. The metadata can be queried to identify the type of DML operation as follows:

1 = Delete
2 = Insert
3 = Update (record’s value before update)
4 = Update (record’s value after update)

More concepts are covered in my PowerPoint presentation which you can download here or by clicking slide below.

3 Responses

  1. […] This post was mentioned on Twitter by Jose Chinchilla, Jose Chinchilla. Jose Chinchilla said: New blog: Change Data Capture as a tool for Disaster Recovery, ETL/DW,BI, Auditing, & more. Session summary and slides http://bit.ly/awg0vp […]

  2. Do you have practical examples of CDC being used as ETL? What distances and latencies are you seeing? Thanks!

    • Hi Buck,
      One of the uses for CDC is in our ETL process for population mobility and demographic analysis. Our current custom web application does not keep history of home addresses or other person attributes. This feature was a major request by our Research Department and due to budget, time and resource restrictions making this change in our custom web application was not feasible. Also adding a timestamp or flag column was not preferred due to the size and width of the table among other considerations.

      I decided to enable CDC on our person table and query the corresponding tracked table for rows that have updates and deletions. As part of our business processes several actions for analysis were requested once a home address and other person attributes change such as, household or family composition (single, dual-parents, separation, divorce, etc). Instead of using triggers, CDC provides the flexibility of tracking the columns I am interested in without additional overhead.

      By querying CDC tracking table every night in my ETL process I reduce the number of rows I have to evaluate for changes (i.e. address lookup match and mismatch) and the associated ETL operations that are fired for each one of them.

      Another practical example for CDC used in ETL process is for creating statistics and metrics I provide through SSRS reports for our Contract Management and Research Departments. Each time we load our Data Warehouse I use tracked tables residing in our DW for daily counts of new people receiving services (“new participants entered”), participants no longer receiving services (“participants closed”), percentage of change of key attributes like the ones mentioned before, etc. It also helps me flag individuals for further analysis that had high mobility rates without having to create additional fact tables.

      In terms of latencies, the highest I experienced is 5s in our Production OLTP environment and around 1m in our Data Warehouse after each load of about 200,000 records.

      I’m in the process of creating a couple of videos exemplifying the use of CDC and some of its uses and benefits and hope to have them completed in a couple of days.

Leave a comment