INTRODUCTION TO SQL SERVER DATA QUALITY SERVICES

November 11, 2014

INTRODUCTION TO SQL SERVER DATA QUALITY SERVICES

SQL SERVER DATA QUALITY SERVICES

Before we begin with Data Quality Services, I think we should be aware of what is Enterprise Information Management System. EIM (Enterprise Information Management) The set of capabilities enabling the enterprise to get the right data to the right consumers, reliably, efficiently & with high confidence. It specializes in finding solutions for optimal use of information within organizations, for instance to support decision-making processes or day-to-day operations that require the availability of knowledge. It tries to overcome traditional IT-related barriers to managing information at an enterprise level. Enterprise information management combines Enterprise Content Management (ECM), Business Process Management (BPM), Customer Experience Management (CEM), and Business Intelligence (BI). Enterprise information management takes these two approaches to managing information one step further, in that it approaches information management from an enterprise perspective. Where BI and ECM respectively manage structured and unstructured information, EIM does not make this "technical" distinction. It approaches the management of information from the perspective of enterprise information strategy, based on the needs of information workers. ECM and BI in a sense choose a denominationalised approach, since they only cover part of the information within an organization. This results in a lack of available information during decision-making processes, market analysis, or procedure definition

Enterprise Information Management in SQL Server

Figure 1 : EIM for SQL Server

The Enterprise Management System in SQL Server is the combination of DQS, MDS and SSIS.
SQL Server Integration Services (SSIS) provides a powerful, extensible platform for integrating data from various sources in a comprehensive extract, transform, and load (ETL) solution that supports business workflows, a data warehouse, or master data management. See Integration Services Overview topic for a quick overview and typical uses of SSIS.

SQL Server Data Quality Services (DQS) enables you to cleanse, match, standardize, and enrich data, so you can deliver trusted information for business intelligence, a data warehouse, and transaction processing workloads. See Introducing Data Quality Services topic for the business need for DQS and how DQS answers the need.

SQL Server Master Data Services (MDS) provides a central data hub that ensures that the integrity of information and consistency of data is constant across different applications. See Master Data Services Overview topic for brief descriptions of important features of MDS.

Data Quality Services

Data Quality Services (DQS) in SQL Server 2012 is a new feature used for improving and maintaining the quality of your data across enterprise.

The idea behind this concept is that there is a big difference between what’s invalid according to a computer system and what’s invalid according to common sense. Some system might have a field for a person’s age and the system thinks a value is valid if the age is given in numerical form as an integer. The system wouldn’t have a problem with an age listed as 222. A human being looking at this would spot this as an anomaly and guess that the person entering the age mistyped 22. Another example could be counties, stored as string values. A human being would know that Nottinghamshire is a valid county but Narnia isn’t. An automated system wouldn’t spot the mistake.

This is where a knowledge-driven solution comes in. The idea is to take these pieces of knowledge that we think of as common sense and store them in such a way that the business applications and databases can check values against a knowledge base. This knowledge base might include a list of counties so you can have the systems check all values entered in the County field of a database and look for anomalies.

You then have various rules for how these anomalies are treated. You might create automatic rules, for example to change the abbreviation Notts to the full Nottinghamshire so that you can have consistent values without needing a human being to make any changes. You might also have people, known as data stewards, who are alerted to suspected anomalies so that they can check whether values need to be fixed.

A data quality project tends to have two elements to it. One is an initial fix to clean up bad data. This is known as a data cleansing project. As the name implies, the end goal is to have a set of clean data. The tools look through the data, transforming values to match a standard, flagging outlying values that might be anomalies and suggesting changes that could be made. It also hunts for possible duplicates through data matching, applying policies to look for entries in the database that might refer to the same thing. For example, the entries for Jess Meats and Jessica Meats might be flagged up as a possible match. After the cleansing and matching, the output is delivered as a set of data that has been cured of as many inaccuracies as possible.

The second part of a data quality project is what happens next to keep the data clean. As with Master Data Management, this isn’t a fix-once act. It’s very easy for data quality issues to creep back in after the cleansing has taken place so an implementation of Data Quality Services needs to bear in mind what should happen next. The processes and policies need to be defined to ensure that the data quality knowledgebase is used in future to maintain the quality of the data. It’s also important to identify the data stewards who will be responsible for fixing any problems the knowledgebase flags.

It’s also important to think of the knowledgebase as an on-going project. Things change, Data changes. The knowledgebase should therefore also change. The set of knowledge and rules within the knowledgebase can grow over time, bringing more control and accuracy to your data. As more data passing through the knowledgebase, it becomes more tuned to picking out anomalies and better at identifying what the correct value should be.

A Data Quality Services project should include both the plan for how to clean the data initially and how to maintain quality moving forward.

Figure 2: Requirements for Data Quality Solutions

Features of DQS

Data quality is not defined in absolute terms. It depends upon whether data is appropriate for the purpose for which it is intended. DQS identifies potentially incorrect data, and provides you with an assessment of the likelihood that the data is in fact incorrect. DQS provides you with a semantic understanding of the data so you can decide its appropriateness. DQS enables you to resolve issues involving incompleteness, lack of conformity, inconsistency, inaccuracy, invalidity, and data duplication.

DQS provides the following features to resolve data quality issues.

Data Cleansing: the modification, removal, or enrichment of data that is incorrect or incomplete, using both computer-assisted and interactive processes. For more information, see Data Cleansing.

Matching: the identification of semantic duplicates in a rules-based process that enables you to determine what constitutes a match and performs de-duplication. For more information, see Data Matching.
Reference Data Services: verification of the quality of your data using the services of a reference data provider. You can use reference data services from Windows Azure Marketplace Data Market to easily cleanse, validate, match, and enrich data. For more information, see Reference Data Services in DQS.

Profiling: the analysis of a data source to provide insight into the quality of the data at every stage in the knowledge discovery, domain management, matching, and data cleansing processes. Profiling is a powerful tool in a DQS data quality solution. You can create a data quality solution in which profiling is just as important as knowledge management, matching, or data cleansing. For more information, see Data Profiling and Notifications in DQS.

Monitoring: the tracking and determination of the state of data quality activities. Monitoring enables you to verify that your data quality solution is doing what it was designed to do. For more information, see DQS Administration.

Knowledge Base: Data Quality Services is a knowledge-driven solution that analyzes data based upon knowledge that you build with DQS. This enables you to create data quality processes that continually enhances the knowledge about your data and in so doing, continually improves the quality of your data.

Figure 3: Illustration of DQS Process

Knowledge Base in DQS

The DQS knowledge base is a repository of three types of knowledge: out-of-the-box knowledge, knowledge generated by Data Quality Server, and knowledge generated by the user. DQS enables you to store knowledge about your data in the knowledge base, add business rules and modify the knowledge as you see fit, and then apply it to test the integrity and correctness of the data. After you build the knowledge base, you can continuously improve it, and then reuse it in multiple data-quality improvement processes.

Knowledge in a knowledge base identifies potentially incorrect data and proposes changes to the data. It can find data matches, enabling you to perform data deduplication. It can compare source data with cloud-based reference data maintained and guaranteed by data quality providers. The data steward or IT professional verifies both the knowledge in the knowledge base and the changes to be made to the data, and executes the cleansing, deduplication, and reference data services.

A knowledge base stores all the knowledge related to a specific type of data source. For example, you could maintain one knowledge base for a customer database and another knowledge base for an employee database. Knowledge is contained in one or more data domains, each of which is a semantic representation of a type of data in a data field. A knowledge base for a customer database may have domains for company names, addresses, contacts, contact information, and so on. A domain contains a list of trusted values, invalid values, and erroneous data. Domain knowledge includes synonym associations, term relationships, validation and business rules, and matching policies. Armed with this knowledge, the data steward can make an informed decision about whether to correct specific instances of the values in a domain.