INTRODUCTION TO SQL SERVER DATA QUALITY SERVICES
SQL SERVER DATA QUALITY SERVICES
Before we begin with Data Quality Services, I think we should be aware of what is Enterprise Information Management System. EIM (Enterprise Information Management) The set of capabilities enabling the enterprise to get the right data to the right consumers, reliably, efficiently & with high confidence. It specializes in finding solutions for optimal use of information within organizations, for instance to support decision-making processes or day-to-day operations that require the availability of knowledge. It tries to overcome traditional IT-related barriers to managing information at an enterprise level. Enterprise information management combines Enterprise Content Management (ECM), Business Process Management (BPM), Customer Experience Management (CEM), and Business Intelligence (BI). Enterprise information management takes these two approaches to managing information one step further, in that it approaches information management from an enterprise perspective. Where BI and ECM respectively manage structured and unstructured information, EIM does not make this "technical" distinction. It approaches the management of information from the perspective of enterprise information strategy, based on the needs of information workers. ECM and BI in a sense choose a denominationalised approach, since they only cover part of the information within an organization. This results in a lack of available information during decision-making processes, market analysis, or procedure definition
Enterprise Information Management in SQL Server
![]() |
|
Figure 1 : EIM for SQL
Server
|
The Enterprise Management System in SQL Server is the combination of DQS, MDS and SSIS.
SQL Server Integration Services (SSIS) provides a powerful, extensible platform for integrating data from various sources in a comprehensive extract, transform, and load (ETL) solution that supports business workflows, a data warehouse, or master data management. See Integration Services Overview topic for a quick overview and typical uses of SSIS.
SQL Server Data Quality Services (DQS) enables you to cleanse, match, standardize, and enrich data, so you can deliver trusted information for business intelligence, a data warehouse, and transaction processing workloads. See Introducing Data Quality Services topic for the business need for DQS and how DQS answers the need.
SQL Server Master Data Services (MDS) provides a central data hub that ensures that the integrity of information and consistency of data is constant across different applications. See Master Data Services Overview topic for brief descriptions of important features of MDS.
Data Quality Services
Data Quality Services (DQS) in SQL Server 2012 is a new feature used for improving and maintaining the quality of your data across enterprise.
The idea behind this concept is that there is a big difference between what’s invalid according to a computer system and what’s invalid according to common sense. Some system might have a field for a person’s age and the system thinks a value is valid if the age is given in numerical form as an integer. The system wouldn’t have a problem with an age listed as 222. A human being looking at this would spot this as an anomaly and guess that the person entering the age mistyped 22. Another example could be counties, stored as string values. A human being would know that Nottinghamshire is a valid county but Narnia isn’t. An automated system wouldn’t spot the mistake.
The idea behind this concept is that there is a big difference between what’s invalid according to a computer system and what’s invalid according to common sense. Some system might have a field for a person’s age and the system thinks a value is valid if the age is given in numerical form as an integer. The system wouldn’t have a problem with an age listed as 222. A human being looking at this would spot this as an anomaly and guess that the person entering the age mistyped 22. Another example could be counties, stored as string values. A human being would know that Nottinghamshire is a valid county but Narnia isn’t. An automated system wouldn’t spot the mistake.
This
is where a knowledge-driven solution comes in. The idea is to take these pieces
of knowledge that we think of as common sense and store them in such a way that
the business applications and databases can check values against a knowledge
base. This knowledge base might include a list of counties so you can have the
systems check all values entered in the County field of a database and look for
anomalies.
You
then have various rules for how these anomalies are treated. You might create
automatic rules, for example to change the abbreviation Notts to the full
Nottinghamshire so that you can have consistent values without needing a human
being to make any changes. You might also have people, known as data stewards,
who are alerted to suspected anomalies so that they can check whether values
need to be fixed.
A
data quality project tends to have two elements to it. One is an initial fix to
clean up bad data. This is known as a data cleansing project. As the name
implies, the end goal is to have a set of clean data. The tools look through
the data, transforming values to match a standard, flagging outlying values
that might be anomalies and suggesting changes that could be made. It also
hunts for possible duplicates through data matching, applying policies to look
for entries in the database that might refer to the same thing. For example,
the entries for Jess Meats and Jessica Meats might be flagged up as a possible
match. After the cleansing and matching, the output is delivered as a set of
data that has been cured of as many inaccuracies as possible.
The
second part of a data quality project is what happens next to keep the data
clean. As with Master Data Management, this isn’t a fix-once act. It’s very
easy for data quality issues to creep back in after the cleansing has taken
place so an implementation of Data Quality Services needs to bear in mind what
should happen next. The processes and policies need to be defined to ensure
that the data quality knowledgebase is used in future to maintain the quality
of the data. It’s also important to identify the data stewards who will be
responsible for fixing any problems the knowledgebase flags.
It’s
also important to think of the knowledgebase as an on-going project. Things
change, Data changes. The knowledgebase should therefore also change. The set
of knowledge and rules within the knowledgebase can grow over time, bringing
more control and accuracy to your data. As more data passing through the
knowledgebase, it becomes more tuned to picking out anomalies and better at
identifying what the correct value should be.
A
Data Quality Services project should include both the plan for how to clean the
data initially and how to maintain quality moving forward.
Features of DQS
Data quality is not defined in
absolute terms. It depends upon whether data is appropriate for the purpose for
which it is intended. DQS identifies potentially incorrect data, and provides
you with an assessment of the likelihood that the data is in fact incorrect.
DQS provides you with a semantic understanding of the data so you can decide
its appropriateness. DQS enables you to resolve issues involving
incompleteness, lack of conformity, inconsistency, inaccuracy, invalidity, and
data duplication.
DQS provides the following
features to resolve data quality issues.
Matching: the identification of semantic duplicates in a rules-based process that enables you to determine what constitutes a match and performs de-duplication. For more information, see Data Matching.
Reference Data Services: verification of the quality of your data using the services of a reference data provider. You can use reference data services from Windows Azure Marketplace Data Market to easily cleanse, validate, match, and enrich data. For more information, see Reference Data Services in DQS.
Profiling: the analysis of a data source to provide insight into the quality of the data at every stage in the knowledge discovery, domain management, matching, and data cleansing processes. Profiling is a powerful tool in a DQS data quality solution. You can create a data quality solution in which profiling is just as important as knowledge management, matching, or data cleansing. For more information, see Data Profiling and Notifications in DQS.
Monitoring: the tracking and determination of the state of data quality activities. Monitoring enables you to verify that your data quality solution is doing what it was designed to do. For more information, see DQS Administration.
Knowledge Base: Data Quality Services is a knowledge-driven solution that analyzes data based upon knowledge that you build with DQS. This enables you to create data quality processes that continually enhances the knowledge about your data and in so doing, continually improves the quality of your data.
![]() |
|
Figure 3: Illustration of DQS Process
|
Knowledge Base in DQS
The
DQS knowledge base is a repository of three types of knowledge: out-of-the-box
knowledge, knowledge generated by Data Quality Server, and knowledge generated
by the user. DQS enables you to store knowledge about your data in the
knowledge base, add business rules and modify the knowledge as you see fit, and
then apply it to test the integrity and correctness of the data. After you
build the knowledge base, you can continuously improve it, and then reuse it in
multiple data-quality improvement processes.
Knowledge
in a knowledge base identifies potentially incorrect data and proposes changes
to the data. It can find data matches, enabling you to perform data
deduplication. It can compare source data with cloud-based reference data
maintained and guaranteed by data quality providers. The data steward or IT
professional verifies both the knowledge in the knowledge base and the changes
to be made to the data, and executes the cleansing, deduplication, and
reference data services.
A
knowledge base stores all the knowledge related to a specific type of data
source. For example, you could maintain one knowledge base for a customer
database and another knowledge base for an employee database. Knowledge is
contained in one or more data domains, each of which is a semantic
representation of a type of data in a data field. A knowledge base for a
customer database may have domains for company names, addresses, contacts,
contact information, and so on. A domain contains a list of trusted values,
invalid values, and erroneous data. Domain knowledge includes synonym
associations, term relationships, validation and business rules, and matching
policies. Armed with this knowledge, the data steward can make an informed
decision about whether to correct specific instances of the values in a domain.
![]() |
|
Figure 4: Data Quality Knowledge Base (DQKB)
|
How
to build a DQ Knowledge Base?
· Define your data quality needs
· Create domains/composite domains
· Discover and import knowledge
· Define rules & policies for data validation, cleansing and matching
· Create domains/composite domains
· Discover and import knowledge
· Define rules & policies for data validation, cleansing and matching
Knowledge
Sources
With this we have covered the basics of the Data quality Services in SQL Server. In the coming blogs we will be dealing "how to work with DQS?".











Comments
Post a Comment