What is Data Quality ?

What is Data Quality ?

Data quality can be defined as the degree to which data is accurate, relevant, complete and readily usable for further processing. It can also be defined as availability of the right data in the right place at the right time. Data Quality is an indication of foresight and planning about data, which has been observed in designing and implementing any data capturing application or system. Data Quality is not just a preventive measure but also a corrective course of action in view of legacy systems and their data problems. Data Quality has shown strong benefits through improving customer name and addresses, which is one of the most cherished data in an organization and base for CRM etc. However, data quality is a generic concept which should not be restricted to customer name and address data, but should be an integral part of any data centric IT application development like ERP, CRM, SCM and other marketing and sales automation programs.
Data quality has two distinct aspects, one involving the objective “correctness” of data such as accuracy and consistency, the other involving the appropriateness of data for some intended purpose. As data volumes grow, large companies are faced with a unique challenge of ensuring that the vast amounts of heterogeneous, complex data they are relying on for effective decision support are accurate, complete, and relevant. This is where the Data Quality tools come into picture. DQ tools are required to cleanse all types of data right from legacy to operational. An ideal situation is where an ETL tool is seamlessly integrated with a specialized data-cleansing tool. This helps in an end-to-end ETL operation clubbed with data cleansing activity.

Different data, different problems

There are many types of data, which have different uses and typical quality problems

  • Federated data
  • High dimensional data
  • Descriptive data
  • Longitudinal data
  • Streaming data
  • Web (scraped) data
  • Numeric vs. categorical vs. text data

There are many uses of data

  • Operations.
  • Aggregate analysis.
  • Customer relations.
  • Data Interpretation : the data is not useful if we don’t know all the rules behind the data.
  • Data Suitability : Can we get the answer from the available data
  • Use of proxy data
  • Relevant data is missing

Data quality Problems/constraints

  • Many data quality problems can be captured by static constraints based on the schema. Nulls not allowed, field domains, foreign key constraints, etc.
  • Many others are due to problems in workflow, and can be captured by dynamic constraints E.g., orders above $200 are processed by Biller 2
  • The constraints follow an 80-20 rule o A few constraints capture most cases, thousands of constraints to capture the last few cases.
  • Constraints are measurable. Can these constraints be measured using Data Quality Metrics?

Data Collection & usage process

  • Data gathering
  • Data delivery
  • Data storage
  • Data integration
  • Data retrieval
  • Data mining/analysis

Data Quality Management

Data Quality is one of the main reason for success of moving data from legacy to new applications. A well functioning data quality program simplifies the defect identification and resolution process, generates confidence in the business and provides for a smoothly reconciled conversion process. The broad categories of data quality issues we have identified are

  • Fixing data constraint failures
  • Discovering and structuring non documented information
  • Fixing data that is not “fit-for-use”
  • Issues of standard data format

Characteristic of Data Quality

Data quality rules should be organized within well defined data quality dimensions which will improve the underlying structure of the data. Defining data quality rules segregated within the dimensions enables the governance of data quality management. One can use data quality tools for determining minimum thresholds for meeting business expectations, monitoring whether measured levels of quality meet or exceed those business expectations, which then provides insight into examining the root causes that are preventing the levels of quality from meeting those expectations. Dimensions are usually categorized to the contexts in which metrics are associated with business rules. such as measuring the quality of data associated with data values, data models, data presentation, and conformance with governance policies.
Data Quality Metrics are tool used to evaluate data quality which will help in meeting the objective of quality of data. It is management tool that gives quick representation of what are the expectations of the data and how much value it could derive from the data and provide measurable indicators for the Data quality program.
Below are the different terms used in Data quality metrics.

  • Uniqueness :This refers to requirements that entities are represented uniquely within the relevant application architectures. Uniqueness of the entities means no entity exists more than once within the data set. The dimension of uniqueness is characterized by stating that no entity exists more than once within the data set. When there are any expectations to uniqueness, data instances should not be created if there is an existing record for that entity.
  • Accuracy :Data accuracy means the percentage or the degree to which data correctly represents the accurate objects they are intended to be designed. In many cases, accuracy is measured by how the values agree with an identified source of correct information as an example of an accuracy rule might specify that for transportation providers, the license details of the driver needs to be accurate in the application system. If that data is available as a reference data set then we need to have automated approach to validate the accuracy of the data instead of manual process. The Extent to which data is accurate and lies within domain of acceptable values.
  • Consistency :Consistency refers to data values in one set being consistent with another set. If the gender is specified as
    ‘M’ for Male and ‘F’ for female then it should be consistent across the application. A strict definition of consistency specifies that two data values drawn from separate data sets must not conflict with each other,
    Consistency may be defined within different contexts:

    • Record level consistency
    • Cross-record consistency
    • Temporal consistency
  • Completeness :Completeness refers that all attributes should be assigned values in a data set and it should not be left orphan. Completeness rules can be assigned to a data set in three levels of constraints:
    • Mandatory attributes that require a value
    • Optional attributes, which may have a value based on some set of conditions, and
    • Inapplicable attributes, (such as maiden name for a single male), which may not have a value

    Completeness may also be seen as encompassing usability and appropriateness of data values. The Extent to which data is available and not missing and is sufficient for particular task.

  • Timeliness :Timeliness refers to the time expectation for accessibility and availability of information. It can be measured as time between the need for information and availability of the information for use.
  • Currency :Currency refers to the degree to which information is current with the world that it models. Currency can measure how “up-to-date” information is, and whether it is correct despite possible time-related changes. Data currency may be measured as a function of the expected frequency rate at which different data elements are expected to be refreshed, as well as verifying that the data is up to date. This may require some automated and manual processes. Currency rules may be defined to assert the “lifetime” of a data value before it needs to be checked and possibly refreshed. The Extent to which data is up-to-date and will be updated regularly in future.
  • Conformance :It means whether the instances of data are stored / exchanged / represented in a format consistent with the need of the value or similar to the attribute of the values. Each column has numerous metadata attributes associated with it: its data type, precision, format patterns, use of a predefined enumeration of values, domain ranges, underlying storage formats, etc
  • Referential integrity :Assigning unique identifiers to objects within your environment simplifies the management of data, but introduces new expectations that any time an object identifier is used as foreign keys within a data set to refer to the core representation, that core representation actually exists. More formally, this is referred to referential integrity, and rules associated with referential integrity often are manifested as constraints against duplication

How Technology Supports the Data Quality ?

Any framework implemented for the betterment of the organization should be supported with the technology for better assessment and discovers data quality issues. The definition of data quality is the use of those rules to distinguish between correct and wrong data and possible cleaning up of data, measurement and reporting any issues to those rules.

Assessment

Assessment is part of the process for refining the data quality rules for detecting any issues earlier as proactive measures and establishes relationship between recognized error data and business impacts. This involves process of analysis and discovery involving an objective review of the data values populating data sets through quantitative measures and analyst review. Data profiling contains combined algorithms for statistical analysis and assessment of the quality of data values within a data set, as well as exploring relationships that exists between value collections within and across data sets. For each column / data in a table, data profiling tool will provide frequency distribution of different values thereby providing insight into the type and use of each column. Further analysis on the cross column analysis can help us to identify the embedded value dependencies that represent foreign key relationships between entities.

Definition

Once the analysis is completed, careful examination of those results will help us to identify the discrepancies that exist within the data sets and also business rules embedded within the data. The final result is set of data rules, each of which can be categorized within the framework. This will help in redefining the data quality rules that can be implemented within the software.

Validation and Cleansing

Validation will check what must be the true about data and also confirms whether the data is up to the expectations of the business rules. Both the data transformation and data profiling products will allow the IT managers to define the validation rules that can be tested against a large set of data instances. For example, having determined through profiling that the values within a specific column should fall within a range of “A” to “F”, one can specify a rule asserting that “all values must be greater than or equal to A, and less than or equal to F.” The next time data is streamed through the data quality tool, the rule can be applied to verify that each of the values falls within the specified range, and tracks the number of times the value does not fall within that range.

Monitor and Manage Ongoing Quality of Data

Data Quality metrics tool should have the ability to collect the statistics associated with Data Quality, reports them into fashion that enables action to be taken and provide historical tracking of improvement over a period of time. IT vendors will consider different ways to analyze the results of monitoring data quality metrics that can be captured and presented to the user to allow analysis and study for business impacts.

How Data Quality checks is performed ?

During the testing process a number of business and cleansing rules shall be applied to the Data. Data shall be checked for quality, accuracy and completeness. The following methods will be used for testing the Data.

  • Record Count: The number of records extracted from the Source Application must match the number of records going to target system plus the number of records going to error log files.
  • Check Sum: Totals of some key field values must match when data moves from one stage to another.
  • Business Rules Checking: Prepare test plans to ensure all business rules have been correctly implemented in the Migration System. Test data shall cover all the positive and negative scenarios.
  • Data records may get rejected during the ETL process, due to quality issues, incorrectness or incompleteness of the data or otherwise.
  • Following is an indicative list of Data Quality/Assessment reports at various stages.
    • Column Profile – By Field: Column profiling results sorted by field name
      Column Profile – By File: Column profiling results sorted by file name
      Value Frequency: Shows the number of times that specific values appear in a field.
    • Data Anomalies: Pie Chart report summary where each slice is a count of the specific type fields.
    • Missing Chart: 3D bar chart summarizing the number of Nulls, Blanks or Zeros in each field for a table. Details for Each Table/File can be shown in a separate chart
    • Missing Data: Summary report for each field by table with the count and percent of Nulls, Blanks and Zero values for the field
      Null Rule Exceptions: List of Columns with counts of null, zero, or blanks.
    • Patterns For Attribute: Bar chart that displays the inferred patterns for each Column
    • Percent of Exception Rows: Individual bar chart of the number of rows that meet the criteria for a data rule. A separate bar can be created for each run of the rule based on the date
    • Pattern Exceptions: Bar chart that displays the inferred patterns for each Column when the pattern % meets the minimum selected in the parameter.

Value Added through Data Quality Checks

GartnerGroup© estimates that more than 50 percent of business intelligence and customer relationship management deployments had suffered limited acceptance if not outright failure due to lack of attention to data quality issues. Data driven applications fail in terms of payoffs because the data underlying is not of good quality and is unfit for the input. This leads us to think about data quality before defining the expected returns from data driven applications. The business value of data quality can be understood through illustration of following attributes of data.

  • Data confidence: Good quality data leads to good quality information required for decision-making and operations which in turn generates confidence in decision-making and strategies formulation, which helps lessen the degree of data skepticism often observed in data driven decisions and strategies in an organization.
  • Critical Data requires best quality: Some product lines or processes in an organization are critical and operates at low margin for errors. Data error in such areas often leads to losses or breakdown of services. A catalogue mail campaign will suffer heavily if the accuracy of its customer’s contact address is not high. Cleansed and standardized data means reduction in waste in postal, email, telephone campaigns etc. In a recent data quality report, it was discovered that 15% of the data in the typical US customer database is inaccurate as confirmed by national data audits and if the expenses incurred on poor data usage can be avoided then it will be a net saving.
  • Better Integration: Marketing & Sales managers rely on management tools such as market basket analysis, segmentation and customer profiling to increase sales leads. These analyses are fairly useful only if customers could be unified in a 360-degree view across different departments. If a company fails to integrate & and link customers and cannot classify them as prospective or low yield customers, their expensive marketing campaigns & endeavors will fall flat.
  • Good input for sophisticated applications: Business managers are increasingly using a number of data driven analytic business applications to understand their customers and business processes. These sophisticated applications like Data Warehouse, CRM, Business Intelligence applications, Marketing and Sales automation, and click-stream analysis to name a few, demand high quality of data input. The effect of bad data will be reflected exponentially if input data is inaccurate and will decrease the reliability of the analyses.
  • Improves Customer Service: Good data ensures better customer satisfaction with less number of fallacies in pre and post sales services.

Some major industries that could benefit by Data Quality (DQ) are:

  • Banking and Finance
  • Government
  • Health and Community Services
  • Insurance
  • Tax Systems
  • Law Enforcement
  • Telecommunications
  • Retail
  • Intelligence and Security
  • Manufacturing

The importance of Data Quality can be further highlighted by understanding the criticality and importance of data in various departments in an organization. Any anomaly in their base data may have detrimental effects of high magnitude

Data Quality Process Model

There is multiple data process model adopted by different organizations. Giving below a generic process model along with definitions of various phases that should be followed during Data quality implementation.

  • Data Quality Definition :Data Quality definition is the first phase in data quality process model. This phase is planning phase and consists of activities like ‘Identification of need of data quality initiative’, ‘Identification of the Data Quality improvement approach whether purchased tool based applications or in-house development’, ‘ Defining the requirements’, ‘Setting the Data Quality expectations, goals and objectives’, ‘Estimating ROI’, ‘Conducting KT for Data Quality tool if required’, ‘Defining Data Quality Metrics to monitor and assessing data quality levels before and after implementation of a Data Quality program in the data sources and data destination’, identification of the subject data to be cleansed & enriched’ etc.
  • Data Profiling :Data profiling refers to describing the outline of an object or formal summary of analysis of the data. This can be derived in the form of graph or table. Data profiling focuses on the important features of data that will be used for important considerations by different stake holders or business groups. This helps in defining the metadata of the data. It is process of statistically examining data sources residing in any format.
    This is the first step in improving upon data quality and helps in identifying and validating redundant data across data sources.
  • Data Auditing :Data Auditing is an analytical way of finding the gaps and anomalies in the data. This will help us to find errors and discrepancies within the data that are normally not identifiable or unknown. To effectively manage data holdings & fully realize its potential and IT manager should first be aware of location, condition of its assets. By doing proper data audit, it will help in raising awareness of collection strengths and data issues. Also, this audit will highlight the redundancy and duplication of efforts and areas that need more investment. Other activities that are part of data auditing are as follows
    • Verify each and every data to check for its validness and correctness.
    • Check whether all values in the column are unique and realistic.
    • Validate the rules that define functional dependencies, foreign keys, primary keys etc.
    • Validate all the rules that are applicable in a row of data.
    • Validate that the data rules that should be consistent over all data entities or objects.
  • Data Quality Assessment :Data Quality assessment is the next phase after data profiling and auditing exercise which helps in findings the anomalies found during auditing and designing strategies. An error detected during the project testing phase can cost more than the same error found during the design phase.
  • Data Integrity :Data Integrity is a major issue in all business. Regardless of size or type of industry, information represents a valuable asset. As much business increases the volume of data every year and also automates their business process in more complex environment, data integrity becomes a requirement rather than an option. However companies are often challenged to assess the reliability of their existing data, particularly when dealing with more business processes and applications and complex data requirements.
  • Data Enrichment :Data Enrichment is another term for data integrity that refers to data augmentation or enhancement by correcting inaccurate values or providing missing values.
  • Data Re-engineering and Cleansing :This phase includes implementation of business rules and procedures for DQ assessment to cleanse the data. It focuses more on corrective action on the existing data. Basically this phase contains follows activities.
    • Use of any data integrity tool to fill the missing values through the concept of cross column verification.
    • Assessing if any particular rule conflicts another business rule.
    • Identifying right flow of application to implement business rules.
    • Avoid the exceptions and subsequent breakdown of application.
    • Exceptions handling.
  • Data Defects prevention :This phase focuses on defect prevention unlike Data cleansing, which deals with data correction, and is important because it reduces further efforts in data cleansing by improving the processes and systems which capture, store or process data. Data Defect prevention is often one time effort which helps reducing repetitive efforts which are required in data cleansing. This phase may comprise the following activities.
    • Enhancing the application to capture complete data.
    • Enhancing the applications to capture accurate information by providing data references and checks at the point of data capture.
    • Training the data entry operators to ensure the quality of the data.
    • Facilitating the online cleansing of data at the point of entry.
    • Enabling use of suitable character sets in which to capture data.

What is Data Quality Framework ?

The Data Quality Framework was developed as a voluntary, sector-neutral, standardized solution that would enable trading partner collaboration in order to achieve the benefits of good quality data, regardless of their size, role or activity. The Data Quality Framework is based on a data quality industry protocol that consists of i) a data quality management system to validate the existence and effectiveness of key data management business processes and ii) an inspection procedure to physically validate product attributes. This protocol has been included on the Data Quality Framework along with other elements to further expand on the areas where trading partners can collaborate in order to realize the benefits of good quality data. In a nutshell, a comprehensive checklist of all best practices and desirable requirements for an optimal management of data quality. Currently the Framework contains the following components:

  • Data Quality Management System (DQMS): provides guidance for organizations to establish, implement, maintain and improve a series of processes and activities related to the management of information and data quality of their master data output. This Data Quality Management System is critical to the medium to longer-term vision for consistent high quality data to flow through the global supply chain. This system will focus on the existence of internal business processes, procedures and common performance criteria.
  • Self-Assessment Tools: offers organizations means to perform a self-assessment against the key elements of the Data Quality Management System in order to reveal opportunities for improvement of the management of data quality. The self-assessment procedure can be used as a ‘gap analysis’ tool to show and prioritize the areas where an organization could realize improvements.
  • Product Inspection Procedure:defines a standardized approach for the inspection of the characteristics of trade items and the comparison to their master data.
  • Reference Documentation: additional appendices that point to external documents or expand on the information provided on any of the previous sections. Note: Besides the elements described above, additional implementation and reference tools are available to support the use and application of the Data Quality Framework. Please refer to the ‘Data Quality Framework Implementation Guides v3.0’ for further information on how to use these tools.

Benefits of DQ Framework

  • Checklist of derive best practices in the industry.
  • Adapts well to each organization circumstances.
  • Helps to establish well defined metrics.
  • Supports the implementation industry specific Standards.
  • Useful in the organization of action plans.
  • Provides tools for organizations to run their own assessment of their situation.
  • Offers an objective reference for different companies and industries.
  • The Data Quality Framework is publicly available for the industry to be used to add value to trading partners.

Limitations of DQ framework

  • It cannot solve all data quality needs.
  • Needs step by step manual to define the process.
  • Need detailed implementation of the process.
  • Document is specific to only one specific industry.
  • External audit or certification is needed.
  • Guarantee for trading partners.

Challenges in a Data Quality Implementation

The general challenges faced during implementation of Data quality.

  • Avoiding Data problems: In earlier days, IT companies and managers tend to deny any data problems in their systems which will threaten to unearth the problems in data and systems which were previously well concealed.
  • Lower visibility and Intangible benefits:Data quality has intangible benefit and generally difficult to promote within organization. This does not always result in new application deployment which makes organizations may fail to take note of the importance of DQ improvement programs. A projection of quantifiable business value to be earned with DQ implementation will help with the recognition of DQ benefits in the organization.
  • Different character-sets across data sources:Organizations that are engaged with international trade often store / receive data in different languages and character sets. This hampers the management of multilanguage data as well as address enrichment.
  • Parallel approach:Corrective as well as preventive measures is often required to get more success of a Data Quality application. It is essential that companies not only focus on cleaning the existing data but also explores how to prevent any data discrepancies which includes training of data entry operators, guidelines issues, mutual understanding of data standards.
  • Unavailability of postal data in many regions:Confirming the correctness of the address through verification from external sources like postal directories etc post a high challenges to companies with global presence.
  • Data Quality tools identification: There can be difficulty in identifying readily available tools to implement data quality system. They need to be best in class matching, verification, standardizing, auditing and profiling of the data in comprehensive manner.
  • High customization:The data quality programs readily available in the market cannot be implemented directly and needs high level of customization to suit the company’s requirements.
  • Enterprise integration:Data Quality applications provide better results when they are integrated across enterprise and used by various departments. This makes the investment with good returns but in the same time demands enterprise wide cooperation.
  • Third party data & standards compliance:IT organizations often use third party reference database for industry standards conformity and to leverage upon readily available authoritative company databases in the market. This integration, through web services or files based system, poses a slight challenge and needs to be carefully considered.
  • ROI Measurement:Data quality in turn should be related directly to the Business goals to evaluate returns on investment. It needs mapping realistic goals & objectives and charting out strategies to achieve planned returns.
  • Constant Monitoring:Data Quality Metrics which are implemented based on the business goals provide control and measurement tools to assess the success of the application and in case of any growing failure, it provides a timely opportunity to unearth the causes.
  • Data Ownership and accountability:Data ownership should be assigned with particular team to ensure the data quality is achieved in effective manner. Any tampering or wrong information on the data will lead to wrong outputs about the quality of the application.
  • Trickling Budget:Long term data quality objectives that are continued require a gradual flow of money into the application. A business community with one time investment mindset may get upset with demanding financial commitments in post-launch period otherwise.

An organization should not, however, be deterred from the above stated challenges since Data Quality is much like a piece of a big Jigsaw puzzle and without which such information the Data Quality goals will not be accomplished. It is thus important to carefully plan for Data Quality endeavors.
The importance of high quality data requires constant monitoring and control. But by introducing data quality monitoring and reporting policies, protocols the decisions to acquire and integrate data quality technology will be much simpler. Circumstances that allows wrong/erroneous data into the system are not unique, hence data flaws are constantly being introduced, to ensure that data issues do not negatively impact application and operations: Below are the possible approach like

  • Formalize an approach for identifying data quality expectations and defining data quality rules against which the data may be validated.
  • Baseline the levels of data quality and provide a mechanism to identify leakages as well as analyze root causes of data failures. and lastly
  • Establish and communicate to the business client community the level of confidence they should have in their data

One Reply to “What is Data Quality ?

  1. It’s remarkable to go to see this web site and reading the views of all
    colleagues on the topic of this post, while I am also
    keen of getting familiarity.

Comments are closed.