SAP Training Blog by Michael Management

Effective SAP Data Cleansing

Written by Yoann Bierling | Sep 22, 2021

It's an inevitable problem. Over time, data gets messy. This includes the data in your SAP system. Why does this happen?

During day-to-day operations, data gets duplicated. For example, various users may enter redundant data without knowing that it already exists. For some, systems have been merged together following digital strategy or operational changes. Clients may have also have changed their corporate setup leading to redundant data. Whatever the case, you may end up with data representing the same object, but recorded in different ways.

While it might not seem like a critical problem at first, it can actually lead to huge issues. For example, sales could be reported to the wrong business unit because customers exist twice in the system because they were created by salespeople from different teams. The solution is proper data cleansing.

 

What is Data Cleansing?

Data cleansing is the process of resolving issues with duplicate data and making sure that such issues don't occur in the first place. This includes double-checking the system's data quality and addressing all problems: starting with data extraction from SAP and satellite systems, followed by automatic and manual checks in Microsoft Excel or other data visualization programs, and completed with system update during a business interruption.

 

1. Data extraction

The first step in any data cleansing process is to identify and extract the data.

While it sounds simple at first, it gets more compiles the data cleansing coordinator goes deeper into the system.

Which datasets should be checked, how and by whom? These questions are not trivial, and it usually starts by defining data objects, and their scope.

If a corporation plans on cleansing their customer data, they might want to only include clients that have been active in the past few years, and simply ignore any client that did not interact for more years than the necessary legal duration for invoice archiving.

Older clients can then be avoided, and active ones extracted for further data checks. These checks should be based on defined data definition: which fields are necessary for the checks, can contain wrong information, and will have to be checked.

A data object is a defined way of organizing a subset of data: customer information, bill of materials, are some data objects examples.

The more data objects are identified in the process, the more work will be necessary to cleanse data.

Once the data objects have been defined, extraction processes are validated: how often the data is changed, how long the cleansing will take, and how often will it be extracted?

As business and data entry cannot be stopped, the system data will keep being updated with new – and potentially duplicate – data that must be added to the checklist.

A cleansing process typically takes several months, therefore it is necessary to clearly define these steps.

 

2. Data check

Once the data has been extracted, it can be checked and corrected by data cleansers, based on various data quality criteria:

  • Validity: the data must validate all business rules currently in place, and if it doesn't pass this test, should be double-checked to be either removed from the system, with underlying data being linked to valid data, or updated to meet global requirements,
  • Accuracy: this dimension probably is the hardest one to enforce, as data accuracy means checking in detail every field, and the knowledge not always being in possession of data cleansers. A typical example being zip codes accuracy in customer data. How can a data cleanser be sure that the zip code, especially in another country, can be the right one for a customer address? Double-checking by comparing to zip codes databases, checking address validity on Internet, or validating with customers can quickly become a tedious task,
  • Completeness: while it might start by simply ensuring that all mandatory fields contain data, it doesn't stop there. A manual check of each data set is necessary. It might be helpful to set up automatic checks, such as minimum and maximum field lengths in case of text fields like postal addresses,
  • Consistency: it is where the data cleansers will ensure that no duplicates are existing, like the same customer entered with various name variations, once containing the Corp. mention, and once without for example. This tedious task can be simplified by using automated text checks: values already contained in other entries, X% similar to other entries. Usually, it is also necessary for the data cleansers to double-check with key users which value was the correct value,
  • Uniformity: every piece of data in the same system should be using the same units of measure, and information entered must match corporate policies. It is not possible to compare meters and miles without conversion, and that conversion should be ensured in the system through that dimension.

The ongoing process of data quality validation will help to check metrics such as the percentage of completion of these various checks, that are managed by the data cleansing coordinator, checked by the data cleansers in coordination with business champions, and reported to management until the data quality score matches company's goal, at which point the system can be updated.

 

3. Data update

The final step in the data cleansing activity, at least for a specific project, unless the company has a continuous data quality improvement process, is to proceed to the system update.

It will typically involve a system stop and will be performed over a weekend, during a low business volume period, or by taking advantage of public holidays and business closure.

Depending on the amount of data to be updated, it can take up to several days, and any issue during this critical process might lead to business discontinuity beyond the planned business break.

The technical team will take care of updating the system data based on extracted data and quality check results, and update them one by one using custom-developed programs: archive old and irrelevant data, update identified values, and proceed to the next object.

If any issue occurs at this point in the process, the whole team will most likely be called up to work on it until the update process has been technically finished, and key users can validate data cleansing process completion.

 

4. Set up for a successful data cleansing

Ensuring data correctness isn't only an ongoing process. It might be necessary to run a whole project following technical updates, company mergers, and other situations.

While each data cleansing project is different, it is generally necessary for the whole team to be skilled in data visualization.

Ensuring such a project's success can be accomplished through workforce upskilling not only in data cleansing understanding, but also in system basic usage, and most likely with complete Microsoft Excel skills, as most of the actual checks include setting up formulas that will compare huge amount of data automatically.

Plan your SAP data cleansing project success and save resources time by getting them upskilled with the following certifications: