Data Profiling – Critical for Data Migration

Data migration in simple terms is a process by which data is extracted, transformed and loaded from legacy applications and sources to the target application landscape. Since it is a process its steps must be well defined and in a proper sequence to ensure least disruption to the operations and customer service of the organization. Let us have a brief look at these steps at a high level before we delve deeper into an important step of this process, commonly known as data profiling.

As you can see from the above that data profiling is at the center of the migration process. Let us now look at the definition and details of data profiling.

“Data profiling is defined as the activity of systematically analyzing the data from various sources identified for migration in order to assess the cleanliness, conformity, uniqueness and duplicity of data, in order to ensure smooth migration of data to target landscape and the seamless functioning post migration.”

Now let us look at the purpose and importance of data profiling for a data migration project. During data migration it is important for data to move from source to target in a seamless and error free manner. In order to ensure this, the following data dimensions are critical – Completeness, Conformity, Accuracy and Uniqueness

Dimension Significance Description
Completeness Is all data present and intact? This check identifies the information which is missing in the data attributes This check involves Identification of data fields with no content.
Conformity Does the data conform to the metadata values for each attribute? This check validates, if the data conforms to the allowed metadata format or data type for specific attributes. This check involves: Determining the conformity of the data type format (date, string, currency, number etc.) Determining the conformity of the data pattern (e.g. email, phone, Identifiers etc.)
Accuracy Is the data correct & valid? This check verifies if the information in the data attributesaccurate and maintains the referential integrity with other related attributes. This check involves: Identification of suspect field values (e.g. containing special character, only one character etc.)Ascertaining the data integrity within specific data attributes (Gender & Salutation correlation, Resident Flag and Residence)Ascertaining if junk data has been provided (e.g. XX people with same phone number/email address etc.)Determining the range of value for specific attributes (e.g. date, amount etc.)Date Interdependency Checks between related date attributes (e.g. between Issue Date and Expiry Date etc.)Analysing the valid data values based on reference data set.( e.g. there could be reference static codes which are parameterized which only need to be used)Provision for international address and mobile number within EU.
Uniqueness Is each record unique? This check identifies duplicate records based on business defined rules.

All of the above is handled through the data profiling activity. Hence this activity assumes paramount importance in a data migration project. Now that we know the importance and purpose of data profiling activities, let us look at the steps involved in data profiling.

Sequence of data profiling:

  1. Restoration of source data – The complete data set which needs to be profiled is identified and all application databases identified need to be restored into a staging environment for profiling. Care needs to be taken to ensure that all relevant tables in its entirety are restored and no data is omitted from profiling however un-important it may seem.
  2. Develop Profiling scripts and extract data – Profiling of banking data cannot be done manually as there is a scope for human error and oversight, hence the profiling is better executed through a tool or through automated scripts. These scripts are built by developers based on directions from data analysts with knowledge of functional banking functions and migration. The scripts will extract the data in dimensions of profiling and provide insights to the data analyst about the quality and adequacy of data for migration.
  3. Analyze data – The data once extracted is analyzed by data analysts and statistics of the dimensions are recorded. A sort of data profiling dashboard is an outcome of this analysis. This dashboard is essential for two purposes.
    • The first purpose of the dashboard is, it provides the scope of data cleansing needed by the organization. It shows how many data attributes need to be cleansed in order for the organization to plan the resources needed to complete the activity in time before final rounds of migration are completed.
    • The second purpose of the dashboard is, it provides an insight to the migration team and the management as to the complexity of data, the extent of data and the cleanliness of the data for migration purposes. This helps the management and the project team in better planning and resourcing the project.

A sample of profiling dashboard is shown below:

4. Identify the cleansing actions – Once the cleansing actions are identified as part of the above step a tracker is created and shared with the teams responsible for the cleansing of the data. This tracker will describe the cleansing action, the records needing cleansing, the criticality of the attribute in order to prioritize the cleansing and a status for periodic reporting and monitoring of the cleansing.

Assigning Criticality

5. Perform cleansing actions – Based on the tracker above a plan of action is prepared, resources allocated for cleansing and the cleansing activity is undertaken by the operations teams or data governance teams. This activity may either be done by the BAU teams responsible for creating and maintaining the data or by a dedicating team insourced or outsourced by the bank. The activity is to be properly monitored and the progress is to be regularly reported to the management and the project team. Proper milestone achievement is critical for the alignment of the cleansing and the migration activity.

6. Repeat the steps – Once cleansing activity is reported as completed, the latest cut of data needs to be restored again and steps through 1 to 5 are repeated till there are no further actions pending. Care is to be taken by the management of the bank to ensure that no new data is on-boarded with the same issues as earlier. This defeats the outcome of the profiling activity and leads to continued data erosion while more efforts and spend is needed on cleansing activities.

Challenges to profiling:

  1. Identification and restoration of data –

Data is vast and is often dispersed across an organization in several forms.

  • It may be stored in physical form or digital form.
    • It may be maintained systematically or randomly.
    • It may be stored locally or centrally.

Identification of all the data sources and restoring them to a common repository is the primary challenge.

  • This can be mitigated by running data discovery workshops involving key data owners and consumers in the organization and cataloguing the source with their help.

2. Stakeholder mapping –

It is a key challenge to identify and create ownership for data profiling and cleansing actions.

  • Data can be spread across multiple business and support units, like technology, operations, sales and marketing, management, etc.
    • It is a difficult task to create a complete stakeholder map and ensure participation of all the stakeholders.

This can be mitigated with the help of management teams.

  • The key stakeholders should be mapped and communicated from top level for an effective participation. –

3. Cleansing process and timelines

Cleansing is always a challenging activity and the older the data the more challenging it is to cleanse as volumes get built over time.

  • To plan, identify resources and execute cleansing could prove expensive for an organization without support from technology.

To mitigate this challenge a proper prioritization matrix should be built to identify the critical items for cleansing.

  • Further when volume is high technology should support an automated approach to cleansing.
  • The whole process needs to be planned and monitored by a dedicated team.
  • The progress needs to be reviewed at the management level and necessary impetus be provided to support the completion of the activity.

Benefits of profiling:

  1. Data insights –
    • Profiling provides key insights into an organizations data.
    • It provides the volume and metrics of business data
    • It throws up the levels of inconsistency, incoherence and completeness of the data
    • It gives insight into the complexity of data for a migration
  2. Planning a migration –
    • Profiling helps in better planning of migration by exposing the underlying the complexity of data
    • Profiling helps provide a timeline for data cleansing and aligning it to the larger program  plan
  3. Digitalization and centralization –
    • Profiling helps in undertaking key strategic decisions related to digitalization and centralization of data and operations post data migration
    • It helps drive decisions related to hardware and storage sizing requirements of the organization

As we have seen, the data profiling activity is critical for data migration process. It ensures a seamless, error free migration and smooth operations and customer service post cutover.

Back to Blog Series

If you need more about this topic please drop a mail

    About Author