Real World Data Quality Issues – #10 in This Series

Real-World Data Quality Challenges: Part 10
How a Company Ended Up With 1,600 Duplicates of the Same Customer Record
Duplicate records are one of the most common—and costly—data quality issues in CRM and customer databases. While most organizations struggle with a few duplicate records per customer, one enterprise software company faced a much larger challenge.
They are one of two dominant vendors competing in their market. Due to a systems integration problem, they accumulated as many as 1,600 duplicate customer records for a single organization.
This is the largest duplicate record group our team has ever seen.
What Caused the Duplicate Records?
The root cause was an integration between their accounting system and CRM platform.
Each time the accounting system generated a customer invoice, the integration created a brand-new customer record in the CRM instead of linking the transaction to the existing customer account.
Over time, every invoice produced another duplicate customer record.
The result was:
- Massive duplicate account groups
- Fragmented customer information
- Inaccurate reporting
- Reduced CRM usability
- Increased operational costs
Like many organizations, they initially attempted to clean up the duplicates manually. However, the volume of records made manual deduplication impractical.
They needed an automated data quality solution.
The Hidden Challenge of Large-Scale CRM Deduplication
When we analyzed the database, we discovered a second problem.
In their CRM system, whenever duplicate records were merged, associated child records—such as activities, transactions, contacts, and historical data—were moved to the surviving master record.
As duplicate records were consolidated, the number of child records attached to the surviving account grew dramatically.
What began as a duplicate account problem evolved into a data structure problem.
Some customer accounts became enormous “data pyramids” with:
- One surviving customer record
- Thousands of related child records
- Complex relationship structures
Every merge operation required the CRM platform to examine and reconcile all related records.
As the number of child records increased, merge performance slowed significantly.
Without a different approach, the cleanup project would have taken weeks—or even longer—to complete.
Our Solution: Smarter Duplicate Record Merging
The answer was surprisingly straightforward.
Rather than merging all duplicate records directly into a single survivor record, we developed a program that systematically reduced duplicate groups in stages.
The process worked like this:
- Merge pairs of duplicate records that had never been merged before.
- Merge pairs of records that had been merged once.
- Merge pairs of records that had been merged twice.
- Continue reducing the duplicate group in progressively smaller batches.
This strategy ensured that records with the fewest child records were merged first.
For the largest duplicate group, the process looked roughly like this:
- 800 merges
- Then 400 merges
- Then 200 merges
- Then 100 merges
- And so on
By balancing the merge workload, we prevented the CRM system from becoming overwhelmed by extremely large parent-child record structures.
The final merge operation was slow, but the entire deduplication project for the company’s database finished in just a few days instead of several weeks.
Help from Acme Data
We are data experts. It’s all we do. Our primary focus is the implementation of our data quality platform, Data Studio, but we’ve frequently worked with Global 2000 companies to solve complex, large scale data problems.
If you could use some help with complex data quality problems, contact us. We’re easy.