Data Uniqueness & Duplicate Detection

What is Uniqueness?

Uniqueness measures whether your data values are distinct and non-duplicated. A field has high uniqueness when each record contains a different value. Uniqueness breaks down when the same value appears across multiple records, or when text fields contain repetitive templated content that adds no analytical value.

Duplicate records cost your organization at every stage. Three Account records for the same company split your pipeline. Two Contact records for the same person get two marketing emails. Boilerplate text pasted into thousands of case descriptions makes it impossible to extract insights. Uniqueness analysis quantifies all of these problems.

Uniqueness Rate = (Records with Unique Values / Total Records) x 100

If 7,800 of 10,000 Contact records have a distinct Email value, your Email uniqueness rate is 78%. The remaining 22% share email addresses with at least one other record. This single metric tells you whether a field that expects unique values actually has them.

Why Uniqueness Matters

Reporting

Duplicate records inflate your numbers. When the same company appears as three Accounts, your Account count is overstated by two. Pipeline reports show three deals where one exists. Customer counts used for board presentations and investor reporting are wrong.

Automation

Salesforce automation treats each record independently. A duplicate Account triggers duplicate workflows, sends duplicate notifications, and creates duplicate tasks. A renewal process that fires on every Account matching a company name triggers three times instead of once.

AI and Agentforce

AI models process each record as a separate entity. Duplicate records fragment the model’s view of a customer. Agentforce generates responses using your Salesforce data. When three Account records exist for the same company, Agentforce sees three customers, not one with a complete history. Repetitive boilerplate content in text fields teaches the model your templates, not your business patterns.

System	Uniqueness Impact
Reports	Inflated counts, fragmented metrics
Workflows	Duplicate triggers, redundant notifications
Duplicate Rules	Overwhelmed by existing duplicates if not detected
Agentforce	Fragmented customer view, template-polluted learning

How DQS Measures Uniqueness

DQS produces 6 uniqueness metrics organized around a diagnostic question: “Is the data distinct, how is it distributed, and is the text content original?”

Think of these metrics as a diagnostic flow. Each layer builds on the previous one.

Layer 1: Are Values Unique?

Uniqueness Rate is the headline metric. It calculates the percentage of records where the field value is distinct (not duplicated anywhere else in the dataset). This is the number you put on a dashboard.

You run a scan on the Contact object. The Email field shows a Uniqueness Rate of 78%. That means 22% of email addresses appear on more than one Contact. Some are legitimate (shared department emails like [email protected]), but most are likely duplicate contacts that need merging. This single number justifies a deduplication initiative.

Distinct Count tells you the cardinality of the field: how many different values actually exist. If 10,000 Contact records contain 8,200 distinct email addresses, the Distinct Count is 8,200.

Example: You expect the Lead_Source picklist to have about 12 values (your configured picklist options). But Distinct Count shows 87. Before the picklist was restricted, reps typed in free-text values. “Web”, “web”, “Website”, “Webinar”, “web form” all count as distinct. This metric reveals that your Lead Source data needs normalization, even though the picklist is now locked down.

Layer 2: How Is the Data Distributed?

Uniqueness Rate tells you how many values are unique. Distribution metrics tell you how those values are spread across records. Two fields can have the same Uniqueness Rate but very different distributions.

Entropy measures how evenly values are distributed using Shannon entropy. The scale ranges from 0 (every record has the exact same value) to a maximum determined by the number of distinct values. Higher entropy means more diverse, more evenly spread data.

Entropy alone means nothing. Compare it to the maximum possible entropy for that field. Maximum = log2(Distinct Count), which is the entropy you get if every distinct value appeared exactly the same number of times. The ratio (actual / max) gives you a normalized score from 0 to 1:

Normalized (actual / max)	Interpretation
0.9 or above	Even distribution: values spread uniformly
0.7 to 0.9	Moderate skew: some values appear more than others
Below 0.7	Dominated: a few values hold most of the records

Example: Your Industry field on Accounts has a Uniqueness Rate of 2% (expected for a picklist) and 24 distinct values. Looks fine. But entropy is 1.3, and maximum entropy for 24 values is 4.6. The normalized score is 0.28. The distribution is severely skewed: 60% of records are “Technology” and “Financial Services.” Your industry-based segmentation is a two-bucket system dressed up as 24 categories.

Max Frequency gives you the count of occurrences for the single most common value. If “London” appears 8,400 times in the City field, Max Frequency is 8,400.

A single dominant value often signals a default value problem, a migration artifact, or a genuine business concentration that needs investigation. Max Frequency raises the question. A quick check of the actual value answers it.

Example: The Billing_Country field has a Max Frequency of 34,000 out of 40,000 records. That is 85% of records with one country. Either your business is genuinely concentrated in one market, or someone set a default during migration. The metric surfaces the pattern; you determine the cause.

Layer 3: Is the Text Content Original?

The first two layers measure whether values are identical. Layer 3 asks a different question: is text content substantially similar? Two case descriptions can be 100% unique (different case numbers, dates) but 90% boilerplate (same template, same phrases).

Boilerplate Rate is the headline metric for text content originality. It measures the percentage of content that is repetitive or templated. A higher score means more original content with less boilerplate. DQS detects common templates like email signatures, legal disclaimers, and repeated phrases.

Example: Your organization is evaluating whether the Description field on Opportunities is suitable for AI-powered win/loss analysis. Uniqueness Rate is 99% (every description is technically different). But Boilerplate Rate reveals that 65% of the content follows the same template: “Customer: [name]. Need: [product]. Timeline: [date].” The AI model would learn your template structure, not your win patterns. Boilerplate Rate saves you from a garbage-in, garbage-out AI project.

Boilerplate Records Count gives you the cleanup scope as an absolute number. If 12,400 records contain boilerplate, your data steward knows the size of the remediation project. She can estimate hours, assign resources, and set a realistic timeline.

Example: Your support team logs every interaction in Case Comments. Boilerplate Records Count shows 12,400. Investigation reveals that agents paste a standard opening (“Thank you for contacting support. Your case number is…”) and closing (“Please don’t hesitate to reach out…”) into every case. Before using AI to analyze support interactions, those 12,400 records need the boilerplate stripped.

Three Angles of Analysis

Uniqueness metrics cover three distinct concerns, each serving a different stakeholder:

Concern	Metrics	Question	Stakeholder
Duplication	Uniqueness Rate, Distinct Count	Do we have repeated values?	Data stewards (merge candidates, dedup rules)
Distribution	Entropy, Max Frequency	How is data spread across values?	Analysts and data scientists (segmentation, modeling)
Originality	Boilerplate Rate, Boilerplate Records Count	Is text content genuinely original?	AI teams (training data quality, content extraction)

Metric Reference

Foundation Metrics

These 2 metrics form the base of every uniqueness analysis. They work across all 15 supported field types.

Metric	Type	What It Measures
Uniqueness Rate	Percentage	Share of records with non-duplicate values
Distinct Count	Count	Total number of distinct values in the field

Advanced Metrics

These 4 metrics go beyond “are values unique?” to analyze distribution patterns and text originality. They require the Advanced Uniqueness Analysis mode.

Metric	Type	What It Measures
Entropy	Decimal	How evenly values are distributed (Shannon entropy)
Max Frequency	Count	Occurrence count of the single most common value
Boilerplate Rate	Percentage	Degree of templated or repetitive content
Boilerplate Records Count	Count	Number of records with boilerplate content

Field Type Coverage

Different metrics apply to different field types based on what they measure.

Coverage Group	Field Types	Metrics Available
All types (15)	String, TextArea, LongTextArea, Number, Currency, Percent, AutoNumber, Date, DateTime, Picklist, Email, Phone, URL, Lookup, Checkbox	Uniqueness Rate, Distinct Count
Analysis types (9)	String, TextArea, Number, Picklist, Multiselect Picklist, Checkbox, Email, Phone, URL	Entropy, Max Frequency
Text fields (3)	String, TextArea, Html	Boilerplate Records Count
Long text fields (3)	TextArea, LongTextArea, Html	Boilerplate Rate

Core metrics work on all 15 field types because any field can have duplicates. Distribution metrics (Entropy, Max Frequency) work on 9 field types that produce countable frequency tables. Boilerplate metrics apply only to text fields because they detect repeated content patterns in free-text data.

Two Analysis Modes

DQS offers two uniqueness analysis modes:

Basic Uniqueness answers the question: “Are values distinct?” It produces the 2 foundation metrics and covers the essentials for a quick duplicate detection check or baseline audit.

Advanced Uniqueness Analysis goes deeper. It produces all 6 metrics, including distribution analysis, frequency patterns, and boilerplate detection. Use this mode when you need to understand the full picture of data distribution and text originality, not just the duplication rate.

Business Need	Recommended Mode
Quick duplicate detection audit	Basic Uniqueness
Data migration assessment	Advanced (Max Frequency catches default values, Entropy reveals skew)
Picklist hygiene check	Advanced (Entropy + Max Frequency reveal skew and normalization needs)
AI training data evaluation	Advanced (Boilerplate metrics assess content originality)
Ongoing data governance	Start with Basic Uniqueness, move to Advanced for deeper analysis

Configuring Uniqueness

DQS provides 2 configuration inputs for uniqueness. Each can be set at the global level (applies to all fields) and overridden at the individual field level.

Setting	What It Controls
Case Sensitive	Controls whether value comparison considers letter casing. When disabled (the default), “Apple” and “apple” count as the same value. When enabled, they count as two distinct values.
Include Blanks	Controls whether null and blank records are counted in uniqueness calculations. When disabled (the default), blanks are excluded from evaluation. When enabled, all blank records share a single “blank” value, which can lower the uniqueness rate on fields with many empty records.

Tip: Disable Case Sensitive (the default) for most fields. Enable it only when casing carries meaning, like product codes where “ABC-100” and “abc-100” are genuinely different items.

When to Enable Include Blanks

By default, DQS excludes blank and null records from uniqueness calculations. This makes sense for optional fields where blanks are expected.

Enable Include Blanks when blanks themselves are the problem you want to measure. If 3,000 Contact records have no Email value, those 3,000 blanks share one “blank” value in the uniqueness calculation. This lowers the Uniqueness Rate and makes the blank problem visible in the headline metric.

Example: You scan Phone on Contacts with Include Blanks disabled. Uniqueness Rate is 91%. You enable Include Blanks and re-scan. Uniqueness Rate drops to 72%. The difference reveals that a large portion of your Contact records share a common trait: no phone number. The field looked healthy when blanks were excluded, but the full picture tells a different story.

Common Uniqueness Issues

Duplicate Records from Bulk Imports

Data migrations and list imports introduce duplicates when matching logic is insufficient. A purchased contact list creates new records for people who already exist. A legacy system export creates Accounts that overlap with current data.

Fix: Audit imports before loading. Use DQS to establish a uniqueness baseline on key identifier fields (Email, Phone, Website) before and after each import.

Default Values Masquerading as Data

Integrations and migrations often write default values into fields. “Unknown”, “N/A”, or a company’s own name appears on thousands of records. These inflate duplicate counts and distort distribution metrics.

Fix: Run Advanced Uniqueness Analysis. Max Frequency reveals the dominant value. If one value appears on 85% of records, investigate whether it is real data or a default.

Free-Text Fields with No Governance

Text fields that lack picklist constraints accumulate variations over time. The Job_Title field on Contacts stores the same role 15 different ways. Distinct Count climbs while the actual business concept set remains small.

Fix: Run Advanced Uniqueness Analysis on text fields you plan to standardize. Use Distinct Count and Entropy to scope the normalization effort. Convert high-value free-text fields to picklists.

Boilerplate-Polluted Text Fields

Support agents paste standard openings and closings into every case. Sales reps copy opportunity description templates. The fields are technically “unique” (different case numbers, dates), but the content is 90% identical.

Fix: Run Advanced Uniqueness Analysis with boilerplate detection on text fields. Boilerplate Rate reveals the degree of template pollution. Address this before using these fields for AI training or analysis.

Shared Identifiers That Look Like Duplicates

Department emails ([email protected]), shared phone numbers, and company-wide fax numbers create legitimate duplicate values. Not every low Uniqueness Rate signals a problem.

Fix: Evaluate uniqueness in context. An Email field with 78% uniqueness needs investigation. A Fax field with 40% uniqueness is expected. Set your monitoring thresholds based on what the field represents.

Best Practices

Choose the Right Headline by Field Type

Uniqueness Rate is the right headline for identifier fields (Email, Phone, Account Name). For text content fields (Description, Notes, Comments), combine Uniqueness Rate with Boilerplate Rate to get the full picture. A field can score 99% Uniqueness Rate and still be 65% boilerplate.

Use Distribution Metrics for Segmentation Fields

For fields you use in segmentation, filtering, or reporting (Industry, Country, Lead Source), check Entropy and Max Frequency. Low entropy reveals that your “24-category” picklist is really a 2-bucket system. Max Frequency reveals default values that distort your segments.

Track Trends Across Scans

A single scan shows current state. Run scans regularly to detect new duplicate sources, measure the impact of deduplication initiatives, and identify integrations that introduce repetitive data. A field that drops from 90% to 75% uniqueness between scans has a new problem source.

Prioritize by Business Impact

Not every field needs high uniqueness. An Email field with duplicates signals a merge problem. A Country field with duplicates is normal. Focus uniqueness monitoring on fields that serve as identifiers, drive deduplication rules, or feed AI models.

Address Root Causes

Low uniqueness signals a process issue. Investigate whether users are creating records without checking for existing ones, imports lack deduplication logic, or integrations write default values. Fix the source, not just the symptom.

Next Steps

You now understand how to measure and diagnose uniqueness issues. Continue learning about the next dimension:

In Salesforce: Data Quality in Salesforce - deduplicate Accounts, Contacts, and Leads
Next: Timeliness - Measure data freshness and currency
Previous: Validity - Ensure data follows expected formats
Related: The Five Dimensions - Overview of all dimensions
Action: AI Readiness Assessment - See your current uniqueness scores