Dedupe Rules
The Dedupe Rules section allows you to configure how records are compared and matched to detect duplicates. These rules help improve data quality by identifying records that represent the same entity but may have slight variations or errors.
Rule Fields¶
- Name – Name of the dedupe rule.
- Project – Select the project where this rule will be applied.
- Datasource – Choose the data source for the rule.
- Table Name – Select the table within the data source.
Once a table is chosen, the Dedupe Rules section will be displayed where you can configure how columns are compared using various algorithms. You can also define additional rules such as skipping certain rows or ignoring specific words during comparison.
Column Matching Rules¶
Column | Algorithm | (+) Percentage | (-) Percentage |
---|---|---|---|
Name | Exact Match | 20% | 10% |
DOB | Jaro Winkler | 30% | 15% |
... | ... | ... | ... |
- Column – Select the column to be compared.
- Algorithm – Choose the comparison algorithm.
- (+) Percentage – Percentage increase in similarity score if values match.
- (-) Percentage – Percentage decrease in similarity score if values don’t match.
You can configure multiple columns with different algorithms and weightages based on how important each column is for identifying duplicates.
Skip Certain Rows¶
You can define rules to directly classify certain rows as unique based on column values and conditions without comparing them to other records.
Ignore Rules¶
Column | Condition | Value |
---|---|---|
Name | Contains | Pvt |
Address | Equals | Unknown |
- Column – Choose the column where the rule applies.
- Condition – Select the condition for ignoring the value.
- Value – Provide the value to be ignored during comparison.
This section helps avoid unnecessary comparisons by excluding certain records based on predefined conditions.
Skip Words¶
This section allows you to define a list of words to be ignored during comparison. These words are removed from the values before applying the matching algorithms.
Examples of words that can be skipped:
- “Private Limited”
- “Mr”
- “Inc.”
- “The”
By ignoring such words, comparisons focus on the significant parts of the data, improving the accuracy of matching.
Supported Algorithms¶
-
Person Name – Compares names by accounting for common variations, abbreviations, and misspellings to determine similarity.
-
Exact Match – Checks if the two values are exactly the same, including casing and spacing.
-
Numeric Match – Compares numeric values, allowing for slight formatting differences.
-
Jaro Winkler – Measures similarity by comparing characters and transpositions, giving more weight to matching prefixes.
-
Jaro Winkler Tokenized – Tokenizes strings into smaller parts before comparison, improving matching for complex fields.
-
Geo Position Match – Compares geographical coordinates, allowing for slight variations or errors in location data.
-
Dice Coefficient – Measures similarity based on shared bigrams between two strings, useful for partial matches.
-
Jaccard Index – Compares sets of tokens and measures how many items are shared versus unique between two records.
-
Levenshtein – Calculates the minimum number of single-character edits needed to transform one string into another.
-
Weighted Levenshtein – Similar to Levenshtein but applies different weights to edits for more refined matching.
-
Longest Common Substring – Finds the longest sequence of characters appearing in both strings, identifying shared patterns.
-
Soundex – Converts words into phonetic codes to match similar sounding names despite spelling differences.
-
Different – Checks whether two values are different, useful for detecting discrepancies.
-
Metaphone – Provides a phonetic representation that improves matching for words with similar sounds.
-
Norphone – An alternative phonetic algorithm that focuses on name similarity across diverse languages.
-
QGram – Compares overlapping substrings of fixed length, identifying similarities based on character sequences.
This comprehensive setup ensures that deduplication rules can be tailored to the specific characteristics of your data, providing flexibility, accuracy, and control in identifying and handling duplicates.