Duplicate detection is accomplished by finding all records that contain exactly or approximately the same data in one or more fields. For example, consider this file fragment of five records containing six fields in each record:
Name Address City St ZIP® Phone ------ ----------------- ------------ -- ---------- -------------- 1 SMITH 2 E 13TH ST CHICAGO IL 60601-2407 (312) 458 9992 2 2 13 ST EAST CHICAGO IL 60605 SMITH 3 SMTH 2 EAST 13TH CHICAGO LAWN IL 312-458-9992 4 SMITH 2 E THIRTEENTH ST CHICAGO IL 60605 458-9992 5 SMITH TWO EAST 13TH ST CHICAGO IL 60602 312-458-9991Although all five of the above records intend to refer to the same person at the same address, no two records are exactly alike. If the above records were part of a larger file, consider the following attempts at finding duplicates in the file:
Attempt 1: Select records with the same address field. Finds none of the above records.
Attempt 2: Select records with the same name and same five-digit ZIP. Misses records 1, 3, and 5.
Attempt 3: Select records with the name "SMITH". Misses records 2 and 3 (while probably matching lots of other SMITHs at other addresses).
Obviously, a good strategy for duplicate detection requires a good strategy for dealing with inconsistent data. An excellent first step toward removing inconsistent data is to standardize addresses with CASS Certified™ address correction software. Another important technique is to use data entry programs that validate field formats, to prevent errors such as allowing names in the phone number field. For example, after address correction and field validations, the above examples become:
Name Address City St ZIP Phone ----- ----------- ------- -- ---------- ------------ 1 SMITH 2 E 13TH ST CHICAGO IL 60601-2407 312-458-9992 2 SMITH 2 E 13TH ST CHICAGO IL 60601-2407 3 SMTH 2 E 13TH ST CHICAGO IL 60601-2407 312-458-9992 4 SMITH 2 E 13TH ST CHICAGO IL 60601-2407 XXX-458-9992 5 SMITH 2 E 13TH ST CHICAGO IL 60601-2407 312-458-9991After standardizing as shown above, any attempt at duplicate detection will have a much better chance of finding the correct group of duplicates. In fact, selecting "records with the same address, ZIP, and soundex name" is an attempt that works perfectly on the above example. However, that attempt will still fail to find duplicates like the following (one person at one phone number, with two different versions of his name, and a shipping address different from his mailing address):
Name Address City St ZIP Phone ------------ ----------- ----- -- ----- ------------ BOB JAMES 100 MAIN ST APTOS CA 95003 408-662-2717 ROBERT JAMES PO BOX 200 APTOS CA 95001 408-662-2717The basic struggle in developing a good duplicate detection procedure is finding the right combination of comparisons to test for. If the duplicate test is too "tight" (for example, selecting only records with the exact same unedited contents in a large number of fields), then too many actual duplicates will be overlooked. For example, here's a file of names, Social Security numbers, and phone numbers:
Name SSN Phone ------- ----------- ------------ Johnson 555-99-2223 408-662-2717 Johnsen 555-99-2223 408-662-2717 Johnson 555-99-2223 408-662-2717For the above data, the test "records with the same name, Social Security number, and phone number" is too tight, and misses the "Johnsen" record. Testing for only the same Social Security number and phone number would have been better.
But if the duplicate test is too "loose" (for example, testing only one field, or multiple fields without careful regard to field content), then too many actual non-duplicates will incorrectly be considered as duplicates. Obviously, "records with the same last name" (and no other matching criteria) is far too loose for most mailing lists. Another example is "records with the same phone number", which will match two different people who happen to use the same phone number, a common occurrence at many business addresses.
A good approach toward determining the proper "tightness" for a duplicate test is to start with loose tests, then apply more restrictive selections if too many "false" duplicates are found. For example, if selecting "records with the same phone number" in a file of 100,000 records reports 30,000 duplicates (which might happen when the list contains lots of business addresses with more than one employee at each business), you might tighten up the test by changing the selection to "records with the same personal name and phone number". Continue adding restrictions until the number of duplicates drops to a small enough number of records that you are willing to inspect and review.
Most mailing lists contain at least six fields: personal name, company name, address, city, state, and ZIP Code™. (If the city-state-ZIP are in a single field, break them out as separate fields using address correction software.) If addresses have been standardized, the state field is redundant, leaving five fields that can be used in 31 different combinations to test for duplicate records. Of course, if your file contains additional fields, many more combinations are possible. The most frequently used duplicate test is probably "records with the same name and address", but many other tests may be appropriate, depending on the list being processed. See the Appendix for a description of each of the possible field combinations in a typical list, and what that combination finds when it is used to search for duplicates.
In general, comparing more fields makes a test tighter, and comparing fewer fields loosens a test. However, a test also can vary its tightness, without changing the number of fields being tested, by "filtering" the contents of the fields it is examining. For example, instead of using all characters in a field, the test might keep only the non-blank characters, or only the numeric characters. Consider this example:
Name Phone ------- -------------- Johnson 408-662-2717 Johnson (408) 662-2717 Johnson 408 662 2717With the above data, the test "records with the same phone" produces no matches, if all characters in the phone field are considered significant. But the test "records with the same digits in the phone number" successfully matches all three records. Other common filters are "take the leftmost n characters" and "the equivalent soundex code for a name". In general, applying a filter loosens up a test, because more records will match the filtered field compared to the unfiltered field.
Before trying to detect duplicates, it's important that each field in the file is consistently used for only one type of data: names should be in the name field and in no other field, addresses in the address field, ZIP codes in the ZIP field, and so on. Using one field for more than one purpose (sometimes using an address field for company names, sometimes for addresses, sometimes for mailstops, and so on), is bad database maintenance and makes it impossible to do thorough duplicate detection.
Also note that individual fields sometimes make better duplicate detection candidates if the field is split into two or more fields. A common candidate for field splitting is personal names. For example, it's very difficult to automatically determine the duplicates in the following file:
Name --------------- BOB JAMES ROBERT JAMES MR BOB JAMES ROBERT JAMES JR DR R JAMES PHD
However, when the above data is split into multiple fields, duplicate detection becomes easier:
Prefix First Last Suffix ------ ------ ----- ------ BOB JAMES ROBERT JAMES MR BOB JAMES ROBERT JAMES JR DR R JAMES PHDBe sure to carefully decide exactly when you consider two records to be duplicates. Do you have a duplicate when you find two different employees at the same business address? No, if you're mailing to individuals. Yes, if you're mailing to companies. Do you have a duplicate when you find a male and female with the same last name at the same residential address? No, if you're mailing to individuals. Yes, if you're mailing to families or households. What about two different last names at the same residential address? Depending on your application, what might be considered a duplicate for one mailing might not be a duplicate for the next mailing.
Personal name and company name aren't the only fields to consider when determining which records qualify as duplicates. A good example is finding the same name at a street address and at a post office box address. Do you keep both or choose only one? Street addresses can provide better demographic information, and are easier to correct and less prone to undeliverability because of typos, but are sometimes used only for shipping (not mailing), and occupants usually change street addresses more frequently than their box addresses. (Note that because street ZIP codes and post office box ZIP codes are often different within the same city, searching for name-address-ZIP duplicates will often fail to find one person having both a street address and box address. Use a name-address-city duplicate test instead.)
After a pair of duplicate records has been found, there still has to be a decision about which data to prefer if the two records don't match exactly. For example, consider these two "duplicates":
Version #1 Version #2 ----------------- ------------------- JANE DONNELY JANE S DONELLEY XYZ CO INC M/S 3A XYZ CORP LEGAL DEPT 100 MAIN ST 100 MAIN ST DENVER CO 80211 DENVER CO 80211 303-333-3000 X12 303-333-3112Three of the five fields in the above two records don't match. How do you reconcile the two records? Use the newer version? The older, original version? Somehow merge the two? Or arbitrarily delete one?
Note that regardless of the method you decide on to choose between or resolve ambiguous duplicates, it's often best NOT to simply delete the unwanted records. Instead, mark them as duplicates, but keep them on file. That way, if the duplicate records are ever again reacquired, they won't have to be re-examined to determine if they're really duplicates. For example, imagine a file containing:
ID # Name Company City ---- ---------- -------- -------- 236 EVAN SMITH IBM CORP LAKESIDEA new prospect list is acquired, the two files are merged, and duplicate detection reports the file now contains this possible match:
ID # Name Company City ---- ---------- ---------------------- -------- 236 EVAN SMITH IBM CORP LAKESIDE 543 EVAN SMITH INTL BUSINESS MACHINES LAKESIDEIf, after review, one of the two records is simply deleted, the review process will have to be repeated if the deleted form ever shows up again. But, if both records are retained and one is marked as a duplicate, the file might instead become something like:
ID # Name Company City Useable? ---- ---------- ---------------------- -------- ----------- 236 EVAN SMITH IBM CORP LAKESIDE YES 543 EVAN SMITH INTL BUSINESS MACHINES LAKESIDE NO, USE 236In the above file, undesirable record 543 has been marked as a duplicate for record 236. Now, if the prospect "EVAN SMITH, INTL BUSINESS MACHINES, LAKESIDE" is reacquired from some subsequent source, it will exactly and completely match record 543 and therefore can be automatically deleted without any review. If record 543 had been deleted after the first acquisition and review, the review process will have to be done over again if the unwanted version is reintroduced to the file.
The technique of marking duplicates instead of deleting them also pays off when doing change-of-address processing. For example, let's say you acquire a prospect list and mail to the following record:
ID # Name Address City ---- -------- --------- ----------- 563 JANE DOE 100 MAPLE LOS ANGELESBut the piece comes back undeliverable and with a forwarding address because the recipient has moved, so you delete the old address and change it to the new address:
ID # Name Address City ---- -------- --------- ----------- 563 JANE DOE BOX 5000 NEW YORKUnfortunately, if you reacquire the original address (say, from another prospect list), there won't be any way to tell you've reintroduced an obsolete record:
ID # Name Address City ---- -------- --------- ----------- 563 JANE DOE BOX 5000 NEW YORK 847 JANE DOE 100 MAPLE LOS ANGELESInstead of wasting money remailing to undeliverable addresses you've already tried mailing to, a better scheme is "mark, don't delete". When a forwarding address is received, mark the obsolete address so the file becomes something like this:
ID # Name Address City Status Ref ---- -------- --------- ----------- ------ --- 563 JANE DOE 100 MAPLE LOS ANGELES MOVED 688 688 JANE DOE BOX 5000 NEW YORK OKWith the above scheme, any attempt to reacquire an obsolete prospect can immediately be detected and ignored. Even if an undeliverable address doesn't have a new forwarding address, simply mark the bad address "obsolete" and retain it to serve as a sentinel that can trigger duplicate detection if reacquisition of the obsolete data ever occurs.
To summarize the strategy and tactics described above: standardize addresses with CASS Certified address correction software, validate data entry, use fields consistently, start with loose tests then tighten, test for duplicates with appropriate field combinations, filter fields as necessary, split fields for consistency, mark instead of delete.