Looking Advice on How to Implement a Same-SameOrNull-Different Test

2 posts / 0 new
Last post
SolveItWithAQuery
Offline
Joined: 01/09/2019 - 10:20
Looking Advice on How to Implement a Same-SameOrNull-Different Test

I'm currently trying to identify duplicates in our vendor file. I'm tackling the issue of using the address, because I'm able to find a lot of potential duplicates which we would not otherwise identify (e.g. John Doe and ABC Corp at 1234 Main St).
The manual algotrithm which I'm doing currently works on three criteria:

  1. Formatted Address Number matches. (Formatted address is [ZIP]:[Street, only letters], e.g. 90210:1310)
  2. Vendor ID does NOT match. (This is to exclude vendors which for some reason have the same address used multiple times.)
  3. The TIN matches, or at least one of the TINs used for the comparison is null. Effectively, null should be a wildcard that matches with any TIN.

For example, in the following example:

  • #001 | 90210:1310 | 555-55-5555
  • #002 | 90210:1310 | [null]
  • #003 | 90210:1310 | 000-00-0000

I would expect a match group with 001 and 002, and a match group with 002 and 003. I could potentially be okay with having 001, 002, 003 in a single match group in this case for ease of review. If record 002 did not exist, I do not want 001 and 003 in the same group since they clearly have different TINs.
I have previously used (and extended) some of Brian Element's Same-Same-Differnt code for other projects. https://www.ideascripting.com/Same-Same-Different-Test
This almost works for this issue (and would work if I didn't want the nulls to match with any entry). Any thoughts on how to do this?

SolveItWithAQuery
Offline
Joined: 01/09/2019 - 10:20

Okay, I think I have an algorithm that will work for this, still wrapping my head around the code needed to do this. Talking my way through this here so it's shared, and also so that someone can correct me if I'm approaching this incorrectly (if either my alogrithm is bad or if there's a more efficient way to do this).
Note: The key fields for the source data are VendorID and AddressNumber (not given in above examples)

  1. Run SSDT on Source data: Same Address, Same TIN, Diff VendorID. Summarize the keys in Summary1.
  2. Run Summary on TIN = null, SummaryNullTIN (just need list of VendorID).
  3. Join Source with SummaryNullTIN, then run SSDT on this new table on Same Address, Diff VendorID. Summarize Keys in Summary2.
  4. Append Summary2 to Summary1.
  5. Join Source with Summary1, delete superfluous keys from Summary1.
  6. Primary Sort on Address (which is the match group), Secondary Sort on TIN.