Identifying Duplicates using @SoundsLike

8 posts / 0 new
Last post
siredge's picture
siredge
Offline
Joined: 08/30/2013 - 16:45
Identifying Duplicates using @SoundsLike

I can use @SoundsLike to find matches to static strings I enter into the equation or two fields on the same record.  But I don't know how to use @SoundsLike to match any one record with any similar record in the same file because I don't know how to reference the same field from two records as parameters to the function.  Any suggestions?  Is what I'm trying to do possible?

Brian Element's picture
Brian Element
Offline
Joined: 07/11/2012 - 19:57

Hi siredge,

Unfortunately what you want to do can't be done easily.  There is an alternative.  You can use the @SoundEx function to get a value and then use duplicate key detection or summary to look for like items, I also created a  custom function that does the same thing, more to figure out the coding behind it then for any real reason.  Let me know if this helps you out.  

Brian Element's picture
Brian Element
Offline
Joined: 07/11/2012 - 19:57

Another option to do this.  You need one large file that has a record for each different pairing of fields that you want to compare.  One suggestion is you create two files that have the field and the record number from the original source file, that way if there is a match you can trace it back to the source file.  You take these two files and use my cross join script and create one file where you can have every type of possible match.  Just a warning if you original file has 1000 transaction the final file will have 1,000,000 (1,000 x 1,000).  You would now have both fields in the same record and you can run the @SoundsLike function on it.

Lois's picture
Lois
Offline
Joined: 02/14/2013 - 09:00

Hello Brian,
Looking through the attached, I am trying to extract all entries that have proper names and group the ones with the same name. The objective is to track all payments made to a specific individual. Unfortunately there is no PAYEE field and the names are entered according to the user's descretion. The colored ones show same names but spelt differently. Can the SOUNDSLIKE function still work here? If yes, how?
This is the formula  I used to get out something. But as you can see from the field created, it is not 100% accurate.
@alltrim(@SimpleSplit(TRS_DESC,"",1,"-",0))
Lois

Steven Luciani
Offline
Joined: 07/31/2012 - 13:20

Hi Lois,

You can use IDEA's new Fuzzy Duplicates feature to execute your task. It is found on the analysis tab in the Explore group.  Choose the duplicate key drop down and pick Fuzzy from the pick list.

 

The fuzzy Duplicate window will appear. For your scenario you want to allow records in multiple fuzzy groups and include exact duplicates. The similarity degree may need to be slid lower if there a large variations in the way payee names are spelled. 80% is the default and worke fine for your sample file. you need to choose a match key, in this case your IDEA_PAYEE field and give the file a name.

After clicking OK you will get an output that groups your payees and picks up all exact and 80% or better similar payee names

You then summarize this output file by group name which will give you the number of transactions per group. If the control total of the No_OF_RECS field adds to the total of your main file number of records then you have captured each transaction in a different group. If not, you may have to run the analysis again at a lower similarity degree percentage.

Regards,

Steve

 

 

Lois's picture
Lois
Offline
Joined: 02/14/2013 - 09:00

Hi Steve,
A zillion thanks! I am implementing your suggestion immediately. Will let you know how it goes.
Lois

Lois's picture
Lois
Offline
Joined: 02/14/2013 - 09:00

Hi Steve, 
It worked well but then the following happened:
1. I reduced the similarity to 60% because I noticed some records were treated separately in the main file. However, it then went ahead to bring together names that were similar sounding but definitely different! I guess that means, I cannot eat my cake and have it.
2. But did you notice, Steve, that even my IDEA_PAYEE field is not working well? Maybe if I attack it from that angle, my fuzzy would give a more accurate result. Can you please help with suggestions? I do appreciate!
Lois 

osaajah
Offline
Joined: 05/25/2018 - 02:33

reducing similarity degree / percentage will produce more records. because less similar words will be considered as duplicate, even if it just have a few characters in common. so, it is your judgement to choose percentage of similarity degree to use.also, fuzzy duplicate only checks words in same sequence order. fuzzy duplicate will fail if the words have reverse order as shown in picture below.fuzzy duplicate fails to recognized "ALI BABA" and "BABA ALI" which has reverse order.
Firdaus Sentosa.

Images: