Reverse Duplicate
Forums
Hi,
I have two field say column A and E with ID. The ID's were sort of duplicate like
In 1 st row Column A have 12345 and column E have 34344
in 2 nd row Column A have 34344 and Column E have 12345
Both the rows has same information but the ID alone reversed in the 1st and second row. I need to retain the first and remove the second row in the output. Can anyone help me to achieve this in IDEA?
I am thinking this is the way
I am thinking this is the way to go but I undersand the possible problem as the numbers could be duplicated. So why not use an algorithm to change the numbers to something that would be more unique. My suggestion would be to take the first ID and multiple each number of that ID by another number and concatenate them together. In this example I used the ASCII code of the number as the multiplier. So if you have the following:
ID 21771 ID1 21318
would be
(2 * 50) & (1 * 49) & (7 * 55) & (7 * 55) & (1 * 49) =
100 & 49 & 385 & 385 & 49 =
So your new ID would be 1004938538549, you do the same for ID1 and then add them together so the ID would really be unique. Do you think that would work? If not I think it comes down to creating a script that will take the first row, look through the rest of the file for a match, save the match record number to a another file, then do the second row and so on. You then use the file to do a join to remove all the duplicates.
Hi Sri,
Hi Sri,
Ok, you want to try this script out. What it does is display a dialog in which you select your file and the two ID Columns. The script then adds a recno field to the main file and goes through the file extracting the record number of duplicates to a new file. Once this is completed it joins the main file with the duplicate record number file and removes all those record numbers.
Some limitations, I did this script fairly quicly so there is not much in error checking. Also right now it assumes that the ID columns are numeric, if they are not that can be easily fixed but as it is now it will give an error. Also depending on how large a file you have and the power of your computer it might take some time to process.
So let me know how it works for you.
Brian
Hi Sri,
In reply to Hi Brian, by srivibish
Hi Sri,
How about trying something like this.
First make sure you remove all the duplicates. First add a virtual field call record number using the @precno function. This way when we are doing the analysis we can trace the record back to the original file.
Then perform Duplicate Key Detection using the ID and ID1 fields as the key.
In your example it will show that records 15, 17, 18 and 20 are duplicates and should be removed from the file.
You can remove them by doing a join with the mod file being the primary and the duplicate file being the secondary. The match is on the record number and you use the Records with no secondary match.
So now we have a file with no duplicates.
Now comes the fun part, I finally decided to create a virtual field adding the ID and ID1 fields together.
This gives a unique key that will be the same if the ID + ID1 is the same as any other ID + ID1
I then did a summary by ID_ADD and in the Fields I selected the REC_NO to be included and I used the Use fields from last occurrence option. This will give me the record number of the second transaction when there is a match.
So the results shows the REC_NO we must remove from the main file.
In this case they are all duplicates, if there are any non-duplicates then we would have to do one additional step by extracting NO_OF_RECS = 2. We join this file back to the main file using the REC_NO as the match and Records with no secondary match as the match option and hopefully you will now have a file with no duplicates.
Let me know how this works out.
Brian