How to find (possible) duplicate entries in your NAB?

Hi,

We are using an additionl addressbook, the entries in it originate from our AS/400. This NAB contains more than 100.000 person-documents. The problem is that there are duplicate documents in there, but also ‘possible’ documents. Let me explain: If I see a person document for “John Williams”, and another one for “Johan Williams”, aren’t we speaking about the same person here?

Has anyone of you found a way to pick out these possible duplicates?

Another example: If a have a person-document for “George W. Bush”, and another one for “George Walker Bush”, how can I determine that these two just might refer to the same person? Anyone of you knows a good algorytm for this?

Thanx,

Geert.

Subject: How to find (possible) duplicate entries in your NAB?

Would they not have to point to the same mail file?

I normally check for duplication in the ($Users) view. You can see this by CTRL-SHIFT + double click database to open.

However from what you describe it is less document duplication and more that the administrator has basically made multiple users for one person.

Subject: RE: How to find (possible) duplicate entries in your NAB?

I’m not taking about the NAMES.NSF, I’m talking about another addressbook, which I have added to our environment using Directory Assistance. This directory (NAMESEXT.NSF?) retrieves it’s data from an AS/400 using LEI. Now, on the AS/400 application everyone can enter contactdata, and one person writes “John Cash”, and another writes “Johnny Cash”. Although they describe the same physical person, the one who creates “Johnny Cash” does not know that “John Cash” is alreeady in the system. Obviously, when looking at those 2 entries, one could assume that it involves one and the same person; But what if there are, like in our case, are thousands of these documents, collected over the years. I know users should always check when they create a new entry to see if it is not already there, but we all know how users are. And once an entry has been created, no one will ever go through the trouble to see if the entry didn 't exist in the first place.

To summarize: Can anyone come up with a script, that can find names that look alike. No, putting the names in a sorted view doesn 't make any sense, although “Spiderman” and “Zpiderman” are the same person, they will not end up next to each other when displayed in a view.

I have tried counting letters, converting the name to a phonetic spelling, and so one: Either the algorythm is too slow to be used on thousands of documents, or either the results were not very satisfactory.

If you are interested, I can share my letter-counting-comparing algorythm (but is’t slow on lots of documents…)

Humans can spot the (almost!) duplicate names in the following list within 2 seconds, how can computers/Lotuscript accomplish this? Check it out:

  • Brian adams

  • Bill moneyhat

  • Bob the Kid

  • Bruce rosinski

  • Bryan Addams

  • Bart Simpson

  • Bob Happywater

Did you find the solution?

Regards,

G.

Subject: RE: How to find (possible) duplicate entries in your NAB?

Ahh ok understand now. It sounds a lot like something languageware could do easily. Not sure if they have a names dictionary though.

Even so though just because names are similar doesn’t mean it is the same person.

By the same example can you tell me which are the same person.

  • Brian adams

  • Brian B adams

  • Brian A adams

  • Bryan Addams

  • Bryan A Addams

  • Bryan Adddams

The truth is that they could in theory be all different people or multiple people. You need a common field like an ID or email address or maybe phone number. Maybe work from that.

Subject: RE: How to find (possible) duplicate entries in your NAB?

You are correct, but at least, it would give me a list of possible duplicate names! I could then manually check all the “Brian’s”, and find out if they are really the same person or not. At least that would be a start: To giv me alist of possible duplicate person-documents.

But how to find them, that’s the question… What’s a possible algorythm?

Subject: RE: How to find (possible) duplicate entries in your NAB?

You are correct, but at least, it would give me a list of possible duplicate names! I could then manually check all the “Brian’s”, and find out if they are really the same person or not. At least that would be a start: To giv me alist of possible duplicate person-documents.

But how to find them, that’s the question… What’s a possible algorythm?

One possible approach for would be to incorporate an algorithm that uses edit distances (see Levenshtein distance - Wikipedia).

At worst case you could employ a process that for each of the n names, calculates and records the edit distance of between it and each of the other n-1 names, reveals the pairs whose scores are closest to zero (exact match). Such a brute-force approach is an O(n2) algorithm, where n is the number of entries in your directory - admittedly ugly, but there are some obvious opportunities for improvement.

Let us know how you make out.

Subject: How to find (possible) duplicate entries in your NAB? - Solution

Hi Ken,

Levenshtein is great: Is a quick algorythm, and the produced results can really be used! Thanx for giving me the tip. Here 's my function:

Sub Initialize

Msgbox Func_Compare("String1","String2")

End Sub

Function Func_Compare(Str1 As String, Str2 As String) As Integer

n = Len(Str1)

m = Len(Str2)



Redim Array(0 To n+1, 0 To m+1) As Integer



For I = 1 To n+1

	Array(I,1) = I	- 1

Next



For J = 1 To m+1

	Array(1,J) = J - 1

Next



For I = 2 To n + 1

	For J = 2 To m + 1

		If Mid$(Str1,I-1,1) = Mid$(Str2,J-1,1) Then

			Array(I,J) = 0

		Else

			Array(I,J) = 1

		End If

	Next

Next



For I = 2 To n + 1

	For J = 2 To m + 1

		Array(I,J) = Func_Min(Func_Min(Array(I-1,J)+1,Array(I,J-1)+1),Array(I,J)+Array(I-1,J-1))

	Next

Next



Func_Compare = Array(n+1,m+1)

End Function

Function Func_Min(Int1 As Integer, Int2 As Integer) As Integer

If Int1 < Int2 Then

	Func_Min = Int1

Else

	Func_Min = Int2

End If

End Function

Subject: RE: How to find (possible) duplicate entries in your NAB? - Solution

Levenshtein is great: Is a quick algorythm, and the produced results can really be used! Thanx for giving me the tip.

Glad to be of help.

Let us know how you do with your utility that calls Func_Compare to process all the names in your directory.

I’m particularly interested in what approaches you take in order to mitigate any O(n2) characteristics of the utility’s algorithm. I’m guessing n is represented by the number of different names in the first column of the $Users view. You can easily cut the time in half (but it’s still O(n2) ) by only comparing names in the lower triangular matrix as illustrated by the table below where both the X and Y axis represent the n names and the X marks the Func_Compare calls to be made.

name1

name2

name3

. . .

namen

name1

name2

X

name3

X

X

. . .

X

X

X

namen

X

X

X

X

Also, you won’t want to take care not to open any document since you have 100,000. Again, to retrieve the n names you should only need to access the $Users view I would think.