Matching family names including typos

In order to build a database of relationships between credit card owners  in my workplace, I’ve come up with the following SAS code on a lazy day’s work.

The problem, as I saw it, was that a lot (20%) of bank accounts had the same family name for two different card holders, hence, it is very reasonable to assume they are family. But what happens if GrinShtein is written as GreenShtein , an infinite amount of typos obscured the true amount of family members.

Given the fact that the bank accounts with more than 1 member were probably family related, I had to see which names were similar up to some metric.

I chose the Levenshtein Distance as the metric to measure the dissimilarity between two names.  Levenshtein’s distance (L) counts the minimum number of edits  (insertions, substitutions and deletions) needed to transform String1 to String2.

I’ve graded each pair of names with :

Grade=L/max(c1,c2)

where c1 and c2 are the number of characters in each String.

Next I had to use a macro to compare names of 3 and more members of a bank account. The macro’s flow is as follows :

  1. I used the %combo macro to make a list of pairs indices to choose from iteratively, those were written into two macro “+” separated macro variables v_1 and v_2.
  2. If the number of distinct ID holders in the bank account is K then the number of iterations is N=k*(k-1)/2, for each iteration I graded a different pair of names and held the result according to a Boolean rule in an array (Pair array) of size N.
  3. For each bank account I calculated the number of similar names by using the sum of the array,  and the number K with the function : min(K,sum(array)). Later, I substituted 1’s to 2’s because a pair is made of two people. So the values that the similar names receives is S=0,2,…,K.
  4. I calculated the homogeneity : H=S/K.

if H=0 then no names are similar,  if H=0 then all names are similar up to a threshold of the Grade.

I used a threshold of both Grade and max(c1,c2), since smaller names tend to be more volatile. I guess each language has its different type of typos and errors in spelling names. Hebrew, in my case, is prone to dropping and adding vowels (which can result in the same name),  substitution of consonants which appear similar to the ear and keyboard.

Results were excellent, Thousands of solid households were discovered and added to the database.

The Code for the program will be shown in a few days…

Advertisements