Matching strings with a combination of tools

I had a case where I had to match some names and needed to find a set of ways to clean and match strings. The solution I ended up with was a combination of regular expressions, the NLSSORT-function and the UTL_MATCH-package with the Jaro-Winkler algorithm.

First – names were written in different ways, as I’ve illustrated in this table creation:

SQL> CREATE TABLE names (
  2    id NUMBER PRIMARY KEY,
  3    firstName VARCHAR2(15) NOT NULL,
  4    lastName VARCHAR2(15) NOT NULL
  5  );

Table NAMES created.

SQL> BEGIN
  2    INSERT INTO names
  3    VALUES (1, 'John André', 'Adams');
  4    INSERT INTO names
  5    VALUES (2, 'John Andrè', 'Adams');
  6    INSERT INTO names
  7    VALUES (3, 'John A', 'Adams');
  8    INSERT INTO names
  9    VALUES (4, 'John-Andre', 'Adams');
 10    INSERT INTO names
 11    VALUES (5, ' John  Andre´', 'Adams');
 12    INSERT INTO names
 13    VALUES (6, 'John', 'André Adams');
 14    COMMIT;
 15  END;
 16  /

PL/SQL procedure successfully completed.

SQL> SELECT * FROM names;

        ID FIRSTNAME       LASTNAME
---------- --------------- ---------------
         1 John André      Adams
         2 John Andrè      Adams
         3 John A          Adams
         4 John-Andre      Adams
         5  John  Andre´   Adams
         6 John            André Adams

6 rows selected.

Removing accents

One challenge was that names contained accented letters and people don’t always get these right. The NLSSORT-function can produce an accent insensitive version of the names:

SQL> WITH
  2  concat_names AS (
  3    SELECT firstName || ' ' || lastname first_last
  4    FROM names)
  5  SELECT
  6    first_last
  7  , nlssort(first_last, 'nls_sort=binary_ai') nls
  8  , utl_raw.cast_to_varchar2(nlssort(first_last, 'nls_sort=binary_ai')) first_last_ai
  9  , length(first_last) len1
 10  , length(utl_raw.cast_to_varchar2(nlssort(first_last, 'nls_sort=binary_ai'))) len2
 11  FROM concat_names;

FIRST_LAST           NLS                                         FIRST_LAST_AI              LEN1       LEN2
-------------------- ------------------------------------------- -------------------- ---------- ----------
John André Adams     6A6F686E20616E647265206164616D7300          john andre adams            16         17
John Andrè Adams     6A6F686E20616E647265206164616D7300          john andre adams            16         17
John A Adams         6A6F686E2061206164616D7300                  john a adams                12         13
John-Andre Adams     6A6F686E2D616E647265206164616D7300          john-andre adams            16         17
 John  Andre´ Adams  206A6F686E2020616E647265C2B4206164616D7300   john  andre´ adams         19         20
John André Adams     6A6F686E20616E647265206164616D7300          john andre adams            16         17

6 rows selected.

Notice the extra NULL character (“00”) at the end of the raw-value. We’ll get rid of that, and more next.

Removing non-letters

I could have created a bunch of regular expressions to filter out punctuations, hyphens, double whitespaces, whitepace at the beginning and end and so forth:

SQL> WITH
  2  trimmed_names AS (
  3  	SELECT
  4  		FIRSTNAME
  5  	 , REGEXP_REPLACE(REGEXP_REPLACE(FIRSTNAME, '[[:space:]]{2,}', ' '), '(^+[[:space:]]|\.|\-|+[[:space:]]$)', '') FNAME
  6  	FROM names
  7  )
  8  SELECT FIRSTNAME, FNAME, LENGTH(FIRSTNAME) LEN, LENGTH(FNAME) LEN_rx
  9  FROM trimmed_names;

FIRSTNAME       FNAME                    LEN     LEN_RX
--------------- ----------------- ---------- ----------
John André      John André                10         10
John Andrè      John Andrè                10         10
John A          John A                     6          6
John-Andre      JohnAndre                 10          9
 John  Andre´   John Andre´               13         11
John            John                       4          4

6 rows selected.

But – as I was only doing machine based matching, not making the text readable, I went for just removing any non-alpa characters. That’s easily done with the \W switch (note uppercase W for non-word characters):

SQL> SELECT
  2    firstName
  3  , lastname
  4  , regexp_replace(firstName || ' ' || lastname, '\W') first_last_rx
  5  FROM names;

FIRSTNAME       LASTNAME        FIRST_LAST_RX
--------------- --------------- ---------------
John André      Adams           JohnAndréAdams
John Andrè      Adams           JohnAndrèAdams
John A          Adams           JohnAAdams
John-Andre      Adams           JohnAndreAdams
 John  Andre´   Adams           JohnAndreAdams
John            André Adams     JohnAndréAdams

6 rows selected.

Putting the accent insensitive into play, we can clearly see that we now have a better basis for matching:

SQL> WITH
  2  combined_names AS (
  3    SELECT id, firstName || ' ' || lastname first_last
  4    FROM names
  5  ),
  6  ai_names AS (
  7    SELECT id
  8         , first_last
  9         , utl_raw.cast_to_varchar2(nlssort(first_last, 'nls_sort=binary_ai') ) first_last_ai
 10    FROM combined_names
 11  )
 12  SELECT id
 13    , first_last
 14    , regexp_replace(first_last_ai, '\W') first_last_rex
 15  FROM ai_names
 16  ORDER BY first_last_rex;

ID FIRST_LAST           FIRST_LAST_REX
-- -------------------- --------------------
 3 John A Adams         johnaadams
 1 John André Adams     johnandreadams
 2 John Andrè Adams     johnandreadams
 4 John-Andre Adams     johnandreadams
 5  John  Andre´ Adams  johnandreadams
 6 John André Adams     johnandreadams

6 rows selected.

UTL_MATCH

Lastly, the UTL_MATCH-package has algorithms to perform similarity calculations on two strings. Applying one or more of these and selecting threshold values, may help you get a bit further.

Note that the Jaro-Winkler algorithm has an offset scale, favouring strings that match in the beginning, while the Levenshtein algorithm (edit_distance_similarity) doesn’t seem to care about where the differences are:

SELECT
  utl_match.jaro_winkler_similarity('Joe DiMaggio', 'John DiMaggio') A,
  utl_match.jaro_winkler_similarity('DiMaggio Joe', 'DiMaggio John') B,
  utl_match.edit_distance_similarity('Joe DiMaggio', 'John DiMaggio') C,
  utl_match.edit_distance_similarity('DiMaggio Joe', 'DiMaggio John') D,
  utl_match.jaro_winkler('Joe DiMaggio', 'John DiMaggio') E,
  utl_match.jaro_winkler('DiMaggio Joe', 'DiMaggio John') F,
  utl_match.edit_distance('Joe DiMaggio', 'John DiMaggio') G,
  utl_match.edit_distance('DiMaggio Joe', 'DiMaggio John') H
FROM dual;

A  B  C  D  E                  F                  G  H
-- -- -- -- ------------------ ------------------ -- --
93 95 85 85 0,9367521367521368 0,9525641025641025  2  2

Meaning: In our name matching example, if the last name has fewer variations than the first name, you may get higher match-scores on the concatenation of lastname, firstname than firstname, lastname.

Putting it all together

Let’s use all these techniques together and see how the different strategies work when we do a cross join on our names. I’ve set the match threshold to 90, which is “just a number”. (scroll right to view the full result-set)

SQL> WITH
  2  ai_names AS (
  3    SELECT id
  4    , utl_raw.cast_to_varchar2(nlssort(firstname, 'nls_sort=binary_ai') ) firstname
  5    , utl_raw.cast_to_varchar2(nlssort(lastname, 'nls_sort=binary_ai') ) lastname
  6    FROM names
  7  ),
  8  cleaned_names AS (
  9    SELECT id
 10    , regexp_replace(firstname, '\W') firstname
 11    , regexp_replace(lastname, '\W') lastname
 12    FROM ai_names
 13  ),
 14  match_results AS (
 15    SELECT
 16     a.firstname a_firstname
 17    , a.lastname a_lastname
 18    , b.firstname b_firstname
 19    , b.lastname b_lastname
 20    , DECODE(a.firstname || a.lastname, b.firstname || b.lastname, 'MATCH', NULL) eq_first_last
 21    , utl_match.jaro_winkler_similarity(a.firstname || a.lastname, b.firstname || b.lastname) jw_first_last
 22    , utl_match.jaro_winkler_similarity(a.lastname || a.firstname, b.lastname || b.firstname) jw_last_first
 23    , utl_match.edit_distance_similarity(a.firstname || a.lastname, b.firstname|| b.lastname) edt_first_last
 24    , utl_match.edit_distance_similarity(a.lastname || a.firstname, b.lastname || b.firstname) edt_last_first
 25    FROM cleaned_names a
 26    CROSS JOIN cleaned_names b
 27    WHERE a.id != b.id
 28  )
 29  SELECT mr.*
 30    , CASE
 31      WHEN jw_last_first > 90 THEN 'MATCH'
 32      ELSE NULL
 33      END jw_lf_match
 34    , CASE
 35      WHEN jw_first_last > 90 THEN 'MATCH'
 36      ELSE NULL
 37      END jw_fl_match
 38  FROM match_results mr;

A_FIRSTNAME     A_LASTNAME      B_FIRSTNAME     B_LASTNAME      EQ_FI JW_FIRST_LAST JW_LAST_FIRST EDT_FIRST_LAST EDT_LAST_FIRST JW_LF JW_FL
--------------- --------------- --------------- --------------- ----- ------------- ------------- -------------- -------------- ----- -----
johnandre       adams           johnandre       adams           MATCH           100           100            100            100 MATCH MATCH
johnandre       adams           johna           adams                            90            94             72             72 MATCH
johnandre       adams           johnandre       adams           MATCH           100           100            100            100 MATCH MATCH
johnandre       adams           johnandre       adams           MATCH           100           100            100            100 MATCH MATCH
johnandre       adams           john            andreadams      MATCH           100            76            100             29       MATCH
johnandre       adams           johnandre       adams           MATCH           100           100            100            100 MATCH MATCH
johnandre       adams           johna           adams                            90            94             72             72 MATCH
johnandre       adams           johnandre       adams           MATCH           100           100            100            100 MATCH MATCH
johnandre       adams           johnandre       adams           MATCH           100           100            100            100 MATCH MATCH
johnandre       adams           john            andreadams      MATCH           100            76            100             29       MATCH
johna           adams           johnandre       adams                            90            94             72             72 MATCH
johna           adams           johnandre       adams                            90            94             72             72 MATCH
johna           adams           johnandre       adams                            90            94             72             72 MATCH
johna           adams           johnandre       adams                            90            94             72             72 MATCH
johna           adams           john            andreadams                       90            82             72             58
johnandre       adams           johnandre       adams           MATCH           100           100            100            100 MATCH MATCH
johnandre       adams           johnandre       adams           MATCH           100           100            100            100 MATCH MATCH
johnandre       adams           johna           adams                            90            94             72             72 MATCH
johnandre       adams           johnandre       adams           MATCH           100           100            100            100 MATCH MATCH
johnandre       adams           john            andreadams      MATCH           100            76            100             29       MATCH
johnandre       adams           johnandre       adams           MATCH           100           100            100            100 MATCH MATCH
johnandre       adams           johnandre       adams           MATCH           100           100            100            100 MATCH MATCH
johnandre       adams           johna           adams                            90            94             72             72 MATCH
johnandre       adams           johnandre       adams           MATCH           100           100            100            100 MATCH MATCH
johnandre       adams           john            andreadams      MATCH           100            76            100             29       MATCH
john            andreadams      johnandre       adams           MATCH           100            76            100             29       MATCH
john            andreadams      johnandre       adams           MATCH           100            76            100             29       MATCH
john            andreadams      johna           adams                            90            82             72             58
john            andreadams      johnandre       adams           MATCH           100            76            100             29       MATCH
john            andreadams      johnandre       adams           MATCH           100            76            100             29       MATCH

30 rows selected.

In this case, we get more matches when we match on lastname, firstname, as I have few lastname variations.

Still, the entry John A Adams is problematic to match. Depending on other information and your search space, you might relax a bit on the matching criteria, like lowering the similarity score threshold or match on a subset of the name (first first name) and so forth. With search space, I’m thinking are you searching through your whole table or just a subset (address, gender, birthdate, …)

Cleanup:

SQL> DROP TABLE names;

Table NAMES dropped.

References

Documentation UTL_MATCH

Documentation NLSSORT

Documentation Regular Expressions

Base-functionality example on UTL_MATCH on Oracle Base

For some impressive performance results in name matching, this presentation on YouTube can be of interest: “What’s in a Name? Fast Fuzzy String Matching” – Seth Verrinder & Kyle Putnam – Midwest.io 2015

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s