I had a case where I had to match some names and needed to find a set of ways to clean and match strings. The solution I ended up with was a combination of regular expressions, the NLSSORT-function and the UTL_MATCH-package with the Jaro-Winkler algorithm.
First – names were written in different ways, as I’ve illustrated in this table creation:
SQL> CREATE TABLE names ( 2 id NUMBER PRIMARY KEY, 3 firstName VARCHAR2(15) NOT NULL, 4 lastName VARCHAR2(15) NOT NULL 5 ); Table NAMES created. SQL> BEGIN 2 INSERT INTO names 3 VALUES (1, 'John André', 'Adams'); 4 INSERT INTO names 5 VALUES (2, 'John Andrè', 'Adams'); 6 INSERT INTO names 7 VALUES (3, 'John A', 'Adams'); 8 INSERT INTO names 9 VALUES (4, 'John-Andre', 'Adams'); 10 INSERT INTO names 11 VALUES (5, ' John Andre´', 'Adams'); 12 INSERT INTO names 13 VALUES (6, 'John', 'André Adams'); 14 COMMIT; 15 END; 16 / PL/SQL procedure successfully completed. SQL> SELECT * FROM names; ID FIRSTNAME LASTNAME ---------- --------------- --------------- 1 John André Adams 2 John Andrè Adams 3 John A Adams 4 John-Andre Adams 5 John Andre´ Adams 6 John André Adams 6 rows selected.
Removing accents
One challenge was that names contained accented letters and people don’t always get these right. The NLSSORT-function can produce an accent insensitive version of the names:
SQL> WITH 2 concat_names AS ( 3 SELECT firstName || ' ' || lastname first_last 4 FROM names) 5 SELECT 6 first_last 7 , nlssort(first_last, 'nls_sort=binary_ai') nls 8 , utl_raw.cast_to_varchar2(nlssort(first_last, 'nls_sort=binary_ai')) first_last_ai 9 , length(first_last) len1 10 , length(utl_raw.cast_to_varchar2(nlssort(first_last, 'nls_sort=binary_ai'))) len2 11 FROM concat_names; FIRST_LAST NLS FIRST_LAST_AI LEN1 LEN2 -------------------- ------------------------------------------- -------------------- ---------- ---------- John André Adams 6A6F686E20616E647265206164616D7300 john andre adams 16 17 John Andrè Adams 6A6F686E20616E647265206164616D7300 john andre adams 16 17 John A Adams 6A6F686E2061206164616D7300 john a adams 12 13 John-Andre Adams 6A6F686E2D616E647265206164616D7300 john-andre adams 16 17 John Andre´ Adams 206A6F686E2020616E647265C2B4206164616D7300 john andre´ adams 19 20 John André Adams 6A6F686E20616E647265206164616D7300 john andre adams 16 17 6 rows selected.
Notice the extra NULL character (“00”) at the end of the raw-value. We’ll get rid of that, and more next.
Removing non-letters
I could have created a bunch of regular expressions to filter out punctuations, hyphens, double whitespaces, whitepace at the beginning and end and so forth:
SQL> WITH 2 trimmed_names AS ( 3 SELECT 4 FIRSTNAME 5 , REGEXP_REPLACE(REGEXP_REPLACE(FIRSTNAME, '[[:space:]]{2,}', ' '), '(^+[[:space:]]|\.|\-|+[[:space:]]$)', '') FNAME 6 FROM names 7 ) 8 SELECT FIRSTNAME, FNAME, LENGTH(FIRSTNAME) LEN, LENGTH(FNAME) LEN_rx 9 FROM trimmed_names; FIRSTNAME FNAME LEN LEN_RX --------------- ----------------- ---------- ---------- John André John André 10 10 John Andrè John Andrè 10 10 John A John A 6 6 John-Andre JohnAndre 10 9 John Andre´ John Andre´ 13 11 John John 4 4 6 rows selected.
But – as I was only doing machine based matching, not making the text readable, I went for just removing any non-alpa characters. That’s easily done with the \W switch (note uppercase W for non-word characters):
SQL> SELECT 2 firstName 3 , lastname 4 , regexp_replace(firstName || ' ' || lastname, '\W') first_last_rx 5 FROM names; FIRSTNAME LASTNAME FIRST_LAST_RX --------------- --------------- --------------- John André Adams JohnAndréAdams John Andrè Adams JohnAndrèAdams John A Adams JohnAAdams John-Andre Adams JohnAndreAdams John Andre´ Adams JohnAndreAdams John André Adams JohnAndréAdams 6 rows selected.
Putting the accent insensitive into play, we can clearly see that we now have a better basis for matching:
SQL> WITH 2 combined_names AS ( 3 SELECT id, firstName || ' ' || lastname first_last 4 FROM names 5 ), 6 ai_names AS ( 7 SELECT id 8 , first_last 9 , utl_raw.cast_to_varchar2(nlssort(first_last, 'nls_sort=binary_ai') ) first_last_ai 10 FROM combined_names 11 ) 12 SELECT id 13 , first_last 14 , regexp_replace(first_last_ai, '\W') first_last_rex 15 FROM ai_names 16 ORDER BY first_last_rex; ID FIRST_LAST FIRST_LAST_REX -- -------------------- -------------------- 3 John A Adams johnaadams 1 John André Adams johnandreadams 2 John Andrè Adams johnandreadams 4 John-Andre Adams johnandreadams 5 John Andre´ Adams johnandreadams 6 John André Adams johnandreadams 6 rows selected.
UTL_MATCH
Lastly, the UTL_MATCH-package has algorithms to perform similarity calculations on two strings. Applying one or more of these and selecting threshold values, may help you get a bit further.
Note that the Jaro-Winkler algorithm has an offset scale, favouring strings that match in the beginning, while the Levenshtein algorithm (edit_distance_similarity) doesn’t seem to care about where the differences are:
SELECT utl_match.jaro_winkler_similarity('Joe DiMaggio', 'John DiMaggio') A, utl_match.jaro_winkler_similarity('DiMaggio Joe', 'DiMaggio John') B, utl_match.edit_distance_similarity('Joe DiMaggio', 'John DiMaggio') C, utl_match.edit_distance_similarity('DiMaggio Joe', 'DiMaggio John') D, utl_match.jaro_winkler('Joe DiMaggio', 'John DiMaggio') E, utl_match.jaro_winkler('DiMaggio Joe', 'DiMaggio John') F, utl_match.edit_distance('Joe DiMaggio', 'John DiMaggio') G, utl_match.edit_distance('DiMaggio Joe', 'DiMaggio John') H FROM dual; A B C D E F G H -- -- -- -- ------------------ ------------------ -- -- 93 95 85 85 0,9367521367521368 0,9525641025641025 2 2
Meaning: In our name matching example, if the last name has fewer variations than the first name, you may get higher match-scores on the concatenation of lastname, firstname than firstname, lastname.
Putting it all together
Let’s use all these techniques together and see how the different strategies work when we do a cross join on our names. I’ve set the match threshold to 90, which is “just a number”. (scroll right to view the full result-set)
SQL> WITH 2 ai_names AS ( 3 SELECT id 4 , utl_raw.cast_to_varchar2(nlssort(firstname, 'nls_sort=binary_ai') ) firstname 5 , utl_raw.cast_to_varchar2(nlssort(lastname, 'nls_sort=binary_ai') ) lastname 6 FROM names 7 ), 8 cleaned_names AS ( 9 SELECT id 10 , regexp_replace(firstname, '\W') firstname 11 , regexp_replace(lastname, '\W') lastname 12 FROM ai_names 13 ), 14 match_results AS ( 15 SELECT 16 a.firstname a_firstname 17 , a.lastname a_lastname 18 , b.firstname b_firstname 19 , b.lastname b_lastname 20 , DECODE(a.firstname || a.lastname, b.firstname || b.lastname, 'MATCH', NULL) eq_first_last 21 , utl_match.jaro_winkler_similarity(a.firstname || a.lastname, b.firstname || b.lastname) jw_first_last 22 , utl_match.jaro_winkler_similarity(a.lastname || a.firstname, b.lastname || b.firstname) jw_last_first 23 , utl_match.edit_distance_similarity(a.firstname || a.lastname, b.firstname|| b.lastname) edt_first_last 24 , utl_match.edit_distance_similarity(a.lastname || a.firstname, b.lastname || b.firstname) edt_last_first 25 FROM cleaned_names a 26 CROSS JOIN cleaned_names b 27 WHERE a.id != b.id 28 ) 29 SELECT mr.* 30 , CASE 31 WHEN jw_last_first > 90 THEN 'MATCH' 32 ELSE NULL 33 END jw_lf_match 34 , CASE 35 WHEN jw_first_last > 90 THEN 'MATCH' 36 ELSE NULL 37 END jw_fl_match 38 FROM match_results mr; A_FIRSTNAME A_LASTNAME B_FIRSTNAME B_LASTNAME EQ_FI JW_FIRST_LAST JW_LAST_FIRST EDT_FIRST_LAST EDT_LAST_FIRST JW_LF JW_FL --------------- --------------- --------------- --------------- ----- ------------- ------------- -------------- -------------- ----- ----- johnandre adams johnandre adams MATCH 100 100 100 100 MATCH MATCH johnandre adams johna adams 90 94 72 72 MATCH johnandre adams johnandre adams MATCH 100 100 100 100 MATCH MATCH johnandre adams johnandre adams MATCH 100 100 100 100 MATCH MATCH johnandre adams john andreadams MATCH 100 76 100 29 MATCH johnandre adams johnandre adams MATCH 100 100 100 100 MATCH MATCH johnandre adams johna adams 90 94 72 72 MATCH johnandre adams johnandre adams MATCH 100 100 100 100 MATCH MATCH johnandre adams johnandre adams MATCH 100 100 100 100 MATCH MATCH johnandre adams john andreadams MATCH 100 76 100 29 MATCH johna adams johnandre adams 90 94 72 72 MATCH johna adams johnandre adams 90 94 72 72 MATCH johna adams johnandre adams 90 94 72 72 MATCH johna adams johnandre adams 90 94 72 72 MATCH johna adams john andreadams 90 82 72 58 johnandre adams johnandre adams MATCH 100 100 100 100 MATCH MATCH johnandre adams johnandre adams MATCH 100 100 100 100 MATCH MATCH johnandre adams johna adams 90 94 72 72 MATCH johnandre adams johnandre adams MATCH 100 100 100 100 MATCH MATCH johnandre adams john andreadams MATCH 100 76 100 29 MATCH johnandre adams johnandre adams MATCH 100 100 100 100 MATCH MATCH johnandre adams johnandre adams MATCH 100 100 100 100 MATCH MATCH johnandre adams johna adams 90 94 72 72 MATCH johnandre adams johnandre adams MATCH 100 100 100 100 MATCH MATCH johnandre adams john andreadams MATCH 100 76 100 29 MATCH john andreadams johnandre adams MATCH 100 76 100 29 MATCH john andreadams johnandre adams MATCH 100 76 100 29 MATCH john andreadams johna adams 90 82 72 58 john andreadams johnandre adams MATCH 100 76 100 29 MATCH john andreadams johnandre adams MATCH 100 76 100 29 MATCH 30 rows selected.
In this case, we get more matches when we match on lastname, firstname, as I have few lastname variations.
Still, the entry John A Adams is problematic to match. Depending on other information and your search space, you might relax a bit on the matching criteria, like lowering the similarity score threshold or match on a subset of the name (first first name) and so forth. With search space, I’m thinking are you searching through your whole table or just a subset (address, gender, birthdate, …)
Cleanup:
SQL> DROP TABLE names; Table NAMES dropped.
References
Documentation Regular Expressions
Base-functionality example on UTL_MATCH on Oracle Base
For some impressive performance results in name matching, this presentation on YouTube can be of interest: “What’s in a Name? Fast Fuzzy String Matching” – Seth Verrinder & Kyle Putnam – Midwest.io 2015