Skip to content

Fuzzy Searches with SQLite’s SOUNDEX

Though a minor update, Xojo now offers access to another core SQLite function: SOUNDEX.

What is SOUNDEX?

SOUNDEX is a core function in SQLite that encodes a string into a 4-byte sequence, enabling indexing based on pronunciation. While the technical details of the algorithm are available on Wikipedia, the gist is that SOUNDEX(“Smith”) will match SOUNDEX(“Smythe”), making it useful for large searches to identify different spellings of the same term.

How is it used?

For the curious, you can see how words are indexed by simply invoking a select on a SOUNDEX term:

Var db As New SQLiteDatabase
db.Connect

Var rs As RowSet
rs = db.SelectSQL("SELECT SOUNDEX('Smith')")
Var smith As String = rs.ColumnAt(0).StringValue
rs = db.SelectSQL("SELECT SOUNDEX('Smythe')")
Var smythe As String = rs.ColumnAt(0).StringValue

Running this you’ll notice that ‘Smith’ and ‘Smythe’ both return the same SOUNDEX 4-byte sequence of “S530” — Here’s a quick breakdown of how this was generated:

  1. First Letter (S): The first letter of the name stays the same, so “S” from the name “Smith” or “Smythe” becomes the first character in the code.
  2. Numbers (530): The remaining characters in the code represent the sounds of the next letters in the name. Each number is based on a group of similar-sounding consonants:
    • 5 = “M” (because M sounds like “M” in “Smith” or “Smythe”)
    • 3 = “T” (because T sounds like “T” in “Smith” or “Smythe”)
    • 0 = No sound match or a vowel, so it acts as a filler to complete the 4-character code.

Practical Use Case

With a better understanding of the algorithm, we can put it to practical use by searching our database for names that sound similar to “Smith.”

db.ExecuteSQL("CREATE TABLE contacts(id INTEGER PRIMARY KEY, name TEXT)")
db.ExecuteSQL("INSERT INTO contacts (name) VALUES ('Smith'), ('Smythe'), ('Simith'), ('John')")

rs = db.SelectSQL("SELECT name FROM contacts WHERE SOUNDEX(name) = SOUNDEX('Smith')")

// Expected to find 3 names: Smith, Smythe, and Simith
Var namesFound As Integer = rs.RowCount

The Fuzzy Conclusion

Because the SOUNDEX algorithm is designed around English pronunciation patterns, it may not work as well for languages with different sound structures or alphabets, so results may vary depending on the language. It’s also worth noting that this feature is available for all targets, except Android at this time.

William Yu grew up in Canada learning to program BASIC on a Vic-20. He is Xojo’s resident Windows and Linux engineer, among his many other skills. Some may say he has joined the dark side here in the USA, but he will always be a Canadian at heart.