March 27, 2018

Fuzzy Lookup for Business Users: Excel Add-In Makes It Possible

Post by: Core's Data Team

Until recently, my experience with fuzzy logic was deep in IT-driven data flows. That’s all well and good for IT professionals, but it leaves business users high and dry. Aside from learning IT tools, can business users have a tool that allows them to build their own fuzzy lookup solutions?

The answer is yes! And it comes in one of the most dynamic business tools: Microsoft Excel. There is a Fuzzy Lookup Add-In for Excel 2007 and later (you can download it here). This add-in gives you an extra tab in the Excel Ribbon, providing a side panel to configure and run your fuzzy lookup.

The initial setup is the same as the other fuzzy lookup I’ve described in SSIS. You define two tables in an Excel document, and you can select columns to compare to one another to find matches above a similarity threshold. Sounds familiar, right?

The big question is: Does this add-in effectively aid business users?

Here are 5 Things I’ve Learned from Trying it Out:

1. When you click ‘Go’ to run the Lookup, it will generate a table starting wherever your cursor is. As of this writing, I have not found a different way to define where the lookup results are returned. That said, be very, very careful where your cursor is, as you could overwrite data on one of your sheets, or even the tables you are using in the lookup. Not that I’m speaking from experience or anything…  

2. Defining multiple rules might work consecutively, not concurrently. Being used to the SSIS fuzzy lookup, I tried to set up individual rules for each column so I could tweak their threshold levels. When I did, it seemed the rules were evaluated individually. If records match by one or more rules, it will match even if it fails another rule. I’m assuming the add-in expects you to define the columns you want to compare and to use the same rule for all of them.

3. When I define more than six different rules, Excel either takes a long time to run the comparison, or it crashes. I monitored the resources on my laptop, and while there was some elevated CPU and memory usage, it wasn’t nearly enough that I would expect it to crash. When it did crash, I did not receive an error message stating why the error occurred.

4. The results of the lookup come with little supporting data. There are no similarity results returned for the individual columns. You only receive one similarity percentage for the full comparison. This may not be a big of a concern if you only compare a small number of columns, but it may be more of an issue when comparing full datasets to one another.

5. The order of the columns you compare, as well as the names of the columns, does not matter. I reorganized the columns, renamed them, and then selected all columns in my dataset and attempted a comparison. Regardless of how I set it up, the results appeared to be essentially the same (similarity percentages would change slightly, but the changes were minimal). This is possible because the Excel add-in takes the entire record, tokenizes it (aka breaks it into smaller parts), and then uses a Jaccard similarity algorithm, in which case order isn’t as important.

SSIS Fuzzy Lookup vs Fuzzy Lookup Excel Add-In

In the end, I think the Fuzzy Lookup Excel Add-In is an interesting tool, but it was designed for a different purpose than the SSIS Fuzzy Lookup. It simplifies the setup process and would work well with simple data scenarios, comparing one or two columns. This would be very helpful when business users are using VLOOKUPs with data that doesn’t match exactly.

However, typically whenever I’ve had to use a Fuzzy Lookup, I have needed to use more columns in matching or deduplicating the provided datasets. For enterprise-sized lookup or automated scenarios like that, I would continue to recommend the use of SSIS Fuzzy Lookup for its more robust features.

The Fuzzy Lookup Add-In in Excel is a good tool, but in my opinion, it needs further development to enhance the existing documentation, provide more context around possible errors, and improve the overall user experience. Unfortunately, Microsoft’s last update for the add-in was in 2014, so I’m anticipating this solution will remain as is.

Depending on your use case, and especially for simple matching scenarios, the Fuzzy Lookup Excel Add-In can be used to help solve your fuzzy lookup problems.

Stay Fuzzy!

Relevant Insights

Are You Falling into the Cloud Conversation Gap?

Most conversations about migrating services to the cloud focus on migration of application workloads and data. The vision is often...

How Aggregation Tables Improve Performance for Power BI Reports

In this blog, Marcus Radue, Solution Architect, offers high-level guidance in the advantages of aggregation tables in Power BI. For a...

Getting Started with Azure DevOps – Views

Like most Microsoft products, there are many ways to look at or perform the same activities. Azure DevOps is no...