October 30, 2018

When Low Quality Data Strikes, Can Fuzzy Logic Save the Day?

Post by: Core's Data Team

What is Fuzzy Logic?

First off, there are some important things I think you should know:

  1. I was once a member of a group called the Marx Brothers.
  2. I can solve a Rubik’s cube in less than 1,000 seconds.
  3. I like BLT sandwiches.

Now, onto the reason you’re reading this post. What is fuzzy logic? It’s interesting if you try to search online for a definition of fuzzy logic because you’ll find many different definitions. Some of them are technology-specific; some of them focus on the mathematical components included in coming up with a fuzzy algorithm; and others are all over the place.

To provide clarity on the topic, I decided to add my own definition into the mix, defining fuzzy logic outside of the context of technology or math. Here goes:

Fuzzy logic is taking two pieces of information and identifying a match based on how similar those pieces of information are.

Non-Technical Example of Fuzzy Logic

Suppose I asked you, the reader, to describe me. At the beginning of this post, I shared 3 hypothetical facts about myself. Without scrolling back up, can you recall any of the facts I presented?

If you said, “Marx Brothers”, “Rubik’s cube”, and “BLTs”, you’d be right. That is, you used similar terminology to what I wrote above and, while you may have phrased it differently or only recalled part of what was written, you remember enough of the content to consider it a match to what I wrote.

Why am I wasting your time with this? Because this exercise is fuzzy logic in its basic form.

Look at the definition I provided this way: We took two pieces of information (the beginning of this post and your memory) and compared the information from both sources to see how similar they are, determining if we can identify a match. Even if things don’t match exactly, you can figure out with logic or other assumptions that the two pieces of information match. That’s fuzzy logic in a nutshell. Makes sense?

A Practical Matching Case Study

Now you might be asking, “How do we apply this in the technology world?” To answer, I’m going to present a case study to you. It’s a scenario I’ve run into in a couple times (and perhaps you have, too) where I’ve had to evaluate algorithms and, in some cases, determined that going a fuzzy approach saves time and headaches.

Imagine you’re in a scenario where you’re part of a company’s business intelligence group. As part of that group, you help manage datasets that the company uses for reporting. One of these datasets includes people important to your organization (customers, employees, partners, etc.), and many metrics are calculated from this data to assist in making data-driven decisions. At this time, you have two datasets from different applications, each containing information about these people. Like your typical person dataset, these contain names and demographic information: first names, last names, date of birth, social security number, gender, address, so on and so forth.

Our goal is to merge these two lists into one clean, deduplicated master person dataset. It is important to note that a person can exist in both datasets. This means you can’t merge the datasets without analysis, as that would likely create duplicate records. You need to come up with a solution that can identify duplicate records while avoiding bad matches. Bad matches would be if Joseph Smith in one dataset matches to Johann Smith in the other, identifying them as the same person. That’s obviously a problem.

The Utopian Dataset

You take some time to review the datasets. One of these datasets is not new to you, and it’s coming from your company’s main application. You’ve worked with this dataset for a while, and you have direct access to work with the application developers. In the end, you’ve been able to already incorporate this data into your warehouse, and it is high-quality. All the fields are filled in and formatted nicely. As a result, this data set is that complete, utopian dataset that you would like to deal with.

The Low-Quality Dataset

Now the second dataset you’re being presented with is coming from a new application. You haven’t had a lot of time to look at it yet and don’t have access to the application developers. After an initial review, you’ve identified the data as low quality.

3 Qualities of Low-Quality Data

When identifying data as “low quality”, there are three qualities to look for: typos, blank fields, and varied formatting.

  1. Typos happen with free-form text fields (i.e. name fields) where there are few rules in the application about what is an acceptable value. With no checks in place, you have nothing but typos. (Everybody makes typos from time to time.)
  1. Another problem is blank fields. For example, some of this demographic information may be optional in this new application (i.e. SSN, Birth Date). That can also cause issues in this scenario.
  1. Varied formatting can also be a problem. If you gather 100 developers in a room and ask them how to store social security numbers, phone numbers, or any other data that has formatting on it, you may get 15 or 16 different answers on how it should be stored. Sometimes developers may not care and leave it as another free-form text field, in which case you get what you get.

When these three qualities exist in your data, you can run into a lot of challenges when trying to identify the people that exist in both lists. Additionally, we’re dealing with two separate applications, so we have no guarantee of there being any sort of numerical key that would allow us to easily tie matching records between datasets. Unfortunately, despite the data quality, we need to rely on the business data that exists to find matches.

How can we work around this dilemma?

How to Find Matches When Low Data-Quality has Struck

The goal is to identify duplicates while avoiding making bad matches, which will be complicated given the quality of data.

How could we go about doing this? There are three basic approaches you could take from an algorithmic perspective to find matches: exact matching, fuzzy matching, and manual matching. Let’s go over each one of these at a high level.

Exact Matching

Exact Matching is probably the most familiar approach to those reading this post. In an Exact Matching algorithm, you define columns that you want to compare between your two datasets. If both datasets have a first name column, you can compare those. You can do the same for any columns that the datasets have in common.

Now, a given set of columns can be grouped into a rule. For whatever columns are defined in that rule, the data in those columns must match exactly to identify matches – as the name Exact Match implies. This is the same as common SQL joins or where clauses. The main benefit of using a rule concept is the ability to define very strict rules for whether the rule successfully finds a match. If you pick five or six columns, and you enter your datasets to compare, and they match exactly, then you can be confident that you found a match if you pick the right columns. Conversely, if you do not find a match by that criteria, then you can be confident that you shouldn’t have found a match in that case.

The other benefit is, if you’re defining very strict rules like this, you can define multiple rules and run them in succession to further define a more elaborate algorithm. That’s going to be especially beneficial in this scenario because, with low-quality data on one side of our match, finding one exact rule that works for all cases will be difficult (if not impossible).

Fuzzy Matching

Fuzzy Matching starts like an Exact Match by defining the columns that we want to compare between our two datasets. The difference is that, instead of expecting the data to match exactly, we’re going to use fuzzy logic – expecting data to match based on similarity.

What exactly does that mean? Matching on similarity is going to depend on whatever tool you’re using. Many tools will use some sort of percentage comparison. I’ve seen other tools that use some weird math stuff that I’m not going to pretend to understand. The important thing to note is that, regardless of the tool used, we’re looking for similarity rather than exact matching.

In my experience working with fuzzy algorithms, it’s much faster to set up for complex low-quality scenarios – which just so happens to be what we’re dealing with in this case. This is because the algorithm, in nature, anticipates that the data isn’t going to match exactly all the time, and it is designed to define criteria to work around that constraint.

Manual Matching

This is the equivalent of trusting a business user to take two Excel sheets, put them side-by-side and identify matches between your two datasets. While it may be convoluted, it might be better in a matching situation to trust the human brain to find accurate matches. Unlike any other algorithm we define, the brain, especially a business user who is familiar with your data, will be able to comb for any number of variances that exist in your data, high quality or low. Ultimately, it would be the most accurate form of matching. The obvious downside is that the larger your dataset, the longer manual matching is going to take. Typically, we’re not going to go this route much. You may still have this as part of your full matching workflow, but you’re trying to minimize the need for this by implementing either the exact algorithm or a fuzzy algorithm.

Pros and Cons of Matching Algorithms

After reviewing the different types of algorithms that exist in matching scenarios, it’s inevitable to ask, “Which matching approach do I pick?” Well, unfortunately, algorithms are tools, so I’m not going to answer this for you. With any sort of tool set, some tools work better for certain situations than others. It’s really not a question of what is the best tool to use. The better question is, “What tool makes the most sense for the situation I’m dealing with?”

To clarify your needs for an algorithm, I would ask these two questions to figure out which one makes more sense:

1. “How much time do I want to invest in finding accurate matches?”

If you have infinite resources, infinite time, and infinite business users who want to analyze your data, then why bother writing an algorithm? Just let them deal with it. Obviously, we’re not dealing with infinite resources so that sort of manual matching approach doesn’t make sense.

If you’re very restricted on time, you don’t necessarily have time to refine your algorithm. Refinement is especially important with fuzzy algorithms, as matching by similarity can result in many bad matches. Sometimes you want to at least create something close to what you need, but you may still have some data issues with duplicates or bad matches in your results. In this case, exact matching might be the better way to go. It’s simpler to set up; it’s closer to what we know as SQL developers; and, ultimately, if you define strict rules, then you can run that and feel strongly that whatever matches you found you should have found – and if you didn’t find them, then there’s a pretty good chance that you shouldn’t have.

2. “What resources are available for me to use?”

I’m especially talking about involving business users in the data cleansing workflow. Fuzzy logic used with low-quality data gives you a very good start on finding accurate matches between your two data sets. If there’s a certain similarity threshold area that you’re unsure about, then you can easily move that data through a different workflow, send it off to business users, and have them evaluate it. Exact matching would only tell your users that no matches were found with the existing rules, and it would be up to them to figure out why.

If you have the time and resources for fuzzy logic, then you can build some effective workflows that you could theoretically automate so your data is constantly refined on a regular basis. That’s fantastic if time allows. But, if you don’t have that time, then perhaps exact matching is a better route for you.

So, again, what’s the best tool for your situation?

Can Fuzzy Logic Save the Day?

In many industries, the amount of data collected daily is growing at an incredible rate. While this is great news, the additional data only provides value to businesses if it is high-quality enough to be trusted. Ideally, any data gathered falls under the high-quality category. But if we are all honest with ourselves, then we must acknowledge that we regularly run into low-quality data scenarios.

Whether it be because of user error, faulty gadgets, unrefined business processes, or migrating data between applications that have different rules, data professionals are routinely called upon to resolve low-quality data scenarios. They need to identify data quality issues, solve them via matching or other approaches, and inspire confidence in their business users that the data feeding their important metrics is accurate and will remain accurate in the future.

So, when low-quality data strikes (like in this blog’s case study), Fuzzy Logic is a tool that can simplify many matching, deduplication, or data cleansing scenarios that involve low-quality data. With this tool, you can better ensure that even big data can be clean data.

Relevant Insights

Are You Falling into the Cloud Conversation Gap?

Most conversations about migrating services to the cloud focus on migration of application workloads and data. The vision is often...

How Aggregation Tables Improve Performance for Power BI Reports

In this blog, Marcus Radue, Solution Architect, offers high-level guidance in the advantages of aggregation tables in Power BI. For a...

Getting Started with Azure DevOps – Views

Like most Microsoft products, there are many ways to look at or perform the same activities. Azure DevOps is no...
X