Chapter 3: Entity Resolution Analytics

3.1 Introduction. 65

3.2 Defining Entity Resolution. 66

3.3 Methodology Overview.. 67

3.3.1 Entity Extraction. 67

3.3.2 Extract, Transform, and Load. 67

3.3.3 Entity Resolution. 68

3.3.4 Entity Network Mapping and Analysis. 68

3.3.5 Entity Management 68

3.4 Business Level Decisions. 68

3.4.1 Establish Clear Goals. 68

3.4.2 Verify Proper Data Inventory. 69

3.4.3 Create SMART Objectives. 69

3.4 Summary. 70

 

3.1 Introduction

Organizations in every domain need to properly manage information about individuals and assets being tracked within their data infrastructure. These individuals and assets are generically referred to by analytics professionals as “entities,” because the term can apply to people, places, or things.

Entity is defined as something that has a real existence.

An entity can truly be any real person, place, or thing that we want to track. Below are just a few examples.

      Customer

      Mobile phone

      Contract

      House listed by a realtor

      Fleet vehicle

      Bank Account

      Employee  

The specific business need will determine what entities, and associated data elements, are being tracked by an organization’s data infrastructure. The process of tracking those entities will be customized accordingly. Regardless of an entity type, all data and information about an entity within databases, free text, or other discoverable sources are known as “entity references.”

Entity reference: information or data about the real-world person, place, or thing being referred to.

For example, an entity reference could include a customer profile that appears as a single record in a customer relationship management database, an online real-estate listing for a specific house, or a phone number listed in a billing statement.

As we can see in the example below, an entity—in this case, SAS Institute Inc.—can have incomplete reference information, making the task of connecting that reference to the entity quite challenging. This is an important element of entity resolution, and one we will spend significant time on in the upcoming chapters.

Figure 3.1: Entity Reference Example1

Figure 3.1: Entity Reference Example

Organizations of all kinds struggle with tracking and analyzing entities in different ways, but the underlying technology and analytical tools to overcome their challenges remain the same. Thus, the discussion throughout this book will provide examples across several domains, while keeping the methodology very general.

However, as we will see, there are important decisions to be made throughout this methodology that will drive effective implementation of entity resolution for the particular business context. So, thoroughly understanding the business context is incredibly important, and will ultimately shape the success or failure of entity resolution in practice.

3.2 Defining Entity Resolution

When sifting through the various entity references within institutional data sources, we have to determine which entities are being referred to, and do so in an effective manner. This requires that we apply a rigorous, repeatable process to evaluate each entity reference against every other reference (i.e., pairwise comparison) in our data sources. This rigorous process is called “entity resolution.”

Entity resolution is the act of determining whether or not two entity references in a data system are referencing the same entity.

The concept is actually simple—we compare uniquely identifying attributes of both entity references (e.g., Social Security numbers (SSN)) to determine a match, and repeat this kind of comparison for every entity reference identified in our data sources. However, the implementation is much more complex due to changing business needs, data quality issues, and ever-changing entities—making matches less straightforward. In other words, real-world implementation of entity resolution is difficult, and the particular business application dramatically impacts the kinds of challenges we will need to overcome.

A robust method for entity resolution is critical for large-scale business operations improvement, and enables a wide variety of applications—some examples of which are below:

Domain

Application

Retail

Social Media Ad Campaign Analysis and Planning

Finance

Know Your Customer (KYC), Insider Trading

Government

Insider Threat Detection

Insurance

Fraud, Waste, and Abuse Prevention

 

I will revisit some of these and other examples over the coming chapters to demonstrate the variety of ways high-quality entity resolution can impact an organization’s day-to-day operations.

3.3 Methodology Overview

As we will see during the remainder of this book, the steps involved in applying Entity Resolution (ER) have evolved in recent years to imply far more than just the simple act of resolving two entities. Real-world applications of ER have precipitated the need for an analytical framework that contemplates the end-to-end steps for acquiring data, cleansing it, resolving the entity references, analyzing it for the specific need, and managing the resulting reference linkages. By establishing a robust framework, we are able to apply ER to numerous subject domains, capable of solving a variety of problems. I and some other authors in the industry refer to this framework as Entity Resolution Analytics (ERA).2

The phases that we will discuss for ERA have been sufficiently generalized to support the application of this methodology to a wide variety of data and domain types. Applying the high-level phases described below will have some underlying variation (and nuance not depicted here), depending on the specific problem being solved; however, the major steps shown in Figure 3.2 will not change.

Figure 3.2: Overview of the ERA Flow

3.3.1 Entity Extraction

We begin with entity extraction in both the business and technical contexts as the initial set of tasks for any ERA project. In the business sense, we must define what kinds of entities we want to extract and for what purpose. And the technical elements of the project have to be developed, or put in place to support the identified needs.

Example: A hedge fund manager is researching companies.

The manager wants to better understand companies currently in her portfolio, as well as a few she is evaluating for inclusion in the fund. She likely has at her disposal software for financial professionals that gives her access to incredible volumes of information about the companies of interest, including news articles. This is possible only because that software has sophisticated algorithms for extracting and resolving the entity references in those articles effectively. In so doing, the fund manager is able to gain access to information that enables her to make well-informed decisions. 

3.3.2 Extract, Transform, and Load

Extract, Transform, and Load (ETL) are the classic processes for changing and moving data in Relational Database Management Systems (RDBMS). ETL processes enable us to take data from its raw form and migrate it through database systems in a repeatable way to ensure a dependable, predictable approach to moving and shaping data for our end use. In addition to pulling structured sources for an ERA project, ETL will be performed on the staging tables of entity references generated by the entity extraction phase.

As with every phase of the ERA framework, this is technology function driven by business needs and best practices. Robust business processes will ensure the proper application of ETL technology functions to ensure the level of quality and consistency that we need for each and every project being executed.

3.3.3 Entity Resolution

The next step in our process is entity resolution. This step is immediately following ETL since all the data should now be prepared and staged in the appropriate form to actually begin the process of evaluating each entity against each other entity identified during the extraction phase.

3.3.4 Entity Network Mapping and Analysis

After ER has been completed, we have to then make decisions about what resulting linkages are kept and which are thrown out, as well as what we can do with the results. I’m calling this process “entity network mapping and analysis” since the links that we have we are effectively mapping out the network of entities we will want to analyze. As will be discussed in the upcoming chapters, we have a lot of flexibility in determining thresholds for network linkages to be retained. This process is driven by business decisions made at the outset of the project. The variety of analysis that could be performed here is quite broad; however, I will execute some common analytical approaches that are achievable in the scope of this book.

Example: A bank is attempting to identify fraud.

Each individual in the bank’s data stores will be analyzed for risk of fraud based on historical patterns of behavior. However, because of all the work performed prior to this phase, each individual’s risk profile can be enhanced based on their relationship to known bad actors, or other high-risk individuals. This enables us to develop a more complete understanding of each individual’s fraud risk. 

3.3.5 Entity Management

For long-term monitoring of the entities tracked in your data stores, it is important to have a gold standard or baseline understanding of said entities. So, after an initial set of verified entity references has been developed, it then becomes the baseline against which we compare new entity references. As genuinely new entities are identified through these new references, we add them to our gold standard database. Redundant references are ignored, while references that augment our understanding of existing entities will be used to edit the database.

3.4 Business Level Decisions

Every company or agency has a different set of procedures, and those must be followed. But the key decisions that need to be made for an ERA project fit neatly into any existing management framework. So, let’s go through key elements that you need to nail down before jumping into the technical elements of how ERA is executed.

3.4.1 Establish Clear Goals

Determine the ultimate goal for an ERA project. What business goal(s) are you trying to achieve? If you are a manager of such a project, you must be able to identify the goal(s). If you are the person implementing the technical aspects of an ERA project, you still must elicit this information from your management chain. Documentation of the goal(s) ensures there is a record of common understanding. Without a commonly agreed upon goal(s), projects easily lose focus and success criteria are rarely achieved. Perceived “failure” of these efforts generally stems from a fundamental mismatch in understanding of what is feasible in the period of time allotted for the project.

3.4.2 Verify Proper Data Inventory

Identify all the data sources available for the project, and determine whether the high-level goals are even realistic prior to starting into any project. Once the feasibility is well-known, the specific elements from which entities will be extracted, and the quality of those sources, will need to be determined. Entity references being pulled from structured data will also be documented, but execution of that plan will occur during the ETL phase of the ERA framework. This is a critical item that can’t be emphasized enough, and goes back to goals and expectation management. In many cases, you have structured elements that you are trying to enrich with unstructured sources. However, if you don’t have some key pieces of structure data to leverage for combining with the unstructured references, the goals that you have established may never be realized.

3.4.3 Create SMART Objectives

SMART objectives are aligned to the overall goal of this project—and they are Specific, Measurable, Achievable, Relevant, and Time-bound. This is a direct result of the reasons that were already stated for project failure. We need to ensure that project goals are rationalized via project objectives that can be achieved in a time and performance window acceptable to the project sponsor. This is a critical step for any technical project like ERA as it is the nexus for project management and technical staff to thoroughly set expectations, and examine trade-offs jointly. Here are some notional examples of the things being defined for each of these elements.

Specific: Capture company names in news source X.

Measurable: Use sample data set x to train capability and holdout set y for testing.

Achievable: Attempt accuracy equal to known/advertised baseline (e.g., 75%). 

Relevant: The data source is the business section of a financial newspaper. 

Time-bound: Complete project development within 90 days.

 

The particular business problem at hand may make some of these decisions quite obvious, but it is important to be very intentional, documenting every decision as you go through a project. You will want to refer back to that documentation weeks, months, or perhaps years later to understand why something is set up the way it is—or what decisions led to a particular development path for a solution. Hindsight is always perfect, making it is easy to ignore for factors that may seem less important years later while they were critical during project execution.

Note: An added benefit of this documentation is the opportunity to learn from past project failures and successes.

There is much more that could be said regarding project management best practices, but I don’t want to go too deeply into that topic in this book. The above list is certainly not complete, but contains reminders that can’t be emphasized enough, regardless of the management methodology that you are implementing (e.g., Agile).

3.4 Summary

Whether you are a practitioner of analytics, or a manager of analytics teams, I hope you find this information helpful in getting started down the road of effectively executing ERA projects with SAS. As I said before, we will explore the foundational aspects of ERA, with pointers about how you can expand your knowledge and capabilities to execute much larger scale projects of this type—we have to start somewhere. Now, let’s have some fun!

 

 



1 Groenfeldt, Tom, “Toyota Finance Uses Advanced Analytics to Improve Sales and Profits,” Forbes, April 13, 2007, https://www.forbes.com/sites/tomgroenfeldt/2017/04/13/toyota-finance-uses-advanced-analytics-to-improve-sales-and-profits/#78d9bc655cb7 (accessed August 29, 2018).

2 Talburt, John R., Entity Resolution and Information Quality (Burlington, MA: Morgan Kaufmann, 2010).