How do artificial intelligence systems identify duplicate data?

Discussions on artificial intelligence concepts, such as comparing records in a database, and how to integrate these technologies with Salesforce.

When you compare two Salesforce records or any other CRM records simultaneously, you can easily determine if they are duplicates. However, even with a small number of records, say less than 100,000 , it's nearly impossible to sift through them one by one and make such comparisons. This is why companies have developed various tools to automate these processes; however, for the machine to do the job well, it needs to be able to identify all the similarities and differences between records. In this article, we'll examine more closely some of the methods data scientists use to train machine learning systems to identify duplicates.

How do machine learning systems compare and contrast records?

One of the main tools researchers use is string measurement. When you take two data strings and return a value, a lower value is returned if the strings are similar, and a higher value is returned if the strings are different. How does this work in practice?

If a person sees these two records, it's obvious they are duplicates. However, machines rely on string metrics to replicate human thought processes, which is the essence of artificial intelligence. One of the most well-known string metrics is the Hamming distance, which measures the number of substitutions required to transform one string into another. For example, if we go back to the two records above, only one substitution is needed to change " burgendy " to " burgendy ", so the Hamming distance is 1 .

There are many other string metrics that measure the similarity between two strings, and each string is separated by the operations it allows. For example, we mentioned the Hamming distance, but this string metric only allows substitution, meaning it can only be applied to strings of equal length. Something like the Levenshtein distance allows deletion, insertion, and substitution.

How can all of these be used for deduplication in Salesforce ?

There are several methods for AI systems to implement deduplication in Salesforce. One method is the blocking method, as shown below:

This blocking method makes it scalable. It works by automatically grouping together records that look "similar" whenever you upload a new record to your Salesforce account, based on the first three letters of the name or any other criterion.

This is highly beneficial because it reduces the number of comparisons required. For example, suppose you have 100,000 records in Salesforce and you want to upload an Excel spreadsheet containing 50,000 records . A traditional rule-based deduplication application would need to compare each new record with existing records, meaning 5,000,000,000 comparisons ( 100,000 x 50,000 ). Imagine how long that would take and how much it increases the probability of errors. Furthermore, we need to remember that 100,000 records is only a fairly limited fraction of Salesforce records. Many organizations have hundreds of thousands or even millions of records. Therefore, traditional methods scale poorly when trying to adapt to such models.

Another option is to compare each field individually:

Acme Company

Once the system groups "similar" records together, it continues to analyze each record field by field. This is where all the string metrics we discussed earlier come into play. Beyond this, the system will assign a specific "weight" or importance to each field. For example, suppose the " Email " field is the most important for your dataset. You can tune the algorithm yourself, or the system will automatically learn the correct weights when you mark records as duplicates (or unique). This is called active learning, and is preferable because the system can accurately calculate the importance of one domain to another.

What are the advantages of machine learning methods?

The greatest benefit machine learning can offer is that it does all the work for you. The active learning aspect we described in the previous section will automatically apply all the necessary weights to each field. This means there's no need to create complex setup processes or rules. Let's look at the following scenario. Suppose one of the sales representatives discovers a duplicate issue and notifies the Salesforce administrator. The Salesforce administrator then goes on to create a rule to prevent such duplicates from happening in the future. Every time a new copy is discovered, making such a process unsustainable, this process must be repeated over and over again.

Additionally, we need to remember that Salesforce 's deduplication functionality is also rule-based, albeit very limited. For example, it can only merge three records at a time, doesn't support custom objects, and has many other limitations. Machine learning is simply a smarter approach because rule creation is easily automated, while artificial intelligence and machine learning attempt to replicate human thought processes. This article discusses the differences between machine learning and automation. Choosing a deduplication product that simply extends Salesforce's functionality instead of fixing the entire process makes no sense. This is why machine learning is the best approach.

How do artificial intelligence systems identify duplicate data?

Read next

CATDOLL 136CM Vivian (Customer Photos)

CATDOLL 136CM Tami

CATDOLL 108CM Maruko

CATDOLL 139CM Luisa (TPE Body with Soft Silicone Head)