Effective identification of duplicated content is inherent in processing large amounts of text documents, such as web pages, articles or contracts. Recently, our team has developed a solution exploiting contrastive learning techniques to find duplicated content among thousands of construction documents in a blink of an eye. The talk shares our experience and introduces into contrastive learning.
One of our clients is a company processing a tremendous amount of unstructured data collected from the Internet. They fetch construction-related articles to be the first to know about new opportunities and further share this information with construction firms. This way, they offer the most up-to-date information regarding the market, and their customers get immediate and high-quality news regarding bids that might be of interest to them.
As the company collects the information from different sources, some inconsistencies may appear in the collected data. The same construction investment might be expressed in varying words or slightly altered data fields depending on the data source. It's just like one asked two different persons to describe the same situation - probably each description would be somewhat different than the other.
In such circumstances, duplicates are expected. It means the same bid or project may appear multiple times in moderately different forms coming from other data sources. Therefore, there is a need for an efficient deduplication algorithm to ensure the high quality of the offered information. This talk explains how we built one by leveraging deep neural networks and contrastive learning. Moreover, it provides a gentle introduction into the concepts of contrastive techniques.