Itinerary activities deduplication
Deduplicating itinerary activities using MinHash LSH and semantic clustering
- python
- airtable
- travel
Prelude
In my previous post, I mentioned migrating our Airtable AI extraction to Rails.
This article describes how we handled duplicate content in itineraries extracted by Airtable’s AI — before we migrated the system to Rails.
Context
Due to the unstructured documents from operators and our use of AI for extraction, the same activities may result in different variants. For example, “Lovina Morning Dolphin Watching Tour” and “Dolphin Watching in Lovina” refer to the same activity. This issue affected not only activities, but also accommodations. However, accommodations were less problematic, because we could use the Google Places API to find the canonical name and merge them.
This problem caused our AI chatbot to recommend similar activities during itinerary customization. We have a feature allowing users to chat with our AI chatbot to get activity recommendations based on users’ preferences and specific requests.
Therefore, clean data was necessary to recommend distinct activities.
Approach
I designed a multi-stage deduplication pipeline to balance computational efficiency with semantic accuracy. Initially, I implemented a pure Levenshtein distance algorithm for string similarity matching. However, because it required an O(n^2) all-pairs comparison, processing just 3,000 records took over 20 minutes on my local machine — rendering it completely unviable for a production-ready scheduled job. To solve this performance bottleneck, I restructured the approach into a tiered filtering pipeline:
- MinHash LSH (Locality-Sensitive Hashing): Used as a rapid first-level indexing mechanism. It clusters text based on Jaccard similarity in sub-linear time, shrinking the massive pool of candidates down to small, high-probability duplicate buckets. This successfully compressed the processing runtime from 20 minutes to under 20 seconds.
- Levenshtein Distance: Applied only within those isolated LSH buckets to calculate exact typographical distances, eliminating the expensive global cross-comparisons.
- Mutual Clustering Algorithm: Implemented as the final stage to evaluate the relationships within a cluster. It ensures that every record in a group shares a high mutual similarity score above a strict threshold. This prevents a ‘transitive chain reaction’ where Record A matches Record B, and Record B matches Record C, but Record A and C are entirely distinct entities.
System design
This system automatically detects and merges duplicate records using semantic similarity — meaning it understands meaning, not just exact text matches. “Bali Beach Resort” and “Beach Resort Bali” would be caught as duplicates even though they are worded differently.
How It Works
1. Building a fingerprint for each record Every record is converted into a numerical “fingerprint” (called an embedding) using OpenAI’s language model. Records with similar meaning produce fingerprints that are mathematically close to each other. These fingerprints are stored in a database optimised for this kind of comparison.
2. Finding clusters of duplicates Every hour, the system compares all fingerprints against each other using cosine similarity — a standard measure of how “close” two vectors are in space. Records are first narrowed down by location (country + city) before comparison, which keeps the process efficient. A custom clustering algorithm then groups records that are all mutually similar to each other above a tuned threshold, preventing false positives from transitive matches (A looks like B, B looks like C, but A and C are actually different).
3. Merging intelligently For each detected cluster, the system picks the oldest record as the “primary” and merges the others into it. The merge strategy varies by field type: arrays and attachments are combined and deduplicated, simple text fields fill in blanks without overwriting existing data, and computed fields (formulas, rollups) are never touched to avoid corrupting downstream data.
4. Staying in sync After merging, duplicate records are deleted from Airtable and the team is notified via Slack. A separate cleanup job runs every two hours to remove any stale fingerprints from the database.
Why It Matters
The system runs fully autonomously with zero manual intervention. Similarity thresholds were empirically tuned per entity type to balance precision (avoiding false merges) against recall (catching real duplicates). The result is a consistently clean database that downstream reporting and recommendation features can rely on.
Diagram
flowchart TD
subgraph Ingestion["Embedding Ingestion"]
A1([Airtable Record Event]) --> A2["POST /embeddings/:entity"]
A2 --> A3[Enqueue RQ Job via Redis]
A3 --> A4[Worker: Fetch record from Airtable]
A4 --> A5[Build text from configured columns]
A5 --> A6["OpenAI text-embedding-3-small\n(1536 dimensions)"]
A6 --> A7[("PostgreSQL + pgvector\nupsert embedding")]
end
subgraph Scheduling["Scheduled Trigger (hourly via APScheduler)"]
B1([Cron Job]) --> B2[auto_merge_clusters]
end
subgraph Clustering["Clustering Algorithm"]
C1[Fetch all embeddings from DB] --> C2["Apply exclusion_fn\n(filter ineligible records)"]
C2 --> C3["Group records by grouping_columns\n(e.g. country + location)"]
C3 --> C4{For each group}
C4 --> C5[Compute cosine similarity matrix]
C5 --> C6["Apply pair_exclusion_fn\n(optional per-entity rule)"]
C6 --> C7["mutual_clusters algorithm\n(all-pairs >= threshold)\nAccomm: 0.92 / Op.Acc: 0.87"]
C7 --> C8["Remove singletons\n(keep clusters size >= 2)"]
end
subgraph Merging["Merge Execution"]
D1{For each cluster} --> D2[Fetch full records from Airtable]
D2 --> D3[build_update_fields]
D3 --> D4["Arrays & Attachments:\nmerge + deduplicate"]
D3 --> D5["Simple fields:\nprimary wins, secondary fills blanks"]
D3 --> D6["Computed fields: skip\n(formula, rollup, lookup)"]
D4 & D5 & D6 --> D7[Batch update primary record]
D7 --> D8[Batch delete secondary records]
D8 --> D9[Slack notification]
end
subgraph Cleanup["DB Cleanup (every 2 hours)"]
E1[Remove stale DB records\nno longer present in Airtable]
end
A7 -.->|embeddings stored| C1
B2 --> C1
C8 --> D1
D8 --> E1
Iterations and findings
The difficult part was setting the threshold value for the clustering. Too low, we would merge unrelated activities; too high, we would miss merging similar ones. As a result, we could only do human verification and find a threshold value through an empirical approach.
Furthermore, there were always exceptional cases like activities involving the phrase “national park tour”. Because it is too generic, by embedding similarity, it would be merged with other activities in general. In this case, it was better to merge them manually on Airtable instead of relying on this automation.
Future improvement
Since we are actively migrating our core system to Ruby on Rails, the architecture can be significantly streamlined. Instead of maintaining a standalone Python application and wrestling with Airtable’s strict API rate limits and data synchronization latency, we can leverage our core database infrastructure:
-
PostgreSQL Trigram Extension (
pg_trgm): We can use trigram indexing natively in Rails to execute lightning-fast, first-level fuzzy string lookups directly during data ingestion. This replaces the need for a separate LSH preprocessing service entirely. -
Unified Data Store: Running the secondary embedding clustering jobs via Sidekiq right next to our relational data eliminates the brittle syncing logic. If a duplicate is detected, a Rails database transaction can safely merge the records, update foreign keys, and maintain data integrity instantly without external network calls.