Agentic trip composer

Introduction

At TourHero, we built Maya — an AI-powered agentic chatbot that lets users compose and refine travel itineraries through natural language. Rather than navigating a UI to manually add, remove, or swap activities and accommodations, users simply describe what they want and Maya handles the rest, while respecting the complex constraints of our live inventory: dates, capacity, and pricing.

This was our team’s first AI-native product, and shipping it required solving problems across the full stack — data quality, LLM architecture, prompt engineering, and production-grade reliability.

My contributions

As a senior engineer on this initiative, my work spanned both the LLM execution layer and the underlying data foundation:

JSON output schema design — I designed the structured output schemas for the activities and accommodations searcher agents. Strict schemas were critical for reliable, deterministic search results that the orchestrator could act on.
Searcher agent prompt engineering — I engineered the prompts for the accommodations and activities searcher agents, handling parameters like price range, location, and trip vibes. The main orchestrator was owned by another engineer.
Tool and API layer — I built the underlying tools and APIs that allowed the model to query and update our production database within safe constraints.
Data-cleaning pipeline — recognizing that recommendation quality depended entirely on clean catalog data, I also designed and built a deduplication pipeline for our activities and accommodations catalog using vector embeddings and similarity search. This is covered in detail in my activities deduplication post.

Challenges

This was our team’s first AI-native product and our first time working with the OpenAI Responses API at production scale.

Tooling gaps: At the time, there was no official openai-ruby gem — we used ruby-openai instead. Because we needed tool calling, guardrails, and multi-agent orchestration, we ended up building our own OpenAI framework in Ruby, modeled loosely on the OpenAI Agents SDK. We used other AI tools (Claude, among others) to help build it.

Prompt engineering from scratch: Every team member was new to prompt engineering. Getting the model to reliably interpret user intent as discrete, reversible operations — especially edge cases like “move the snorkeling from day 3 to day 5” — required many iterations.

Structured output: One thing that helped significantly was OpenAI’s strict JSON schema enforcement on the Responses API. Having schema-validated output made Maya’s actions more deterministic and reduced the surface area for unexpected behavior.

Implementation

Maya needed RAG (Retrieval-Augmented Generation) to operate effectively — the chatbot had to reason about both the user’s itinerary request and our catalog of available activities and accommodations.

We used pgvector for embedding storage, enabling semantic search over the catalog. However, embeddings alone were not sufficient. Our matching needed to factor in abstract trip attributes — things like travel style, activity preferences, and pace — that might never appear verbatim in the conversation. We addressed this by using OpenAI to score both the itinerary context and user request against defined relevance dimensions, then using those scores to improve ranking and recommendation relevance.

Prompt design for the operations layer was the other major investment. The prompt needed to handle add, remove, replace, and move operations across both activities and accommodations, each with their own constraints around pricing, location, and availability. Getting this to generalize reliably across diverse user phrasings took significant iteration.

Outcomes

Maya moved from concept to production and fundamentally changed how users engage with trip planning on our platform. Users can now compose high-ticket itineraries in plain English, while the system silently enforces all complex inventory constraints in the background.

The result: manual operational bottlenecks in trip planning were eliminated, transaction velocity increased significantly, and the platform scaled to support over 1,000 integrated operators.

Iterations and findings

Relevance scoring calibration: Assigning and weighting relevance scores was not straightforward. Early versions produced recommendations that were semantically close but tonally wrong — a user asking for a “relaxed beach trip” would receive high-energy water sports. Tuning the scoring model and weighting took several rounds of human review.

Prompt stability: Prompts that worked well for common requests often failed on edge cases. The same prompt that correctly handled “add a cooking class on day 2” might misinterpret “swap the sunset cruise for something quieter”. We had to document known failure cases and iterate until the prompts generalized.

Data quality as a prerequisite: The deduplication work was not optional — without it, Maya would regularly recommend duplicate or near-duplicate activities. Clean catalog data was a hard dependency for recommendation quality, which is why the pipeline work preceded the chatbot’s production launch.