PDF extraction

Prelude

AI can play many roles in a product. I categorize them into three layers:

AI tools for planning and development
Prompt-driven automation within the system — text generation, embeddings, and content verification
User-facing AI agents, such as chatbots or assistants

Context

Our Airtable contained many PDF, DOCX, and XLSX documents from operators’ itineraries. We used Airtable’s AI feature to extract the data and convert it into more structured itineraries. However, Airtable offered less control over key concerns such as prompt versioning and structured JSON schema output. Due to lack of versioning such as git, we couldn’t simply roll back the prompt. Furthermore, the existing prompts produced inconsistent results, with occasional hallucinations in the output.

Next, the Airtable Webhook was difficult to work with, because it fires an event for every trivial change. To read the event, our server needed to fetch the entire batch payload — that’s why we implemented a fetcher to poll for changes periodically instead of listening to the webhook.

Alternatively, we could use Airtable Automation to send the event when a record is created or updated. This could replace the webhook, but it was also a problem to set up. Especially when there are field changes, we needed to update the Automation JavaScript accordingly. And again, there was no versioning control for the changes. It is user-friendly, but not developer-friendly.

Moreover, the documents provided by the operators were unstructured and inconsistent. Some used XLSX. Parsing the tables was very challenging, because most of the time the AI could hardly read the columns correctly.

As a result, we decided to migrate the flow to our Ruby on Rails app.

Approach

AI tools usage

To optimize development speed and code reliability, my company adopted AI tools — with Claude Code as the primary one. I also used ChatGPT (web) and Claude (web) for early planning to draft the overall solution. I then used Claude Code, particularly its Plan mode, to structure and execute the implementation.

I prepare a Markdown document to describe the problem, goals, constraints, and edge cases for each task — this gives the AI richer context than a short prompt and produces significantly better plans. I then open a Claude Code session, provide the Markdown document, and use Plan mode to review and refine the proposed approach before writing any code.

Implementation

Migrate every prompt from Airtable to Rails
Re-design the JSON schema output to improve extraction quality
Use the newer model gpt-5.4-mini — chosen for its lower cost and faster response time
Integrate tool calling such as web_search and a custom currency converter tool

Different types of document

One of the challenges was the variety of document formats. Since we were using OpenAI, which does not natively support DOCX and XLSX, I built a companion service using Python and FastAPI, deployed on Heroku, with Pandoc for document conversion. During extraction, the document is sent to the FastAPI app and converted into Markdown format.

Code snippet

class ItineraryExtraction
  def initialize(quote_id)
    @quote = AirtableQuote.find(quote_id)
  end

  def call
    # Step 1: Run through sequential OpenAI JSON-schema extractions
    itinerary_data = ItineraryExtractor.run(documents: @quote.attachments)
    pricing_data   = PricingExtractor.run(documents: @quote.attachments, context: itinerary_data)
    
    # Step 2: Convert and persist to database within a clean transaction
    ActiveRecord::Base.transaction do
      product = ExtractionToProduct.call(itinerary_data, pricing_data)
      enqueue_refinement_workers(product.id)
    end
  rescue ValidationError => e
    Rails.logger.error("Pipeline failed at schema validation: #{e.message}")
    raise e
  end
end

Note: This is not the exact code from the codebase.

System design

Overview

When a tour operator submits a travel itinerary document (PDF, Word, or Excel), manually converting it into a structured product listing is tedious and error-prone. This pipeline automates that process end-to-end using a sequence of specialized AI models.

How it works

The system reads the raw document and passes it through four AI extraction stages, each focused on a distinct concern:

Itinerary Extractor — reads the day-by-day schedule and produces structured daily summaries, a marketing title, and the trip’s inclusions and exclusions
Pricing Extractor — identifies all pricing tiers by month, season, and group size; when prices are in a foreign currency, it automatically fetches the live exchange rate and converts to USD before finalizing
Airports Extractor — determines the arrival and departure airports, inferring from city names when not explicitly stated
Activities Extractor — identifies every bookable experience across the itinerary and structures them as standalone, reusable catalog items

Each extractor enforces a strict JSON output schema, making downstream processing reliable without manual cleanup.

From extraction to live product

Once extraction is complete, the data is persisted to the database and then promoted into a publishable product. This final step assembles the product record, wires up the assigned tour guide’s profile, syncs pricing and event dates from the original quote, and sets cancellation policy — all derived automatically from the extracted data.

Why this approach

Breaking the extraction into separate, single-purpose steps makes each stage easier to test, retrain, and improve independently. If the pricing model needs tuning, only that step changes. The pipeline also supports partial runs — operators can extract first and convert to a product later, which fits naturally into a human review workflow.

Diagram

flowchart TD
    START([Airtable Quote ID]) --> FETCH[Fetch Quote from Airtable]
    FETCH --> CHECK_DOCS{Documents attached?}
    CHECK_DOCS -- No --> DOWNLOAD[Download PDF / Word / Excel\nfrom Airtable]
    CHECK_DOCS -- Yes --> ITINERARY_EX
    DOWNLOAD --> ITINERARY_EX

    subgraph AI_PIPELINE ["AI Extraction Pipeline (OpenAI Responses API)"]
        direction TB
        ITINERARY_EX["ItineraryExtractor  ·  gpt-5.4-mini\nDays, title, accommodations,\ncities, inclusions, exclusions"]
        PRICING_EX["PricingExtractor  ·  gpt-5.4-mini + tool use\nPrices per month / season / pax\nCurrencyFetcher for non-USD conversion"]
        AIRPORTS_EX["AirportsExtractor  ·  gpt-5.2\nArrival and departure airports"]
        ACTIVITIES_EX["ActivitiesExtractor  ·  gpt-5.4-mini\nBookable activities per day"]
        ITINERARY_EX --> PRICING_EX --> AIRPORTS_EX --> ACTIVITIES_EX
    end

    ACTIVITIES_EX --> CHECK_CONVERT{convert: true?}
    CHECK_CONVERT -- Yes --> CONVERTER["ExtractionConverter\nCreate/update AirtableItinerary\nRebuild DailySummaries\nRebuild MonthlyPrices\nFind/create Accommodations"]
    CHECK_CONVERT -- No --> CHECK_ENRICH
    CONVERTER --> CHECK_ENRICH{enrich: true?}
    CHECK_ENRICH -- Yes --> ENRICHER["ActivitiesEnricherAirtable\nEnrich activities via Airtable"]
    CHECK_ENRICH -- No --> CITY_ASSOC
    ENRICHER --> CITY_ASSOC["CityAssociator\nAssociate cities to extraction"]

    CITY_ASSOC --> ETP["ExtractionToProduct"]
    ETP --> VALIDATE{Output valid &\ncountry codes present?}
    VALIDATE -- No --> ERROR(["Raise Error"])
    VALIDATE -- Yes --> HERO_MEDIUM["Build Hero photo medium"]
    HERO_MEDIUM --> CREATE_PRODUCT["Create Product record\nSync itinerary YAML"]
    CREATE_PRODUCT --> CHECK_DATES{Tour dates\nfrom Airtable?}
    CHECK_DATES -- Yes --> SYNC_PRICING["Sync Event, Price\n& Cancellation Policy"]
    CHECK_DATES -- No --> ASYNC_WORKERS
    SYNC_PRICING --> ASYNC_WORKERS["Enqueue async workers\nFillExtractionPrelude\nRefineExtractionTitle"]
    ASYNC_WORKERS --> DONE(["Product Created"])

    classDef ai fill:#dbeafe,stroke:#3b82f6,color:#1e40af
    classDef db fill:#dcfce7,stroke:#22c55e,color:#166534
    classDef async fill:#fef9c3,stroke:#eab308,color:#713f12
    classDef error fill:#fee2e2,stroke:#ef4444,color:#991b1b

    class ITINERARY_EX,PRICING_EX,AIRPORTS_EX,ACTIVITIES_EX ai
    class CONVERTER,CREATE_PRODUCT,SYNC_PRICING db
    class ASYNC_WORKERS async
    class ERROR error

Iterations and findings

The most challenging part of development was quality assurance and testing. My product manager gave invaluable feedback throughout, which shaped many iterations of the extraction pipeline.

Designing the JSON schema upfront proved to be one of the highest-leverage decisions.

Rule 5. Data dominates. If you’ve chosen the right data structures and organized things well, the algorithms will almost always be self-evident. Data structures, not algorithms, are central to programming. (Brooks, 1995, p. 102).

Prompt engineering was the other major challenge, particularly avoiding overfitting. For instance, when generating itinerary titles, providing example outputs caused the model to anchor too closely to those examples rather than generalizing across varied inputs.

The result has significantly reduced our content team’s workload.

Limitations and future improvements

While the new pipeline vastly improved reliability, a few engineering challenges remain for future iterations:

Sequential Latency: Chaining four distinct extractor models sequentially introduces cumulative latency. While acceptable for background processing via async workers, we are exploring parallelizing the execution of independent stages (like Airports and Pricing) to optimize throughput.
Schema Evolution Overhead: Enforcing strict JSON schemas means that when operators introduce new, unexpected document formats, the schema must be updated and backward-compatibly migrated within the Rails codebase.
Edge-Case Hallucinations: Extremely long documents or poorly scanned PDFs occasionally push token limits or result in truncated data, requiring an administrative dashboard UI for manual override and human-in-the-loop validation.

References

Brooks, F. P., Jr. (1995). The mythical man-month: Essays on software engineering (Anniversary ed.). Addison-Wesley.