← Work

Personal project · AI systems

The AI Shopping Buddy
An AI layer for e-commerce, designed from a real frustration.

I was buying a phone. I ended up with a build plan for an AI discovery layer that any e-commerce could ship.

Origin Personal project
Role Product Manager
Type Product + technical spec
Scope V1 discovery + recommendations

A shopping problem that became a product question

I needed a new phone. Budget was clear. Use case was clear. Which phone to buy — not clear at all.

So I did what most people do now. I opened ChatGPT and described my situation: daily driver, camera-heavy, under ₹50,000, prefer stock Android. Got a shortlist. Clicked through to Flipkart. Opened the first option. Loaded the spec sheet — 6.7" AMOLED, 50MP main camera, Snapdragon 7 Gen 3, 5000mAh. Looked reasonable. Went back to the LLM. Copy-pasted the entire spec block. Asked: "Given what I told you about my needs, does this one fit?"

Then did the same for the next phone. And the next. ChatGPT for reasoning, Perplexity for research, Claude for deeper dives — and Flipkart as the spec sheet I kept copying out of.

I was the bridge between two systems that should have been talking to each other.

The LLM had the reasoning capability. The e-commerce site had the inventory data. Neither knew the other existed. Every time I moved between them — copy-pasting specs, re-explaining my context, reformatting — I was doing integration work that the product should have done for me.

At some point mid-session I stopped and thought: I've spent 40 minutes on a decision I should have made in five. And as someone who builds product systems for a living, I couldn't let the question go — why does no e-commerce site have this built in? The reasoning exists. The inventory data is already there. The user's context is already being tracked. What would it actually take to close that loop?

Some do, of course. Amazon has Rufus. A few others are building toward it. This isn't a claim that nobody has thought of it. It's my version — what I'd actually decide if I were the PM on this, designed from a real frustration and the constraints I'd actually face.

That question became a weekend rabbit hole. This is what came out of it — a build plan for an AI shopping layer that can be embedded into any e-commerce.

"I was the bridge between two systems that should have been talking to each other — copy-pasting specs from Flipkart into Claude and asking 'does this fit my needs?'"

Named before anything else was decided

My starting point wasn't a product brief — it was a shopping session that felt more broken than it should have. Before making any design decisions, I named the assumptions I was building on — because every architectural call downstream is contingent on these being true.

Explicit assumptions

  • Indian market. Price sensitivity is high, average order values vary from ₹5,000 accessories to ₹2,00,000 laptops. Trust in AI-driven recommendations is still being established.
  • Early-stage startup. Every decision is weighted against shipping speed and cost. Infrastructure requiring a dedicated ML team is out of scope for V1.
  • User persona: mid-intent buyers. They know their use case and budget, not the specific product. "Gaming laptop under 60k" — not "Dell XPS 15 16GB".
  • Build timeline: 2–3 week sprint. PM estimate, not engineering commitment. V1/V2 split is driven by this constraint, not feature priority alone.
  • Usage: 1,000–10,000 queries/day at launch. Model selection and cost architecture calibrated to this range.
  • Opus-class models excluded from V1. Quality delta over Sonnet on structured shopping tasks doesn't justify cost at this scale. Revisited in V2 when cost-per-order data exists.

What this builds — and what it deliberately doesn't

Order management is excluded from V1. Not because users don't need it, but because order flows carry higher risk: "cancel the one I ordered yesterday" requires disambiguation logic, transactional rollback handling, and edge case coverage that adds 3–4 weeks of engineering risk to the first sprint.

More importantly, order management doesn't need AI. The user already knows what they want — their order, their status, their cancellation. There's no ambiguity to resolve. Direct API call, direct response.

Discovery and recommendations are the opposite. The user knows their budget and use case but not the product. This is exactly the problem I was living — the gap between "I know what I need" and "I don't know what to buy." That's the I don't know what I don't know problem, and that's exactly where an LLM earns its place in the UX.

Every hard call, made in advance

01

Floating side panel, not a dedicated page

Reads current page context on open. User on the HP Omen product page — copilot infers gaming laptop intent, mid-to-high budget, performance focus. No cold start. User never explains where they are. Context is pre-loaded before the first message.

02

Two modes: authenticated vs guest

Authenticated users get purchase history, wishlist, cart, inferred budget and ecosystem. User bought an iPhone and Sony earbuds — ecosystem compatibility is weighted in earbud recommendations without asking. Guest users get session-only context. Same interface, different context depth.

03

Rule-based pre-filter before the LLM

High-precision intents bypass the model entirely. "Where is my order" routes directly to the order API. No LLM cost, no added latency, no hallucination risk. The LLM only sees queries that genuinely need language understanding.

04

Names are fuzzy. IDs are truth.

Transactional actions (add to cart, wishlist) resolve to a product entity ID before execution. If resolution is ambiguous, the system asks. It never guesses. A state-changing action on the wrong product is a trust-breaking bug, not a UX edge case.

05

Hybrid spec passing for electronics

Electronics is spec-heavy — a single laptop spec block is ~500 tokens. Default path passes summary specs (top 4 attributes, use case tags, ~150 tokens). Comparison and deep technical queries escalate to full specs. Tiered by intent, not by default.

06

Two-layer freshness: vector DB for relevance, live API for truth

Price and stock change constantly in electronics. Vector DB handles semantic match only — it never serves prices. A live API call at retrieval time fetches current price and stock. Eliminates the scenario where the buddy quotes ₹78,990 and the cart shows ₹82,499.

The system, not the features

Query flow — end to end

User input Text query arrives with page context (current product page) and user context (auth state, cart, purchase history).
Pre-filter Rule-based check for high-precision intents. Order queries bypass LLM entirely — direct to order API. Fast, cheap, zero hallucination risk.
Intent layer Lightweight model (Gemini Flash) classifies intent and extracts entities. Returns structured JSON: intent, entities (category, price range, use case, brand), confidence score, context_required flag.
Routing Low confidence → one clarifying question. context_required: true → escalate to strong model (Claude Sonnet) with full session. Out of scope → graceful redirect.
Retrieval Vector DB returns semantically similar products. Live API call fetches price + stock — never from the vector DB. Hard filters applied post-retrieval.
Generation LLM formats the response. LLM never generates prices, inventory, delivery timelines, or order details. These values pass from APIs through the LLM to the UI. Every visible fact is traceable to a live source.
Action Cart and wishlist actions require explicit confirmation against a resolved product ID. Revalidate price + stock at action time. If validation fails, surface alternatives — never a dead end.

Three layers of signals, not one

Evals aren't dashboards you build after launch. They're the safety net you build before changing anything — prompts, models, retrieval logic. A golden dataset of 100 queries with expected outputs runs on every change. Accuracy drop blocks the ship.

North star

Buddy-assisted conversion rate vs baseline

Did users who used the buddy buy more than those who didn't? This is the single number that justifies the product.

30% faster

Time to purchase

Buddy sessions vs non-buddy. Proves the discovery thesis.

> 70%

Resolution rate

User resolved without leaving chat.

< 2%

Hallucination rate

Responses with data not in retrieved context. Immediate investigation if breached.

100%

Spec accuracy

Zero tolerance. A wrong spec causes a wrong purchase — that's a return, a ticket, and a broken trust moment.

< 2s

Latency p95

End to end. Responses stream token-by-token to reduce perceived latency.

10× ratio

AI cost ROI

Revenue influenced by the buddy vs monthly inference cost. The number that goes to the CEO.

V2 is data-triggered, not assumed

Nothing in the V2 backlog ships on a timeline. It ships when production data creates the case for it.

  • — Order management (status, cancellation, returns) with full transactional guardrails
  • — Voice input
  • — Summarisation layer for sessions exceeding 8 messages — added only when p95 session length data justifies it
  • — Embedding-based recommendation engine replacing popularity heuristics when behavioural data exists
  • — Collaborative filtering for authenticated users
  • — Personalised retrieval re-ranking based on purchase history
  • — Multi-modal search (search by image)
  • — A/B testing framework for prompt variants on live traffic
  • — LLM-as-judge for automated response quality evaluation
  • — Opus-class model evaluation when cost-per-order data makes the case

What I'd validate before writing a line of code

The honest caveat

My own frustration is a data point of one. Before committing to a single sprint of engineering, the first thing I'd validate is whether the discovery gap is real for this specific store's users. Are they arriving without knowing what they want — or are they arriving with a specific product already in mind? The entire product thesis rests on that assumption. I'd run a two-week session recording analysis on the existing store before touching the architecture. A buddy built for the wrong problem is a well-architected failure.