Personal project · AI systems
The AI Shopping Buddy
An AI layer for e-commerce, designed from a real frustration.
I was buying a phone. I ended up with a build plan for an AI discovery layer that any e-commerce could ship.
How it started
A shopping problem that became a product question
I needed a new phone. Budget was clear. Use case was clear. Which phone to buy — not clear at all.
So I did what most people do now. I opened ChatGPT and described my situation: daily driver, camera-heavy, under ₹50,000, prefer stock Android. Got a shortlist. Clicked through to Flipkart. Opened the first option. Loaded the spec sheet — 6.7" AMOLED, 50MP main camera, Snapdragon 7 Gen 3, 5000mAh. Looked reasonable. Went back to the LLM. Copy-pasted the entire spec block. Asked: "Given what I told you about my needs, does this one fit?"
Then did the same for the next phone. And the next. ChatGPT for reasoning, Perplexity for research, Claude for deeper dives — and Flipkart as the spec sheet I kept copying out of.
I was the bridge between two systems that should have been talking to each other.
The LLM had the reasoning capability. The e-commerce site had the inventory data. Neither knew the other existed. Every time I moved between them — copy-pasting specs, re-explaining my context, reformatting — I was doing integration work that the product should have done for me.
At some point mid-session I stopped and thought: I've spent 40 minutes on a decision I should have made in five. And as someone who builds product systems for a living, I couldn't let the question go — why does no e-commerce site have this built in? The reasoning exists. The inventory data is already there. The user's context is already being tracked. What would it actually take to close that loop?
Some do, of course. Amazon has Rufus. A few others are building toward it. This isn't a claim that nobody has thought of it. It's my version — what I'd actually decide if I were the PM on this, designed from a real frustration and the constraints I'd actually face.
That question became a weekend rabbit hole. This is what came out of it — a build plan for an AI shopping layer that can be embedded into any e-commerce.
"I was the bridge between two systems that should have been talking to each other — copy-pasting specs from Flipkart into Claude and asking 'does this fit my needs?'"
Assumptions
Named before anything else was decided
My starting point wasn't a product brief — it was a shopping session that felt more broken than it should have. Before making any design decisions, I named the assumptions I was building on — because every architectural call downstream is contingent on these being true.
Explicit assumptions
- → Indian market. Price sensitivity is high, average order values vary from ₹5,000 accessories to ₹2,00,000 laptops. Trust in AI-driven recommendations is still being established.
- → Early-stage startup. Every decision is weighted against shipping speed and cost. Infrastructure requiring a dedicated ML team is out of scope for V1.
- → User persona: mid-intent buyers. They know their use case and budget, not the specific product. "Gaming laptop under 60k" — not "Dell XPS 15 16GB".
- → Build timeline: 2–3 week sprint. PM estimate, not engineering commitment. V1/V2 split is driven by this constraint, not feature priority alone.
- → Usage: 1,000–10,000 queries/day at launch. Model selection and cost architecture calibrated to this range.
- → Opus-class models excluded from V1. Quality delta over Sonnet on structured shopping tasks doesn't justify cost at this scale. Revisited in V2 when cost-per-order data exists.
Scope decision
What this builds — and what it deliberately doesn't
Order management is excluded from V1. Not because users don't need it, but because order flows carry higher risk: "cancel the one I ordered yesterday" requires disambiguation logic, transactional rollback handling, and edge case coverage that adds 3–4 weeks of engineering risk to the first sprint.
More importantly, order management doesn't need AI. The user already knows what they want — their order, their status, their cancellation. There's no ambiguity to resolve. Direct API call, direct response.
Discovery and recommendations are the opposite. The user knows their budget and use case but not the product. This is exactly the problem I was living — the gap between "I know what I need" and "I don't know what to buy." That's the I don't know what I don't know problem, and that's exactly where an LLM earns its place in the UX.
Key decisions
Every hard call, made in advance
01
Floating side panel, not a dedicated page
Reads current page context on open. User on the HP Omen product page — copilot infers gaming laptop intent, mid-to-high budget, performance focus. No cold start. User never explains where they are. Context is pre-loaded before the first message.
02
Two modes: authenticated vs guest
Authenticated users get purchase history, wishlist, cart, inferred budget and ecosystem. User bought an iPhone and Sony earbuds — ecosystem compatibility is weighted in earbud recommendations without asking. Guest users get session-only context. Same interface, different context depth.
03
Rule-based pre-filter before the LLM
High-precision intents bypass the model entirely. "Where is my order" routes directly to the order API. No LLM cost, no added latency, no hallucination risk. The LLM only sees queries that genuinely need language understanding.
04
Names are fuzzy. IDs are truth.
Transactional actions (add to cart, wishlist) resolve to a product entity ID before execution. If resolution is ambiguous, the system asks. It never guesses. A state-changing action on the wrong product is a trust-breaking bug, not a UX edge case.
05
Hybrid spec passing for electronics
Electronics is spec-heavy — a single laptop spec block is ~500 tokens. Default path passes summary specs (top 4 attributes, use case tags, ~150 tokens). Comparison and deep technical queries escalate to full specs. Tiered by intent, not by default.
06
Two-layer freshness: vector DB for relevance, live API for truth
Price and stock change constantly in electronics. Vector DB handles semantic match only — it never serves prices. A live API call at retrieval time fetches current price and stock. Eliminates the scenario where the buddy quotes ₹78,990 and the cart shows ₹82,499.
Architecture
The system, not the features
Query flow — end to end
Evaluation
Three layers of signals, not one
Evals aren't dashboards you build after launch. They're the safety net you build before changing anything — prompts, models, retrieval logic. A golden dataset of 100 queries with expected outputs runs on every change. Accuracy drop blocks the ship.
North star
Buddy-assisted conversion rate vs baseline
Did users who used the buddy buy more than those who didn't? This is the single number that justifies the product.
30% faster
Time to purchase
Buddy sessions vs non-buddy. Proves the discovery thesis.
> 70%
Resolution rate
User resolved without leaving chat.
< 2%
Hallucination rate
Responses with data not in retrieved context. Immediate investigation if breached.
100%
Spec accuracy
Zero tolerance. A wrong spec causes a wrong purchase — that's a return, a ticket, and a broken trust moment.
< 2s
Latency p95
End to end. Responses stream token-by-token to reduce perceived latency.
10× ratio
AI cost ROI
Revenue influenced by the buddy vs monthly inference cost. The number that goes to the CEO.
What ships next
V2 is data-triggered, not assumed
Nothing in the V2 backlog ships on a timeline. It ships when production data creates the case for it.
- — Order management (status, cancellation, returns) with full transactional guardrails
- — Voice input
- — Summarisation layer for sessions exceeding 8 messages — added only when p95 session length data justifies it
- — Embedding-based recommendation engine replacing popularity heuristics when behavioural data exists
- — Collaborative filtering for authenticated users
- — Personalised retrieval re-ranking based on purchase history
- — Multi-modal search (search by image)
- — A/B testing framework for prompt variants on live traffic
- — LLM-as-judge for automated response quality evaluation
- — Opus-class model evaluation when cost-per-order data makes the case
Reflection
What I'd validate before writing a line of code
The honest caveat
My own frustration is a data point of one. Before committing to a single sprint of engineering, the first thing I'd validate is whether the discovery gap is real for this specific store's users. Are they arriving without knowing what they want — or are they arriving with a specific product already in mind? The entire product thesis rests on that assumption. I'd run a two-week session recording analysis on the existing store before touching the architecture. A buddy built for the wrong problem is a well-architected failure.