Shipping AI Features in Next.js: RAG, Embeddings, and Lessons From Real Builds

Most AI demos look great in a five minute screen recording and fall apart the moment a real user asks a real question. I have built a few of these features into client products now, and the gap between a demo and something that ships is mostly about grounding, cost, and latency. Here is what I have learned wiring large language models into Next.js apps, and where the work actually goes.

The chatbot is the easy part

Calling an LLM from a Next.js route handler is a few lines. You set up a server-side endpoint, stream the response back, and the UI feels responsive. The hard part is making the model answer from your data instead of whatever it happened to learn during training. A general chatbot that hallucinates product details or invents a return policy is worse than no chatbot, because it sounds confident while being wrong.

On a recent project for a clinic client, the first version answered medical-adjacent questions from general knowledge. That is exactly what you do not want. The fix was not a better prompt. It was retrieval: pull the actual approved content first, then ask the model to answer using only that content.

RAG is just retrieval plus discipline

Retrieval augmented generation sounds heavier than it is. You chunk your source documents, turn each chunk into an embedding (a vector of numbers that captures meaning), and store those vectors. When a user asks something, you embed their question the same way, find the closest chunks by vector similarity, and paste those chunks into the prompt as context. The model answers from what you handed it.

The two decisions that actually matter are chunk size and what you retrieve. Chunks that are too big bury the relevant sentence in noise and burn tokens. Chunks that are too small lose context and the model stitches together fragments badly. I usually land on a few hundred tokens per chunk with some overlap, then test against real questions rather than guessing. For storage, I have used pgvector inside an existing Postgres instance on smaller projects because it means one less service to run, and a dedicated vector database only when scale or filtering needs justify it.

Grounding is the discipline part. The system prompt has to tell the model to answer from the provided context and to say it does not know when the context does not cover the question. That single instruction is the difference between a feature you can put in front of customers and one your support team has to apologize for.

Streaming, prompt caching, and the cost math

Streaming is not a nice-to-have, it is what makes the feature feel alive. In Next.js I stream tokens from a server route straight to the client so the answer appears as it is generated instead of after a long pause. It also sidesteps request timeouts on longer answers. The Anthropic SDK has a helper that collects the full message at the end, so you get the streamed experience and still have the complete response to log or store.

Cost and latency are where projects quietly get expensive. The retrieved context and your system prompt are often the same across many requests, and you pay to process them every single time. Prompt caching fixes this: a cached prefix reads back at roughly a tenth of the input price. The catch is that caching is a prefix match, so any byte change near the front invalidates everything after it. I keep the stable parts (the instructions, the frozen system prompt) at the very start and put the volatile part (the user question) last. Drop a timestamp into the system prompt and you have silently turned caching off.

Model choice is the other lever. You do not need your most capable model for every call. Routing simple classification or short replies to a cheaper, faster model and reserving the expensive one for the actual reasoning keeps both the bill and the latency down. I decide this per route, not once for the whole app.

What separates shipped from demoed

The features that survive contact with users all share the same traits. They ground answers in real data so the model cannot freelance. They handle the empty case, where retrieval finds nothing relevant, by saying so instead of inventing an answer. They stream so the wait feels short. And they cache aggressively so a popular feature does not become a budget problem.

None of that shows up in a demo. A demo just needs one good question and one good answer. A shipped feature needs to be right, affordable, and fast across thousands of questions you did not anticipate. That is the work, and it is mostly engineering judgment rather than model magic.

#Generative AI#RAG#Next.js#Embeddings#LLM