I have a confession to make - I’m a full-blown coffee nerd. The kind who reads tasting notes like wine reviews, gets excited about new Ethiopian naturals, and treats “limited micro-lot” like concert tickets.
I’m always hunting for the latest single-origin or blended drops from my favorite roasters, looking for that one bag that tastes like blueberries, chocolate, or something weird and wonderful.
But here’s the frustration: when I ask an AI assistant those coffee-nerd questions - like “What is my favorite roaster currently offering?” - I usually get the digital equivalent of a shrug: “I don’t have information about their current inventory.”
The internet updates every day, and roasters release new coffees weekly - but LLMs? Their knowledge is frozen in time. Even the newest models are trained on data that’s already months old. For someone trying to track fresh releases, that makes them weirdly out of touch.
So I did what any data-obsessed coffee geek would do - I built my own expert coffee chatbot; one that doesn’t rely only on old training data but goes out, fetches the latest offerings, and answers based on what’s actually available right now. It’s part web scraper, part AI assistant, and surprisingly practical for anyone who wants their LLM connected to the real, constantly-changing world.
The big idea
What we are building is a RAG (Retrieval-Augmented Generation) system - a system that focuses LLMs on preferred, real-world data of your own, reducing hallucination. Here's the architecture in a nutshell:
Scrape fresh data from a coffee roaster's website using Scrapy and Zyte API,
Store it in a vector database (ChromaDB), where semantically similar content is clustered together,
Let users query naturally ("Show me coffees with fruity notes grown above 2,000 meters"),
Retrieve relevant context from the vector store,
Generate answers using OpenAI's GPT-4 with that context.
But you could swap out coffee for restaurant menus, real estate listings, product catalogs, or any domain where information changes frequently.

Part 1: The scraper - handling JavaScript-heavy sites
To arm my bot with the knowledge I want to interrogate, I will be scraping my favourite coffee roasting store, Dak Coffee Roasters (I love their freshly roasted Colombian coffee; Milky Cake is my favourite).

But first, let's talk about the elephant in the room: modern websites are JavaScript nightmares for traditional scrapers, and Dak’s is no exception. It's built with all the modern web conveniences that make life difficult for scraping.
This is where Zyte API becomes your best friend.
The spider implementation
Here's the core of my spider (dak_coffee.py):
That browserHtml: true parameter? That's Zyte spinning up a real browser, executing all the JavaScript, and handing you back fully-rendered HTML. There’s no Selenium gymnastics, Puppeteer configuration hell, or worrying about detection. It just works.
The secret sauce: Auto-extraction
But here's where it gets really interesting. Instead of writing fragile CSS selectors or XPath expressions that break every time the site redesigns, I'm using Zyte's pageContent auto-extraction.
Think of pageContent as a one-call content fetcher: URLs go in; clean, structured data comes out. While a regular Zyte API request returns a target page’s HTML, passing pageContent: true will strip out all the noise and return just the text you want wrapped in JSON.
This is perfect for an LLM project where we are playing with limited tokens and don’t want to send the entire HTML for processing.
What you get back is structured, LLM-ready content without writing a single selector:
That item_main key contains all the semantically important content from the page: coffee descriptions, tasting notes, origins, processing methods, etc, already extracted and cleaned.
For a RAG pipeline, this is gold. You don't need to teach your scraper about DOM structure; you just need the content.
Why this matters for RAG
Traditional scraping forces you to think in terms of page structure: "The coffee name is in an H2 with class 'product-title'." But LLMs don't care about your DOM tree; they care about semantic content.
Zyte's automatic extraction bridges that gap. One API call gets you:
Rendered JavaScript content (handling the modern web).
Structured extraction (no selector maintenance).
LLM-ready text (semantic content, not HTML soup).
For my coffee bot, this means I can scrape the entire catalog in minutes and get data that's immediately useful for embeddings.
Part 2: The RAG pipeline - from text to intelligence
Now that we've got fresh coffee data in JSON format, it’s time to make it queryable. I have used the LangChain LLM application builder framework, which makes it really easy to set up the RAG pipeline.
Building the vector store
The RAG pipeline (coffeebot_RAG_Pipeline.ipynb) follows a straightforward flow:
1. Load and structure the data:
Our scraped data is stored in a JSON object we are loading for processing. LangChain provides you an option to load multiple data types including PD, HTML, CSV. You have to change this code segment for your document type and use LangChain document loader.
2. Chunk it intelligently: LLMs have context size
This is the preprocessing technique of splitting large documents into smaller, manageable text segments (chunks). It is essential for overcoming LLM context window limits and optimizing Retrieval-Augmented Generation (RAG) by allowing the model to focus only on the most relevant, context-rich information, rather than processing entire, large files.
Small chunks (400 characters) work well here because coffee descriptions are naturally concise. No overlap of text is needed since each product is self-contained in JSON. But, if you’re using another document type like PDF, you need to define an efficient overlap.
3. Generate embeddings and store it in vector database (called vector_store):
Embeddings are mathematical representations of the data. They are usually stored in matrix form in a vector database.
LLM application workflows become seamless with these embeddings and vector databases. In fact, to generate these embeddings, we’re using an existing embedding model by OpenAI. Generally speaking you may need to use the same model for embedding as you do for inference later on.
I'm using OpenAI's text-embedding-3-large model for embeddings. At 3,072 dimensions, it captures nuanced semantic relationships. When someone asks for "coffees with fruity notes," the embedding model understands that "notes of strawberry and citrus" is semantically similar.
The retrieval chain
Here's where LangChain shines. The retriever pulls relevant context, and the LLM generates coherent answers:
That k=50 is intentional. I want to retrieve all potentially relevant coffees, not just the top few. The system prompt template then instructs the LLM to list everything that matches:
This is crucial for a product recommendation system. Users don't want: "Here are three examples.” They want the full catalog of options that match their criteria.
Taste-testing my coffee bot
I am running the inference straight from the Jupyter Notebook inside my Visual Studio Code. I can send a prompt and get a response from OpenAI, right in the notebook.
Asking my chatbot to show the latest washed Ethiopian coffees was an “a-ha!” moment.

I tried going one step further, asking which two coffees I could blend to get acidic and floral notes in the final cup. The assistant delivered!
Now that I have a companion I can geek-out with, I’ll add more roasters down the line and can’t wait to learn more about coffees in an interactive way.
Why this architecture works
Let me break down what makes this concept powerful:
1. Fresh data, always
Run the scraper daily (or hourly), and your bot always knows the current inventory. Sold out of that Ethiopian Yirgacheffe? The bot knows. New Guatemala Huehuetenango just dropped? The bot knows.
2. Semantic search over keyword matching
Traditional databases require exact matches. Vector stores understand meaning:
"Fruity" matches "notes of berry and citrus."
"High altitude" matches "grown at 2,000 meters."
"Washed process" means "wet-processed."
3. Scalability
This same architecture scales from 50 coffees to 50,000 products. The scraper runs independently, the vector store handles millions of embeddings efficiently, and the LLM only sees relevant context.
4. No hallucinations
By grounding the LLM in retrieved context and explicitly telling it not to invent information, you get factual responses. The bot won't recommend coffees that don't exist.
Real-world applications beyond coffee
This pattern isn't just for coffee nerds like me. You might create a web data-fuelled RAG engine for:
E-commerce assistants: "Find me Bluetooth headphones under $100 with noise cancellation."
Real estate bots: "Show me three-bedroom apartments near transit in my price range."
Restaurant recommendation systems: "Vegetarian-friendly Italian restaurants with outdoor seating."
Documentation search: Keep your LLM updated on your ever-changing API docs.
Market research: Track competitor products and pricing automatically.
This suits any domain where:
Information changes frequently.
The source is web-based (even JavaScript-heavy).
Users need natural language querying.
Accuracy matters (no hallucinations).
This architecture fits.
Getting started
Want to build your own?The complete source code for this project is available on GitHub. Feel free to fork it, break it, and build something interesting with it. That's what demo projects are for.
Here's what you need:
1. Clone and setup:
2. Add your API keys:
ZYTE_API_KEY in scraper/coffee_scraper/settings.py
OPENAI_API_KEY in your environment
3. Scrape fresh data:
4. Run the notebook:
Open rag-pipeline/notebook/coffeebot_RAG_Pipeline.ipynb and execute the cells.
The entire pipeline from scraping to querying takes less than five minutes to run for the first time.
The bigger picture
Here's what this project taught me: the gap between LLM knowledge and current reality is an opportunity which web scraping beautifully bridges .
We don't need artificial general intelligence (AGI) to have useful, intelligent assistants. We need:
Tools to fetch current data (web scraping).
Ways to make that data semantically searchable (embeddings in vectordb).
Methods to ground LLM responses in facts (RAG).
Zyte API handles the first part elegantly, especially for modern JavaScript-heavy sites. The pageContent automatic extraction feature means you spend less time fighting with selectors and more time building intelligence on top of your data.
ChromaDB and LangChain handle the second and third parts with minimal boilerplate.
The result? A chatbot that actually knows what it's talking about.
What's next?
This is a starting point, not a finished product. Some ideas for extending it:
Add image search (multi-modal RAG): Zyte API can extract product images, too.
Multi-source scraping: Combine data from multiple roasters.
Preference learning: Remember user preferences across sessions.
Price tracking: Alert users when coffees go on sale.
Brew method recommendations: Match coffees to brewing equipment.
The architecture supports all of this. That's the power of separating concerns: scraping, storage, and generation each do one thing well.
Final thoughts
AI is powerful, but it's not omniscient. The internet is vast, but it's not static. Web scraping and RAG pipelines are how we bridge that gap - keeping AI grounded in current reality.
Whether you're building a coffee bot, a customer service assistant or a market research tool, the pattern is the same: scrape fresh data, embed it, retrieve it, and generate it.
And, if your data source happens to be a JavaScript-heavy modern website? Well, that's what Zyte API is for.
Now if you'll excuse me, I need to ask my bot which Ethiopian coffees are currently in stock. Because, unlike my LLM's training data, those change weekly.
