How to Build a RAG App, for Beginners: Local LLMs, Ollama, and LangChain

This tutorial is for developers, designers who code, or anyone new to AI who wants a hands-on introduction to building a custom AI chatbot that can search and answer questions using your own data.


I wanted to build an AI-powered tool for our team, but I had zero experience building AI applications—so I decided to figure it out. While researching, I found this YouTube video by Santiago Valdarrama on building a Retrieval-Augmented Generation (RAG) system with LangChain, and it turned out to be a great starting point.

Instead of just following along, I broke everything down step by step, adding explanations and extra context to help me understand what was happening. This walkthrough is my way of organizing what I learned—and hopefully making it easier for anyone else figuring this out for the first time!

Fair warning—this is long. Like, really long. If you’re looking for a quicker, more to-the-point version, go watch the video—it’s a great, concise, and easy to follow along. But if you’re like me and want to really understand what’s happening under the hood, then roll up your sleeves.

This tutorial will walk through how to build a simple LangChain app that:

  1. Loads a PDF document.
  2. Splits it into pages.
  3. Converts each page into embeddings.
  4. Uses a retriever to fetch the most relevant pages based on a question.
  5. Invokes a chain (with a prompt template and parser) to answer the question.

In this tutorial, you will learn how to:

  • Install and run an open‐source local LLM using Ollama.
  • Switch between GPT models (via an API) and local LLMs without changing your application code.
  • Use LangChain to build chains, incorporate custom prompt templates, and retrieve relevant documents.
  • Build a simple retrieval-augmented generation (RAG) system to answer questions from a PDF.
  • And, if you’re not already familiar, you’ll learn what some of these words mean.

Part One: Project Setup in VS Code

In this section, we’ll set up our development environment to run a local large language model (LLM). By the end, you’ll have a working Jupyter Notebook where you can seamlessly switch between OpenAI’s API and a locally running model like LLaMA 2 or Mixtral.

Already familiar with this setup? You might want to skip ahead and check out Santiago Valdarrama’s GitHub project (llm), which accompanies his YouTube tutorial on building a Retrieval-Augmented Generation (RAG) system with LangChain. Otherwise, let’s get started!


This tutorial assumes you have a basic familiarity with programming concepts, command-line usage, and Visual Studio Code (VS Code). It also uses Jupyter notebooks inside VS Code.

⛔ Dependencies: Here’s what you’ll need to get up and running:

  • Visual Studio Code
  • Jupyter (Extension for VisualStudio Code)
  • Python Plugin

1. Installing Ollama and Downloading Models

Ollama is a lightweight tool that acts as a wrapper around several open-source LLMs (like Llama 2, Mixtral, etc.) to run them via a common interface.

Download and Install Ollama:

  • Visit https://ollama.com/ (or the appropriate download page) and download the version for your operating system (Mac, Linux, or Windows).
  • Follow the installation instructions. The first time you run it, it may prompt you to install command-line tools or download a model (e.g., Llama 2).
  • Using the Command Line (Terminal on Mac):
# List available commands by typing:
ollama --help --help

# Install a model (for example, Llama 2) by running:
ollama pull llama2

# Verify your installed models:
ollama list

#To start serving a model locally, run:
ollama run llama2

You can now interact with your model through the command line. Try typing a prompt like “tell me a joke”. For other models, view the list here: https://ollama.com/search

2. Creating a Project Directory:

First we need to set up a project workspace in VS Code, ensuring there is a clean and isolated environment to work with. We’ll create a dedicated directory, set up a Jupyter Notebook, and configure a virtual environment along with environment variables for sensitive data like your API keys.

Create a new directory (e.g., local-model) and open it in VS Code. You can do this manually or via the command line: dev mkdir local-model

3. Set Up a Jupyter Notebook:

  • Create a new Jupyter Notebook (e.g., notebook.ipynb) in your project.
  • If you haven’t already, install the Jupyter and Python extensions for VS Code.

4. Creating a Virtual Environment:

We’re using a virtual environment to keep dependencies isolated and ensure that installing packages doesn’t interfere with other projects.

  • Open the terminal in VS Code and run: python3 -m venv .venv
  • Activate the virtual environment: source .venv/bin/activate

5. Making Sure It All Runs:

Now, let’s confirm that everything is set up correctly. Open your Jupyter Notebook and run the following simple Python command to check if your environment is working properly. Add print("Hello World") to your file and Run. When this runs, you’ll be prompted to select a kernel. Select Python Environment, then select the Python .venv

Then run your file:

6. Setting Up Libraries & Environment Variables:

Create an environment file(e.g., .env) inside of your folder to store your API keys and other configuration. Get your OpenAI API key and store it in the variable.

OPENAI_API_KEY=your_openai_key_here

In your notebook, load these variables using Python’s os library or a dedicated library (like dotenv).

#################################################################################
### IMPORT LIBRARIES
#################################################################################
import os 

# Library that reads environment variables in the .env files
from dotenv import load_dotenv 
load_dotenv()

# print("Hello")

#################################################################################
### IMPORT - Load environment variables
#################################################################################

# Get the OpenAI Key
# Why do we need to use an openai api when we are running models locally?
# We want to test everything we are doing locally with openai/gpt to see how they compare. 
OPENAI_API_KEY = os.getenv("OPENAI_API_KEY")

#################################################################################
### DEFINE MODELS
#################################################################################
MODEL = "gpt-3.5-turbo" #Chat GPT Model
# MODEL = "mixtral;8x7b" #Locally running open source model
# MODEL = "llama2"

In order to read environment variables, install Python dotenv in the terminal by running pip install python_dotenv

Why use a local LLM?

Before diving into the code, here are some reasons why you might want to run a local LLM:

  • Cost Efficiency: Open-source models can be significantly cheaper than using external APIs.
  • Privacy: Keeping everything in-house avoids sending data to third-party APIs.
  • Offline Usage: Local models are ideal for edge devices, robotics, or environments with no internet connectivity.
  • Backup: They can serve as a backup if the external API is unavailable.

At this point, we have:

  • A locally running AI model using Ollama.
  • A working Jupyter Notebook environment.
  • A virtual environment to manage dependencies.
  • The ability to switch between OpenAI’s API and a local model like LLaMA 2.

Did You Know That…

Before we move on to the next step, let’s review some vocabulary.

  • Python – Widely used in artificial intelligence and machine learning due to its large ecosystem of libraries (such as TensorFlow, PyTorch, and scikit-learn), ease of prototyping, and strong community support. For beginners, Python is a great starting point because it allows you to focus on learning AI concepts without getting bogged down in complex syntax. Most AI tutorials, courses, and research papers use Python, making it easier to find resources, examples, and help as you learn.
  • Jupyter Notebook – An interactive development environment (IDE) that runs Python in structured cells.
  • Large Language Model (LLM) – An AI model trained to process and generate human-like text.
  • Local LLM – A language model (like LLaMA 2 or Mixtral) that runs on your local machine instead of an API-based model.
  • Ollama – A tool that simplifies running open-source large language models (LLMs) on your local machine.
  • Llama2 – An open-source LLM that can be run locally with Ollama.
  • API Key – A unique credential used to authenticate access to external services (like OpenAI’s API).

Part Two: Setting Up LangChain

What is LangChain?

LangChain is an open-source framework designed to help developers build applications powered by Large Language Models (LLMs) like GPT, Llama, and Claude. In LangChain, a Chain is a structured sequence of operations that process inputs (e.g., user queries) through one or more steps before producing an output. Chains allow you to combine multiple components, such as LLMs, retrieval systems, APIs, and logic, into a pipeline.

https://www.youtube.com/watch?v=1bUy-1hGZpI

http://youtube.com/watch?v=cDn7bf84LsM

Setting Up Langchain

First, install LangChain in the terminal so we can use it.

pip install langchain_openai
pip install langchain
pip install langchain_community

Using LangChain, create a very simple model in the notebook to make sure that the API is working.

from langchain_openai.chat_models import ChatOpenAI

model = ChatOpenAI(api_key=OPENAI_API_KEY,model=MODEL)
model.invoke("Tell me a joke.")

Creating a Prompt Template and Parser

When working with different language models, it’s important to understand how they return results:

  • Chat Models (e.g., ChatGPT Turbo): These models are designed for conversational interactions. Their outputs are typically wrapped in an AIMessage object (for example, AIMessage(content="...")). This structure is useful when you want to keep track of conversation context, such as differentiating between user and assistant messages.
  • Completion Models (e.g., Llama2): In contrast, completion models return a plain string as their output. They are optimized for generating text completions rather than managing dialogue context.

First, we define a function or use a conditional check to select the correct model type based on our configuration. In our code, this means checking if the model’s name starts with “GPT” (indicating a chat model) or not (indicating a local completion model like Llama2). Depending on this check, we instantiate the model accordingly, and ensure that the parser is applied to handle differences in the output formats.

This approach allows your chain to work seamlessly with either the GPT model or your local LLM without needing to change the downstream logic.

from langchain_openai.chat_models import ChatOpenAI
from langchain_community.llms import Ollama

if MODEL.startswith("gpt"):  
	model = ChatOpenAI(api_key=OPENAI_API_KEY,model=MODEL)
else:
	model = Ollama(model=MODEL)
	
model.invoke("Tell me a joke.")

Because our application requires a consistent output format regardless of which model is used, we need to introduce a parser. The parser will convert the output—whether it’s an AIMessage or a plain string—into a standardized format (a simple string) that the rest of our pipeline can process uniformly.

from langchain_core.output_parsers import StrOutputParser
parser = StrOutputParser()

# Create a langchain chain
# Langchain sends a request to the model, and gets the output of the model
# Pipe the output of the model into the input of the parser. 
chain = model | parser

chain.invoke ("tell me a joke")

⛔ If you run into errors running the above, try updating langchain:

pip install --upgrade langchain

 

Let’s recap. By this point:

  • We know how to run a model
  • We can run a model locally
  • We know how to create a LangChain chain

Did You Know That…

Before we move on to the next step, let’s review some vocabulary.

  • LangChain – LangChain is an open-source framework for building applications powered by large language models (LLMs) like GPT, Llama, and Claude. It lets you combine things like LLMs, APIs, and logic into chains—structured pipelines that take in a prompt, process it, and return a response. It’s designed to make working with LLMs more modular and flexible.
  • Prompt Template – A prompt template is a reusable format for structuring the input you send to a language model. It can include placeholders (like {question}) that get filled in at runtime. This ensures consistency and allows you to customize prompts without rewriting them every time.
  • Invoke – invoke() is a function used to call a language model and get a response. In LangChain, you use model.invoke("your input") to send a message to the model and receive the output. It’s a simple way to run the chain and see what the model returns.
  • AIMessage – An AIMessage is a special type of output object returned by chat-based models like GPT. It helps keep track of the model’s response in a conversation. Instead of just returning a string, it wraps the text in an object with metadata—like who said it and when. You can extract the actual text using .content.

 

Part Three: Building a Simple RAG System

Next, we’ll build a simple RAG (Retrieval-Augmented Generation) system that retrieves information from a PDF and uses it to answer questions.

What is a RAG system?

A RAG (Retrieval-Augmented Generation) system is an AI framework that enhances text generation by retrieving knowledge from an external source before generating a response.

It combines the strengths of:

Retrieval-based models → Finds relevant information from a database.

Generation-based models → Uses an LLM (Large Language Model) to generate an answer.

This approach improves accuracy and reduces hallucinations in AI responses.

Installing PyPDF

PyPDF is a Python library for reading, manipulating, and extracting data from PDFs. However, it is NOT a document loader. Instead, LangChain’s PyPDFLoader builds on top of PyPDF to integrate it into AI-powered workflows.

  • First, find a multi-page PDF to use for this experiment. Drop it into your project folder.
  • Next, install pypdf: pip install pypdf

Setting Up a Document Loader

A Document Loader is a component that processes documents from various sources (e.g., PDFs, text files, web pages, databases) and converts them into a structured format for retrieval. It will allow us to load our PDF into our application.

LangChain provides a document loader called PyPDFLoader, which is built on top of PyPDF to facilitate PDF text extraction for AI applications.

Now, let’s load a PDF and extract text using LangChain’s PyPDFLoader:

from langchain_community.document_loaders import PyPDFLoader

# **Stores the file reference** (but doesn’t load it yet).
loader = PyPDFLoader("2025-why-work-with-atomic.pdf")

# load_and_split() loads the PDF and splits it into chunks or pages.
# 'pages' stores the extracted text in memory as a list of Document objects.
pages = loader.load_and_split() 

# Print the extracted pages as Document objects
pages 

When you run this code, you should see a list of pages from your PDF that look something like this:

[ADD SCREENSHOT OF OUTPUT]

How PyPDFLoader Uses PyPDF in LangChain

LangChain’s PyPDFLoader uses PyPDF internally to read and extract text from PDFs, making it easier to integrate PDFs into AI chatbots, search engines, and RAG systems.

What is load_and_split() doing?

  • This method loads the PDF file and splits it into smaller text chunks (usually page-by-page or based on a chunking strategy).
  • It prepares the data for retrieval-based AI models.

Understanding the pages variable

  • The pages variable stores a list of chunks, or Document objects, into memory.
  • A Document is a LangChain object. Each Document object contains:
    • page_content: The extracted text.
    • metadata: Information like page number, file name, and source.

Common Operations on pages

Action Effect
pages Prints a list of Document objects (one per page/chunk)
pages[0] Prints the first Document object
pages[0].page_content Prints the text from the first page
pages[0].metadata Prints metadata (e.g., page number)
len(pages) Prints the total number of pages extracted

Create a Prompt Template

Before our app can start answering questions, we need to give it some clear instructions. Basically, we want to tell the model to only answer based on the specific information we provide—nothing from its built-in knowledge, nothing from what it learned before. Just stick to the context we give it and don’t make stuff up. If the answer isn’t in the provided information, it should just say so instead of guessing.

By default, AI models (like GPT-4) use both pre-trained knowledge and the input they receive. However, in our RAG system, we want the AI to only use the retrieved PDF data to answer questions accurately.

from langchain.prompts import PromptTemplate

# Define a custom prompt template
custom_template="""
You are AI assistant instructed to answer questions about Atomic Object. 
- Answer all questions based STRICTLY on the provided context. 
- ONLY use the above context to answer the question.
- DO NOT use prior knowledge or external sources.
- If the answer is not found in the context, say: "I couldn't find the answer in the provided text."

Context:
{context}

Question:
{question}	
"""

context="The company was founded in 2020 and specializes in AI research."
question="When was the company founded?"

# Fill in the template
prompt = PromptTemplate.from_template(custom_template)
prompt.format(context=context, question=question)

Pass the prompt template back into our chain.

Let’s test it out.

chain = prompt | model | parser

chain.invoke(
    {
        "context": "My name is Alecia",
        "question": "What is my name?"
    }
)

When running this, the model should follow the instructions in the template (the prompt part of our chain) and answer the question based upon the context.

When we invoke our chain in our application, it may be helpful to understand what the input of the chain looks like. Adding chain.input_schema.schema() will show you the schema of chain and what inputs it is expecting.

Let’s Recap. By this point:

  • We have a chain that has a prompt, model, and parser.
  • We have our PDF document pages loaded into memory

Did You Know That…

Before we move on to the next step, let’s review some vocabulary.

  • RAG (Retrieval-Augmented Generation) – An approach that combines document search (retrieval) with language generation to improve answer accuracy. It pulls relevant info from external sources (like PDFs) before generating a response.
  • Retrieval-Based Model – Part of a RAG system that searches your documents to find the most relevant chunks based on the user’s question.
  • Generation-Based Model – The large language model (like GPT or LLaMA) that takes in context and generates a natural-language answer.
  • PyPDF – A Python library used to read and extract text from PDF files. LangChain builds on it for AI document processing.
  • PyPDFLoader – A document loader from LangChain that uses PyPDF under the hood. It loads PDF content and turns it into chunks that an AI can use.
  • Document Loader – A component in LangChain that pulls in data from files, URLs, or databases and turns it into a structured format (like Document objects) for AI workflows.
  • load_and_split() – A method that loads a document (like a PDF) and splits it into smaller parts—typically pages—so they can be searched or retrieved more efficiently.
  • Document Object – A LangChain object that represents a single chunk of content. Each one includes page_content or the actual text, and metadata info like page number or source.

Wrapping Up

If you made it this far, you’ve already built the foundation of a Retrieval-Augmented Generation (RAG) application! Starting from scratch, we:

  • set up a development environment
  • ran a local large language model with Ollama
  • and connected everything through LangChain to create a simple AI workflow.

Along the way, we explored how to switch between API-based models like GPT and locally running models such as Llama2, how LangChain chains structure interactions with language models, and how prompt templates and parsers help standardize outputs. We also walked through the basics of a RAG system by loading a PDF, breaking it into retrievable chunks, and instructing the model to answer questions using only the provided context.

While this tutorial focuses on the core concepts and setup, it’s really just the beginning. The full project can be extended in many ways—adding embeddings and vector databases for smarter retrieval, building a user interface, connecting multiple data sources, or deploying the system as a real tool.

If you’d like to keep building on what we started here, I highly recommend continuing with the original YouTube video by Santiago Valdarrama. His video walks through the remaining steps of the project and provides a great visual guide for expanding the RAG system further. You can follow along with the video using the setup and explanations we covered here.

The important takeaway is that building AI applications doesn’t require deep machine learning expertise to get started. With tools like Ollama, LangChain, and open-source models, it’s possible to experiment, learn, and build useful AI-powered tools step by step—and I hope this guide helped make that first step a little clearer.

Conversation

Join the conversation

Your email address will not be published. Required fields are marked *