Retrieval Augmented Generation (RAG)
If you're using Ollama, note that it defaults to a 2048-token context length. This severely limits Retrieval-Augmented Generation (RAG) performance, especially for web search, because retrieved data may not be used at all or only partially processed.
Retrieval Augmented Generation (RAG) is a cutting-edge technology that enhances the conversational capabilities of chatbots by incorporating context from diverse sources. It works by retrieving relevant information from a wide range of sources such as local and remote documents, web content, and even multimedia sources like YouTube videos. The retrieved text is then combined with a predefined RAG template and prefixed to the user's prompt, providing a more informed and contextually relevant response.
One of the key advantages of RAG is its ability to access and integrate information from a variety of sources, making it an ideal solution for complex conversational scenarios. For instance, when a user asks a question related to a specific document or web page, RAG can retrieve and incorporate the relevant information from that source into the chat response. RAG can also retrieve and incorporate information from multimedia sources like YouTube videos. By analyzing the transcripts or captions of these videos, RAG can extract relevant information and incorporate it into the chat response.
Local and Remote RAG Integrationβ
Local documents must first be uploaded via the Documents section of the Workspace area to access them using the # symbol before a query. Click on the formatted URL in the that appears above the chat box. Once selected, a document icon appears above Send a message, indicating successful retrieval.
Need to clean up multiple uploaded documents or audit your storage? You can now use the centralized File Manager located in Settings > Data Controls > Manage Files. Deleting files there will automatically clean up their corresponding RAG embeddings.
You can also load documents into the workspace area with their access by starting a prompt with #, followed by a URL. This can help incorporate web content directly into your conversations.
Web Search for RAGβ
Context Length Warning for Ollama Users: Web pages typically contain 4,000-8,000+ tokens even after content extraction, including main content, navigation elements, headers, footers, and metadata. With only 2048 tokens available, you're getting less than half the page content, often missing the most relevant information. Even 4096 tokens is frequently insufficient for comprehensive web content analysis.
To Fix This: Navigate to Admin Panel > Models > Settings (of your Ollama model) > Advanced Parameters and increase the context length to 8192+ (or rather, more than 16000) tokens. This setting specifically applies to Ollama models. For OpenAI and other integrated models, ensure you're using a model with sufficient built-in context length (e.g., GPT-4 Turbo with 128k tokens).
For web content integration, start a query in a chat with #, followed by the target URL. Click on the formatted URL in the box that appears above the chat box. Once selected, a document icon appears above Send a message, indicating successful retrieval. Open WebUI fetches and parses information from the URL if it can.
Web pages often contain extraneous information such as navigation and footer. For better results, link to a raw or reader-friendly version of the page.
RAG Template Customizationβ
Customize the RAG template from the Admin Panel > Settings > Documents menu.
Markdown Header Splittingβ
When enabled, documents are first split by markdown headers (H1-H6). This preserves document structure and ensures that sections under the same header are kept together when possible. The resulting chunks are then further processed by the standard character or token splitter.
Use the Chunk Min Size Target setting (found in Admin Panel > Settings > Documents) to intelligently merge small sections after markdown splitting, improving retrieval coherence and reducing the total number of vectors in your database.
Chunking Configurationβ
Open WebUI allows you to fine-tune how documents are split into chunks for embedding. This is crucial for optimal retrieval performance.
- Chunk Size: Sets the maximum number of characters (or tokens) per chunk.
- Chunk Overlap: Specifies how much content is shared between adjacent chunks to maintain context.
- Chunk Min Size Target: Although Markdown Header Splitting is excellent for preserving structure, it can often create tiny, fragmented chunks (e.g., a standalone sub-header, a table of contents entry, a single-sentence paragraph, or a short list item) that lack enough semantic context for high-quality embedding. You can counteract this by setting the Chunk Min Size Target to intelligently merge these small pieces with their neighbors.
Why use a Chunk Min Size Target?β
Intelligently merging small sections after markdown splitting provides several key advantages:
- Improves RAG Quality: Eliminates tiny, meaningless fragments, ensuring better semantic coherence in each retrieve chunk.
- Reduces Vector Database Size: Fewer chunks mean fewer vectors to store, reducing storage costs and memory usage.
- Speeds Up Retrieval & Embedding: A smaller index is faster to search, and fewer chunks require fewer embedding API calls (or less local compute). This significantly accelerates document processing when uploading files to chats or knowledge bases, as there is less data to vectorize.
- Efficiency & Impact: Testing has shown that a well-configured threshold (e.g., 1000 for a chunk size of 2000) can reduce chunk counts by over 90% while improving accuracy, increasing embedding speed, and enhancing overall retrieval quality by maintaining semantic context.
How the merging algorithm works (technical details)
RAG Embedding Supportβ
Change the RAG embedding model directly in the Admin Panel > Settings > Documents menu. This feature supports Ollama and OpenAI models, enabling you to enhance document processing according to your requirements.
Citations in RAG Featureβ
The RAG feature allows users to easily track the context of documents fed to LLMs with added citations for reference points. This ensures transparency and accountability in the use of external sources within your chats.
File Context vs Builtin Toolsβ
Open WebUI provides two separate capabilities that control how files are handled. Understanding the difference is important for configuring models correctly.
File Context Capabilityβ
The File Context capability controls whether Open WebUI performs RAG (Retrieval-Augmented Generation) on attached files:
| File Context | Behavior |
|---|---|
| β Enabled (default) | Attached files are processed via RAG. Content is retrieved and injected into the conversation context. |
| β Disabled | File processing is completely skipped. No content extraction, no injection. The model receives no file content. |
When to disable File Context:
- Bypassing RAG entirely: When you don't want Open WebUI to process attached files at all.
- Using Builtin Tools only: If you prefer the model to retrieve file content on-demand via tools like
query_knowledge_basesrather than having content pre-injected. - Debugging/testing: To isolate whether issues are related to RAG processing.
When File Context is disabled, file content is not automatically extracted or injected. Open WebUI does not forward files to the model's native API. If you disable this, the only way the model can access file content is through builtin tools (if enabled) that query knowledge bases or retrieve attached files on-demand (agentic file processing).
Individual files and knowledge bases can also be set to bypass RAG entirely using the "Using Entire Document" toggle. This injects the full file content into every message regardless of native function calling settings. See Full Context vs Focused Retrieval for details.
The File Context toggle only appears when File Upload is enabled for the model.
Builtin Tools Capabilityβ
The Builtin Tools capability controls whether the model receives native function-calling tools for autonomous retrieval:
| Builtin Tools | Behavior |
|---|---|
| β Enabled (default) | In Native Function Calling mode, the model receives tools like query_knowledge_bases, view_knowledge_file, search_chats, etc. |
| β Disabled | No builtin tools are injected. The model works only with pre-injected context. |
When to disable Builtin Tools:
- Model doesn't support function calling: Smaller or older models may not handle the
toolsparameter. - Predictable behavior needed: You want the model to work only with what's provided upfront.
Combining the Two Capabilitiesβ
These capabilities work independently, giving you fine-grained control:
| File Context | Builtin Tools | Result |
|---|---|---|
| β Enabled | β Enabled | Full Agentic Mode: RAG content injected + model can autonomously query knowledge bases |
| β Enabled | β Disabled | Traditional RAG: Content injected upfront, no autonomous retrieval tools |
| β Disabled | β Enabled | Tools-Only Mode: No pre-injected content, but model can use tools to query knowledge bases or retrieve attached files on-demand |
| β Disabled | β Disabled | No File Processing: Attached files are ignored, no content reaches the model |
- Most models: Keep both enabled (defaults) for full functionality.
- Small/local models: Disable Builtin Tools if they don't support function calling.
- On-demand retrieval only: Disable File Context, enable Builtin Tools if you want the model to decide what to retrieve rather than pre-injecting everything.
Enhanced RAG Pipelineβ
The togglable hybrid search sub-feature for our RAG embedding feature enhances RAG functionality via BM25, with re-ranking powered by CrossEncoder, and configurable relevance score thresholds. This provides a more precise and tailored RAG experience for your specific use case.
KV Cache Optimization (Performance Tip) πβ
For professional and high-performance use casesβespecially when dealing with long documents or frequent follow-up questionsβyou can significantly improve response times by enabling KV Cache Optimization.
The Problem: Cache Invalidationβ
By default, Open WebUI injects retrieved RAG context into the user message. As the conversation progresses, follow-up messages shift the position of this context in the chat history. For many LLM enginesβincluding local engines (like Ollama, llama.cpp, and vLLM) and cloud providers / Model-as-a-Service providers (like OpenAI and Vertex AI)βthis shifting position invalidates the KV (Key-Value) prefix cache or Prompt Cache, forcing the model to re-process the entire context for every single response. This leads to increased latency and potentially higher costs as the conversation grows.
The Solution: RAG_SYSTEM_CONTEXTβ
You can fix this behavior by enabling the RAG_SYSTEM_CONTEXT environment variable.
- How it works: When
RAG_SYSTEM_CONTEXT=True, Open WebUI injects the RAG context into the system message instead of the user message. - The Result: Since the system message stays at the absolute beginning of the prompt and its position never changes, the provider can effectively cache the processed context. Follow-up questions then benefit from instant responses and cost savings because the "heavy lifting" (processing the large RAG context) is only done once.
If you are using Ollama, llama.cpp, OpenAI, or Vertex AI and frequently "chat with your documents," set RAG_SYSTEM_CONTEXT=True in your environment to experience drastically faster follow-up responses!
YouTube RAG Pipelineβ
The dedicated RAG pipeline for summarizing YouTube videos via video URLs enables smooth interaction with video transcriptions directly. This innovative feature allows you to incorporate video content into your chats, further enriching your conversation experience.
Document Parsingβ
A variety of parsers extract content from local and remote documents. For more, see the get_loader function.
When using Temporary Chat, document processing is restricted to frontend-only operations to ensure your data stays private and is not stored on the server. Consequently, advanced backend parsing (used for formats like complex DOCX files) is disabled, which may result in raw data being seen instead of parsed text. For full document support, use a standard chat session.
Google Drive Integrationβ
When paired with a Google Cloud project that has the Google Picker API and Google Drive API enabled, this feature allows users to directly access their Drive files from the chat interface and upload documents, slides, sheets and more and uploads them as context to your chat. Can be enabled Admin Panel > Settings > Documents menu. Must set GOOGLE_DRIVE_API_KEY and GOOGLE_DRIVE_CLIENT_ID environment variables to use.
Detailed Instructionsβ
- Create an OAuth 2.0 client and configure both the Authorized JavaScript origins & Authorized redirect URI to be the URL (include the port if any) you use to access your Open-WebUI instance.
- Make a note of the Client ID associated with that OAuth client.
- Make sure that you enable both Google Drive API and Google Picker API for your project.
- Also set your app (project) as Testing and add your Google Drive email to the User List
- Set the permission scope to include everything those APIs have to offer. And because the app would be in Testing mode, no verification is required by Google to allow the app from accessing the data of the limited test users.
- Go to the Google Picker API page, and click on the create credentials button.
- Create an API key and under Application restrictions and choose Websites. Then add your Open-WebUI instance's URL, same as the Authorized JavaScript origins and Authorized redirect URIs settings in the step 1.
- Set up API restrictions on the API Key to only have access to Google Drive API & Google Picker API
- Set up the environment variable,
GOOGLE_DRIVE_CLIENT_IDto the Client ID of the OAuth client from step 2. - Set up the environment variable
GOOGLE_DRIVE_API_KEYto the API Key value setup up in step 7 (NOT the OAuth client secret from step 2). - Set up the
GOOGLE_REDIRECT_URIto my Open-WebUI instance's URL (include the port, if any). - Then relaunch your Open-WebUI instance with those three environment variables.
- After that, make sure Google Drive was enabled under
Admin Panel<Settings<Documents<Google Drive