Making Sense of AI Terminology: LLM & RAG Basics That Matter¶

Here's a plain English explanation of key Large Language Model (LLM) and Retrieval-Augmented Generation (RAG) concepts.

Large Language Model (LLM) A sophisticated neural network trained on vast amounts of text data, enabling it to understand and generate human-like text. Examples include GPT-4 and Claude.

Basic LLM Architecture

Small Language Model (SLM) A more compact version of an LLM that trades some capabilities for efficiency, making it faster, cheaper to run, and practical for specific business tasks. Examples include Mistral 7B, Gemma 9B, Phi-3.5 Mini, Llama3.2-1B.

Prompt The input text provided to an LLM or SLM that guides what the model generates in response. A prompt can range from a simple question to complex instructions with examples and context. The quality and structure of prompts significantly impact the usefulness of the model's output, making effective prompt design a key skill for working with LLMs. Well-crafted prompts can make the difference between vague, generic responses and precise, accurate responses.

Tokens and Tokenization How AI models break text into processable pieces. This directly affects costs for proprietary and hosted models (you pay per token) and also system performance. For example, "Strawberry" might be split into three tokens "str", "aw", and "berry".

Tokens in Strawberry

Context Window The amount of text an LLM can consider at once. Larger windows enable processing more information but increase costs for proprietary and paid models. Modern models range from 4,000 to 2,000,000 tokens.

Context Window Visualization

Input/Output Tokens Distinguished because pricing often differs between tokens used in prompts versus those generated in responses. Input are the tokens in the prompt and output are the tokens in the response.

Temperature A setting controlling how creative versus focused responses will be. Lower settings (near 0) are better for factual tasks, higher settings for creative tasks.

Temperature Effect

Inference When the model is actually processing inputs and generating responses - the operational phase of using an LLM - rather than the training phase.

Training The resource-intensive process of teaching a model by exposing it to vast amounts of text data and adjusting its parameters.

Training Data The information that the language model was trained upon and has knowledge of. This is commonly large datasets based on the content of the Internet.

Training Data Cutoff The date when a model stopped learning new information, after which it has no knowledge of world events or developments. This varies by model and perhaps counterintuitively the latest models don#t always have the latest cuttoff date.

Model Licensing The legal framework for using AI models:

Commercial models require payment and have usage restrictions
"Open weights" models provide more flexibility but often have limitations
Truly open source models would include all training code, data, and weights

Model Family A collection of related AI models that often share the same underlying architecture but come in different sizes or with different specializations. Understanding model families helps you make informed decisions about which version to implement:

Models within a family typically share similar capabilities but differ in size and performance
Larger family members generally offer better performance but require more resources and usually higher cost
Smaller variants trade some capabilities for faster speed and lower cost
Specialized variants may be optimised for specific tasks, for example coding or reasoning

Examples include the OpenAI GPT family (GPT-3.5, GPT-4), the Claude family (Claude 3 Haiku, Claude 3 Sonnet, Claude 3 Opus), or open weights models like the Llama family (Llama 3 8B, 70B, 405B), the Qwen Family and DeepSeek Family.

Model Versions Different iterations of the same model family, each with specific capabilities and limitations (e.g., GPT-3.5 vs GPT-4).

Hallucination When an AI confidently states something that isn't true. LLMs inherently generate plausible-sounding but potentially incorrect information. This is a major challenge, however thoughtful system architecture can significantly reduce and mitigate this risk. Hallucination, often incorrectly, makes many users think LLMs are unsuitable for many tasks.

Prompt Engineering The art of crafting effective instructions for effectively communicating with AI models such as LLMs to achieve desired outcomes. This is often an repetative process performed by a specialist Prompt Engineer. Unlike traditional programming with deterministic results, prompt engineering is an exploratory process.

Foundation Models Base versions of AI models trained on vast, diverse datasets without specific task orientation. These models provide general language capabilities and broad knowledge that can be adapted for specific applications. Examples include base versions of GPT, Claude, and Llama before specialized training.

Instruct Models Versions of foundation models specifically trained to follow instructions reliably. They respond more consistently to prompts, maintain better focus on the requested task, and generally produce more useful outputs for business applications. Most commercial models like OpenAI GPT, Claude are instruction-tuned with open weights models such as Llama having instruction-tuned versions available such as Llama 3 8B Instruct.

Specialized Models: Code Generation Models optimised specifically for programming tasks. These models understand programming languages and patterns, can generate functional code from descriptions, and assist with debugging and code explanation. Examples include GitHub Copilot, specialized versions of GPT-4 and Claude, and purpose-built models like CodeLlama.

Multimodal Models AI systems that can process and understand multiple types of information beyond just text, such as images, audio, or video. Unlike text-only LLMs, multimodal models can "see" images, analyze visual content, and integrate this understanding with text processing. Examples include GPT-4V, Claude Opus, and Gemini, which can analyze charts, diagrams, screenshots, and photos while maintaining their language capabilities. This expansion beyond text opens new possibilities for more comprehensive AI applications that can interact with the world more like humans do.

Vision Language Model (VLM) A type of multimodal AI system specifically designed to understand and process both images and text together. Unlike text-only LLMs, VLMs can "see" images and reason about visual content in conjunction with language. These models can describe images, answer questions about visual content, and understand instructions that reference visual elements. Popular examples include GPT-4V, Claude models, and Gemini Pro Vision. VLMs represent a significant step toward more human-like AI that can perceive and communicate about the visual world.

Proprietary Models Commercial AI models like GPT-4, Claude and Gemini that offer high performance but operate as "black boxes" with subscription or usage fees.

Open Weights Models Models from families like Llama, Mistral, Qwen and DeepSeek that provide their trained model files for download and use, though often with specific terms.

Chat Templates The specific formats different models expect for conversations, crucial for reliable interaction.

Parameter Count The number of adjustable variables in a model (measured in millions or billions). More parameters generally indicate greater capabilities but require more computational resources.

Fine-tuning The process of adapting a pre-trained model to specific tasks or domains by training it on additional targeted data.

Transformer Architecture The fundamental design behind modern language models, using attention mechanisms to process text.

Embeddings Numerical representations of text that capture meaning in a way computers can process, crucial for search and comparison operations.

Zero-shot Learning Zero-shot learning refers to a model's ability to perform tasks or make predictions on categories or tasks that were completely unseen during training. The model uses knowledge of related concepts and semantic relationships. The model often uses descriptions or attributes of the unseen data or scenarios to make predictions.

Few-shot Learning Providing examples within the prompt to guide the model toward desired outputs.

System Prompt Instructions given to the model that define its behavior and role, setting the foundation for all subsequent interactions.

Evaluation Metrics The measurements used to assess how well an AI system is performing against business objectives. These include relevance (how directly the system addresses user queries), coherence (whether responses flow logically), accuracy (factual correctness), latency (response time), and consistency (reliability across similar queries). For RAG systems specifically, metrics also measure how well responses align with the retrieved information.

Establishing clear metrics is essential for organizations to objectively evaluate AI performance, guide improvements, and determine when systems are ready for production use. Different applications prioritize different metrics—customer service chatbots might emphasize response speed and satisfaction, while research tools would prioritize factual accuracy and thoroughness.

Batch Processing Running multiple requests simultaneously for efficiency, as opposed to processing one at a time.

API Keys Security credentials required to access LLM services, which must be properly managed to control access and costs.

Agent An AI-powered system that can take actions to accomplish goals, typically by using tools, planning multiple steps, and adapting to feedback. Agents can extend beyond simple text generation by connecting LLMs to external capabilities like web searches, API calls, or database queries. They can maintain state across interactions, break complex tasks into manageable steps, and make decisions about which tools to use when. Basic agent frameworks include ReAct, function calling chains, and tool-augmented LLMs.

Retrieval-Augmented Generation (RAG)¶

A technique that enhances LLMs by retrieving relevant information from external sources before generating responses, significantly improving accuracy. Think about this as if you researched paragraphs from documents that contain the facts or knowledge that would provide enough information to accurately answer your question. Often these facts might be specific to your domain and not contained within the language model's training data.

Basic RAG Process Flow

This is how Retrieval-Augmented Generation (RAG) differs from a normal query asked of a language model:

RAG vs. Traditional LLM

Chunking The process of breaking down documents into smaller, manageable pieces before storing them for retrieval. Rather than treating entire documents as single units, chunking divides them into paragraphs, sections, or semantic units that can be individually embedded and retrieved. Effective chunking is crucial for RAG systems because it determines what information can be found and how precisely it can be retrieved. Chunks that are too large may contain irrelevant information, while chunks that are too small might lose important context. The way documents are chunked significantly impacts both retrieval quality and system performance.

Document Processing Pipeline The series of steps that prepare documents for use in RAG systems. This typically includes collecting documents from various sources, cleaning and normalizing the text, breaking documents into chunks, creating embeddings, and storing them in a searchable database.

Vector Search The process of finding information based on meaning rather than exact keyword matches. In vector search, text is converted into numerical representations (vectors) that capture semantic meaning. When a query is made, the system finds content with similar nearby vector representations, even if they don't share the same words. This allows RAG systems to retrieve information based on conceptual similarity rather than just keyword overlap. For example, a query about "vehicle maintenance schedules" might retrieve relevant documents about "car service intervals" even though the exact words don't match. This capability is what makes modern AI retrieval systems much more effective than traditional keyword search.

Hybrid Search A retrieval approach that combines keyword-based search with semantic (vector) search to get the best of both worlds. While semantic search excels at understanding meaning, it might miss exact terms. Hybrid search ensures both conceptually similar and exact-match content is found, improving overall retrieval quality and addressing the limitations of either approach alone.

Knowledge Freshness The degree to which a RAG system's information is current and up-to-date. Unlike traditional LLMs that are limited to information from their training data, RAG systems can continuously incorporate new documents, allowing them to stay current with the latest information. This makes RAG particularly valuable for domains where information changes rapidly.

Domain Adaptation The process of tailoring a RAG system to a specific industry, company, or knowledge domain. By focusing on relevant content and terminology, domain-adapted RAG systems can provide more accurate and contextually appropriate responses than general-purpose systems, making them particularly valuable for specialized business applications.

RAG Efficiency Benefits¶

Energy and Cost Efficiency

A consideration not commonly discussed is that RAG systems can be significantly more energy and cost efficient than relying solely on larger models, especially when combined with Small Language Models (SLMs). By retrieving specific information rather than encoding everything in model parameters, organisations can:

Reduce computational requirements for inference
Lower energy consumption and decrease the carbon footprint of AI operations
Maintain high-quality outputs with smaller, more efficient models
Achieve faster response times with lower hardware requirements

This efficiency advantage is particularly valuable for organisations balancing performance needs with sustainability goals and operational costs.

Understanding Model Types in Depth¶

The different model types mentioned in this glossary represent important distinctions in how AI systems are trained and optimized.

Foundation models provide the base capabilities through extensive training on diverse data, learning language patterns and general knowledge without specific focus. While powerful, they're often less predictable for business use without additional optimization.

Instruction-tuned models build on this foundation with additional training specifically designed to make them follow directions accurately. This specialized training significantly improves their reliability for practical applications, making them respond more predictably to specific requests.

Specialized models like code generators take this optimization further, with focused training on programming languages and development patterns. This specialization trades general capabilities for exceptional performance in specific domains.

Understanding these distinctions helps select the right tools for the task, balancing general capabilities against specialized performance as your AI implementation strategy evolves.

LLM Chat Interfaces¶

The conversational front-ends that allow non-technical users to interact with large language models through natural dialogue:

Commercial Proprietary Chat Interfaces Consumer-friendly applications that provide access to powerful LLMs through simple conversation, such as ChatGPT, Claude Chat, and Gemini. These interfaces handle the technical aspects of prompt formatting automatically and maintain conversation history for contextual understanding. They often offer both free tiers with basic capabilities and premium subscriptions with enhanced features. Commercial interfaces typically include content moderation and safety guardrails to prevent misuse and protect users.

Enterprise Chat Solutions Organisation-specific implementations of LLM technology tailored to business needs. These solutions often connect to company data sources through RAG for accurate, organisation-specific responses and enforce company-specific security and compliance policies. They can integrate with existing workflows and tools to enhance productivity within established systems, and provide custom capabilities aligned with specific business requirements and industry contexts.

Local LLM frontends / clients Software, often Open Source, that can be run locally to use LLMs and SLMs with a chat interface. Such as LM Studio, Ollama and GPT4All.

API vs. Chat Interface Direct API access usually for developers to build custom applications, allows customisation of all these parameters and more compared to chat interfaces for end-user interaction.

Continue with intermediate terminology¶

Improving LLM & RAG Systems: Essential Concepts for Practitioners

Chris Thomas is an AI consultant helping organisations validate and implement practical AI solutions specialising in LLMs and RAG.

Connect with me on LinkedIn

Follow me on X (Twitter)

Subscribe to Updates