Swapan Rajdev

The Only Real Moat in AI: Speed

Swapan — Fri, 25 Jul 2025 05:22:39 GMT

This Isn’t a Typical Post (But I Think It’s Important)

Typically, I write technical analyses, breakdowns, and implementation ideas. Today’s post is a little different. It’s more of a thought I’ve been sitting with, one I keep repeating to founders, builders, and myself.

If you like (or don’t like) this type of note, hit the like button or leave a quick comment. That way, I’ll know whether to share more of these in the future.

The Only Real Moat Left: Speed

Lots of people keep asking me:

“How do I build a moat in today’s AI-everywhere world?”

My answer is simple: speed of execution.

OpenAI took Codex from idea to public product in seven weeks. Eight engineers. Zero quarterly-planning detours.

Replit’s AI agent once nuked a customer’s production database. The team pulled an all-nighter, restored the data, and shipped new guardrails, all within 24 hours.

On the funding side, Dharmesh Shah jumped on a Sunday Zoom with Lovable founder Anton Osika and wired the check Monday morning.

Even the giants get it. Google and Meta are scooping up top AI talent in 48–72 hours. No red tape, just action.

For every one of these success stories, there’s another team killing ideas faster than slower teams can finish their pitch decks. That’s product velocity, deal velocity, hiring velocity, compounding quietly while others complain that “AI is too crowded.”

At Haptik, we had a rule:
Inaction is worse than failure.
In 2025, inaction isn’t just worse, it’s lethal.

If your AI idea needs 4–5 months of “alignment etc.” before users can touch it, someone faster has already eaten your lunch.

So let’s build.

Drop the overthinking. Quit waiting for permission. Ship something today.
Then learn, iterate, and ship again.

LLM Evaluation Metrics: Your Production Scorecard

Swapan — Thu, 17 Jul 2025 06:38:18 GMT

The Enterprise LLM Scaling Reality Check

You launched the GenAI app, the demo impressed everyone, but then day-to-day use exposed the cracks. Some replies are spot-on, others fail in odd ways, and every prompt tweak feels risky, as you are unsure what else will break.

After shipping more than 500 enterprise chatbots, I learned the solution to all of these is a clear evaluation scorecard you can track and share. It has four simple parts:

Quality – Are the answers correct and helpful?
Outcome – Do users finish their tasks and drive real value?
Performance – Do they arrive quickly and stay within budget?
Safety & Compliance – Do they avoid toxic or private content?

The rest of this guide breaks down how to measure each part so you can improve with confidence and show the proof.

Why LLM Evaluation Metrics Are Critical for Enterprise Success

Without proper evaluation frameworks, enterprises face:

🔄 Change paralysis: No confidence in making prompt improvements without breaking functionality
📉 Quality degradation: Issues go unnoticed until users complain or satisfaction drops
🚫 No optimization path: Can't improve what you can't measure
💸 Cost explosions: Per-user costs growing to $8, $15, even $20 per interaction

Bottom line: Evaluation isn't overhead, it's how you scale profitably and safely.

The 4 Essential LLM Evaluation Categories for Production

Think of evaluation as your executive dashboard. These four categories tell you everything you need to know about your LLM application's health:

Quality Metrics: answer correctness, relevance, context quality, and hallucination rates.
Outcome Metrics: user satisfaction, task completion, and use-case-specific success metrics.
Performance Metrics: cost per query, response time, uptime, and system reliability.
Safety & Compliance: toxicity detection, security violations, and data leak prevention.

When to Measure Each Category

Key Insight: Quality and safety metrics are your insurance policies; test them thoroughly before launch, then monitor continuously. Business and performance metrics are the heartbeat of your production.

12 Essential LLM Metrics Everyone Should Track

🎯 Quality Metrics

Ensure accurate, relevant responses. Test thoroughly before production, validate regularly with sample data.

Note: Answer Correctness measures "Is it TRUE?" while Answer Relevancy measures "Does it answer the QUESTION?" You need both, a response can be factually accurate but completely off-topic, or relevant but factually wrong.

💰 Outcome Metrics

Track these in real-time on production traffic, they directly impact your P&L.

⚡ Performance Metrics

Ensure speed and reliability within budget. Monitor continuously in production.

🛡️ Safety & Compliance Metrics

Protect the brand and ensure compliance. Implement before launch, monitor continuously.

Enterprise Use-Case Metrics Mapping

Choose your metrics based on your primary use case. Start with "Must Have" metrics for production readiness, then add "Good to Have" for optimization.

Metric Implementation Priority Framework

🚀 Start Here (Week 1): Pick one metric from each category

💰 Business: Cost per Query OR Task Completion Rate
🎯 Quality: Answer Correctness OR Answer Relevancy
⚡ Performance: Response Time
🛡️ Safety: Toxicity Detection OR Security Violations

📈 Scale Up (Month 2): Add remaining "Must Have" metrics from your use case

🎯 Optimize (Month 3+): Layer in "Good to Have" metrics for comprehensive monitoring

Common Implementation Pitfalls to Avoid

Don't measure everything at once - Start with 4-5 core metrics
Don't ignore edge cases - Test with adversarial inputs
Don't skip baseline measurement - Establish benchmarks before optimization
Don't forget user feedback - Implicit signals matter as much as explicit ratings
Don't over-rely on manual review - Automate evaluation where possible, use human-in-the-loop only for edge cases and quality spot-checks

Conclusion: Building Your Production-Ready LLM Evaluation System

Companies succeeding with LLMs in production don't have better models, they have better measurement systems. With proper evaluation metrics, you can deploy changes confidently, optimize systematically, and scale user satisfaction.

Your Action Plan:

Choose your use case from the mapping table above
Implement "Must Have" metrics first, these are critical for production success
Set up technical infrastructure using the implementation details provided
Start optimizing based on real production data

Start measuring today. Your evaluation system is your competitive advantage.

Multi-Layered Guardrails for AI Agents

Swapan — Thu, 01 May 2025 07:42:06 GMT

Why Guardrails Matter More Than Ever

In early 2025, Meta’s AI assistant confidently told users that conservative activist Robby Starbuck had taken part in the January 6 Capitol riot. He hadn’t. The AI hallucinated it. And the damage, reputational, personal, legal, was very real.

This isn’t just a glitch. As AI agents become more autonomous, planning, executing, and acting on behalf of users, we need serious guardrails. And not the kind most people are used to.

Most organizations treat guardrails like seatbelts, a last-minute safety feature bolted onto an otherwise complete system. But in today's complex AI agents, effective guardrails must be woven into the system architecture itself.

The Narrow View: How Most People Think About Guardrails

Most guardrail implementations focus on a limited set of basic protections:

Input filters – Scan prompts for banned content
Output moderation – Flag risky responses
Prompt sanitization – Strip out sensitive data

But these are single-point fixes, better suited to chatbots than agentic systems. Today's agents reason, plan, use tools, store memory, and interact with APIs. Limiting guardrails to input/output creates massive blind spots.

A Broader Approach: Guardrails Must Span the Full Agent System

Think of agents as systems, not models. Each part needs its own protections:

Goals: Ensure objectives remain aligned with human values and organizational policies
Context: Verify that environmental information isn't misleading or manipulated
Memory: Protect against persistence of sensitive data or memory poisoning
Reasoning: Guard against logical fallacies and confirmation bias
Planning: Validate that steps align with goals and don't violate safety constraints
Tools: Control access to external systems and verify proper use
Knowledge bases: Prevent retrieval of sensitive information or toxic content

This is why we need multi-layered guardrails that span the entire agent architecture. The Swiss Cheese Model from safety engineering, recently adapted for AI systems by Shamsujjoha et al. in their paper 'Swiss Cheese Model for AI Safety' (2025), provides a useful metaphor: multiple protective layers with different strengths and weaknesses ensure that holes in any single layer won't lead to catastrophic failure.

Key Attributes of Good Guardrails

Effective guardrails protect several critical attributes:

Accuracy: Guardrails must prevent hallucinations, misinformation, and factually incorrect steps. As we saw with Meta's AI falsely linking Robby Starbuck to events he had no connection with, accuracy failures can have serious real-world consequences.
Privacy: AI systems must avoid leaking sensitive information. Samsung learned this lesson when employees inadvertently leaked proprietary code through ChatGPT, leading to a company-wide ban on AI tools.
Security: Guardrails need to prevent injection attacks, unsafe tool usage, or malicious code execution when AI agents access internal systems and resources.
Safety: AI systems should prevent harm, whether emotional, physical, or reputational, that could result from their behavior, from refusing to generate malicious code to avoiding dangerous medical advice.
Compliance: Outputs must adhere to legal and regulatory constraints, from financial disclosure requirements to HIPAA compliance in healthcare to GDPR in European operations.
Fairness: AI agents should avoid biased behavior based on characteristics like race, gender, or geography that could undermine ethical standards and organizational diversity goals.

Guardrail Actions: The Essential Toolkit

Now that we understand what to guard, let's explore the actions to perform:

Block: Prevents harmful inputs or outputs entirely, from rejecting jailbreak attempts to restricting access to sensitive data fields.
Modify: Transforms inputs or outputs to remove sensitive information or add necessary context, such as automatically redacting PII before passing data to external services.
Flag: Alerts human reviewers when situations require attention rather than automatic intervention, like detecting potential self-harm signals in customer interactions.
Validate: Ensures plans, tool calls, or outputs meet specific requirements before proceeding, such as confirming financial transactions fall within authorized limits.
Defer / Human-in-the-loop: Delays high-risk actions until explicit human approval is provided, particularly for irreversible operations or sensitive communications.

Looking Ahead

Implementing these layers takes effort but the tools are maturing. Frameworks like NeMo Guardrails, Portkey, and GuardrailsAI offer building blocks, and in my next post, I’ll walk through real-world implementations with architecture patterns and code.

Conclusion

AI agent safety requires shifting from narrow output filtering to comprehensive, multi-layered protection. By implementing guardrails that address key attributes across your entire agent architecture, you can build systems that remain helpful, honest, and safe—even as they grow increasingly autonomous. Remember: guardrails aren't afterthoughts but core architectural principles that should evolve alongside your AI capabilities.

Acknowledgment

This article draws from the research paper "Swiss Cheese Model for AI Safety: A Taxonomy and Reference Architecture for Multi-Layered Guardrails of Foundation Model Based Agents" by Md Shamsujjoha, Qinghua Lu, Dehai Zhao, and Liming Zhu (Data61, CSIRO, Australia, 2025). Their comprehensive framework for AI guardrails informed many of the concepts presented here. Read the full paper for a deeper technical exploration.

A Summary of LLM Post-Training Techniques

Swapan — Thu, 27 Mar 2025 03:33:35 GMT

Fig1: Taxonomy of post training methods taken from LLM Post-Training: A Deep Dive into Reasoning Large Language Models”

Introduction: The Foundation and the Fine-Tuning

This post is based on insights from the recent survey “LLM Post-Training: A Deep Dive into Reasoning Large Language Models”, which organizes and explains the landscape of post-training techniques. I’ve distilled the core methodologies from the paper into this concise summary to make the concepts more accessible for practitioners and builders.

Large Language Models (LLMs) have transformed what machines can do with text. They can write, summarize, translate, and even reason to a degree. But raw pretraining, even on trillions of tokens, is only the beginning.

Post-training is where LLMs gain purpose: the ability to align with human goals, specialize in domains, and generate trustworthy outputs. This “second phase” includes fine-tuning, reinforcement learning, and smart techniques applied at inference time to push performance without retraining the model.

Why Post-Training is Essential for Your AI Applications

Pretrained models are generalists; they're “trained on everything,” but often confused by specifics. Without post-training:

They hallucinate facts
Fail to follow instructions
Ignore ethical and safety boundaries
Struggle with complex, multi-step reasoning

Post-training solves these problems by:

Aligning the model with human preferences
Specializing in tasks or domains
Boosting reasoning and factuality
Making smaller models act smarter via test-time enhancements

If pretraining is learning language, post-training is learning usefulness.

Key Post-Training Methodologies Explained

We group post-training into three big buckets:

Fine-Tuning: Specializes the model using curated data.
Test-Time Scaling (TTS): Boosts performance during inference
Reinforcement Learning (RL): Aligns behavior using feedback signals

Each approach solves different challenges, and together, they unlock powerful capabilities.

Looking Ahead: The Evolving Landscape of Post-Training

The future of post-training is rich and experimental. We're already seeing:

AI feedback replacing human feedback (RLAIF, Constitutional AI)
Search-based reasoning (ToT, Graph-of-Thoughts)
Self-critiquing and refinement loops for continuous output polishing
Inference-level intelligence replacing brute-force scaling

The real magic now lies not in growing models, but in teaching them how to think better.

Conclusion

The best AI models today aren’t just big. They’re well-refined. Post-training is how we bridge the gap between raw capabilities and trustworthy, helpful, aligned agents.

Mastering the post-training stack is essential whether you're building a customer assistant, a reasoning agent, or a specialized chatbot.

It’s not about more data. It’s about smarter training.

Building Moats in the Age of AI

Swapan — Wed, 19 Mar 2025 05:04:53 GMT

The AI landscape is evolving at breakneck speed, making traditional moats increasingly difficult to build and maintain. Let me expand on your thoughts and provide a comprehensive perspective on building defensible AI companies today.

The Challenge of Building Moats in Today's AI Ecosystem

The AI landscape has fundamentally altered the dynamics of competitive advantage. Traditional moats are increasingly difficult to establish and maintain for several compelling reasons:

Unprecedented Technology Democratization: Once-proprietary AI capabilities are now available as APIs or open-source implementations, dramatically lowering barriers to entry. What was cutting-edge six months ago is now a commodity.
Product Lifecycle Compression: The time from idea to market has shrunk dramatically. What once took years now takes weeks or months, and successful products are quickly replicated.
Open Source Momentum: The open-source AI community is advancing at a staggering pace, with models like Llama, Mistral, and others rapidly closing performance gaps with proprietary alternatives. This creates a powerful "free alternative" for many use cases.

The Three Viable Moats for Companies

1. Speed

In today's AI landscape, velocity is perhaps the most powerful competitive advantage:

Small, nimble teams can outmaneuver larger organizations by shipping faster
Rapid iteration allows for quick feedback loops and product-market fit discovery
Fast execution creates breathing room before competitors catch up

The companies winning today demonstrate exceptional execution velocity - they ship weekly or monthly updates that meaningfully improve their products.

2. Quality

Despite the democratization of AI technology, quality remains a significant differentiator:

Exceptional UX/UI that makes complex AI capabilities accessible
Superior performance on key metrics that matter to users
Thoughtful implementation that addresses edge cases and failure modes

Quality encompasses not just technical performance but the entire user experience - how the product feels, how it handles errors, and how it delights users.

3. Distribution

Perhaps the most durable moat in today's landscape:

Network effects that increase value as more users join
Strategic partnerships that provide exclusive access to customers
Community building that creates advocates and reduces churn
Novel go-to-market strategies that bypass established channels

Embracing the "Wrapper" Mindset

The stigma around being a "wrapper" company is misplaced:

Many of today's most successful AI companies (Perplexity, Cursor, etc.) are effectively wrappers around foundational models
The value isn't just in the base technology but in the thoughtful application layer
Building a great wrapper requires deep product thinking and user empathy

Creating an effective wrapper involves:

Identifying specific workflows where AI can create 10x improvements
Building intuitive interfaces that abstract complexity
Automating repetitive tasks completely
Deeply understanding domain-specific challenges

AI-Native Thinking

The companies that will dominate in this new era are approaching problems with an AI-native mindset:

Rather than incrementally improving existing solutions, they're reimagining entire workflows
They're not constrained by industry conventions or legacy thinking
They focus on outcomes rather than methods

This creates a unique opportunity for outsiders to disrupt established industries. Those without industry baggage can envision entirely new approaches that established players might miss.

Building for the Long Term

While the pace of change is rapid, this AI wave is here to stay:

The foundational technology continues to improve at a remarkable pace
Enterprise and consumer adoption is just starting
The economic impact is becoming increasingly clear

For entrepreneurs, this means:

It's worth investing in building expertise and capabilities now
Early failures can be valuable learning experiences
Position yourself for long-term success as the market matures

Conclusion

Building a moat in today's AI landscape requires a combination of execution speed, product quality, and distribution strategy. While the challenges are significant, the opportunities for companies that navigate these waters successfully are enormous.

The most successful AI companies will embrace wrapper strategies while bringing AI-native thinking to their domains. They'll focus on creating exceptional user experiences, move with extraordinary speed, and develop innovative distribution channels.

The window is open now for entrepreneurs willing to experiment, learn quickly, and persevere through early challenges. Those who build the right capabilities today will be positioned to lead as the market matures.

The AI Stack: Understanding the Key Layers Powering the AI Ecosystem

Swapan — Wed, 26 Feb 2025 05:28:25 GMT

AI is transforming industries at an unprecedented pace, but to truly understand where value is created and how businesses differentiate, it's essential to break down the AI stack. While this might be obvious to some, having a clear framework helps founders, builders, and investors pinpoint opportunities, competition, and where real moats can be built.

Here’s how I see the full AI stack, from the foundational infrastructure to the applications that deliver AI-powered experiences to users.

1. Infrastructure Layer (The Hardware & Compute Backbone)

At the base of the AI stack lies the infrastructure layer, which includes the hardware and low-level software required to run AI models. This consists of:

GPUs & AI Accelerators – NVIDIA, AMD, Intel, Google TPUs, AWS Trainium, etc.
Compute & Cloud Providers – AWS, Google Cloud, Azure, Oracle Cloud, etc.
Software Libraries – CUDA (for NVIDIA GPUs), ROCm (for AMD), oneDNN (for Intel), etc.

Without this layer, no AI models can be trained or deployed efficiently. The companies operating here are capital-intensive but provide the critical foundation for the AI ecosystem.

(Note: Electricity, cooling, and data center infrastructure also play a crucial role in this layer, but we have not gone deep into that here.)

2. Model Layer

The model layer consists of two key types of AI models:

Foundation Models (Pre-Training) – These are large-scale models trained on vast datasets and used as the base for further adaptation. Companies building these include OpenAI (GPT-4, DALL·E), Anthropic (Claude), Google (Gemini), Mistral, Meta (Llama), etc.
Post-Trained & Fine-Tuned Models – These are models that have been adapted using proprietary or domain-specific data for targeted use cases. Cohere, Hugging Face, Stability AI, Replit (code models), and others focusing on specific use cases like legal, medical, or scientific AI.

This distinction is crucial, as foundation models provide a general base, while post-trained models refine them for real-world applications.

The foundation model layer is where most of the cutting-edge AI research happens, and while only a few companies have the resources to train these models from scratch, many build upon them.

3. Data Layer

Like the model layer, the data layer can be divided into two parts: data for foundation models (pre-training) and data for post-training/inferencing. Each of these requires different types of datasets:

Data for Foundation Models (Pre-Training) – Companies like Scale AI, Common Crawl, Karya, and Labelbox specialize in data annotation and labeling for large-scale pre-training purposes.
Data for Post-Training & Inferencing – Companies like Snowflake, Databricks (DBRX), and synthetic data providers focus on fine-tuning and real-time inferencing needs.

This layer is crucial because high-quality, diverse, and well-labeled data directly impacts model performance and reliability.

4. Tooling Layer

Once foundation models exist, they need to be adapted, fine-tuned, and deployed efficiently. The tooling layer includes:

MLOps & AI Deployment – PortKey, Weights & Biases, MLflow, etc
Vector Databases & Retrieval-Augmented Generation (RAG) – Pinecone, Weaviate, ChromaDB, etc.
AI API Marketplaces – Hugging Face Spaces, Replicate, AssemblyAI, etc.
Agentic & Workflow Tools – Composio, LangChain, LlamaIndex and orchestration frameworks that make AI more useful in workflows.

This layer is about enabling businesses to use AI effectively—ensuring reliability, monitoring, and deployment at scale.

5. Application Layer

At the top of the stack sits the application layer, where AI is delivered to end users. These applications integrate models, data, and infrastructure to solve real-world problems. Examples include:

AI Assistants & Chatbots – ChatGPT, Perplexity, Jasper, Copy.ai, etc.
B2B AI SaaS – Haptik, Glean, Decagon, Cursor etc.
AI in Vertical Markets – Legal (Harvey AI), Healthcare (Qure.ai), etc.

A key aspect of succeeding in this layer is seamless integration with other systems and workflows. AI applications need to be embedded into existing software stacks, enterprise systems, or consumer applications to maximize utility and adoption. Companies that effectively integrate AI into industry-specific workflows or existing SaaS products have a better chance of owning this layer and delivering differentiated value.

Why This Matters

The AI ecosystem is rapidly evolving, and companies that fail to clearly define their place within the stack risk being outpaced by more strategic players. Whether you're a founder, investor, or engineer, understanding the layers you operate in is essential for long-term success. Here’s why this matters:

Identify Your Unique Edge – Are you competing in a commoditized layer, or are you creating unique value? Differentiation is key to building a sustainable business.
Anticipate Market Shifts – AI is not static. New technologies, open-source advancements, and regulatory changes constantly reshape the landscape. Knowing your layer helps you stay ahead.
Know Your True Competitors – Your biggest challenge might not be another startup but a hyperscaler, a foundation model provider, or an incumbent integrating AI more effectively.
Build Defensible Moats – True AI moats come from ownership—of data, distribution, or deep integrations within enterprise workflows. Relying too much on other layers can leave you vulnerable.

Different companies operate at different layers of the AI stack. For example, an AI application company might use proprietary customer data to fine-tune models but still rely on foundation models from OpenAI or Anthropic and infrastructure from AWS. Clearly defining which layers you are leveraging vs. where you are creating unique differentiation helps businesses make strategic decisions on partnerships, investments, and long-term vision.

This is just the beginning. In a future post, I’ll explore AI moats—how companies can build lasting defensibility in an industry that is evolving faster than ever.

Thanks to Pratyush and Aakrit for reviewing and providing valuable feedback on this post

Understanding GPU Components for LLMs

Swapan — Wed, 08 Jan 2025 08:14:04 GMT

In the fast-evolving world of artificial intelligence, each new GPU release is met with a mix of excitement and scrutiny. Terms like cores, memory, and bandwidth are frequently discussed, but what do they mean for training and inferencing large language models? For anyone working in this space, understanding these GPU components is not just helpful—it’s essential. This blog unpacks why GPU specs matter and how they influence the performance and scalability of LLMs.

The Math Behind LLMs

Large Language Models are essentially sophisticated layers of mathematical operations, primarily centered around matrix multiplications. Let’s explore this in simple terms:

Matrix Multiplication Basics

Consider two matrices:

A (3x2): [[1, 2], [3, 4], [5, 6]]
B (2x3): [[7, 8, 9], [10, 11, 12]]

To calculate A×B

Just for the multiplication of 2 matrix above, below are the number of operations:

Multiplications: 18
Additions: 9
Total Operations: 27

This is straightforward math, but when scaled to LLMs, the matrices involved often contain billions of elements. The number of operations grows exponentially, making computational efficiency critical.

Quick note: while our example used integers for simplicity, LLMs rely on floating-point numbers for greater precision—a topic we’ll cover shortly.

Training and Inference Operations

According to the scaling laws for training LLMs, the total operations required can be approximated using:

Training Operations = 6 × N × D

Inference Operations = 2 × N

Where:

N: Number of model parameters
D: Number of tokens in the training dataset
The factor 6 accounts for forward and backward passes and optimization overhead.

For instance, training a LLaMA 7B model requires 1.26×10²² operations, while inference for the same model uses 2 × N, equating to 1.4×10¹⁰ operations. Imagine the computational demand for models with 70 billion or even 400 billion parameters!

Floating-Point Precision in LLMs

Floating-point numbers are indispensable in LLMs, offering the flexibility to represent extremely large or small values with fractional precision. Here’s a quick breakdown of the most common types of floating-point precisions:

PrecisionUsageAdvantagesDisadvantagesRelevance in LLMsFP32TrainingHigh precision and numerical stabilityHigh memory and computational costEssential for critical calculations like gradient updates.FP16Training & InferenceFaster computations, lower memory usageSusceptible to underflow/overflowWidely used in mixed-precision training.BF16Training & InferenceCombines FP32 range with lower precisionSlightly less precise than FP32Preferred for training due to stability and efficiency.INT8InferenceLow memory and latencyPotential accuracy lossIdeal for deploying LLMs on resource-constrained systems.INT4ExperimentalUltra-compressed, hardware efficientSignificant accuracy lossEmerging for deploying very large models.

Memory Usage

Each floating-point number consumes memory when loading a model.

FP32: 4 bytes (32 bits) per parameter.
FP16: 2 bytes (16 bits) per parameter.
BF16: 2 bytes (16 bits) per parameter.
INT8: 1 byte (8 bits) per parameter.

Here’s how memory usage scales with precision for a model like LLaMA 7B:

FP32: 7×10^9 × 4bytes = 28 GB
FP16/BF16: 7×10^9 × 2byte = 14 GB
INT8: 7×10^9 × 1byte = 7 GB
INT4: 7×10^9×0.5bytes = 3.5 GB

Note: Additional memory is consumed for optimizations like K-V caching, which will be explored in future posts.

Key GPU Components for LLMs

To meet the massive computational and storage requirements of LLMs, GPUs are designed with specialized components. Let’s break down their roles:

Tensor Cores

Optimized for matrix multiplications, Tensor Cores handle mixed-precision computations (e.g., FP16/FP32) with high throughput, drastically reducing training and inference time.

FLOPS (Floating Point Operations Per Second)

As the name suggests FLOPS is how many operations a GPU can perform per second and is a measure of the GPU's raw computational power. Higher FLOPS directly translate to faster training and inference, making them critical for LLM workloads.

Memory (VRAM) and Bandwidth

VRAM: Tells so how much capacity the GPU’s has to store large model parameters.
Bandwidth: Shows the speed of data transfer, critical during forward and backward passes.

NVLink and Interconnects

In multi-GPU setups, NVLink ensures efficient communication, a necessity for distributed training of large models.

Power and Thermal Efficiency

Every GPU operation consumes power. Efficient designs minimize cost and ensure thermal stability during intensive tasks.

Decoding GPU Tech Sheets

To illustrate, let’s consider the data sheet of an NVIDIA H200 GPU. Equipped with the knowledge of FLOPS, Tensor Cores, and memory bandwidth, you can now interpret what makes these GPUs optimal for AI workloads.

Closing Thoughts

GPUs are the backbone of modern AI, particularly for LLMs. Understanding key specifications like cores, FLOPS, memory, and bandwidth empowers developers to make informed choices, maximizing performance and scalability.

Whether you're training multi-billion parameter models or optimizing for real-time inference, the right GPU can unlock new possibilities in AI innovation. As the AI landscape evolves, staying informed about GPU advancements will keep you ahead in this exciting field.

DeepSeek-V3: The Pinnacle of Open-Source AI

Swapan — Fri, 03 Jan 2025 09:17:42 GMT

DeepSeek-V3 currently stands as the best open-source AI model, decisively outperforming competitors such as Llama 3.1 405B, Qwen, and Mistral in benchmarks. Remarkably, it is on par with leading closed-source models like OpenAI’s GPT-4o and Claude 3.5 Sonnet. Not only does it excel in performance, but it was also trained at an astonishingly low cost of $5.576 million, significantly less than its counterparts. For example, GPT-4 reportedly cost hundreds of millions of dollars to train, showcasing DeepSeek-V3’s efficiency. Furthermore, it offers unparalleled inference speed and can be hosted inexpensively, making it an ideal choice for organizations of all sizes.

1. Introduction

About Deepseek

DeepSeek, a Chinese artificial intelligence company founded in 2023, has made remarkable advancements within just one year of its inception. Established by Liang Wenfeng, the founder of the Chinese quantitative hedge fund High-Flyer, DeepSeek focuses on developing advanced AI models with a vision of achieving artificial general intelligence (AGI). In this short span, the company has launched multiple groundbreaking models, including DeepSeek-V3, which boasts 671 billion parameters and rivals industry leaders like GPT-4. DeepSeek’s rapid progress and commitment to open-sourcing its models have positioned it as a disruptive force in the global AI landscape.

Why DeepSeek-V3 Matters

DeepSeek-V3 is more than just an AI model; it’s a statement about the future of open-source AI. By addressing both performance and cost barriers, it has the potential to reshape AI adoption across industries.

2. Architectural Innovations

Mixture-of-Experts (MoE) Framework

At the heart of DeepSeek-V3 lies its Mixture-of-Experts (MoE) architecture. With 671 billion parameters, this model activates only 37 billion parameters per token. This selective activation ensures a balance between computational efficiency and model depth, making it one of the most advanced architectures in AI.

Multi-head Latent Attention (MLA)

The introduction of MLA optimizes attention mechanisms, allowing the model to infer faster while maintaining accuracy. This innovation ensures that even complex tasks are processed with remarkable speed and precision.

Auxiliary-Loss-Free Load Balancing

Traditional load balancing relies heavily on auxiliary losses, which can increase training complexity. DeepSeek-V3’s novel approach eliminates this dependency, streamlining operations and improving performance.

3. Training Methodology

Data Scale and Quality

DeepSeekV3 was trained on an extensive dataset comprising 14.8 trillion tokens, sourced from diverse, high-quality domains, including curated web content, academic papers, and proprietary datasets. The company prioritized filtering and preprocessing to ensure data quality, reducing noise and improving contextual relevance. DeepSeek leveraged partnerships and open-source contributions to access a wide variety of training data while adhering to ethical data usage practices.

FP8 Mixed Precision Training

Using FP8 mixed precision significantly enhanced the training efficiency of DeepSeek V3 by reducing computational overhead and memory consumption without compromising model accuracy. FP8 (8-bit floating point) mixed precision allowed for faster matrix computations and optimized GPU utilization, enabling the training to scale effectively across massive datasets and complex architectures. This precision format also minimized energy costs and reduced hardware strain, contributing to the cost-efficient training of DeepSeek V3.

HAI-LLM Framework

The HAI-LLM Framework is a cutting-edge training system that combines hardware-aware strategies with algorithmic innovations, it ensures efficient and stable training across diverse environments. Key features include tailored hardware optimizations for NVIDIA H800 GPUs, support for FP8 mixed precision to reduce memory requirements, and advanced gradient stability techniques to prevent training interruptions. Additionally, its robust data pipeline minimizes bottlenecks, enabling seamless handling of the massive 14.8 trillion-token dataset used for DeepSeek-V3. This framework reduced training costs and ensured scalability and reliability, cementing its role in delivering state-of-the-art AI performance.

Post training distillation

DeepSeek-V3’s advanced reasoning capabilities are a result of a specialized post-training process. By distilling reasoning skills from its DeepSeek-R1 series models, the company has incorporated verification and reflection patterns into the V3 model. This enhances the model’s ability to perform complex reasoning tasks while maintaining precise control over output style and length. These post-training innovations ensure that DeepSeek-V3 not only excels in benchmarks but also delivers practical value in real-world applications requiring high-level reasoning.

4. Performance and Benchmarking

Unmatched Inference Speed

Generating 60 tokens per second, DeepSeek-V3 delivers three times the speed of its predecessor, making it an efficient choice for real-time applications.

Industry Benchmarks

DeepSeek-V3 has redefined the benchmarks for open-source AI, outperforming all its open-source competitors such as Llama 3.1 405B, Qwen, and Mistral. Its exceptional performance places it on par with industry titans like OpenAI’s GPT-4o and Claude 3.5 Sonnet. Below, you can find a detailed comparison of benchmark performances that illustrate its dominance in the field.

5. Efficiency

Optimized Training Resources

DeepSeek V3, trained using 2.788 million H800 GPU hours, demonstrates remarkable efficiency compared to many leading large language models (LLMs). Meta’s Llama 3.1, for instance, required approximately 31 million GPU hours on H100-80GB GPUs, significantly outpacing DeepSeek in computational demands. Similarly, NVIDIA’s Nemotron-4 utilized 6,144 H100 GPUs over six months, processing a similar scale of data but at a far greater cost and resource intensity. Even OpenAI’s GPT-3, despite being an older model, required over 1 million GPU hours, with the newer GPT versions demanding exponentially more, though exact figures remain undisclosed. DeepSeek’s cost-effective training, achieved at $5.576 million, showcases its highly optimized processes, including FP8 mixed precision and the HAI-LLM framework, making it both resource-efficient and high-performing. While other models like DBRX and Mistral have notable achievements, they lack the demonstrated balance of efficiency and performance seen in DeepSeek V3. These comparisons highlight DeepSeek’s ability to achieve state-of-the-art performance with fewer resources, positioning it as a leader in advancing LLM development.

Competitive API Pricing and Hosting

Currently, DeepSeek is offering promotional pricing until February 8, 2025, with rates set at $0.1 per million input tokens (cache hits), $1 per million input tokens (cache misses), and $2 per million output tokens.

After the promotional period, the pricing will adjust to $0.5 per million input tokens (cache hits), $2 per million input tokens (cache misses), and $8 per million output tokens.

Compared to other large language model (LLM) APIs, DeepSeek V3's pricing is competitive. For instance, OpenAI's GPT-4o is priced at $2.5 per million input tokens and $10 per million output tokens, while Anthropic's Claude 3.5 Sonnet is offered at $3 per million input tokens and $15 per million output tokens.

6. Features of DeepSeek-V3

Advanced Reasoning and Deep Thinking

DeepSeek-V3 excels in complex reasoning tasks, thanks to its post-training enhancements that integrate verification and reflection techniques. This allows the model to solve intricate problems in mathematics, logic, and programming with exceptional precision. Whether you're debugging code or solving a logic puzzle, DeepSeek-V3 consistently delivers structured and reliable solutions, making it a strong contender against ChatGPT in technical and analytical domains.

Search-Enhanced Responses

DeepSeek-V3’s built-in search capabilities provide real-time, context-aware answers by fetching and synthesizing information from external sources. Unlike static models, it can validate data, ensuring the accuracy of responses. This feature is particularly useful for fact-checking, generating up-to-date content, or tackling domain-specific queries, offering a functionality that extends beyond ChatGPT's standard outputs.

Multi-Modal Input Support

DeepSeek-V3 supports multi-modal inputs, enabling users to interact with the model using text, images, or diagrams. For example, you can upload a chart or a photo, and the model will analyze the visual data to extract meaningful insights or summarize patterns. While ChatGPT’s multi-modal capabilities are available in its Pro version, DeepSeek-V3 provides a more comprehensive approach to interpreting and responding to visual content.

128k Token Context Window

With a massive 128k token context window, DeepSeek-V3 can process lengthy and complex documents without losing coherence. This feature is ideal for tasks like summarizing books, analyzing contracts, or handling multi-turn conversations. While ChatGPT offers extended context windows in its higher-tier models, DeepSeek-V3 democratizes this capability with its open-source accessibility.

Coding and Data Analysis

DeepSeek V3 is purpose-built to excel in tasks requiring precision and intelligence, such as coding and data analysis. Leveraging its advanced pre-trained capabilities, developers can seamlessly debug complex code, generate optimized database queries, and interpret intricate algorithms with clarity and reliability.

7. Examples

Search Example

Creative Writing Example

Coding Example

Reasoning Example

8. Open-Source Commitment

DeepSeek-V3 exemplifies a balanced approach to open-source innovation by providing accessibility while ensuring responsible usage through its licensing arrangements. The model and its accompanying codebase are distributed under two distinct licenses:

Model License

The model is provided under DeepSeek's proprietary Model License, which permits use, reproduction, and distribution, including for commercial purposes, subject to specific terms and conditions. These terms include use-based restrictions to promote ethical and responsible AI deployment.

Code License

The accompanying codebase is released under the MIT License, a permissive open-source license that allows for extensive reuse with minimal restrictions, fostering broad adoption and adaptation by the developer community.

By combining these licensing strategies, DeepSeek-V3 ensures open access while addressing ethical considerations in AI development. This dual licensing model not only accelerates innovation but also empowers organizations to leverage state-of-the-art AI responsibly. True to its open-source ethos, DeepSeek-V3 fosters a spirit of collaboration, enabling developers and organizations to build on its robust foundation and drive advancements across industries.

9. Conclusion

DeepSeek-V3 stands out as a versatile and powerful tool, offering unmatched reasoning and mathematical capabilities that redefine expectations for open-source AI. Its advanced post-training enhancements make it excel in tasks requiring complex problem-solving and logic, outshining models like GPT-4o and rivaling closed-source leaders such as Claude 3.5 Sonnet. While Claude 3.5 Sonnet may hold an edge in creative writing and coding finesse, DeepSeek-V3’s strengths in reasoning and mathematics firmly establish its dominance in analytical and technical domains. Additionally, its open-weight nature allows organizations to host the model themselves, ensuring greater flexibility and control—a distinct advantage for developers building custom applications on top of LLMs. For those seeking a cost-effective, high-performance model capable of delivering state-of-the-art results, DeepSeek-V3 is an unparalleled choice.

Understanding Agentic AI Architecture

Swapan — Tue, 26 Nov 2024 06:32:27 GMT

A Comprehensive Guide for Beginners to Experts

Introduction

Agentic AI represents a fascinating and complex frontier in the development of artificial intelligence. Unlike traditional machine learning models that passively receive inputs and produce outputs, agentic AI involves entities, called agents, that can make decisions, interact with their environment, and achieve specific goals. These agents utilize components such as reasoning, memory, and planning to perform tasks autonomously, even collaborating with other agents to solve larger problems. This paper provides a detailed exploration of agentic AI, covering concepts from beginner to expert level, while offering practical examples and applications to illustrate these concepts effectively.

What is Agentic AI

Agentic AI refers to AI systems that demonstrate autonomous behavior, actively working to achieve goals with minimal direct human intervention. An agentic AI system is capable of understanding its environment, reasoning about available options, planning its actions, executing tasks, and learning from its experiences. The term "agentic" suggests that these systems can take initiative, make decisions, and even communicate with other agents to achieve complex tasks.

Examples

Customer Support Agents: Customer support agents use agentic AI to interact with customers, answer queries, and escalate issues when needed. These agents can provide round-the-clock support, adapt to the customer's needs, and learn from previous interactions to improve their responses.
Marketing Agents: Marketing agents are designed to autonomously manage marketing campaigns, including segmenting audiences, scheduling posts, and optimizing ads based on real-time performance data. They help businesses achieve better outreach with minimal human intervention.
Coding Agent: Coding agents are designed to autonomously generate code, scripts, or perform debugging tasks. These agents are useful for automating repetitive coding tasks, generating boilerplate code, or creating custom functions as needed. They can operate within an integrated development environment (IDE) to execute specific commands, optimizing the software development lifecycle.
Virtual Assistants: Personal assistants like Alexa or Google Assistant use agentic AI to respond to user requests, set reminders, make recommendations, and control smart devices, adapting to users' preferences.
Financial Trading Bots: Automated trading agents are capable of making trading decisions in real time, based on market conditions, predefined rules, and evolving strategies.

Agents and System 1 vs. System 2 Thinking

In the book Thinking, Fast and Slow by Daniel Kahneman, human cognition is described in terms of two systems: System 1 and System 2. System 1 thinking is fast, intuitive, and automatic, while System 2 thinking is slower, more deliberate, and logical. This concept can be applied to understanding the operation of agentic AI.

System 1 Agents: These agents are designed to make rapid decisions and respond to simple, routine tasks. They operate in a way similar to System 1 thinking—quickly and efficiently. For example, a customer support agent that gives instant responses based on predefined rules or patterns is akin to System 1 thinking, where speed and immediacy are prioritized.
System 2 Agents: These agents handle more complex tasks requiring careful reasoning, planning, and evaluation. They embody System 2 thinking, which involves a deeper, more thoughtful approach to decision-making. Agents like the Planning agent are meant to exhibit System 2 behavior, where multiple layers of analysis, planning, and evaluation come into play before a decision is made.

Agentic AI often blends these two modes of operation—rapid, instinctual responses for well-defined tasks (System 1) and more comprehensive, deliberate actions for complex decision-making (System 2). Understanding this distinction helps to design agents that can leverage both approaches effectively, depending on the nature of the task at hand.

Agentic AI Architecture and Components

Agentic AI Framework Architecture

Agent Component Architecture

Above is a diagram depicting the components of an agentic framework.

Triggering Mechanisms

Any agentic system can be initiated by a user interacting with it or by another system triggering an action, either through a webhook or a frequency timer.

LLM

Every agentic framework is dependent on a language model (LLM). Each component can access the same or different LLM to complete its goal, providing flexibility and scalability in handling various tasks.

Planning Agent

The planning agent serves as the orchestration component, which includes reasoning, planning, and task decomposition. This agent knows all other agents and, through proper planning, reasoning, and task decomposition, decides which agents to execute and in what order. LLMs with reasoning capabilities (such as OpenAI’s o1-preview) are best suited for such agents but should be used carefully to balance other considerations of an agentic system.

Agents

Agents encapsulate a set of instructions and tools to achieve a given task.:

Prompts: Commands given to a language model along with the tools it has access to.
Tools: Execution blocks that perform actions, such as simple code blocks, API calls, or integrations with other systems.
Environments: Tools may also have environments associated with them to execute specific tasks, such as IDEs or general computer use.
Complex Agents: Agents can also be entire architectures, such as Retrieval-Augmented Generation (RAG), which include embeddings and vector databases.
Memory: Memory in agentic AI allows agents to store information and recall it in future interactions. Memory is available to all components at all times and can include different types:
User Profile: User-specific information that helps agents create personalized experiences.
Chat History: The history of conversations, allowing agents to pull context from past interactions.
Chat State: Tracking the workflows that have been executed to avoid duplicating tasks.

Guardrails

Guardrails are safety mechanisms that prevent harmful behaviors while ensuring robustness in handling unforeseen inputs or scenarios. Rules such as “Ensure there is no mention of competitors in the reply”, “Avoid discussing religion or politics” are a few such examples that should exist at a framework level. These constraints are crucial for deploying agents in dynamic environments, providing default safety checks that can be edited if needed.

Agent Observability

Observability allows developers and users to understand what an agent is doing and why. Providing transparency in agent behavior helps diagnose issues, optimize performance, and ensure that the agent's decisions are aligned with the desired outcomes.

Adaptation and Learning

Adaptation involves an agent's ability to modify its behavior based on feedback from the environment. This includes reinforcement learning or other adaptive techniques that enable agents to optimize their decision-making over time. For example, a marketing agent may adapt its strategy based on changing customer preferences.

AI Agents Ecosystem

The agent AI infrastructure has evolved tremendously over the last year and is expected to continue evolving rapidly. This growth has brought many new tools and components that facilitate building agentic AI systems.

Agentic AI Infrastructure

The diagram provided by Madrona showcases the different parts of the ecosystem, including components such as agent hosting, orchestration, memory, platform tools, and more.

Pros

Diverse Tooling: There are numerous good tools available for every part of building agentic AI. This diversity allows developers to pick specialized tools for their specific requirements.
Designed for simplicity: Many of these tools are designed with user-friendly interfaces or straightforward APIs, making it easy to get started and develop prototypes quickly in each area.

Cons

Fragmented Ecosystem: The large number of specialized tools results in a highly fragmented ecosystem, which requires a lot of effort to stitch together. Developers often spend more time integrating and managing multiple tools than on core development.
Integration Complexity: Ensuring compatibility and efficient communication between different tools can be complex and time-consuming, adding significant overhead to projects.

While this diversity in tooling enables a high degree of customization and specialization, it also presents a significant challenge for developers, who must figure out how to integrate them seamlessly into their frameworks.

This fragmentation often makes development cumbersome and time-consuming, especially as developers have to manage multiple tools that may or may not work well together. As the ecosystem continues to grow, consolidation into unified frameworks or more comprehensive tools will be key to simplifying the development process. Such consolidation would make it easier for developers to implement agentic AI systems without constantly dealing with integration issues and compatibility challenges.

Ultimately, while the pace of evolution in agentic AI infrastructure is encouraging, the need for cohesive and consolidated tools is crucial to ensuring that development remains accessible, efficient, and scalable for a wide range of use cases.

Example: Hedge Fund Agent

To evaluate the above frameworks, we use an example of a hedge fund agent that does different analyses to give a summary and decision about a public company stock

Hedge Fund Agent Architecture

Analyst Agent Architecture

The above is the architecture of a hedge fund team of analysts **

Portfolio Manager: Acts as a planning agent, orchestrating the overall decision-making process by deciding which agents to call and in what order.
Fundamental Analyst: Performs fundamental analysis of a stock, evaluating financial statements, revenue, profitability, and other indicators.
Technical Analyst: Conducts technical analysis using historical market data, price trends, and chart patterns.
Sentiment Analyst: Focuses on news articles, social media, and other public sentiment sources to gauge market emotions.
Summary Analyst: Takes inputs from other analysts and decides whether the stock should be a Buy, Hold, or Sell.
User-Asking Analyst: Reaches out to the user when additional information is required to proceed with further analysis.

Frameworks

Various frameworks exist to make it easy to build agents. Below is a quick evaluation of all the frameworks having implemented them in most of the frameworks below:

Framework Notes

Most of the frameworks have most of the functionality and differ in implementation detail rather than functionality.

Langchain: Easy to get started but hard to build complex flows. Best used alongside LangGraph.
LangGraph: Along with Langchain, Langgraph is a very powerful tool. Takes a little getting used to, and visualizing the graph is hard, making it challenging for someone to understand the flow without deep exploration.
CrewAI: Overall a great framework that allows all functionality. Also has a Saas platform to make it easier to use. Allows hierarchical flows, but more complex flows are still on the roadmap.
Swarm: Primarily an educational framework and not yet ready for production. It has most functionalities, but combining them to build complex flows seems limited.
Microsoft AutoGen: To be evaluated soon.

Observations

No one has a way to build guardrails on a framework level. There is a framework by NVIDIA called NeMo Guardrails that can allow plug-in programmable guardrails but must be plugged in separately.
Memory for most is done using states and context variables. All of this needs to be managed manually
Planning is possible via regular agents, but state handling needs to be done manually
Dynamic workflows are done using handoffs, but langgraph and crewAI make it easier to build graphs.

Evaluation of Agents

Evaluating agentic AI is crucial for ensuring reliability, performance, and safety. Due to the need for fast iteration and the subjective nature of evaluating agents, there is often insufficient focus on testing and evaluation. Below, we discuss insights and best practices for effective testing of agentic AI systems.

Unit Evaluation

Ideally, each agent should have its own evaluations, akin to unit testing in software engineering. Unit evaluation ensures that individual agents behave as expected, meeting their functional requirements. Agents should be tested across different scenarios to validate their reasoning, planning, and execution accuracy. It is beneficial to keep the output structured for each agent, as structured output makes validation easier.

Integration Evaluation

The entire system should also undergo evaluations, akin to integration testing. Integration evaluation examines how well agents work together as a complete system, verifying that the interactions between agents lead to correct outcomes. This type of testing is crucial for identifying issues that arise from communication failures or unexpected dependencies among agents.

Runtime Verification

Runtime verification involves assessing the agents' outputs in real time to improve their adaptation and learning capabilities. There are 2 ways to do this, by having a human in the loop which can only be done on small amounts of data or by using larger models. Using larger models as runtime verifiers can help ensure that agents make decisions aligned with their intended goals. However, this can be computationally expensive. To mitigate costs, batched runtime verification or selective runtime verification (e.g., based on user feedback or critical events) can be employed. These verification methods help maintain system quality while adapting to new data over time.

Red Teaming

Red teaming is a critical component of evaluating agentic frameworks, especially for identifying vulnerabilities and ensuring robustness. This type of evaluation involves simulating adversarial conditions to determine how well the system can handle unexpected or potentially harmful inputs.

Red team testing helps expose weaknesses in agents' reasoning, memory handling, and interaction strategies. It is essential for understanding how an agent may fail under adversarial conditions and for building better safety measures. Incorporating red team testing into the evaluation process ensures that agents are resilient to attacks and can operate safely in diverse environments.

Best Practices

Structured Outputs: Ensure each agent's output is well-structured. Structured output makes it easier to validate correctness and promptly identify issues.
Testing at Scale: Use larger language models to test final outcomes to ensure scalability. Larger models can simulate a variety of user behaviors, which helps stress-test the system effectively.
Iterative Evaluation: Agents should be evaluated iteratively, allowing developers to identify weaknesses early and make improvements swiftly. Each iteration helps refine agents and contributes to more stable outcomes over time.

By incorporating these testing practices, developers can enhance the reliability, adaptability, and robustness of agentic AI systems.

Considerations in Developing Agentic AI

Despite the potential, agentic AI systems face several limitations that restrict their applicability.

Accuracy of function calling

While LLMs can determine which tools to call, the accuracy of these decisions is still not great. According to the UC Berkeley Function-Calling Leaderboard, the best-in-class LLM has an accuracy of 68.9%, with a high measure of hallucinations. This limitation makes LLMs less reliable for high-risk use cases.

Function calling accuracy

Best Practices for Improving Accuracy

Limit the Tools: Limit the number of tools to 4-5 per prompt. Beyond this, hallucinations increase, and accuracy decreases significantly.
Specific Prompts with Examples: Make the prompt very specific and try to provide examples of when to call which tool. This can help the LLM better understand the context, reducing errors and improving decision accuracy.
Good Evaluations for Tool Calling: Having thorough evaluations for tool calling is essential, as it helps measure and improve accuracy. This is achievable since the output can be structured, making it easier to validate if the correct tools were used effectively.

Cost

Running agentic AI can be expensive, especially given the number of language model (LLM) calls and the involvement of multiple agents. Managing memory, reasoning, and tool usage adds to computational overhead, making the operation costly. However, there are several strategies that can help mitigate these costs:

Pricing Dynamics: while agentic AI might be expensive today, the pricing trend is expected to decrease significantly over time (10x every year). Advances in model optimization and increasing competition among providers are leading to lower costs, making agentic AI more feasible in the near future.
LLM Selection: Each agent can use a different type of LLM that is best suited for its purpose. For instance, a planning agent may require a more powerful LLM capable of complex reasoning, while simpler agents can rely on lightweight LLMs that are less expensive. This selective usage can help control costs while maintaining overall system efficiency.
Hybrid Models: Utilizing a combination of LLMs and smaller, specialized models (such as rule-based systems) for tasks that do not require complex language understanding can also help reduce the reliance on costly LLM calls. Agents can offload simpler tasks to non-LLM models, thereby reducing the frequency of expensive operations.
Optimized Token Usage: Using tokens and prompts wisely can significantly reduce costs. For example, sending summaries or only the relevant portions of a code base can help reduce the token count, thereby minimizing LLM usage expenses.
Batch Processing and Shared Resources: Agents can be designed to share resources when possible. For instance, memory and intermediate results can be cached and reused by multiple agents, reducing redundant computations and LLM calls. This type of optimization helps cut down on both computational and financial costs.

Latency

Agentic AI often involves multi-step reasoning and interactions, which can introduce significant latency, especially when agents communicate or make calls to external tools. Below are some best practices for reducing latency:

Streaming Last Layer: For user-facing agents, making the last layer of the LLM streaming can significantly reduce the perceived latency. Streaming the output allows the system to provide tokens as they are generated, creating a faster response time for the end-user and improving the overall user experience.
Prompt Optimization: Prompts and results can often be identical across different users. To reduce latency, prompt caching can be employed to reuse previously computed results whenever possible. By caching common prompts, agents can bypass unnecessary recomputation, delivering faster responses.
Specialized LLMs: Not every agent requires the use of a large, complex LLM. By utilizing simpler LLMs for straightforward tasks and reserving the more resource-intensive models for complex reasoning, latency can be effectively minimized.
Parallel Execution: When possible, agents should run in parallel rather than sequentially. For example, a sentiment analysis and a technical analysis could be conducted simultaneously, reducing the total processing time for a task.
Efficient Runtime Strategies: Using batched or selective runtime verification can also reduce latency. Instead of verifying every output, only critical outputs or random samples may be checked, saving processing time and resources.

Alignment

Ensuring that agents are aligned with human values and objectives is an ongoing challenge in the development of agentic AI systems. Misaligned agents may make decisions that, while logical from their perspective, could be harmful or undesirable for users. Below, we explore best practices and approaches for improving alignment:

Fine-Tuning for Task-Specific Goals: Fine-tuning language models to specific tasks helps improve the alignment of agents with desired outcomes. By training on task-specific data, agents can better understand and meet the expectations of particular applications, ensuring their behavior is more predictable and aligned with user goals.
Reinforcement Learning from Human Feedback (RLHF): Reinforcement Learning from Human Feedback can be used to fine-tune agents based on user preferences and values. RLHF enables agents to iteratively improve their behavior by learning from user feedback, making them more attuned to nuanced requirements and avoiding undesired behaviors.
Consistency Checks: Regular consistency checks should be performed to ensure agents are making decisions in line with expected behavior. Consistency checks can include comparing agent outputs against a set of pre-defined acceptable responses, which helps identify misalignment early on.
Explainability for Better Alignment: Agents that can explain their reasoning are easier to align with human expectations. Explainable AI enables users to understand why an agent made a particular decision, providing transparency and making it easier to detect when an agent is deviating from its intended purpose.
Human-in-the-Loop Oversight: For critical applications, incorporating human-in-the-loop oversight is crucial. This allows for real-time intervention when agents make decisions that deviate from intended goals, ensuring safety and alignment with human values. Human oversight is especially important in high-risk scenarios where autonomous decision-making could have significant consequences.

Conclusion

In this comprehensive guide, we have explored the fundamental concepts and components of agentic AI, including their architecture, various examples, and frameworks. Agentic AI represents a unique approach to building autonomous systems capable of reasoning, planning, and adapting to achieve specific goals. We discussed how agents can operate like System 1 and System 2 thinking, blending fast, intuitive actions with deliberate and complex decision-making processes. Furthermore, we analyzed different frameworks that enable the implementation of agentic systems and shared insights on best practices for testing and evaluation.

While there are successful case studies, such as Amdocs using Nvidia NIM and Wiley working with Salesforce, challenges remain in making agents fully autonomous. Trust and scalable runtime verifiability are still significant obstacles, as is the question of how best to architect agents to achieve optimal results. Developing robust frameworks can play a key role in overcoming these challenges, and ongoing improvements in testing, including runtime verification and red team testing, will help ensure agents perform reliably and safely.

Stay updated with the latest insights on Agentic AI. Subscribe to our newsletter below.