Teach Developers to Think of LLMs as (Fuzzy) Functions

Article summary

Why This Mental Model Matters
The Function Analogy (With Important Caveats)
Familiar Practices, New Context
Common Pitfalls and How to Address Them
The Path Forward

Most developers first encounter LLMs through chat interfaces — ChatGPT, Claude, Gemini, etc.. It’s a natural starting point, but it creates a specific mental model: LLMs are conversational partners, digital assistants you interact with through natural language. This mental model works fine when you’re chatting, but it can lead you astray when you start integrating LLMs into production systems.

The problem? When you chat with ChatGPT, you’re having a conversation—you ask for help, clarify when needed, and iterate together toward a solution. But when building systems, you need predictable, repeatable behavior. You can’t have your production system ask “Did you mean X or Y?” or respond with “I’d be happy to help you with that! Here’s your JSON data:” followed by the actual JSON.

The shift I want to explore isn’t revolutionary—it’s about helping developers recognize that the skills they already have apply directly to LLM development, just with a probabilistic twist. The core insight is simple: stop thinking of LLMs as entities to chat with, and start thinking of them as functions to call.

Why This Mental Model Matters

When developers think “chatbot,” they build chatbot-shaped solutions. When they think “function,” familiar patterns emerge: composition, testing, versioning, and integration. The difference? These functions are fuzzy — probabilistic rather than deterministic. But that doesn’t make them less useful; it just means adapting our practices slightly.

The Function Analogy (With Important Caveats)

Let’s be clear: LLMs aren’t deterministic functions. Even with temperature=0, outputs might vary slightly. But they’re still functions in the architectural sense:

import re

# Traditional function
def extract_email(text: str) -> str:
    # Deterministic regex parsing
    match = re.search(r'[\w\.-]+@[\w\.-]+', text)
    return match.group(0) if match else ""

# LLM as a function
def extract_email_llm(text: str) -> str:
    response = llm.invoke(
        prompt=f"Extract the email address from: {text}",
        temperature=0
    )
    return response["email"]

Both take input and produce output. The LLM version handles cases the regex can’t: “my email is john dot doe at company dot com” or “reach me at jdoe[at]example[dot]com”.

Familiar Practices, New Context

The good news: most of what you already know still applies. Let’s explore how familiar practices translate to this probabilistic world.

1. Your Prompts Are Your Code

In traditional development, we version source code religiously. With LLMs, the prompt IS your implementation. A seemingly minor prompt change can alter behavior dramatically, just like changing a RegEx pattern or SQL query.

But there’s a twist: the “runtime” (the model) evolves underneath you. Your model may be deprecated, or you may want to optimize costs with a cheaper model, and suddenly your carefully crafted prompt behaves differently. This is like having your Python interpreter randomly updated in production—except it’s expected behavior.

What this means for developers:

Version your prompts like source code
Document why specific wording was chosen
Plan for model migrations like framework upgrades
Keep a regression suite to catch behavioral drift

2. Test-Driven Development Becomes Test-Driven Prompt Development

Traditional TDD follows red-green-refactor: write a failing test, make it pass, then improve. With LLMs, the cycle is similar but the assertions differ.

Instead of:

assert extract_age("I am 25 years old") == 25

You write:

result = extract_age_llm("I am 25 years old")
assert isinstance(result, int)
assert 20 <= result <= 30  # reasonable bounds
assert result == 25  # might work, might be brittle

Your tests become evaluations — they grade quality rather than enforce exact outputs. This feels foreign at first, but it’s similar to testing systems with inherent variability:

UI responsiveness (must load in < 2 seconds)
Search relevance (top result should be relevant)
Compression algorithms (output should be smaller than input)

Practical approach:

Start with property-based tests (“output should be valid JSON”)
Add golden datasets—curated examples that should always work
Use similarity metrics where exact matches are too strict
Run evaluations continuously as models evolve

3. Defensive Programming in a Probabilistic World

Remember writing code for unreliable networks? The patterns are remarkably similar. Just as you wouldn’t trust a network call to always succeed, you shouldn’t trust an LLM to always return perfect output.

The graceful degradation pattern:

def extract_data_with_fallback(text: str) -> dict:
    try:
        # Start with the fast, cheap model
        result = gpt41_nano_extract(text)
        if validate_structure(result):
            return result
    except (TimeoutError, RateLimitError):
        pass

    try:
        # Escalate to more capable model if needed
        result = gpt41_extract(text)
        if validate_structure(result):
            return result
    except (TimeoutError, RateLimitError):
        pass

    # Last resort: simple heuristics or human review
    return manual_review_queue.add(text)

This mirrors patterns you already use:

CDN → Origin server → Cached fallback
Automated processing → Manual review → Exception handling
Real-time data → Recent cache → Historical average

4. Human-in-the-Loop: From Exception to Rule

In traditional software, human intervention is often the exception—what happens when automation fails. With LLMs, HITL (Human-in-the-Loop) becomes a first-class pattern, not just a fallback.

Consider how this changes the development cycle:

def process_with_confidence(text: str) -> dict:
    result = llm_extract(text)
    confidence = calculate_confidence(result)

    if confidence < 0.8:
        # Route to human review
        human_result = human_review_queue.add(text, result)
        # Crucially: add to training data
        add_to_golden_dataset(text, human_result)
        return human_result

    return result

The key insight: humans aren’t just fixing errors—they’re continuously improving your evaluation suite. Each human correction becomes a test case, gradually raising the automation ceiling. This is different from traditional software where human intervention rarely feeds back into the test suite automatically.

5. Composition: Building Complex Systems from Simple Functions

One common mistake is creating “god prompts” that do everything—the LLM equivalent of a 500-line function. Instead, embrace composition:

def process_customer_email(email: str) -> dict:
    # Each step is a focused, testable function
    metadata = extract_email_metadata(email)
    sentiment = analyze_sentiment(email)
    intent = classify_intent(email)

    # Combine the results
    return {
        "metadata": metadata,
        "sentiment": sentiment,
        "intent": intent,
        "priority": calculate_priority(sentiment, intent)
    }

Each function has a single purpose, can be tested independently, improved without affecting others, and can even use different models or non-LLM implementations. This is just good software design, applied to probabilistic components.

6. Observability: You Can’t Debugger-Step Through an LLM

When traditional functions fail, you can step through with a debugger. With LLMs, you need different tools:

Log everything (with PII awareness):

Input prompt (with sensitive data masked)
Model name and version
Temperature and parameters
Raw output
Parsed/validated output
Latency and token usage

Tip: Hash or redact emails, IDs, and other personal data before writing logs to disk.

Monitor for drift:

Track validator success rates
Monitor output format compliance
Watch for semantic drift in responses
Set up alerts for cost spikes or unusual patterns

Common Pitfalls and How to Address Them

Pitfall 1: “Let me have a conversation with the API”

Symptom: Prompts that say “Please analyze this and tell me about…”
Fix: Define exact output format: Return only: {"summary": "...", "category": "..."}

Pitfall 2: “It worked in testing!”

Symptom: Surprise when production behavior differs.
Fix: Test against multiple model versions; build in graceful degradation.

Pitfall 3: “LLMs are magic/useless”

Symptom: Either over-trusting or completely dismissing LLMs.
Fix: Understand them as powerful but imperfect tools, like OCR or speech recognition.

The Path Forward

The mental shift from “chatbot” to “fuzzy function” is key to building robust systems with LLMs. It’s not about abandoning what we know—it’s about adapting proven practices to probabilistic components.

These adaptations — evaluations instead of assertions, confidence thresholds instead of binary success, human-in-the-loop as a feature not a bug — prepare you for the next level of AI development. Once you’re comfortable with LLMs as functions, you’re ready to compose them into agents, workflows, and increasingly sophisticated systems.

The future of software isn’t deterministic OR probabilistic—it’s both, working together. Your existing engineering skills are the foundation. The probabilistic twist is just that—a twist, not a rewrite.

Start small. Find one text transformation in your current system that could benefit from some fuzziness. Wrap it in a function. Test it like you would any external service. Monitor it in production. Let humans improve it. Then build from there.

Your decades of engineering wisdom still apply. You just have a new, powerful, slightly unpredictable tool in your toolkit.

From Chatbots to Components: Teaching Developers to Think of LLMs as (Fuzzy) Functions

Article summary

Why This Mental Model Matters

The Function Analogy (With Important Caveats)

Familiar Practices, New Context

1. Your Prompts Are Your Code

2. Test-Driven Development Becomes Test-Driven Prompt Development

3. Defensive Programming in a Probabilistic World

4. Human-in-the-Loop: From Exception to Rule

5. Composition: Building Complex Systems from Simple Functions

6. Observability: You Can’t Debugger-Step Through an LLM

Common Pitfalls and How to Address Them

Pitfall 1: “Let me have a conversation with the API”

Pitfall 2: “It worked in testing!”

Pitfall 3: “LLMs are magic/useless”

The Path Forward

Join the conversation Cancel reply

Tell Us About Your Project

Article summary

Why This Mental Model Matters

The Function Analogy (With Important Caveats)

Familiar Practices, New Context

1. Your Prompts Are Your Code

2. Test-Driven Development Becomes Test-Driven Prompt Development

3. Defensive Programming in a Probabilistic World

4. Human-in-the-Loop: From Exception to Rule

5. Composition: Building Complex Systems from Simple Functions

6. Observability: You Can’t Debugger-Step Through an LLM

Common Pitfalls and How to Address Them

Pitfall 1: “Let me have a conversation with the API”

Pitfall 2: “It worked in testing!”

Pitfall 3: “LLMs are magic/useless”

The Path Forward

Related Posts

Large Language Models Are Designed to Be Average

Tame Software Project Complexity with AI: How I Use Cursor to Build a Smarter Knowledge Base

Keep Your Workflow Safe with MCP

Keep up with our latest posts.

Join the conversation Cancel reply

Tell Us About Your Project