Most developers first encounter LLMs through chat interfaces — ChatGPT, Claude, Gemini, etc.. It’s a natural starting point, but it creates a specific mental model: LLMs are conversational partners, digital assistants you interact with through natural language. This mental model works fine when you’re chatting, but it can lead you astray when you start integrating LLMs into production systems.
The problem? When you chat with ChatGPT, you’re having a conversation—you ask for help, clarify when needed, and iterate together toward a solution. But when building systems, you need predictable, repeatable behavior. You can’t have your production system ask “Did you mean X or Y?” or respond with “I’d be happy to help you with that! Here’s your JSON data:” followed by the actual JSON.
The shift I want to explore isn’t revolutionary—it’s about helping developers recognize that the skills they already have apply directly to LLM development, just with a probabilistic twist. The core insight is simple: stop thinking of LLMs as entities to chat with, and start thinking of them as functions to call.
Why This Mental Model Matters
When developers think “chatbot,” they build chatbot-shaped solutions. When they think “function,” familiar patterns emerge: composition, testing, versioning, and integration. The difference? These functions are fuzzy — probabilistic rather than deterministic. But that doesn’t make them less useful; it just means adapting our practices slightly.
The Function Analogy (With Important Caveats)
Let’s be clear: LLMs aren’t deterministic functions. Even with temperature=0
, outputs might vary slightly. But they’re still functions in the architectural sense:
import re
# Traditional function
def extract_email(text: str) -> str:
# Deterministic regex parsing
match = re.search(r'[\w\.-]+@[\w\.-]+', text)
return match.group(0) if match else ""
# LLM as a function
def extract_email_llm(text: str) -> str:
response = llm.invoke(
prompt=f"Extract the email address from: {text}",
temperature=0
)
return response["email"]
Both take input and produce output. The LLM version handles cases the regex can’t: “my email is john dot doe at company dot com” or “reach me at jdoe[at]example[dot]com”.
Familiar Practices, New Context
The good news: most of what you already know still applies. Let’s explore how familiar practices translate to this probabilistic world.
1. Your Prompts Are Your Code
In traditional development, we version source code religiously. With LLMs, the prompt IS your implementation. A seemingly minor prompt change can alter behavior dramatically, just like changing a RegEx pattern or SQL query.
But there’s a twist: the “runtime” (the model) evolves underneath you. Your model may be deprecated, or you may want to optimize costs with a cheaper model, and suddenly your carefully crafted prompt behaves differently. This is like having your Python interpreter randomly updated in production—except it’s expected behavior.
What this means for developers:
- Version your prompts like source code
- Document why specific wording was chosen
- Plan for model migrations like framework upgrades
- Keep a regression suite to catch behavioral drift
2. Test-Driven Development Becomes Test-Driven Prompt Development
Traditional TDD follows red-green-refactor: write a failing test, make it pass, then improve. With LLMs, the cycle is similar but the assertions differ.
Instead of:
assert extract_age("I am 25 years old") == 25
You write:
result = extract_age_llm("I am 25 years old")
assert isinstance(result, int)
assert 20 <= result <= 30 # reasonable bounds
assert result == 25 # might work, might be brittle
Your tests become evaluations — they grade quality rather than enforce exact outputs. This feels foreign at first, but it’s similar to testing systems with inherent variability:
- UI responsiveness (must load in < 2 seconds)
- Search relevance (top result should be relevant)
- Compression algorithms (output should be smaller than input)
Practical approach:
- Start with property-based tests (“output should be valid JSON”)
- Add golden datasets—curated examples that should always work
- Use similarity metrics where exact matches are too strict
- Run evaluations continuously as models evolve
3. Defensive Programming in a Probabilistic World
Remember writing code for unreliable networks? The patterns are remarkably similar. Just as you wouldn’t trust a network call to always succeed, you shouldn’t trust an LLM to always return perfect output.
The graceful degradation pattern:
def extract_data_with_fallback(text: str) -> dict:
try:
# Start with the fast, cheap model
result = gpt41_nano_extract(text)
if validate_structure(result):
return result
except (TimeoutError, RateLimitError):
pass
try:
# Escalate to more capable model if needed
result = gpt41_extract(text)
if validate_structure(result):
return result
except (TimeoutError, RateLimitError):
pass
# Last resort: simple heuristics or human review
return manual_review_queue.add(text)
This mirrors patterns you already use:
- CDN → Origin server → Cached fallback
- Automated processing → Manual review → Exception handling
- Real-time data → Recent cache → Historical average
4. Human-in-the-Loop: From Exception to Rule
In traditional software, human intervention is often the exception—what happens when automation fails. With LLMs, HITL (Human-in-the-Loop) becomes a first-class pattern, not just a fallback.
Consider how this changes the development cycle:
def process_with_confidence(text: str) -> dict:
result = llm_extract(text)
confidence = calculate_confidence(result)
if confidence < 0.8:
# Route to human review
human_result = human_review_queue.add(text, result)
# Crucially: add to training data
add_to_golden_dataset(text, human_result)
return human_result
return result
The key insight: humans aren’t just fixing errors—they’re continuously improving your evaluation suite. Each human correction becomes a test case, gradually raising the automation ceiling. This is different from traditional software where human intervention rarely feeds back into the test suite automatically.
5. Composition: Building Complex Systems from Simple Functions
One common mistake is creating “god prompts” that do everything—the LLM equivalent of a 500-line function. Instead, embrace composition:
def process_customer_email(email: str) -> dict:
# Each step is a focused, testable function
metadata = extract_email_metadata(email)
sentiment = analyze_sentiment(email)
intent = classify_intent(email)
# Combine the results
return {
"metadata": metadata,
"sentiment": sentiment,
"intent": intent,
"priority": calculate_priority(sentiment, intent)
}
Each function has a single purpose, can be tested independently, improved without affecting others, and can even use different models or non-LLM implementations. This is just good software design, applied to probabilistic components.
6. Observability: You Can’t Debugger-Step Through an LLM
When traditional functions fail, you can step through with a debugger. With LLMs, you need different tools:
Log everything (with PII awareness):
- Input prompt (with sensitive data masked)
- Model name and version
- Temperature and parameters
- Raw output
- Parsed/validated output
- Latency and token usage
Tip: Hash or redact emails, IDs, and other personal data before writing logs to disk.
Monitor for drift:
- Track validator success rates
- Monitor output format compliance
- Watch for semantic drift in responses
- Set up alerts for cost spikes or unusual patterns
Common Pitfalls and How to Address Them
Pitfall 1: “Let me have a conversation with the API”
Symptom: Prompts that say “Please analyze this and tell me about…”
Fix: Define exact output format: Return only: {"summary": "...", "category": "..."}
Pitfall 2: “It worked in testing!”
Symptom: Surprise when production behavior differs.
Fix: Test against multiple model versions; build in graceful degradation.
Pitfall 3: “LLMs are magic/useless”
Symptom: Either over-trusting or completely dismissing LLMs.
Fix: Understand them as powerful but imperfect tools, like OCR or speech recognition.
The Path Forward
The mental shift from “chatbot” to “fuzzy function” is key to building robust systems with LLMs. It’s not about abandoning what we know—it’s about adapting proven practices to probabilistic components.
These adaptations — evaluations instead of assertions, confidence thresholds instead of binary success, human-in-the-loop as a feature not a bug — prepare you for the next level of AI development. Once you’re comfortable with LLMs as functions, you’re ready to compose them into agents, workflows, and increasingly sophisticated systems.
The future of software isn’t deterministic OR probabilistic—it’s both, working together. Your existing engineering skills are the foundation. The probabilistic twist is just that—a twist, not a rewrite.
Start small. Find one text transformation in your current system that could benefit from some fuzziness. Wrap it in a function. Test it like you would any external service. Monitor it in production. Let humans improve it. Then build from there.
Your decades of engineering wisdom still apply. You just have a new, powerful, slightly unpredictable tool in your toolkit.