Role: Sole Conversation Designer |
Responsibilities: Diagnose AI behavioral inconsistencies, design a consistent behavioral system, and establish shared standards across teams. |
Collaborators: 10+ teams including product, data science, conversation teams, accessibility, legal, and engineering. |
Timelines: Q4, 2025 |
Overview
Oracle Health's clinical AI agent was live across multiple workflows and entry points, but without a defined behavioral system, it produced inconsistent responses that eroded clinician trust. This project established the conversational and behavioral foundation that made the AI predictable, clinically appropriate, and trustworthy at scale.
Goals
Make AI behavior consistent and predictable across the product
Earn and maintain clinician trust through transparent, reliable responses
Establish a shared behavioral standard that scales across teams and workflows
Ensure the AI communicates in a way that is clinically appropriate and safe
The Problem
Oracle Health's clinical AI agent supported clinicians across a wide range of tasks and workflows. The same clinical question could return different answers from the AI at different times. Not always wrong, just unpredictable. In a clinical environment, unpredictable AI behavior is a trust problem. Clinicians depend on reliable information to make fast, confident decisions. When the AI became inconsistent, it lost their trust.

Research
Before jumping into solutions, I interviewed clinicians to understand their needs and expectations around the AI agent. The goal was to ground the design in real user behavior, not assumptions.
Key insights:
Efficiency over everything. Clinicians are busy. They need fast, accurate answers they can act on immediately. Long, verbose responses added cognitive load at exactly the wrong moment.
Scannability matters. Responses that were dense or repetitive got ignored. Clinicians needed to extract information at a glance, not read through paragraphs to find the answer.
Tone should be clinical and professional. Overly formal or overly friendly language felt out of place. Clinicians wanted the AI to communicate the way a knowledgeable colleague would: clear, direct, and to the point.
Trust is built on transparency. Clinicians needed to know where information came from and why it was surfaced. An answer without a source or rationale was an answer they could not rely on. Hedging language like "it looks like" or "I think" undermined confidence rather than building it.
Clinical decisions belong to clinicians. There was a clear and consistent expectation that the AI supports decision-making, not replaces it. Any response that felt like a clinical recommendation created discomfort and distrust.
When things go wrong, clarity is critical. Clinicians need consistent, clear guidance when the AI cannot help or something fails. Leaving them to guess what happened or what to do next was not acceptable.
These insights helped shape the prompt architecture, voice and behavior guidelines, and confidence signaling approach in the design.
Diagnosing the Failure
With research in hand, I audited conversations across the main spaces: chart review, Q&A, and main workflows. The goal was to understand exactly where and how the AI was falling short. Key findings:
No shared scope definition. There were no clear boundaries defining what the AI should and shouldn't handle. Without that, each context behaved differently, and clinicians had no consistent mental model of what the AI was actually for.
No confidence calibration. The AI expressed the same certainty whether it was retrieving a documented chart fact or generating a synthesized response. Clinicians couldn't tell the difference, and in healthcare, that distinction matters.
Inconsistent voice and behavior. Without defined guidelines, the AI communicated differently across experiences. Tone, structure, and style shifted depending on where clinicians interacted with it, creating a fragmented experience that was hard to learn and harder to trust.
No standard for handling limits. When the AI couldn't answer a question, the experience fell apart. Some areas redirected cleanly, others returned irrelevant or confusing responses, and a few generated answers that had no business being there. The inconsistency made it clear there was no intentional design behind how the AI handled its own limitations.
The findings reinforced what the research surfaced: the AI had no defined behavioral system to work from. That became the focus of the design.
The Design
With research insights and audit findings combined, I designed a five-layer behavioral system. Each layer addressed a specific dimension of the consistency problem, and together they formed the foundation for how the AI should behave across every context.
Layer 1: System Prompt Architecture
The first step was translating the behavioral rules into a structured prompt architecture that engineers and AI/ML could implement consistently across every entry point.
Four functional zones:
Role and identity.
A precise definition of what the AI is doing in this context. Not a personality description but a behavioral contract. Includes the never-attempt list in explicit language.
Input handling rules.
Defined expected behavior for each input type: direct factual question, ambiguous query, boundary question, out-of-scope question. Behavior specs, not response scripts.
Confidence and uncertainty signaling.
Three tiers: high confidence (retrieved, verifiable), moderate confidence (synthesized, flagged), out-of-scope (redirect, no attempt). Each tier has a defined behavior so the AI can express it naturally while staying consistent.
Escalation and handoff rules.
What the AI does when it can't help. Every dead end has a defined next step: a resource, a workflow path, or a point of contact, based on why it couldn't help.
Layer 2: Voice and Behavior Guidelines
Research made it clear that tone and communication style had a direct impact on trust. Clinicians needed the AI to feel like a reliable clinical tool, not a consumer chatbot. I drafted a detailed set of voice and behavior guidelines covering two areas: safety and interaction behavior, and tone and communication style.
The guidelines were written as explicit rules with prohibited and acceptable examples, so there was no room for interpretation across teams. Key rules included:
Stay within clinical boundaries. The AI never diagnoses, interprets clinical results, or suggests treatment. It never impersonates a clinician or expresses personal opinions about care decisions. Every response had to be traceable to a source, not inferred or assumed.
Handle incomplete or uncertain information explicitly. Partial data had to be flagged as incomplete. Relative time references without anchor dates were not acceptable. If the AI didn't know something, it said so clearly rather than guessing.
Always give a path forward. Rejecting a request without explaining why or offering a next step was not acceptable. Dead ends erode trust faster than wrong answers.
Avoid repeating the same response. If a clinician asked a follow-up question and got the identical response, that was a failure. The AI had to recognize when a different approach was needed.
Communicate like a clinical tool, not a chatbot. No casual greetings, filler phrases, or social chatter. No humor, sarcasm, or emojis. No exclamation marks. No directive or commanding language. No over-apologizing for system limitations. Responses had to be direct, scannable, and free of hedging language.
These guidelines were applied across all interfaces and fed directly into the system prompt architecture.
Layer 3: Response Standards for Scope Boundaries
Each team had defined their own scope boundaries, but there was no shared standard for how the AI should respond when it reached them. That gap was where the inconsistency lived.
I worked with the cross-functional team to define a consistent response standard that every area had to follow. The standard had three requirements: decline clearly without ambiguity, explain why in plain language, and always provide a useful next step. Leaving clinicians with a dead end was not acceptable.
This gave clinicians a predictable experience at the boundaries regardless of which part of the product they were in. They always knew what to expect when the AI couldn't help, and they always had somewhere to go next.
Layer 4: Error and Recovery Patterns
Good behavioral rules need defined failure handling. I built a named pattern library for the failure states that appeared most in the audit. Naming matters because it gives the team shared vocabulary, and means future decisions can reference existing patterns instead of redesigning from scratch.
The Boundary Pattern Trigger: question outside defined scope. Behavior: decline clearly, explain why in one sentence, offer the most relevant alternative. No partial attempts. No excessive apology. Acknowledge and redirect, efficiently.
The Uncertainty Signal Pattern Trigger: AI is synthesizing or inferring rather than retrieving. Behavior: complete the response, surface the confidence tier as a clear signal. Not buried in hedging language, not at the end. Consistent enough that clinicians learn to read it.
The Recovery Pattern Trigger: follow-up signals the previous response didn't land, including when a clinician repeats the same question. Behavior: do not repeat the same response. Acknowledge the gap, offer a different approach (simpler explanation, more specific path, direct handoff) and give the clinician a clear choice. A repeated question is a signal the first response failed, not that the question changed.
The Conflict Pattern Trigger: query where available data contains conflicting information. Behavior: surface both data points explicitly. Do not resolve the conflict autonomously. In clinical contexts, data conflicts are always a human decision.
Layer 5: Design Validation
Before the design was finalized, I ran validation sessions with clinicians and the cross-functional team to test whether the behavioral system held up in practice.
With clinicians, the focus was on whether the AI's responses felt clear, trustworthy, consistent, and appropriate. The sessions surfaced a significant pattern: many of the error states the AI could produce were technical in nature, reflecting system or model-level failures that meant nothing to a clinician mid-workflow. Exposing that language directly created confusion and eroded trust.
The solution was to design a layer of generic, user-facing error messages that translated technical failures into clear, actionable responses. Clinicians didn't need to know why the AI failed. They needed to know what to do next.
Validation with the team focused on whether the behavioral rules were implementable and whether the logic held up across edge cases. Several gaps were identified and fed back into the design before implementation.
Cross-Team Adoption
A behavioral system only works if the whole team builds from it. I packaged the design as a working reference for three audiences:
Data scientists: system prompt architecture with annotated rationale
Product and engineering: behavioral rules and decision standards
QA: validation findings and audit criteria
I ran a working session with the cross-functional team to walk through the design and establish shared vocabulary. The pattern names were finalized in that session. Named patterns only work if everyone uses the same names.
The design became the baseline for subsequent implementations. New flows were designed against it rather than from scratch.
Impact
Consistency
86% of audited conversations followed the defined response standards post-implementation
Trust and usability
Reduction in clinician-reported confusion around AI responses by 42%
Reduction in "why did it say that" escalations by 29%
Error handling
92% of technical error messages replaced with user-facing responses
Reduction in dead-end responses (no next step offered) from 63% to 9%
Deliverables
System prompt architecture: four-zone structure with annotated rationale.
Voice and behavior guidelines: a comprehensive set of interaction and communication rules that became the behavioral baseline for the Oracle Clinical AI Agent persona.
Response standards for scope boundaries: consistent decline and redirect behavior across all entry points.
Error and recovery pattern library: four named patterns with trigger conditions and behavioral specs.
Design validation findings: session insights and iterations fed back into the design.
Team reference guide: packaged for data scientists, product/engineering, and QA.