When The Copilot Forgets

Why Understanding Enterprise AI Limits Can Be As Important As Knowing Its Strengths

Jul 24, 2025

Believe In Yourself Drawings for Sale - Fine Art America — Credit: Pete McKee

It’s mid-2025, so of course here’s yet another article banging on about AI in the workplace. But at this point, that’s because it is really here, albeit in its early incarnations, and nowhere is that more obvious than in the rise of enterprise AI copilots. In large organisations, it’s already becoming the new norm to have copilots in pretty much every major application or platform used in the back office. Tools like Microsoft Copilot have slipped into the daily apps many of us use; Outlook, Word, Teams, and they’re already summarizing meetings, assisting with drafting, pulling insights from documents, and doing it all through an intuitive natural language interface.

But alongside all that promise, we’re already encountering some issues and risks. One such issue that hit me between the eyes recently is the risk that comes from using these tools beyond their limits because we don’t fully understand where those limits lie.

When AI Doesn’t Deliver, Even When You Do Everything Right

I recently ran into this exact problem. After a 90-minute internal meeting covering different types of contracts and their respective processes, I used Microsoft Copilot to help generate a follow-up report.

This wasn’t a naive first test or casual experiment. I used a detailed and structured prompt, consistent with best-practice guidance; the same one I had successfully used in several prior meetings of the same nature. The only difference was the specifics of what was discussed.

The meeting itself wasn’t highly technical, but it was nuanced. And while Copilot’s output looked impressive at first glance, it quickly became clear that it had missed or ignored critical portions of the conversation.

In several cases, when I prompted Copilot to recall something that I knew had been discussed, it replied along the lines of: “That wasn’t covered in the meeting.” And only after multiple follow-ups where I asked things like “Are you sure that X wasn’t covered? Didn’t someone say Y and Z about X?” and in one case quoted the transcript directly in my prompt, did it acknowledge: “You’re right, that was discussed, and here’s what was said…”.

Oh erm, shit. I thought you had this covered Copilot, I believed in you, and now I’m a tad nervous you’re not across all the detail.

Not a Bug

This wasn’t a glitch per se. It was a fundamental reminder that LLMs are generative in nature, not record-keeping systems with verbatim recall. General-purpose LLMs are designed to provide fluent, helpful responses, not necessarily to perfectly extract or reproduce facts.

But that nuance is often lost in how these tools are described. One of the most persistent metaphors has been the idea that tools such as Microsoft Copilot are like “a junior lawyer who never sleeps, has read everything, and is eager to please.” It’s a simple and evocative explanation, but it’s somewhat misleading.

Because, unlike a junior team member tackling this kind of task, the LLM:

Doesn’t know or vocalize when it’s unsure.
Doesn’t keep track of source fidelity.
It’s not optimising for accuracy; it’s optimising for linguistic coherence and plausibility.

This oversimplification feeds into binary thinking that can become insidious for users with low-to-mid tech literacy. If the AI got it right last time, surely it will get it right this time. If it works for a 15-minute team meeting, it should work for a 90-minute client meeting or contract negotiation. They’re both just recorded conversations with a transcript for the LLM to reference. So why shouldn’t it just work?

With this mindset, we forget that these tools need to be deployed thoughtfully if we want reliable outcomes. It’s like expecting your household electric lawnmower to handle a 100-acre farm: the scale, context, and demands are completely different. AI copilots work best within certain boundaries, and when we push them into bigger or more complex tasks without adjusting our expectations, the cracks start to show. We need to ask not just “Can it do this?” but “How does it work, where does it struggle, and what changes when we scale up or shift the task?”

We treat these tools as if they have intent and reliability, rather than remembering that, primarily, under the hood, they are probabilistic language engines.

Why Copilot Struggled With My Meeting

Several factors likely contributed to the tool’s failure in this case:

Length and Complexity: The meeting was longer and more nuanced than typical calls. LLM performance tends to degrade as inputs become more complex or lengthy.
Semantic Similarity: The discussion involved multiple contract types that sounded similar but were substantively different. LLMs often flatten nuance and misrepresent subtle distinctions.
Transcript Limitations: Copilot relies heavily on the quality of the Teams transcript. Overlaps in speech, inconsistent phrasing, or unclear labelling can severely impact fidelity.
Delayed Recall Issues: Copilot was less effective when asked about details that weren’t mentioned recently in the conversation or summary window, so prompting it didn’t always help unless I was very specific.

The most concerning part from a busy user perspective? It was all presented confidently. And because I had mentally “outsourced” that part of the task to the AI, I almost missed those errors too.

Not All AI Is Built the Same: General-Purpose vs Task-Specific Tools

Whilst my focus here is the enterprise AI that most large businesses now have easy access to. It’s also worth reflecting on something that often gets overlooked when people discuss AI tooling: the difference between broad platform-based copilots and purpose-built, task-specific applications.

Tools like Microsoft Copilot are designed to be multi-purpose, embedded across a range of enterprise contexts, from Excel formulas to email drafts to meeting summaries. That general-purpose nature is a strength, but it also comes with trade-offs: more probabilistic responses, more variability, and the need for the user to apply judgment and structure around its outputs.

By contrast, tools designed for a single task often perform more reliably, simply because their constraints are tighter and more thoughtfully engineered. A good example from our workflow is our use of Granola for preparing and reviewing episodes of our Law://WhatsNext podcast. Granola is designed specifically for meeting transcription and summarisation. It handles episode prep and post-recording notes far more reliably than I’ve ever seen from Copilot. Mainly because the application is tailored for that job, and has been developed in a way that hard-codes out the uncertainty wherever possible.

We trust Granola in that setting not because it’s “smarter,” but because the job is narrower, and the tool is built around that narrow job.

Matching the Tool to the Task: Context Matters

On a related note, another important nuance: not every use of AI demands the same level of accuracy or certainty.

When we use AI tools for podcast research or prep: whether it’s summarising a guest’s blog post with Claude, running an outline through Granola, or generating show notes in Riverside; we’re happy to embrace a bit of roughness. These are ancillary tasks, not mission-critical ones. A bit of hallucination or fluff doesn’t hurt anything. In fact, there can be a degree of fun or creative flexibility in seeing where the tools go.

But that tolerance for error doesn’t carry over to tasks like summarising a three-hour legal negotiation or drafting a report on contractual processes. In those contexts, the output is the product, not just support material. The room for ambiguity narrows. And the requirement for confidence in accuracy rises.

That’s why understanding tool intent, what it's built for, where it’s strong, and where it’s inherently fuzzy, is so essential.

Just because something works brilliantly in one domain doesn’t mean it’s fit for another.

Cognitive Offloading: The Human Blind Spot

My Copilot experience revealed another more subtle issue: how we, as users, change our own behaviour when AI is involved.

The MIT Media Lab’s recent study, Your Brain on ChatGPT, found that people engage less cognitively when using LLMs. Brain scans showed reduced activity in memory and reasoning regions. In other words, when you expect AI to remember for you, you stop remembering yourself.

Fortunately, I reviewed the Copilot output immediately after the meeting, while my memory was fresh and the nature of the call followed a defined structure, so the answer “this wasn’t covered in the chat” jumped out at me because it wasn’t possible. But had I waited even a day or two, or the meeting had less defined structure, then I might not have caught it. I might’ve accepted the summary as complete, accurate, and ready to share.

The risk here isn’t just technical; it’s cognitive. The more we rely on AI to think for us, the less we think alongside it, so the human-in-the-loop safety net starts to get significant holes in it.

The Overreliance Trap: From Resistance to Blind Trust

In legal and enterprise settings, many professionals go through a typical digital adoption curve: initial resistance, followed by enthusiastic acceptance, and eventually, overconfidence.

We start small, so maybe Copilot helps summarise a 15-minute sync or team meeting. It does well. We use it again. Eventually, we assume that if it can do X, it can do 10X. We apply the same tool to a three-hour negotiation or a detailed process planning call, without reconsidering whether the context or complexity might push it beyond its capabilities.

But the tools don’t warn us when they’re struggling. They still speak with authority. And unless you’re reviewing outputs with care, or still holding a mental map of the original discussion, you may not catch the misses.

Training: The Silent Crisis

One of the most under-addressed issues in AI adoption is the training paradox. Because tools like Copilot are intuitive, respond to natural language, and are embedded in familiar platforms like Word or Teams, most users skip formal training altogether.

A 2025 study titled Beyond Training: Social Dynamics of AI Adoption in Industry revealed that 9 out of 10 users ignored official Copilot training, instead relying on trial and error or informal peer learning. While that works for discovering features, it doesn’t help users understand:

When and why tools fail
How to prompt for precision
What to verify before acting on AI-generated output

This blind spot is amplified in legal teams, where digital literacy varies, and where the cost of error can be high. When we don’t know the boundaries of a tool, we inevitably overstep them; usually without realising it.

Some Thoughts on How to Stay Grounded with AI Tools

So how do we keep the benefits of AI copilots without falling into these traps?

Train for Limits, Not Just Features
Prioritise awareness of known failure points and breakdown conditions, especially around summarisation, long inputs, and sensitive content.
Treat AI Outputs as Drafts, Not Truths
Copilot’s tone is confident, but that doesn’t mean it’s right. Use its output as a starting point, not a final answer.
Keep Light Mental Anchors
Even when delegating to AI, stay mentally present in meetings. Flag two or three key points you expect to see later in the summary. Check if they appear.
Recognise Scope Creep
Don’t assume success on a short call means readiness for a multi-hour negotiation. Evaluate each use case on its own merit.
Create a Culture of Prompting and Probing
Build habits that encourage users to ask follow-ups, test assumptions, and challenge overly polished outputs.

Conclusion: Understand the Tool or Risk Being Misled by It

Enterprise AI is here to stay. And tools like Copilot can genuinely enhance productivity when used well. But if we continue to treat them like magical assistants, rather than language models with specific strengths and clear weaknesses, we risk losing more than we gain.

Understanding what these tools can’t do is just as important as knowing what they can. That doesn’t mean we stop using them. It means we start using them more intentionally. It means making space for learning, asking better questions, and knowing when to step in and verify.

The horse has bolted. AI tools like Copilot will increasingly be part of how legal teams operate. The opportunity now is to help people become not just users of AI, but competent users. People who understand the benefits and the boundaries.

These tools work best when paired with informed oversight.

If you’re interested in how Tom & I have set up the podcast (and workflows around it) let us know; we’re happy to share here or privately!

Law://WhatsNext

When The Copilot Forgets

Why Understanding Enterprise AI Limits Can Be As Important As Knowing Its Strengths

Discussion about this post