Why CFOs Don’t Trust AI, and What Would Actually Change That

In July 2025, Deloitte Australia delivered a 237-page report to the Department of Employment and Workplace Relations. The contract was worth about US$290,000. Nothing odd, right? Yet, by October, Deloitte was issuing a refund.

A University of Sydney researcher had spotted that the report contained citations to academic papers that did not exist, a quoted passage from a Federal Court judgment that had never been written, and references to court cases that were entirely fictional. The revised version of the report quietly disclosed, in a new footnote, that an LLM had been used in its preparation.

This wasn’t a scrappy startup or an unsupervised intern. It was a Big Four firm, charging premium consulting rates, producing analysis for a government client. Only, the work product was, in places, invented.

This is the story every CFO has been carrying in the back of their mind whenever someone walks into the office with a deck that says “AI-powered.” It is also the story that explains why, even as, according to Kyriba’s 2025 CFO Survey, 96% of CFOs say AI integration is a priority, only 59% are actually using it in their finance function, per Gartner’s 2025 AI in Finance Survey. Adoption has been roughly flat for two years, after a sharp jump from 37% in 2023.

The conventional explanation is that CFOs are risk-averse. The actual explanation is more interesting, and worth understanding if you want to know what would change.

The boardroom-safe answer

If you ask a CFO why they’re cautious about AI in finance, you’ll get a tidy answer. Security. Privacy. Data residency. Vendor lock-in. The Kyriba survey put 76% CFOs citing security and privacy concerns as a main barrier. That answer is true. It is also, in many cases, not the real reason.

The real reason is harder to articulate, which is why it tends to come out as something else. It goes roughly like this. A CFO is the last person in the company whose signature ends up on a number. If the number is wrong, the consequence sits with them, not with the system that produced it. Auditors will not accept “the model said so” as a defense. Boards will not accept it. Regulators certainly will not. Yet most AI tools currently being marketed to finance teams produce outputs whose correctness cannot be verified by inspection.

The skepticism isn’t about whether AI is impressive. CFOs have watched AI write surprisingly competent first drafts of board memos. They’ve seen it summarize ten-year filings in seconds. The skepticism is about a much narrower question, which is whether you can rely on a number that came out of a system you cannot fully trace.

This is the question that doesn’t get asked in pilots. Pilots ask: did the tool produce something useful? They rarely ask the harder one: would you defend this number under questioning, and on what basis?

Probability is the wrong primitive

Accounting is a deterministic discipline.

Given the same inputs and the same rules, two finance teams should arrive at the same numbers. If they don’t, someone has made an error, and the error is the kind of thing that gets surfaced, corrected, and reconciled. The entire architecture of finance, from GAAP to IFRS to the internal controls framework you write up for SOX, assumes that numbers are reproducible.

Large language models are not deterministic in this sense. Outputs can drift across runs, across model updates, across the specific phrasing of a prompt. More importantly, the way they produce a number is not by following a rule. It is by predicting what number would plausibly follow the preceding context. Those are very different operations, and the difference matters in ways that don’t show up, until they do.

A model that estimates next month’s revenue by drawing on statistical patterns in historical text is not doing the same thing as a system that calculates next month’s revenue by applying a rule to actual data. Both can produce a number that looks right. Only one of them produces a number you can defend. The Deloitte incident is the clearest possible illustration. The model generated content that looked exactly like the kind of citation an expert would provide. None of the citations were real. The output was indistinguishable from competent work until someone bothered to check.

For most use cases in the world, plausibility is good enough. For finance, plausibility is the failure mode.

A CFO’s job, in the part where their judgment actually counts, is to notice when a number that looks plausible is wrong. A tool that produces plausible numbers very fast doesn’t help with that. In fact, it makes things more difficult because it dresses up uncertainty in the visual language of certainty.

What “auditable” actually means

Almost every AI tool sold into finance now claims to be auditable. The word has been so thoroughly drained of meaning that it tells you almost nothing about what the tool actually does. There are at least three different things people mean when they say “auditable,” and only one of them is what a CFO needs.

The first level is logging. The system records what was asked, what was retrieved, and what was returned. This is the minimum, and it’s what most “auditable AI” actually delivers. It’s useful for forensic review after something goes wrong. It is not a control. It is a record of how the failure happened.

The second level is source attribution. The system not only logs the interaction but also tells you which documents or data sources informed the output. This is better, but it has a subtle weakness. Knowing the source doesn’t mean the model used it correctly. A model can cite a real document and still misinterpret it, conflate it with another, or invent specifics the source didn’t contain. The Deloitte report cited a Federal Court judgment. The judgment existed. The quote attributed to it did not.

The third level, and the only one that meets the bar a CFO actually needs, is deterministic traceability. Every number in the final output can be traced backward through a chain of rule-based transformations to the underlying source rows. The logic is inspectable. The same inputs always produce the same outputs. If you change a definition, you can see exactly which numbers change and why. There is no point in the pipeline where a probabilistic step decides what the answer was.

Most AI tools today sit at level one, occasionally at level two, and market themselves as if they were at level three. The language for all three sounds identical from twenty feet away. Distinguishing them requires asking specific questions that most pilots don’t get around to. Things like: if I run the same query twice, do I get the same result? Can you show me, for any number in this report, the exact SQL or rule that produced it? If a metric definition changes, what’s the propagation path?

These aren’t exotic questions. They’re the questions an internal auditor would ask. They are almost never asked during AI procurement.

When traceability is real, and when it’s theater

There’s a difference between a tool that can show you its work and a tool that did the work in a way worth showing.

Consider two systems that both produce an MRR retention curve. The first uses a model to read your raw subscription data, interpret what each row means, and generate a chart. When asked how it arrived at the numbers, it produces a fluent natural-language explanation of its reasoning. The explanation may even be accurate. But it was generated after the fact, by the same probabilistic system that produced the numbers. The traceability is a story the model tells about its own work, and stories are exactly the kind of thing models are good at producing whether or not they are true.

The second system uses defined logic. Subscription events are classified by deterministic rules. Cohorts are constructed by a SQL query you can inspect. The retention calculation is a specific formula applied consistently. When asked how the number was derived, the system shows you the rule and the rows. The narrative layer, if there is one, describes work that has already been done deterministically. It does not justify the work in the same breath as producing it.

The first system feels more sophisticated. It can answer questions in plain language and explain its reasoning. The second system is the one a CFO can actually use, because the audit trail it produces is the audit trail of the actual computation, not a plausible reconstruction of it. The first system’s reasoning is a hallucination risk. The second system’s reasoning is a property of the code.

This distinction is what separates AI tools that finance teams quietly stop using after the pilot from the ones that survive. It is also the distinction most procurement processes are not equipped to detect, because the tools that fail at it are usually better at the demo.

Where the model actually belongs

None of this is an argument against using AI in finance. The 67% of finance leaders who say they’re more optimistic about AI this year than last year are not wrong. The argument is about which parts of the work AI should be doing in the first place.

There is a clean line. On one side are tasks where the answer needs to be exactly right and defensible: calculating MRR, building a P&L, computing a variance, generating the underlying data for a QBR. These should run deterministically, on rules and SQL, against your actual data. There is no good reason to introduce probability into a calculation that doesn’t need it. The work of finance is, in its core, the work of getting numbers right and being able to show why.

On the other side are tasks where probability is acceptable and often useful: drafting the narrative summary, suggesting which anomalies look worth flagging, proposing an explanation for a variance the user can then verify. These are interpretation tasks. The cost of a small error is low, the speed gain is real, and the user is going to read the output before it leaves the building anyway. A model that drafts a paragraph explaining why bookings dropped, grounded in numbers that were produced deterministically, is doing something useful. A model that calculated the bookings number itself and then explained the result is doing something dangerous, even if the explanation sounds excellent.

This is the architecture AskEnola is built around. The analytics layer is deterministic. SQL queries run against your actual warehouse. Business logic is encoded in rules you can inspect, including the BADIR-style structured analytical patterns that come pre-built. The semantic layer that defines what MRR or churn or gross margin means in your business is a stable artifact, not something the model reinvents on each query. The model’s role is to translate your question into the right computation and write the narrative around the result. It does not get to decide what the numbers are.

The phrase that sometimes gets used for this is “zero hallucination,” which is a strong claim worth interrogating carefully. It does not mean the model never produces text that contains an error. It means the numbers are not produced by the model in the first place, so they cannot be hallucinated. If a CFO asks what Q3 ARR was, the system runs a query, gets a deterministic answer from the warehouse, and the model presents it. The number is as trustworthy as the underlying SQL and the source data. It is not subject to the failure mode where the model generates a plausible figure because something plausible was what the prompt called for.

The cost of this architecture is that the system cannot answer every conceivable question. If a rule isn’t defined, the answer isn’t available. This is often framed as a limitation. For a CFO, it is the feature. A system that says “I don’t have logic for this yet” is more useful than one that confidently makes something up. The first one is a colleague. The second one is a liability.

The trust threshold

The CFOs who will adopt AI most aggressively over the next two years are not the ones who become more confident in models. They are the ones who get clearer about which parts of the pipeline they’re willing to let a model touch. The trust gap closes not when the technology gets better, but when the architecture stops asking finance teams to trust the wrong things.

The Deloitte report did not fail because AI is bad. It failed because the parts of the work that required deterministic correctness, the citations and quotes and references to real cases, were produced by a system whose job is to generate plausible text. The system did its job. The mistake was using it for work where plausibility wasn’t the standard.

Finance leaders who get this right will end up with tools that automate the drudgery without inheriting the risk. They will run their close cycle faster, produce their CFO package automatically, generate their QBR pack without rebuilding the underlying analysis every quarter. When an auditor or a board member asks where a number came from, they will point at a rule, not a model.

That is the threshold. Nothing below it will hold. Everything above it changes how finance works.