Innovation Automation Is Here. Now Sponsors Must Raise the Evidence Standard.
The easiest story about AI and innovation is already everywhere.
AI helps teams generate ideas faster. It helps non-technical innovators build prototypes. It reduces dependency on developers. It can create mockups, landing pages, product flows, interview guides, business model options, research summaries, and coded demos in minutes.
That story is true. I use AI for parts of this work myself.
It is useful when I do not have immediate access to a broad range of experts. It helps me prepare interviews, sharpen assumptions, formulate hypotheses, and think through different user groups and their possible point of view before I enter the field.
In cases where only a few real interviews are possible, that preparation matters. You cannot afford to waste those conversations on vague questions and lazy assumptions.
But that is also where the line sits.
AI can prepare the work. It cannot replace the work.
A synthetic user can help you ask better questions. It can help you anticipate objections, test different framings, and expose weak spots in your own thinking. But it does not buy your product. It does not change its workflow. It does not fight for budget. It does not risk credibility in an internal meeting. It does not ignore your solution because the timing is wrong, the politics are difficult, or the switching effort is simply not worth it.
Real life is not made of artificial users.
This distinction sounds obvious until a polished AI-generated validation deck enters a steering committee. Synthetic feedback, prototype reactions, and research summaries start to look like evidence. If nobody asks where the findings came from, the room accepts what it sees and funds the project without any meaningful reduction in uncertainty.
That is the problem.
AI will not make innovation predictable. It will make weak evidence easier to produce, easier to package, and easier to mistake for progress.
Unless leaders change how they judge it.
The wrong question
A lot of current AI-in-innovation talk starts with the wrong question: how can we build faster?
That question is not useless. Speed matters. Cost matters. Access to technical skills matters. A team that can move from idea to prototype without waiting three weeks for internal resources is in a better position than one that cannot.
You can see this logic in how AI product development is now being taught and sold: AI user research, synthetic interviews, no-code prototyping, micro-experiments, and faster movement from idea to customer signal [1]. Wharton and many other business schools have also framed generative AI as a force that changes how organizations conceive, shape, and select ideas at scale [2].
But faster building can also make weak innovation work worse.
Teams can now move from vague idea to polished prototype before they understand the problem. They can generate evidence-shaped material before they define what evidence would actually change the decision. They can produce confidence faster than they can earn it.
That is not progress. It is acceleration without discipline.
The better question is not how fast a team can build. The better question is what becomes worth testing when the cost of testing drops.
That is where AI becomes strategically useful. Not because it can predict which innovation will win. It probably cannot. Innovation outcomes depend on user motivation, timing, trust, budget cycles, switching costs, procurement, internal politics, regulation, channel access, and the painful details of adoption.
You cannot prompt your way around that complexity.
But you can structure ignorance better.
Synthetic users are not customers
One tempting use of AI is to simulate users.
Create artificial personas. Let them react to a concept. Ask whether they would buy. Run synthetic interviews. Simulate objections before spending money on real discovery.
I understand the appeal because I have used this myself.
It can be useful, especially when the number of real interviews is limited. If you only get access to a few relevant people, preparation becomes critical. Synthetic interviews can help you form hypotheses, improve your questions, and enter real conversations with sharper assumptions.
But they must remain assumptions.
They are not validation.
This is where the misuse begins. A team runs synthetic interviews, collects plausible feedback, turns it into a clean summary, adds a few quotes, and presents it as if the market has spoken. The deck looks professional. The logic sounds coherent. The feedback fits the story. The project moves to the next gate.
But no real customer has changed anything.
No buyer has given time. No budget owner has engaged. No user has shared a painful workaround. No team has changed its workflow. No switching effort has appeared. No internal sponsor has taken a risk. No one has paid, committed, or reorganized around the problem.
A synthetic user can give a plausible answer. But plausibility does not reveal causation, and it certainly does not prove demand.
Demand shows up through real-world commitment. It shows up when someone takes a second meeting, shares internal constraints, introduces a colleague, accepts a pilot, reveals budget ownership, spends time on integration, or changes an existing routine. These signals are imperfect, but they touch reality. That is why they matter.
Real users also surprise you. They contradict themselves. They rationalize decisions after the fact. They say something is important and then do nothing.
They ignore the feature the team loves. They care about small frictions the team barely noticed. They protect status, avoid embarrassment, and follow habits.
They make decisions inside constraints that are not visible from the outside.
That is where qualitative discovery earns its place.
Not because interviews are magic. They are not. Interviews can be badly designed, badly interpreted, and easily abused. But good qualitative work can reveal causal mechanisms behind a struggle. It can help explain why someone behaves the way they do, what they are trying to avoid, what must be true for them to act, and where the real obstacle sits.
The current research on LLMs as substitutes for human participants should make innovation teams careful here. Recent work argues that LLMs are useful but unreliable tools for simulating human psychology, and that they need to be validated against human responses for every new application [3]. Another study found that LLMs failed to reproduce the full range of human behavioral variability in a cognitive task, even when prompts, model configurations, and sampling settings were varied [4].
Synthetic feedback can help prepare for real discovery. It should not replace it.
Prediction gets cheaper. Judgment does not.
Agrawal, Gans, and Goldfarb make a distinction that matters here: prediction helps estimate what may happen, while judgment is about deciding what matters when outcomes are uncertain [5][6]. AI lowers the cost of prediction. It does not touch judgment.
That distinction matters for innovation.
AI may help predict which message is clearer, which onboarding flow creates less friction, which segment resembles a known pattern, or which assumptions conflict with available data. It can help generate options and compare scenarios. It can help a team avoid obvious blind spots before spending scarce time with real customers.
But the harder questions remain human and strategic.
What would this signal mean? Which assumption is being tested? What would count as failure? Which evidence is strong enough to justify the next funding step? Which result should stop the project? Which uncertainty can we still afford to carry? Which one can we not?
AI lowers the cost of producing options. It does not lower the need for judgment.
In fact, it raises the premium on judgment because teams now have access to a larger volume of variants, signals, summaries, and polished artifacts. Without expertise, that volume becomes dangerous. A tool that creates value in the right place creates blind confidence in the wrong one.
That is why steering committees and boards need to become stricter, not less strict.
The key question is no longer only what the team learned. It is where the evidence came from.
Was it synthetic or real? Was it based on simulated users or actual customers? Was it opinion or observed action? Was it stated interest or demonstrated commitment? Did the signal involve time, budget, switching effort, workflow change, sponsor risk, or money?
That question is not a detail. It is governance.
Gates should reduce uncertainty, not reward activity
Innovation projects need gates, not because companies need extra bureaucracy, but because uncertainty has to be reduced before capital keeps flowing.
Each phase should have a return.
Not always revenue. Not yet. Early innovation phases rarely produce financial return in the usual sense. Their return is validated learning, better evidence, and a clearer decision about whether to continue, change direction, or stop.
A gate should therefore not ask only whether the team has been active. It should ask whether the team has reduced the right uncertainty.
This is where AI can either improve innovation governance or make it worse.
Used well, AI helps teams prepare stronger hypotheses, design better tests, compare variants, and make assumptions explicit before entering the real world. Used badly, AI helps teams produce convincing evidence-shaped material without touching reality.
The difference sits in how gates are managed.
If a team presents synthetic interviews, the steering committee should ask what they were used for. If the answer is preparation, good. If the answer is validation, the gate should not pass.
If a team presents positive feedback, the committee should ask whether it came from real customers and what kind of commitment was involved. If a team presents prototype reactions, the committee should ask whether anyone had to change anything, give anything, risk anything, or pay anything.
Without these questions, gates become performance rituals.
The team shows progress. The stakeholders see movement. The project gets another round of funding. Uncertainty remains mostly intact.
That is cash burn with better slides.
From prototype to disciplined evidence
The prototype became the default artifact of innovation work.
Build the smallest version. Show it to users. Learn.
That logic still has value, but AI makes a different discipline possible. Instead of building one favored version, teams can create a small set of tests that compare competing assumptions before they commit to one path.
Not one landing page, but several versions, each tied to a different belief about the customer problem. Not one prototype, but several variants, each isolating a different assumption about urgency, workflow, switching pain, willingness to pay, or adoption friction. Not one interview guide, but different guides designed to test different explanations for the same struggle.
The point is not to create additional material. The point is to create cleaner comparison.
Before the test, the team defines the evidence parameters. What are we trying to learn? Which assumption are we exposing? What would make the signal valid? What would make it invalid? What could create a false positive? What could create a false negative?
This matters because weak evidence rarely arrives with a warning label. It usually looks useful.
A false positive might be a customer saying the concept is interesting while having no budget, no urgency, and no willingness to change anything. It might be a high click rate from the wrong audience. It might be a synthetic user confirming what the team hoped to hear.
A false negative can be just as dangerous. A rough prototype may fail because the workflow is unclear, not because the problem is unimportant. A buyer may reject a proposal because procurement timing is wrong, not because the need is weak. A user may struggle to explain the problem because it is embedded in routines they no longer notice.
Evidence discipline means looking at both risks.
It means not accepting positive signals too quickly and not killing real opportunities for the wrong reason.
AI can help with this. Research on AI and Lean Startup methods makes a useful distinction between discovery-oriented AI, which helps reduce uncertainty in novel areas, and optimization-oriented AI, which improves existing processes [7]. The goal is not only to produce an MVP faster. The goal is to understand which uncertainty the next test is supposed to reduce.
AI can generate test variants, identify assumption types, draft interview guides, compare feedback patterns, and suggest what to test next. It can help a team prepare better before using scarce customer access.
But the decision logic still belongs to the team.
Especially the kill logic.
Without kill logic, AI will not improve innovation. It will industrialize confirmation bias.
Evidence theater at scale
Lower experiment cost sounds good.
It is good.
But it creates a new problem.
When so-called “evidence” becomes cheap to produce, companies may produce evidence-shaped material without improving decision quality.
More dashboards. More synthetic interviews. More AI-generated research summaries. More prototype reactions. More validation decks. More confident narratives.
Before AI, weak evidence at least required effort.
Now it can be produced at scale.
That changes the leadership task.
The scarce resource is no longer only research budget, design capacity, or engineering time. The scarce resource becomes evidence discipline.
This is where I think steering committees and boards need to be much more demanding. Not anti-AI. Not conservative. Not hostile to experimentation. Just more precise about what counts as evidence at each gate.
Synthetic evidence can justify preparation. It can justify better fieldwork. It can justify sharper hypotheses. It can justify exploring a segment. It can justify improving the prototype before exposing it to real customers.
It should not justify validation.
And it should not justify the next major funding step without real-world evidence.
If a team wants more funding, it needs to show what uncertainty was reduced in the real world. If the project remains uncertain, that is fine. Innovation is uncertain by nature. But the uncertainty should become more explicit, not hidden behind polished artifacts.
A good gate does not ask for certainty. It asks for better uncertainty.
The real shift
AI will not make innovation predictable.
The work is too contextual, too causal, and too dependent on human choices, timing, incentives, trust, budgets, politics, and internal constraints.
But AI can make ignorance cheaper to structure.
It can help teams move from one favored idea to a set of testable assumptions. From polished concepts to controlled variation. From vague conviction to sharper evidence standards before they spend time with real customers.
But only if leaders ask harder questions.
What decision will this evidence change? Which assumption is being tested? What would count as disconfirming evidence? Are we looking at opinions or action? Where did real customers enter the process? Where did real money, time, switching effort, workflow change, or internal sponsor risk show up?
These are not methodological details.
They are funding questions.
The risk is not that AI will make innovation too experimental. The risk is that it will make weak evidence look professional enough to pass the next gate.
That standard will not raise itself. The people controlling the capital have to set it.
Sources
[2] Wharton Executive Education: Supercharging Innovation with AI
[3] arXiv: Large Language Models Do Not Simulate Human Psychology
[4] arXiv: Can LLMs Simulate Human Behavioral Variability? A Case Study in the Phonemic Fluency Task
[6] NBER: Prediction, Judgment and Complexity: A Theory of Decision Making and Artificial Intelligence
[7] arXiv: Artificial Intelligence, Lean Startup Method, and Product Innovations



