Why I'm Betting on Generative UI

14 min read
Share:

I’m joining CopilotKit as Director of Product. It simultaneously feels like a homecoming and an exciting new adventure.

I ran the Angular Portland meetup for five years while working as a fullstack developer building C#/.NET and JavaScript applications, designing APIs, and leading our team’s AngularJS-to-Angular migration at a renewable energy nonprofit. I built a 200-lesson course on that migration, created courses for egghead.io on Angular, React, and AWS, and was a Google Developer Expert for web technologies.

From there I went into developer advocacy at Auth0, where I scaled developer marketing from startup through acquisition. Most recently I spent two years at WRITER, going from building their first developer GTM motion to building developer experience from scratch to product managing their AI platform: public API, LLM Gateway migration, third-party model and guardrail integrations with Bedrock and Azure, SDK integrations with LangChain and AWS Strands, observability, enterprise AI governance.

CopilotKit builds the open-source infrastructure that connects AI agents to user-facing applications. The frontend frameworks I spent years working with were all solving the same core problem: how to keep the interface in sync with what’s happening on the backend, and how to give users meaningful control over the result. Generative UI is that same problem at a completely different scale. The backend is an LLM. The state being managed is an agent’s reasoning process. The interface needs to let a human review, approve, edit, and redirect that reasoning in real time.

I think fixing that interface layer is the most important unsolved problem in the agentic stack. Here’s why.

The third UI paradigm

Jakob Nielsen argues that generative AI is only the third user interface paradigm in computing history, after batch processing and command-based interaction. The user no longer tells the computer what to do. The user tells the computer what outcome they want. Nielsen Norman Group frames the design shift as moving from designing interfaces to designing outcomes.

You can see this in the products growing fastest. Claude Artifacts, Vercel v0, Bolt.new, Replit Agent: in all of these, the interface itself is generated by the model. You describe what you want, and the UI materializes. Bolt.new went from zero to $40M in annual revenue in five months. Replit went from $2.8M to $150M ARR in under a year. These products succeeded because they understood that the interface is the product. As Anshul Ramachandran of Codeium put it on Latent Space, the moat of ChatGPT was the user experience.

A Google Research study quantified the difference: participants preferred AI-generated interfaces over markdown chat responses 83% of the time.

Three patterns are converging under the generative UI label:

  • Static generative UI: the developer pre-registers UI components and the agent decides which to render with which data. This is the most production-ready pattern and the model behind CopilotKit’s useCopilotAction hook.
  • Declarative generative UI: the agent emits a structured JSON description of forms, tables, and multi-step flows, and the client renders them natively. Google’s A2UI spec works this way.
  • Open-ended generative UI: the agent generates arbitrary HTML, CSS, and JavaScript rendered in a sandboxed iframe. Claude Artifacts and Gemini’s dynamic views live here.

What AG-UI actually is

If generative UI is the what, AG-UI is the how.

Three open protocols now define how agents connect to everything else:

  1. MCP (Anthropic) connects agents to tools and data.
  2. A2A (Google) handles agent-to-agent coordination.
  3. AG-UI (CopilotKit) connects agents to user-facing applications. Google, Microsoft, AWS, and Oracle have all adopted AG-UI with first-party integrations.

AG-UI is an event-driven, transport-agnostic streaming protocol. Any agent backend implements a single method (run(input) → Observable<BaseEvent>) and emits typed events over SSE, WebSockets, or HTTP. The spec defines roughly sixteen standard event types grouped into five families:

  • Lifecycle: RUN_STARTED, RUN_FINISHED, RUN_ERROR, STEP_STARTED, STEP_FINISHED
  • Text messages: TEXT_MESSAGE_START, TEXT_MESSAGE_CONTENT (streaming deltas), TEXT_MESSAGE_END
  • Tool calls: TOOL_CALL_START, TOOL_CALL_ARGS (streaming JSON deltas), TOOL_CALL_END, TOOL_CALL_RESULT
  • State: STATE_SNAPSHOT (full JSON), STATE_DELTA (JSON Patch per RFC 6902)
  • Special: RAW (passthrough), CUSTOM

The state sync model is what makes human-in-the-loop tractable at the protocol level. The agent opens with a full STATE_SNAPSHOT, streams incremental STATE_DELTA patches as state mutates, and re-snapshots periodically. The stream is bi-directional: the frontend can send user decisions, interface context, and cancellation signals back to the agent through the same channel.

The UI and the agent share one consistent state. Both sides read from and write to the same event stream, so state can’t diverge. When a user approves a step, edits a field, or rejects a suggestion, that interaction flows through the same typed events as the agent’s output. Framework developers (and I say this from experience) can build reliable agent interactions using patterns they already understand: observable streams, typed events, state management. The architecture feels familiar even when the backend is fundamentally different from anything we’ve connected to before.

The AG-UI repo has roughly 13,000 stars and 1,200 forks at the time I’m writing this, and community SDKs in Kotlin, Go, Java, Rust, Dart, and Ruby. Combined with CopilotKit, the protocol handles over 2 million agent-user interactions per week.

Why now

I’ve been watching this space for a long time, and the reason I’m moving now is that several forces are converging at the same time. Any one of them would be interesting. Together they create a window.

Human oversight is a permanent layer

For a while it looked like models had hit a wall. GPT-5’s launch in August 2025 was underwhelming enough that OpenAI had to publicly apologize, and the plateau debate kicked off in earnest. OpenAI responded with rapid incremental updates (GPT-5.1, then 5.2 less than a month later under a “code red”), and the naming convention itself felt like an admission that the big leaps weren’t coming as fast.

Then November happened. Gemini 3 Pro and Claude Opus 4.5 launched within a week of each other and the plateau narrative broke. Nathan Lambert described Claude Code with Opus 4.5 as “a watershed moment, moving software creation from an artisanal, craftsman activity to a true industrial process.” Opus 4.6 followed in February with adaptive thinking and a 1M token context window. And this month, Claude Mythos scored 93.9% on SWE-bench Verified, a 13-point jump over the previous generation, and 77.8% on the contamination-resistant SWE-bench Pro.

Models are getting dramatically better again. And we still need human-in-the-loop.

Even at 93.9%, the model is wrong 6% of the time, and in enterprise production that 6% is where the expensive mistakes live. Sierra’s τ³-Bench tells a sharper story: voice agents hit 26-38% success rates on realistic scenarios. Berkeley’s 2026 audit found that multiple top benchmarks were exploitable to near-100% without solving a single task, which means even the impressive scores need to be taken with caution. And Anthropic explicitly won’t release Mythos broadly because it’s too capable in offensive cybersecurity. The most powerful model ever built can’t be deployed without new safeguards, which is itself an argument for human oversight infrastructure.

The need for HITL doesn’t shrink as models improve. It shifts from catching basic errors to governing increasingly consequential decisions. The approval checkpoints, correction interfaces, and decision gates that enable that governance are generative UI surfaces, and they sit on AG-UI.

RL training data is being repriced

The AI industry’s spending on training data tells you where the value is shifting. Scale AI grew from $870M to $2B in ARR and commands a $29B valuation after Meta took a 49% stake. Surge AI is targeting $15-25B. Datacurve raised a $15M Series A explicitly for SWE datasets built for RL environments. Scale launched RL Environments in February 2026 and reported in the announcement that nearly half of all new data-training projects now involve reinforcement learning.

The results justify the spending. DeepSeek-R1’s pure-RL training lifted its AIME math benchmark score from 15.6% to 71%. RL works, and the industry is pouring money into it.

The highest-value training signal comes from real users approving, rejecting, and editing real agent actions in real workflows. Labeling vendors sell synthetic or off-distribution preference data. An in-product HITL interface captures in-domain, on-distribution, task-specific interaction traces that those vendors can’t replicate. The UI where humans review agent work is also the UI that produces the training data for the next generation of agents.

The money flowing into adjacent layers tells you how seriously the industry is taking this. LangChain closed a $125M Series B at a $1.25B valuation with Sequoia, Benchmark, and IVP. Braintrust holds an $800M valuation. Temporal raised $300M at $5B. Observability, orchestration, durable execution, training data: all well-funded. The layer where the human feedback is actually produced, where the user says “yes,” “no,” or “let me fix that,” is still wide open.

The SaaSpocalypse and the enterprise gap

The previous two forces are technical. This one is market-driven: the entire enterprise software industry is being forced to rethink how humans interact with their products, and it’s happening fast. In February 2026, $285 billion of software market cap evaporated in 48 hours after Anthropic launched Claude Cowork. This week, Anthropic launched Claude Design and Figma’s stock fell 7% in a single session. Anthropic’s CPO had quietly resigned from Figma’s board three days before the announcement. Satya Nadella put it plainly: SaaS applications are essentially CRUD databases with business logic, and the business logic is all going to agents. One analyst described the SaaSpocalypse as “a $2 trillion pricing error, not an extinction event.”

The popular narrative says developers will vibe code replacements for existing enterprise software. I think this is wrong for anything with real complexity. Try vibe coding Workday. Try vibe coding Coupa, or ServiceNow, or a custom ERP with twenty years of business rules baked into it. These systems have compliance requirements, audit trails, role-based access, integrations with dozens of other systems, and domain logic that took decades to accumulate. A weekend with Bolt.new will not replace them (as great as that product is).

The smartest incumbents already know what’s coming. Recently at TrailblazerDX, Salesforce announced Headless 360: they’re exposing the entire Salesforce platform as APIs, MCP tools, and CLI commands so AI agents can operate the system without ever opening a browser. The release includes sixty new MCP tools, thirty preconfigured coding skills, and a new experience layer that renders rich, native interactions across surfaces from Slack to WhatsApp to voice.

Think about what that means for pricing. The per-seat model was always a proxy for “a human using the software.” When agents can operate the entire platform headlessly, the seat becomes meaningless as a billing unit. Salesforce has already pivoted Agentforce pricing three times in under a year: $2/conversation, then $0.10/action Flex Credits, then $125+/user AELA. Autonomous agents are dismantling the per-seat model that served as the bedrock of the entire SaaS industry. HubSpot launched outcome-based pricing at $1 per lead. Atlassian pivoted Rovo from $20-24/seat to free-in-bundle plus metered credits in six months.

But going headless doesn’t eliminate the need for an interface. It eliminates the need for the old interface. The human still needs to review what the agent did, approve the important decisions, and redirect when something is wrong. That review surface is generative UI. A procurement manager reviewing a supplier risk analysis from an AI agent needs a generated dashboard with the right filters and drill-downs for their specific question. A chat window with a wall of JSON won’t cut it.

This is the enterprise case for generative UI. The browser-based seat is dying, but the human oversight surface still needs to exist somewhere. That somewhere is a dynamic, task-specific interface generated on the fly from the data and actions the agent is working with. CopilotKit’s hooks (useCopilotReadable, useCopilotAction, useAgent) exist for exactly this: they let any existing application become agent-native by feeding real-time application state into the agent’s context and letting the agent render rich UI components back to the user. CX Today’s analysis of Headless 360 arrives at the same conclusion: the future of CRM interfaces is dynamic, generated, and decoupled from the traditional application shell.

Every enterprise software company is on this path. Microsoft Copilot has 15 million paid seats. Salesforce Agentforce hit $540M in standalone ARR. ServiceNow Now Assist is over $600M in ACV. They’re all rebuilding so agents can work inside them and humans can interact through generated interfaces. Gartner expects that by 2030, software companies that bolt AI on top of legacy applications will face margin compression of up to 80%.

Aaron Levie put it as a bridge: if your company doesn’t make it to the other end, you’re out of business. Just as every company needed payments infrastructure but shouldn’t have built Stripe themselves, every company shipping agents needs this interface layer.

Why CopilotKit

I looked at this space for a while before deciding. A few things stood out about CopilotKit that I haven’t seen anywhere else.

Protocol position

CopilotKit authored AG-UI. Google ADK, Microsoft Agent Framework, AWS Bedrock AgentCore, and Oracle have adopted it with first-party integrations. On the agent backend side, CopilotKit integrates with LangGraph, CrewAI, Mastra, Pydantic AI, Agno, LlamaIndex, Google ADK, AWS Strands, and MCP servers. That coverage means CopilotKit sits at the boundary between agent runtime and end user for essentially every serious agent framework in production. Defining the open protocol for that boundary and shipping the production implementation creates a structural advantage that compounds with every new integration.

Traction

CopilotKit has over 30,000 GitHub stars and 170,000 weekly npm downloads, up from 7,000 seventeen months ago, with over 10% of the Fortune 500 in production including Cisco, Deloitte, DocuSign, and TripAdvisor. I built a complete working app with CopilotKit using Claude Code and the MCP server with two compound prompts and got zero build errors. The developer experience has a solid foundation that I can’t wait to expand into agent experience.

Multi-framework

CopilotKit started as a React library and recently brought Hashbrown into the fold, expanding beyond React. Angular will be a focus area, and that matters because enterprise development still relies heavily on Angular. An IDC/AWS survey found 97% of enterprises have not figured out how to scale agents across their organizations, and 55% of IT leaders said adoption would accelerate with better user experiences. I spent years deep in the Angular community (organizing the meetup, teaching migration courses, building Angular applications professionally) and I care about bringing these capabilities to those developers.

The commercial layer

The Enterprise Intelligence Platform includes persistence, A/B testing, observability, and self-improving agents through reinforcement learning on user interactions. Every thumbs-up, edit, and rejection through a CopilotKit-powered interface becomes a training signal. The company that builds the interface where humans review agent outputs captures the feedback loop that trains the next generation of agents. There’s a lot more to come on the enterprise product, so stay tuned.

What’s ahead

I’m incredibly grateful for my two years at WRITER. Building their developer GTM motion from scratch, shipping SDK integrations with LangChain and AWS Strands, product managing the public API and LLM Gateway migration, and shipping their observability platform taught me more about operating at scale than anything else in my career. I accomplished more in those two years than in the previous five. They’re an elite team and I’ll keep rooting for them.

At CopilotKit, I’ll be ensuring the developer experience is a first class product and the red carpet for the enterprise platform, connecting the dots between the product strategy connecting open-source adoption to commercial growth. Every stage of my career led to this intersection: the fullstack apps, the Angular migration courses, the developer advocacy, the API and SDK product management, the enterprise AI governance work. I know what developer adoption friction feels like from every angle.

If you’re building agent-powered products and wrestling with the interface layer, I’d love to hear what you’re running into. I’ll be writing more about generative UI, AG-UI, and agentic interfaces in the coming months.

Thinking about product strategy?

I write about product management, AI workflows, and the practical skills that help you ship better products. Join the newsletter for lessons from managing complex portfolios at an AI company.

Related Articles