Alibaba Page Agent - AI Navigation

← Back to Project List

Page Agent is Ali's open source page embedded GUI Agent, which allows users to operate Web pages in natural language. Its core value is not to be a traditional crawler or back-end RPA, but to add "AI operator/AI Copilot" capabilities to Web applications such as SaaS, ERP, CRM, and management back-end. For pre-sales, it is suitable for scenarios such as "complex system lowering threshold, intelligent form filling, product teaching, customer service robot moving from answering questions to operating on behalf of others, and barrier-free interaction enhancement.

1. Project Overview

Project	Information
GitHub	alibaba/page-agent
Official Demo/Documentation	https://alibaba.github.io/page-agent/
Project Positioning	JavaScript the in-page GUI agent to control the web interface with natural language
Open Source Protocol	MIT
Primary	TypeScript
npm package	'page-agent'
latest npm version	'1.10.0 ', check date: 2026-06-27
latest GitHub Release	'v1.10.0', released on 2026-06-15, check date: 2026-06-27
GitHub Heat	About 20.3k stars, 1.75k forks, Check Date: 2026-06-27
Key Components	'page-agent', '@ page-agent/core',' @ page-agent/page-controller ',' @ page-agent/llms', '@ page-agent/ui',' @ page-agent/mcp', Chrome Extension

! [[17-Temporary Attachments/Page-Agent/banner-light.png]]

2. One word explanation

Page Agent can be understood as "AI operator embedded in your web page": after the developer introduces a JS or npm package into the web page and configures LLM supporting tool call, the user can enter natural language instructions such as "help me open the settings and modify the notification method", "fill in this reimbursement form", "search for an order and export the results, the Agent reads the DOM of the page, plans the action, and completes the operation of clicking, inputting, selecting, scrolling, etc.

It differs from traditional browser automation tools in that:

Dimension	Page Agent	Traditional Browser Automation/RPA/browser-use class tools
Object-Oriented	Website Developer, SaaS Product Team	Automation Script Developer, Crawler/Agent Developer
Deployment mode	Runs in embedded page, or Chrome extension	Usually executed outside the browser, server-side, or automation runtime
Main Purpose	Enhance the user experience of the product and turn the web into a natural language application	Automate tasks, collect data, and control the browser
Perception mode	Mainly based on DOM text and structure, does not rely on screenshots	Possible use of DOM, screenshots, multimodal, browser control protocol

3. Key Screenshot

Demo Home Page

! [[17-Temporary Attachments/Page-Agent/page-agent-home.png]]

This figure is suitable for pre-sales materials to illustrate Page Agent's product perception: the user enters a natural language task in the current web page, and the agent in the page helps the user to complete the operation. The official selling points emphasized in the screenshots can also be seen directly: pure front-end scheme, support for private models, no need for desensitization, and MIT open source.

Chrome Extensibility

! [[17-Temporary Attachments/Page-Agent/page-agent-chrome-extension.png]]

Chrome Extension are optional enhancements. PageAgent.js itself is responsible for in-page automation; extensions additionally provide multi-page tasks, browser-level control, and the ability to initiate tasks from outside the browser.

Model configuration and support

! [[17-Temporary Attachments/Page-Agent/page-agent-models.png]]

The official document emphasizes support for models that comply with the OpenAI API specification and support tool call, including public cloud and local/private deployment paths. During pre-sales communication, pay attention to whether the model supports stable tool call, whether the context length is sufficient, and whether the enterprise-side agent needs to forward LLM requests will directly affect the landing effect.

4. What does it mostly do?

Capabilities	Descriptions	Pre-Sales Value
Natural language operation web page	User input tasks, Agent automatically clicks, inputs, selects, scrolls, and submits forms	Reduce the learning cost of complex systems, and reduce the pressure of training and customer service
DOM text understanding	Through the DOM structure and page text understanding interface, it does not rely on screenshots and multi-modal models.	Cost is more controllable and suitable for business systems, form systems and management background.
AI Copilot embedded in pages	Integration into Web applications through CDN or npm	Lower cost of retrofitting existing SaaS or internal systems
Built-in UI Panel	Can display task execution, progress and interaction panels	Easier to do product demonstration and user experience closed loop
Self-provided LLM	Supports OpenAI-compatible APIs, including Qwen, OpenAI, Claude, DeepSeek, Gemini, local Ollama/LM Studio and other routes	Adapts to customers' existing model resources and privatization demands
Data desensitization	Support masking the page content and then sending it to the model	Suitable for corporate scenarios that are sensitive to privacy and compliance
Custom commands and knowledge injection	Agent behavior can be constrained through system-level/page-level instruction	Business rules, operation specifications, and permission boundaries can be solidified into the Agent
Custom Tools	Extensible Agent Callability	Interconnection with service APIs, verification logic, and audit actions
Chrome Extension	Supports multi-page, multi-tab, and browser-level control	Suitable for cross-system, cross-page processes, but requires higher security authorization
MCP Server Beta	Let local Agent clients initiate browser tasks to Page Agent Ext through MCP	Suitable for connecting browser control capabilities to Claude Desktop, Copilot, enterprise Agent platforms, etc.

5. Typical applicable scenarios

Scenario	Customer Pain Points	How Page Agent Can Cut in
SaaS AI Copilot	The product has many functions and complex pages that new users will not use. High-frequency consultation focuses on "how to operate"	Add a natural language portal to the page so that AI can directly take users to complete the operation
ERP / CRM / OA/HR/Financial System	There are many forms, long processes and many fields, and users are easy to fill in or omit them.	The user describes the target, and Agent automatically locates the field, fills in, submits or prompts for confirmation.
intelligent transformation of management background	UI transformation of old system is costly, but there is a glimmer of hope to improve experience	Copilot layer is done first through page embedding, and it is not necessary to reconstruct the core business system immediately
Customer Service Robot Upgrade	The robot can only answer "Please click the so-and-so button", and the user still needs to operate it himself	Combine the answering robot with Page Agent, and upgrade from "Tell the user how to do it" to "Help the user do it on site"
Product teaching/Onboarding	Training, screen recording and document maintenance are required after new functions are launched	Let AI demonstrate the complete process on site, such as "demonstrate how to submit reimbursement application"
Barrier-free interaction	It is difficult for elderly users, visually impaired users and low digital skilled users to use complex pages	Lower the operating threshold through natural language, voice assistant, screen reader and other portals
Internal operation efficiency improvement	Operators need to repeatedly search, filter, input and export in the background	Let Agent complete controllable semi-automatic operation to reduce repeated clicks
Enterprise Agent Platform Connected to Browser	Enterprise Existing Agents Can Question and Answer and Call API, but Lack Web GUI Operation Ability	Browser Tasks Incorporating Agent Tool Chain into MCP Server through Chrome Extension

6. Not quite the scene

Scenario	Reason
large-scale web crawling, server-side automation	official clear Page Agent for client-side web page enhancement, not server-side automation tools
Pages that mainly rely on visual recognition of pictures, Canvas, WebGL and SVG	Page Agent does not use multi-modal models and screenshots, and mainly understands pages based on DOM text structure
Processes that require drag-and-drop, hover, right-click menu, keyboard shortcut, and coordinate-level control	Official restrictions indicate that these interactions are not currently supported.
Scenarios of complex cross-domain iframes and nested iframes	The official documentation only emphasizes that the same-source single-layer iframes are supported, and cross-domain/nested iframes are boundaries.
Fully automatic trading/approval with strong supervision and strong capital risk	can be used as an auxiliary entrance, but confirmation, authority, audit, wind control and rollback mechanism must be added.
Old systems with poor page semantics	DOM structure, accessibility and page semantics will directly affect the success rate of Agent, and page governance needs to be done first
Customer environments that cannot provide a stable tool call model	The agent relies on the model to stably generate tool calls. Small models or models with weak tool call capabilities usually do not perform well.

7. Architecture and Component Understanding

The official development guide shows that Page Agent is an npm workspaces monorepo, and the core package can be understood as follows:

Package/Module	Role
'page-agent'	Main portal, including built-in UI Panel, for application integration
'@ page-agent/core'	Core Agent logic without UI, suitable for custom UI or programmatic calls
'@ page-agent/llms'	LLM client, which encapsulates OpenAI-compatible calls, retries, tool calls, etc.
'@ page-agent/page-controller'	DOM manipulation, page structure extraction, visual feedback, decoupling from LLM
'@ page-agent/ui'	UI capabilities such as Panel and internationalization
'@ page-agent/mcp'	MCP Server, which allows local Agent clients to control browsers through extensions
'packages/extension'	Chrome Extension, based on WXT React
'packages/website'	Website, documentation, and development playground

Simplify workflow

flowchart U["用户输入自然语言任务"] --> P["Page Agent UI / API"] P --> D["读取页面 DOM 和语义信息"] D --> L["LLM 规划下一步动作"] L --> T["工具调用：点击 / 输入 / 选择 / 滚动 / JS"] T --> W["页面执行动作并反馈状态"] W --> D W --> R["完成结果 / 需要用户确认 / 失败原因"]

Relationship between extension and MCP

flowchart TD A["本地 Agent 客户端"] --> M["Page Agent MCP Server"] M --> H["Chrome 扩展 Hub"] H --> E["Page Agent Ext"] E --> B["浏览器标签页 / 多页面任务"] B --> J["PageAgent.js 页面内自动化"]

The pre-sales explanation can be said as follows: the basic version embeds an in-page AI operator in the customer's own Web application; If you want to control the browser across pages, across tabs or let an external Agent, you need to introduce Chrome Extension and MCP Server.

How to use #8.

Mode 1: Demo CDN Fast Experience

The official provides a one-line script for technical evaluation:

China Mirror:

Note: The official Demo CDN uses the free test LLM API, which is only suitable for technical evaluation and R & D testing, and should not be used directly in a production environment, nor should personally identifiable information or sensitive data be entered.

Mode 2: npm integration

npm install page-agent

import { PageAgent } from 'page-agent'

const agent = new PageAgent({
  model: 'qwen3.5-plus',
  baseURL: 'https://dashscope.aliyuncs.com/compatible-mode/v1',
  apiKey: 'YOUR_API_KEY',
  language: 'zh-CN',
})

await agent.execute('点击登录按钮')

Method 3: Production Environment Suggestions

If you want to integrate into the enterprise Web application, the official document suggests not to put the real LLM API Key into the front-end code. A more reasonable way is to forward the LLM request by the enterprise back-end agent, and do authentication, auditing, flow limiting, desensitization and model routing at the agent layer.

const agent = new PageAgent({
  baseURL: '/api/llm-proxy',
  model: 'gpt-5.1',
  customFetch: (url, init) =>
    fetch(url, { ...init, credentials: 'include' }),
})

Method 4: MCP Server Beta

If the customer already has a local Agent client, you can access it through '@ page-agent/mcp:

{
  "mcpServers": {
    "page-agent": {
      "command": "npx",
      "args": ["-y", "@page-agent/mcp"],
      "env": {
        "LLM_BASE_URL": "https://api.openai.com/v1",
        "LLM_API_KEY": "sk-xxx",
        "LLM_MODEL_NAME": "gpt-5.2"
      }
    }
  }
}

The official explanation is that MCP Server is still Beta and can be used for pre-sales demonstration, but special attention should be paid to security authorization, stability and version compatibility for production landing.

9. What can I say before sales

Business-oriented

Customer Concerns	Recommended Words
users will not use complex systems	"instead of making another question-and-answer robot, it will let AI directly help users operate in the system, turning training documents and operation steps into executable interactions."
Form filling/approval process is very long	"It is suitable for compressing 20 clicks and multi-field filling into a natural language instruction, especially for intensive form scenarios such as ERP, CRM, OA and financial reimbursement."
I don't want to change the old system greatly.	"It is a page embedded scheme. You can add a layer of intelligent operation portal to the existing Web application to reduce the reconstruction cost."
Customer Service Pressure	"Traditional customer service robots can only tell users how to do it. Page Agent allows robots to further help users complete operations."
requires domestic model or private model	"it supports OpenAI-compatible interface and local model path, theoretically can interface with the customer's existing large model platform, but to verify the tool call, context length and CORS/proxy configuration."

Technology-oriented

Technical Issues	Recommended Notes
How to integrate	"CDN can be experienced quickly, npm package integration is recommended for production, and LLM requests are proxied through the back end."
Whether a browser plug-in is required	"The current page automation does not require a plug-in; Cross-page, multi-tab, and external Agent calls require Chrome Extension."
Does it rely on screenshots/visual models?	"It does not rely on screenshots and multimodality, and mainly takes DOM text structure, so it requires higher semantic HTML and accessibility."
How to do security	"It needs to be designed together with operation whitelist, data desensitization, user confirmation, permission control, back-end agent and audit log."
How to choose a model	"A model with stable tool call, fast speed, and sufficient context is preferred. Small models or models with weak tool call are usually not suitable for complex page operations."

10. PoC Recommendations

Recommended PoC Target

Choose a process that the customer is familiar with, has clear pain points, and can control risks, such:

PoC Process	Validate Value
CRM creates a new customer and completes fields	Verify form filling, field positioning, business rule tips
OA Submit Leave/Reimbursement Application	Verify Multi-step Process, Confirmation Action, User Interaction
Background search for orders and export	Verify search, filter, click, confirm before download
Product teaching "Demonstrate how to configure a function"	Verify teaching while doing and training to reduce cost
Customer Service Robot Operation	Verify Experience Upgrade from Q & A to Execution

PoC range control

Project	Proposal
page selection	give priority to pages with clear DOM semantics, standard form structure and stable interaction path
Number of Tasks	Do 3-5 high-frequency processes first, and do not cover the whole system at the beginning
Model selection	First use the strong tool call model to get through the effect, then evaluate the domestic/private model replacement
Security boundary	Secondary confirmation must be added to high-risk actions, such as submit, delete, transfer, and approve

Data	Use desensitization test data to avoid entering sensitive information in Demo CDN

Example of Acceptance Index

Indicator	Proposed target
Task Completion Rate	The selected process reaches more than 80%, and then enters the next round of optimization
Reduce operation steps	Reduce more than 50% compared to manual click/entry
Manual takeover rate	Clear prompt and give to user when failing or uncertain
Confirmation of Risk Action	100% of actions such as deletion, submission and approval need confirmation
Log traceability	Can record tasks, actions, results, failure reasons, and user confirmations

11. Risks and Considerations

Risk	Description	Recommendation
API Key is exposed on the front end.	There is a risk of leakage when configuring the real LLM Key directly on the front end.	Production must go through the back-end agent.
Page DOM Quality Impact Effect	Unclear semantics, no text on buttons, and unstable dynamic elements will reduce the success rate	Do page accessibility and semantic governance
The output of the model is unstable	The format of the tool call is wrong or the plan is unstable, which will cause failure.	Select the strong tool call model and set the retry and error recovery.
Sensitive Data Outgoing	Page Content May Enter LLM Request	Data Desensitization, Field Masking, Private Model, or Private Gateway
Unauthorized Operation	The agent may attempt actions that the user does not intend to authorize	Permissions, Whitelist, Secondary Confirmation, and Audit
More cross-page permissions	More permissions for Chrome Extension. If it is abused, it will bring privacy risks	Token authorization, trusted application list, minimum permissions, and user visible confirmation
Beta Capability Maturity	MCP Server Marked Beta	First for Demo and Internal PoC, Production Caution Assessment
interaction boundary	does not support dragging, hovering, right-clicking, complex visual operations, etc.	avoid these interactions or modify the page when selecting a process

12. Relationship with related programs

Category	Represent	Relationship with Page Agent
browser-use class browser Agent	browser-use	Page Agent borrows its DOM processing and prompt ideas, but the goal is client-side web page enhancement
Traditional RPA	UiPath, Shadow Knife, etc.	RPA is more cross-application process automation; Page Agent is more suitable for embedding Web products to enhance user experience
Multimodal Browser Agent	Visual Screenshot Model	Multimodal can understand visual content, but the cost and authority are higher; Page Agent is lighter, but limited by DOM semantics
SaaS built-in Copilot	Salesforce Copilot and Microsoft Copilot classes	Page Agent can be used as the underlying GUI operation capability of self-developed SaaS Copilot
Enterprise Agent Platform	Dify, Coze, LangGraph, MCP Client, etc.	Browser GUI operation can be connected to existing Agent tool chain through MCP/extension

13. My Pre-Sales Judgment

The value of Page Agent lies in that it takes "AI question and answer" to "AI executable operations", which is especially suitable for B- side Web systems. The problem with many enterprise systems is not that they have no functions, but that they have too many functions, too deep paths, and users do not know how to use them. Page Agent's page embedded mode is very suitable for packaging complex operations into natural language portals and verifying the actual value of AI Copilot with low transformation costs.

Its most suitable pre-sales entry point is not "automatic replacement of manual", but "to assist users to complete complex page operations". In the expression of the scheme, it should be emphasized that it can be controlled, audited, confirmed and taken over. Promises to customers should also be clear boundaries: it is not good at visual recognition, not suitable for large-scale server-side crawlers, not suitable for unprotected execution of high-risk actions.

It is recommended that the pre-sales Demo choose a high-frequency process that customers are familiar with, such as "Create Customer Profile", "Submit Reimbursements", "Configure Product Parameters", "Search Orders and Generate Reports". During the demonstration, the user will first see the natural language portal, and then show how the Agent understands the page, operates step by step, and requests the user to confirm if necessary. This makes it easier for business parties to understand value than simply talking about technical architecture.

14. Reusable customer Q & A


Is this RPA?	"Not exactly. It's more like a AI operator embedded in a web application, with a focus on enhancing user experience and operational efficiency; if you want to automate at scale across systems and browsers, you still need to combine scaling, MCP, or RPA capability assessment."
Do you need to change the backend?	"No. For quick experience, it is recommended that the backend provides LLM proxy, authentication, audit, desensitization, and model routing."
Can it be privatized?	"The project itself is open source MIT, and the model side supports OpenAI-compatible interfaces and local runtime paths. However, the specific privatization effect depends on the tool call capability, context length, latency, and page complexity of the customer model."
Can I operate any website?	"The basic PageAgent.js needs to run on the current page after website integration; Chrome Extension can be extended to any web page and multi-label, but the permissions and security requirements are higher."
Will it be messy?	"It needs to be managed by operation whitelist, system instruction, risk action confirmation, user takeover and audit log. Pre-sale PoC should not be designed to be completely unsupervised."
Can you recognize images or diagrams?	"It is based primarily on DOM text and structure and does not rely on screenshots and multimodal models. Images, Canvas, WebGL, pure visual cues are not its strong points."
Which pages work best?	"Pages with good semantic HTML, clear button/field labels, standardized form structure, and stable process work best."

15. Follow-up recommendations

Select a customer's real process to do PoC for 1-2 weeks, and give priority to pages with intensive forms, high frequency repetition and controllable risks.
Combing the DOM and accessibility quality of the page, and supplementing the button text, form label and ARIA attributes.
Determine the model route: public cloud, Qwen, OpenAI-compatible gateway, local Ollama/LM Studio, or the existing model platform of the customer.
Design security policies: back-end agent, data desensitization, operation whitelist, secondary confirmation, audit log, exception takeover.
Re-evaluate Chrome Extension and MCP Server if cross-page or external agent platforms are involved.

16. REFERENCE

-GitHub repository: alibaba/page-agent

-Page Agent Official Demo

-Official Document: Overview

-Official Document: Quick Start

-Official Document: Limitations

-Official: Chrome Extension

-Official Document: MCP Server Beta

-Official Document: Models

-npm:page-agent

-Release v1.10.0

-Terms and Privacy