1. Project Overview
| Project | Information |
|---|---|
| GitHub | alibaba/page-agent |
| Official Demo/Documentation | https://alibaba.github.io/page-agent/ |
| Project Positioning | JavaScript the in-page GUI agent to control the web interface with natural language |
| Open Source Protocol | MIT |
| Primary | TypeScript |
| npm package | 'page-agent' |
| latest npm version | '1.10.0 ', check date: 2026-06-27 |
| latest GitHub Release | 'v1.10.0', released on 2026-06-15, check date: 2026-06-27 |
| GitHub Heat | About 20.3k stars, 1.75k forks, Check Date: 2026-06-27 |
| Key Components | 'page-agent', '@ page-agent/core',' @ page-agent/page-controller ',' @ page-agent/llms', '@ page-agent/ui',' @ page-agent/mcp', Chrome Extension |
! [[17-Temporary Attachments/Page-Agent/banner-light.png]]
2. One word explanation
Page Agent can be understood as "AI operator embedded in your web page": after the developer introduces a JS or npm package into the web page and configures LLM supporting tool call, the user can enter natural language instructions such as "help me open the settings and modify the notification method", "fill in this reimbursement form", "search for an order and export the results, the Agent reads the DOM of the page, plans the action, and completes the operation of clicking, inputting, selecting, scrolling, etc.
It differs from traditional browser automation tools in that:
| Dimension | Page Agent | Traditional Browser Automation/RPA/browser-use class tools | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Object-Oriented | Website Developer, SaaS Product Team | Automation Script Developer, Crawler/Agent Developer | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Deployment mode | Runs in embedded page, or Chrome extension | Usually executed outside the browser, server-side, or automation runtime | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Main Purpose | Enhance the user experience of the product and turn the web into a natural language application | Automate tasks, collect data, and control the browser | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Perception mode | Mainly based on DOM text and structure, does not rely on screenshots | Possible use of DOM, screenshots, multimodal, browser control protocol | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
3. Key Screenshot
Demo Home Page
! [[17-Temporary Attachments/Page-Agent/page-agent-home.png]]
This figure is suitable for pre-sales materials to illustrate Page Agent's product perception: the user enters a natural language task in the current web page, and the agent in the page helps the user to complete the operation. The official selling points emphasized in the screenshots can also be seen directly: pure front-end scheme, support for private models, no need for desensitization, and MIT open source.
Chrome Extensibility
! [[17-Temporary Attachments/Page-Agent/page-agent-chrome-extension.png]]
Chrome Extension are optional enhancements. PageAgent.js itself is responsible for in-page automation; extensions additionally provide multi-page tasks, browser-level control, and the ability to initiate tasks from outside the browser.
Model configuration and support
! [[17-Temporary Attachments/Page-Agent/page-agent-models.png]]
The official document emphasizes support for models that comply with the OpenAI API specification and support tool call, including public cloud and local/private deployment paths. During pre-sales communication, pay attention to whether the model supports stable tool call, whether the context length is sufficient, and whether the enterprise-side agent needs to forward LLM requests will directly affect the landing effect.
4. What does it mostly do?
| Capabilities | Descriptions | Pre-Sales Value |
|---|---|---|
| Natural language operation web page | User input tasks, Agent automatically clicks, inputs, selects, scrolls, and submits forms | Reduce the learning cost of complex systems, and reduce the pressure of training and customer service |
| DOM text understanding | Through the DOM structure and page text understanding interface, it does not rely on screenshots and multi-modal models. | Cost is more controllable and suitable for business systems, form systems and management background. |
| AI Copilot embedded in pages | Integration into Web applications through CDN or npm | Lower cost of retrofitting existing SaaS or internal systems |
| Built-in UI Panel | Can display task execution, progress and interaction panels | Easier to do product demonstration and user experience closed loop |
| Self-provided LLM | Supports OpenAI-compatible APIs, including Qwen, OpenAI, Claude, DeepSeek, Gemini, local Ollama/LM Studio and other routes | Adapts to customers' existing model resources and privatization demands |
| Data desensitization | Support masking the page content and then sending it to the model | Suitable for corporate scenarios that are sensitive to privacy and compliance |
| Custom commands and knowledge injection | Agent behavior can be constrained through system-level/page-level instruction | Business rules, operation specifications, and permission boundaries can be solidified into the Agent |
| Custom Tools | Extensible Agent Callability | Interconnection with service APIs, verification logic, and audit actions |
| Chrome Extension | Supports multi-page, multi-tab, and browser-level control | Suitable for cross-system, cross-page processes, but requires higher security authorization |
| MCP Server Beta | Let local Agent clients initiate browser tasks to Page Agent Ext through MCP | Suitable for connecting browser control capabilities to Claude Desktop, Copilot, enterprise Agent platforms, etc. |
5. Typical applicable scenarios
| Scenario | Customer Pain Points | How Page Agent Can Cut in |
|---|---|---|
| SaaS AI Copilot | The product has many functions and complex pages that new users will not use. High-frequency consultation focuses on "how to operate" | Add a natural language portal to the page so that AI can directly take users to complete the operation |
| ERP / CRM / OA/HR/Financial System | There are many forms, long processes and many fields, and users are easy to fill in or omit them. | The user describes the target, and Agent automatically locates the field, fills in, submits or prompts for confirmation. |
| intelligent transformation of management background | UI transformation of old system is costly, but there is a glimmer of hope to improve experience | Copilot layer is done first through page embedding, and it is not necessary to reconstruct the core business system immediately |
| Customer Service Robot Upgrade | The robot can only answer "Please click the so-and-so button", and the user still needs to operate it himself | Combine the answering robot with Page Agent, and upgrade from "Tell the user how to do it" to "Help the user do it on site" |
| Product teaching/Onboarding | Training, screen recording and document maintenance are required after new functions are launched | Let AI demonstrate the complete process on site, such as "demonstrate how to submit reimbursement application" |
| Barrier-free interaction | It is difficult for elderly users, visually impaired users and low digital skilled users to use complex pages | Lower the operating threshold through natural language, voice assistant, screen reader and other portals |
| Internal operation efficiency improvement | Operators need to repeatedly search, filter, input and export in the background | Let Agent complete controllable semi-automatic operation to reduce repeated clicks |
| Enterprise Agent Platform Connected to Browser | Enterprise Existing Agents Can Question and Answer and Call API, but Lack Web GUI Operation Ability | Browser Tasks Incorporating Agent Tool Chain into MCP Server through Chrome Extension |
6. Not quite the scene
| Scenario | Reason |
|---|---|
| large-scale web crawling, server-side automation | official clear Page Agent for client-side web page enhancement, not server-side automation tools |
| Pages that mainly rely on visual recognition of pictures, Canvas, WebGL and SVG | Page Agent does not use multi-modal models and screenshots, and mainly understands pages based on DOM text structure |
| Processes that require drag-and-drop, hover, right-click menu, keyboard shortcut, and coordinate-level control | Official restrictions indicate that these interactions are not currently supported. |
| Scenarios of complex cross-domain iframes and nested iframes | The official documentation only emphasizes that the same-source single-layer iframes are supported, and cross-domain/nested iframes are boundaries. |
| Fully automatic trading/approval with strong supervision and strong capital risk | can be used as an auxiliary entrance, but confirmation, authority, audit, wind control and rollback mechanism must be added. |
| Old systems with poor page semantics | DOM structure, accessibility and page semantics will directly affect the success rate of Agent, and page governance needs to be done first |
| Customer environments that cannot provide a stable tool call model | The agent relies on the model to stably generate tool calls. Small models or models with weak tool call capabilities usually do not perform well. |
7. Architecture and Component Understanding
The official development guide shows that Page Agent is an npm workspaces monorepo, and the core package can be understood as follows:
| Package/Module | Role |
|---|---|
| 'page-agent' | Main portal, including built-in UI Panel, for application integration |
| '@ page-agent/core' | Core Agent logic without UI, suitable for custom UI or programmatic calls |
| '@ page-agent/llms' | LLM client, which encapsulates OpenAI-compatible calls, retries, tool calls, etc. |
| '@ page-agent/page-controller' | DOM manipulation, page structure extraction, visual feedback, decoupling from LLM |
| '@ page-agent/ui' | UI capabilities such as Panel and internationalization |
| '@ page-agent/mcp' | MCP Server, which allows local Agent clients to control browsers through extensions |
| 'packages/extension' | Chrome Extension, based on WXT React |
| 'packages/website' | Website, documentation, and development playground |
Simplify workflow
Relationship between extension and MCP
The pre-sales explanation can be said as follows: the basic version embeds an in-page AI operator in the customer's own Web application; If you want to control the browser across pages, across tabs or let an external Agent, you need to introduce Chrome Extension and MCP Server.
How to use #8.
Mode 1: Demo CDN Fast Experience
The official provides a one-line script for technical evaluation:
China Mirror:
Note: The official Demo CDN uses the free test LLM API, which is only suitable for technical evaluation and R & D testing, and should not be used directly in a production environment, nor should personally identifiable information or sensitive data be entered.
Mode 2: npm integration
npm install page-agent
import { PageAgent } from 'page-agent'
const agent = new PageAgent({
model: 'qwen3.5-plus',
baseURL: 'https://dashscope.aliyuncs.com/compatible-mode/v1',
apiKey: 'YOUR_API_KEY',
language: 'zh-CN',
})
await agent.execute('点击登录按钮')
Method 3: Production Environment Suggestions
If you want to integrate into the enterprise Web application, the official document suggests not to put the real LLM API Key into the front-end code. A more reasonable way is to forward the LLM request by the enterprise back-end agent, and do authentication, auditing, flow limiting, desensitization and model routing at the agent layer.
const agent = new PageAgent({
baseURL: '/api/llm-proxy',
model: 'gpt-5.1',
customFetch: (url, init) =>
fetch(url, { ...init, credentials: 'include' }),
})
Method 4: MCP Server Beta
If the customer already has a local Agent client, you can access it through '@ page-agent/mcp:
{
"mcpServers": {
"page-agent": {
"command": "npx",
"args": ["-y", "@page-agent/mcp"],
"env": {
"LLM_BASE_URL": "https://api.openai.com/v1",
"LLM_API_KEY": "sk-xxx",
"LLM_MODEL_NAME": "gpt-5.2"
}
}
}
}
The official explanation is that MCP Server is still Beta and can be used for pre-sales demonstration, but special attention should be paid to security authorization, stability and version compatibility for production landing.
9. What can I say before sales
Business-oriented
| Customer Concerns | Recommended Words |
|---|---|
| users will not use complex systems | "instead of making another question-and-answer robot, it will let AI directly help users operate in the system, turning training documents and operation steps into executable interactions." |
| Form filling/approval process is very long | "It is suitable for compressing 20 clicks and multi-field filling into a natural language instruction, especially for intensive form scenarios such as ERP, CRM, OA and financial reimbursement." |
| I don't want to change the old system greatly. | "It is a page embedded scheme. You can add a layer of intelligent operation portal to the existing Web application to reduce the reconstruction cost." |
| Customer Service Pressure | "Traditional customer service robots can only tell users how to do it. Page Agent allows robots to further help users complete operations." |
| requires domestic model or private model | "it supports OpenAI-compatible interface and local model path, theoretically can interface with the customer's existing large model platform, but to verify the tool call, context length and CORS/proxy configuration." |
Technology-oriented
| Technical Issues | Recommended Notes |
|---|---|
| How to integrate | "CDN can be experienced quickly, npm package integration is recommended for production, and LLM requests are proxied through the back end." |
| Whether a browser plug-in is required | "The current page automation does not require a plug-in; Cross-page, multi-tab, and external Agent calls require Chrome Extension." |
| Does it rely on screenshots/visual models? | "It does not rely on screenshots and multimodality, and mainly takes DOM text structure, so it requires higher semantic HTML and accessibility." |
| How to do security | "It needs to be designed together with operation whitelist, data desensitization, user confirmation, permission control, back-end agent and audit log." |
| How to choose a model | "A model with stable tool call, fast speed, and sufficient context is preferred. Small models or models with weak tool call are usually not suitable for complex page operations." |
10. PoC Recommendations
Recommended PoC Target
Choose a process that the customer is familiar with, has clear pain points, and can control risks, such:
| PoC Process | Validate Value |
|---|---|
| CRM creates a new customer and completes fields | Verify form filling, field positioning, business rule tips |
| OA Submit Leave/Reimbursement Application | Verify Multi-step Process, Confirmation Action, User Interaction |
| Background search for orders and export | Verify search, filter, click, confirm before download |
| Product teaching "Demonstrate how to configure a function" | Verify teaching while doing and training to reduce cost |
| Customer Service Robot Operation | Verify Experience Upgrade from Q & A to Execution |
PoC range control
| Project | Proposal | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| page selection | give priority to pages with clear DOM semantics, standard form structure and stable interaction path | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Number of Tasks | Do 3-5 high-frequency processes first, and do not cover the whole system at the beginning | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Model selection | First use the strong tool call model to get through the effect, then evaluate the domestic/private model replacement | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Security boundary | Secondary confirmation must be added to high-risk actions, such as submit, delete, transfer, and approve | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Data | Use desensitization test data to avoid entering sensitive information in Demo CDN |
Example of Acceptance Index
| Indicator | Proposed target |
|---|---|
| Task Completion Rate | The selected process reaches more than 80%, and then enters the next round of optimization |
| Reduce operation steps | Reduce more than 50% compared to manual click/entry |
| Manual takeover rate | Clear prompt and give to user when failing or uncertain |
| Confirmation of Risk Action | 100% of actions such as deletion, submission and approval need confirmation |
| Log traceability | Can record tasks, actions, results, failure reasons, and user confirmations |
11. Risks and Considerations
| Risk | Description | Recommendation |
|---|---|---|
| API Key is exposed on the front end. | There is a risk of leakage when configuring the real LLM Key directly on the front end. | Production must go through the back-end agent. |
| Page DOM Quality Impact Effect | Unclear semantics, no text on buttons, and unstable dynamic elements will reduce the success rate | Do page accessibility and semantic governance |
| The output of the model is unstable | The format of the tool call is wrong or the plan is unstable, which will cause failure. | Select the strong tool call model and set the retry and error recovery. |
| Sensitive Data Outgoing | Page Content May Enter LLM Request | Data Desensitization, Field Masking, Private Model, or Private Gateway |
| Unauthorized Operation | The agent may attempt actions that the user does not intend to authorize | Permissions, Whitelist, Secondary Confirmation, and Audit |
| More cross-page permissions | More permissions for Chrome Extension. If it is abused, it will bring privacy risks | Token authorization, trusted application list, minimum permissions, and user visible confirmation |
| Beta Capability Maturity | MCP Server Marked Beta | First for Demo and Internal PoC, Production Caution Assessment |
| interaction boundary | does not support dragging, hovering, right-clicking, complex visual operations, etc. | avoid these interactions or modify the page when selecting a process |
12. Relationship with related programs
| Category | Represent | Relationship with Page Agent |
|---|---|---|
| browser-use class browser Agent | browser-use | Page Agent borrows its DOM processing and prompt ideas, but the goal is client-side web page enhancement |
| Traditional RPA | UiPath, Shadow Knife, etc. | RPA is more cross-application process automation; Page Agent is more suitable for embedding Web products to enhance user experience |
| Multimodal Browser Agent | Visual Screenshot Model | Multimodal can understand visual content, but the cost and authority are higher; Page Agent is lighter, but limited by DOM semantics |
| SaaS built-in Copilot | Salesforce Copilot and Microsoft Copilot classes | Page Agent can be used as the underlying GUI operation capability of self-developed SaaS Copilot |
| Enterprise Agent Platform | Dify, Coze, LangGraph, MCP Client, etc. | Browser GUI operation can be connected to existing Agent tool chain through MCP/extension |
13. My Pre-Sales Judgment
The value of Page Agent lies in that it takes "AI question and answer" to "AI executable operations", which is especially suitable for B- side Web systems. The problem with many enterprise systems is not that they have no functions, but that they have too many functions, too deep paths, and users do not know how to use them. Page Agent's page embedded mode is very suitable for packaging complex operations into natural language portals and verifying the actual value of AI Copilot with low transformation costs.
Its most suitable pre-sales entry point is not "automatic replacement of manual", but "to assist users to complete complex page operations". In the expression of the scheme, it should be emphasized that it can be controlled, audited, confirmed and taken over. Promises to customers should also be clear boundaries: it is not good at visual recognition, not suitable for large-scale server-side crawlers, not suitable for unprotected execution of high-risk actions.
It is recommended that the pre-sales Demo choose a high-frequency process that customers are familiar with, such as "Create Customer Profile", "Submit Reimbursements", "Configure Product Parameters", "Search Orders and Generate Reports". During the demonstration, the user will first see the natural language portal, and then show how the Agent understands the page, operates step by step, and requests the user to confirm if necessary. This makes it easier for business parties to understand value than simply talking about technical architecture.
14. Reusable customer Q & A
| Is this RPA? | "Not exactly. It's more like a AI operator embedded in a web application, with a focus on enhancing user experience and operational efficiency; if you want to automate at scale across systems and browsers, you still need to combine scaling, MCP, or RPA capability assessment." | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Do you need to change the backend? | "No. For quick experience, it is recommended that the backend provides LLM proxy, authentication, audit, desensitization, and model routing." | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Can it be privatized? | "The project itself is open source MIT, and the model side supports OpenAI-compatible interfaces and local runtime paths. However, the specific privatization effect depends on the tool call capability, context length, latency, and page complexity of the customer model." | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Can I operate any website? | "The basic PageAgent.js needs to run on the current page after website integration; Chrome Extension can be extended to any web page and multi-label, but the permissions and security requirements are higher." | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Will it be messy? | "It needs to be managed by operation whitelist, system instruction, risk action confirmation, user takeover and audit log. Pre-sale PoC should not be designed to be completely unsupervised." | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Can you recognize images or diagrams? | "It is based primarily on DOM text and structure and does not rely on screenshots and multimodal models. Images, Canvas, WebGL, pure visual cues are not its strong points." | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Which pages work best? | "Pages with good semantic HTML, clear button/field labels, standardized form structure, and stable process work best." |
15. Follow-up recommendations
- Select a customer's real process to do PoC for 1-2 weeks, and give priority to pages with intensive forms, high frequency repetition and controllable risks.
- Combing the DOM and accessibility quality of the page, and supplementing the button text, form label and ARIA attributes.
- Determine the model route: public cloud, Qwen, OpenAI-compatible gateway, local Ollama/LM Studio, or the existing model platform of the customer.
- Design security policies: back-end agent, data desensitization, operation whitelist, secondary confirmation, audit log, exception takeover.
- Re-evaluate Chrome Extension and MCP Server if cross-page or external agent platforms are involved.