← Back to Project List
Page Agent is Ali's open source page embedded GUI Agent, which allows users to operate Web pages in natural language. Its core value is not to be a traditional crawler or back-end RPA, but to add "AI operator/AI Copilot" capabilities to Web applications such as SaaS, ERP, CRM, and management back-end. For pre-sales, it is suitable for scenarios such as "complex system lowering threshold, intelligent form filling, product teaching, customer service robot moving from answering questions to operating on behalf of others, and barrier-free interaction enhancement.

1. Project Overview

ProjectInformation
GitHubalibaba/page-agent
Official Demo/Documentationhttps://alibaba.github.io/page-agent/
Project PositioningJavaScript the in-page GUI agent to control the web interface with natural language
Open Source ProtocolMIT
PrimaryTypeScript
npm package'page-agent'
latest npm version'1.10.0 ', check date: 2026-06-27
latest GitHub Release'v1.10.0', released on 2026-06-15, check date: 2026-06-27
GitHub HeatAbout 20.3k stars, 1.75k forks, Check Date: 2026-06-27
Key Components'page-agent', '@ page-agent/core',' @ page-agent/page-controller ',' @ page-agent/llms', '@ page-agent/ui',' @ page-agent/mcp', Chrome Extension

! [[17-Temporary Attachments/Page-Agent/banner-light.png]]

2. One word explanation

Page Agent can be understood as "AI operator embedded in your web page": after the developer introduces a JS or npm package into the web page and configures LLM supporting tool call, the user can enter natural language instructions such as "help me open the settings and modify the notification method", "fill in this reimbursement form", "search for an order and export the results, the Agent reads the DOM of the page, plans the action, and completes the operation of clicking, inputting, selecting, scrolling, etc.

It differs from traditional browser automation tools in that:

DimensionPage AgentTraditional Browser Automation/RPA/browser-use class tools
Object-OrientedWebsite Developer, SaaS Product TeamAutomation Script Developer, Crawler/Agent Developer
Deployment modeRuns in embedded page, or Chrome extensionUsually executed outside the browser, server-side, or automation runtime
Main PurposeEnhance the user experience of the product and turn the web into a natural language applicationAutomate tasks, collect data, and control the browser
Perception modeMainly based on DOM text and structure, does not rely on screenshotsPossible use of DOM, screenshots, multimodal, browser control protocol

3. Key Screenshot

Demo Home Page

! [[17-Temporary Attachments/Page-Agent/page-agent-home.png]]

This figure is suitable for pre-sales materials to illustrate Page Agent's product perception: the user enters a natural language task in the current web page, and the agent in the page helps the user to complete the operation. The official selling points emphasized in the screenshots can also be seen directly: pure front-end scheme, support for private models, no need for desensitization, and MIT open source.

Chrome Extensibility

! [[17-Temporary Attachments/Page-Agent/page-agent-chrome-extension.png]]

Chrome Extension are optional enhancements. PageAgent.js itself is responsible for in-page automation; extensions additionally provide multi-page tasks, browser-level control, and the ability to initiate tasks from outside the browser.

Model configuration and support

! [[17-Temporary Attachments/Page-Agent/page-agent-models.png]]

The official document emphasizes support for models that comply with the OpenAI API specification and support tool call, including public cloud and local/private deployment paths. During pre-sales communication, pay attention to whether the model supports stable tool call, whether the context length is sufficient, and whether the enterprise-side agent needs to forward LLM requests will directly affect the landing effect.

4. What does it mostly do?

CapabilitiesDescriptionsPre-Sales Value
Natural language operation web pageUser input tasks, Agent automatically clicks, inputs, selects, scrolls, and submits formsReduce the learning cost of complex systems, and reduce the pressure of training and customer service
DOM text understandingThrough the DOM structure and page text understanding interface, it does not rely on screenshots and multi-modal models.Cost is more controllable and suitable for business systems, form systems and management background.
AI Copilot embedded in pagesIntegration into Web applications through CDN or npmLower cost of retrofitting existing SaaS or internal systems
Built-in UI PanelCan display task execution, progress and interaction panelsEasier to do product demonstration and user experience closed loop
Self-provided LLMSupports OpenAI-compatible APIs, including Qwen, OpenAI, Claude, DeepSeek, Gemini, local Ollama/LM Studio and other routesAdapts to customers' existing model resources and privatization demands
Data desensitizationSupport masking the page content and then sending it to the modelSuitable for corporate scenarios that are sensitive to privacy and compliance
Custom commands and knowledge injectionAgent behavior can be constrained through system-level/page-level instructionBusiness rules, operation specifications, and permission boundaries can be solidified into the Agent
Custom ToolsExtensible Agent CallabilityInterconnection with service APIs, verification logic, and audit actions
Chrome ExtensionSupports multi-page, multi-tab, and browser-level controlSuitable for cross-system, cross-page processes, but requires higher security authorization
MCP Server BetaLet local Agent clients initiate browser tasks to Page Agent Ext through MCPSuitable for connecting browser control capabilities to Claude Desktop, Copilot, enterprise Agent platforms, etc.

5. Typical applicable scenarios

ScenarioCustomer Pain PointsHow Page Agent Can Cut in
SaaS AI CopilotThe product has many functions and complex pages that new users will not use. High-frequency consultation focuses on "how to operate"Add a natural language portal to the page so that AI can directly take users to complete the operation
ERP / CRM / OA/HR/Financial SystemThere are many forms, long processes and many fields, and users are easy to fill in or omit them.The user describes the target, and Agent automatically locates the field, fills in, submits or prompts for confirmation.
intelligent transformation of management backgroundUI transformation of old system is costly, but there is a glimmer of hope to improve experienceCopilot layer is done first through page embedding, and it is not necessary to reconstruct the core business system immediately
Customer Service Robot UpgradeThe robot can only answer "Please click the so-and-so button", and the user still needs to operate it himselfCombine the answering robot with Page Agent, and upgrade from "Tell the user how to do it" to "Help the user do it on site"
Product teaching/OnboardingTraining, screen recording and document maintenance are required after new functions are launchedLet AI demonstrate the complete process on site, such as "demonstrate how to submit reimbursement application"
Barrier-free interactionIt is difficult for elderly users, visually impaired users and low digital skilled users to use complex pagesLower the operating threshold through natural language, voice assistant, screen reader and other portals
Internal operation efficiency improvementOperators need to repeatedly search, filter, input and export in the backgroundLet Agent complete controllable semi-automatic operation to reduce repeated clicks
Enterprise Agent Platform Connected to BrowserEnterprise Existing Agents Can Question and Answer and Call API, but Lack Web GUI Operation AbilityBrowser Tasks Incorporating Agent Tool Chain into MCP Server through Chrome Extension

6. Not quite the scene

ScenarioReason
large-scale web crawling, server-side automationofficial clear Page Agent for client-side web page enhancement, not server-side automation tools
Pages that mainly rely on visual recognition of pictures, Canvas, WebGL and SVGPage Agent does not use multi-modal models and screenshots, and mainly understands pages based on DOM text structure
Processes that require drag-and-drop, hover, right-click menu, keyboard shortcut, and coordinate-level controlOfficial restrictions indicate that these interactions are not currently supported.
Scenarios of complex cross-domain iframes and nested iframesThe official documentation only emphasizes that the same-source single-layer iframes are supported, and cross-domain/nested iframes are boundaries.
Fully automatic trading/approval with strong supervision and strong capital riskcan be used as an auxiliary entrance, but confirmation, authority, audit, wind control and rollback mechanism must be added.
Old systems with poor page semanticsDOM structure, accessibility and page semantics will directly affect the success rate of Agent, and page governance needs to be done first
Customer environments that cannot provide a stable tool call modelThe agent relies on the model to stably generate tool calls. Small models or models with weak tool call capabilities usually do not perform well.

7. Architecture and Component Understanding

The official development guide shows that Page Agent is an npm workspaces monorepo, and the core package can be understood as follows:

Package/ModuleRole
'page-agent'Main portal, including built-in UI Panel, for application integration
'@ page-agent/core'Core Agent logic without UI, suitable for custom UI or programmatic calls
'@ page-agent/llms'LLM client, which encapsulates OpenAI-compatible calls, retries, tool calls, etc.
'@ page-agent/page-controller'DOM manipulation, page structure extraction, visual feedback, decoupling from LLM
'@ page-agent/ui'UI capabilities such as Panel and internationalization
'@ page-agent/mcp'MCP Server, which allows local Agent clients to control browsers through extensions
'packages/extension'Chrome Extension, based on WXT React
'packages/website'Website, documentation, and development playground

Simplify workflow

flowchart U["用户输入自然语言任务"] --> P["Page Agent UI / API"] P --> D["读取页面 DOM 和语义信息"] D --> L["LLM 规划下一步动作"] L --> T["工具调用:点击 / 输入 / 选择 / 滚动 / JS"] T --> W["页面执行动作并反馈状态"] W --> D W --> R["完成结果 / 需要用户确认 / 失败原因"]

Relationship between extension and MCP

flowchart TD A["本地 Agent 客户端"] --> M["Page Agent MCP Server"] M --> H["Chrome 扩展 Hub"] H --> E["Page Agent Ext"] E --> B["浏览器标签页 / 多页面任务"] B --> J["PageAgent.js 页面内自动化"]

The pre-sales explanation can be said as follows: the basic version embeds an in-page AI operator in the customer's own Web application; If you want to control the browser across pages, across tabs or let an external Agent, you need to introduce Chrome Extension and MCP Server.

How to use #8.

Mode 1: Demo CDN Fast Experience

The official provides a one-line script for technical evaluation:

China Mirror:

Note: The official Demo CDN uses the free test LLM API, which is only suitable for technical evaluation and R & D testing, and should not be used directly in a production environment, nor should personally identifiable information or sensitive data be entered.

Mode 2: npm integration

npm install page-agent
import { PageAgent } from 'page-agent'

const agent = new PageAgent({
  model: 'qwen3.5-plus',
  baseURL: 'https://dashscope.aliyuncs.com/compatible-mode/v1',
  apiKey: 'YOUR_API_KEY',
  language: 'zh-CN',
})

await agent.execute('点击登录按钮')

Method 3: Production Environment Suggestions

If you want to integrate into the enterprise Web application, the official document suggests not to put the real LLM API Key into the front-end code. A more reasonable way is to forward the LLM request by the enterprise back-end agent, and do authentication, auditing, flow limiting, desensitization and model routing at the agent layer.

const agent = new PageAgent({
  baseURL: '/api/llm-proxy',
  model: 'gpt-5.1',
  customFetch: (url, init) =>
    fetch(url, { ...init, credentials: 'include' }),
})

Method 4: MCP Server Beta

If the customer already has a local Agent client, you can access it through '@ page-agent/mcp:

{
  "mcpServers": {
    "page-agent": {
      "command": "npx",
      "args": ["-y", "@page-agent/mcp"],
      "env": {
        "LLM_BASE_URL": "https://api.openai.com/v1",
        "LLM_API_KEY": "sk-xxx",
        "LLM_MODEL_NAME": "gpt-5.2"
      }
    }
  }
}

The official explanation is that MCP Server is still Beta and can be used for pre-sales demonstration, but special attention should be paid to security authorization, stability and version compatibility for production landing.

9. What can I say before sales

Business-oriented

Customer ConcernsRecommended Words
users will not use complex systems"instead of making another question-and-answer robot, it will let AI directly help users operate in the system, turning training documents and operation steps into executable interactions."
Form filling/approval process is very long"It is suitable for compressing 20 clicks and multi-field filling into a natural language instruction, especially for intensive form scenarios such as ERP, CRM, OA and financial reimbursement."
I don't want to change the old system greatly."It is a page embedded scheme. You can add a layer of intelligent operation portal to the existing Web application to reduce the reconstruction cost."
Customer Service Pressure"Traditional customer service robots can only tell users how to do it. Page Agent allows robots to further help users complete operations."
requires domestic model or private model"it supports OpenAI-compatible interface and local model path, theoretically can interface with the customer's existing large model platform, but to verify the tool call, context length and CORS/proxy configuration."

Technology-oriented

Technical IssuesRecommended Notes
How to integrate"CDN can be experienced quickly, npm package integration is recommended for production, and LLM requests are proxied through the back end."
Whether a browser plug-in is required"The current page automation does not require a plug-in; Cross-page, multi-tab, and external Agent calls require Chrome Extension."
Does it rely on screenshots/visual models?"It does not rely on screenshots and multimodality, and mainly takes DOM text structure, so it requires higher semantic HTML and accessibility."
How to do security"It needs to be designed together with operation whitelist, data desensitization, user confirmation, permission control, back-end agent and audit log."
How to choose a model"A model with stable tool call, fast speed, and sufficient context is preferred. Small models or models with weak tool call are usually not suitable for complex page operations."

10. PoC Recommendations

Recommended PoC Target

Choose a process that the customer is familiar with, has clear pain points, and can control risks, such:

PoC ProcessValidate Value
CRM creates a new customer and completes fieldsVerify form filling, field positioning, business rule tips
OA Submit Leave/Reimbursement ApplicationVerify Multi-step Process, Confirmation Action, User Interaction
Background search for orders and exportVerify search, filter, click, confirm before download
Product teaching "Demonstrate how to configure a function"Verify teaching while doing and training to reduce cost
Customer Service Robot OperationVerify Experience Upgrade from Q & A to Execution

PoC range control

ProjectProposal
page selectiongive priority to pages with clear DOM semantics, standard form structure and stable interaction path
Number of TasksDo 3-5 high-frequency processes first, and do not cover the whole system at the beginning
Model selectionFirst use the strong tool call model to get through the effect, then evaluate the domestic/private model replacement
Security boundarySecondary confirmation must be added to high-risk actions, such as submit, delete, transfer, and approve
DataUse desensitization test data to avoid entering sensitive information in Demo CDN

Example of Acceptance Index

IndicatorProposed target
Task Completion RateThe selected process reaches more than 80%, and then enters the next round of optimization
Reduce operation stepsReduce more than 50% compared to manual click/entry
Manual takeover rateClear prompt and give to user when failing or uncertain
Confirmation of Risk Action100% of actions such as deletion, submission and approval need confirmation
Log traceabilityCan record tasks, actions, results, failure reasons, and user confirmations

11. Risks and Considerations

RiskDescriptionRecommendation
API Key is exposed on the front end.There is a risk of leakage when configuring the real LLM Key directly on the front end.Production must go through the back-end agent.
Page DOM Quality Impact EffectUnclear semantics, no text on buttons, and unstable dynamic elements will reduce the success rateDo page accessibility and semantic governance
The output of the model is unstableThe format of the tool call is wrong or the plan is unstable, which will cause failure.Select the strong tool call model and set the retry and error recovery.
Sensitive Data OutgoingPage Content May Enter LLM RequestData Desensitization, Field Masking, Private Model, or Private Gateway
Unauthorized OperationThe agent may attempt actions that the user does not intend to authorizePermissions, Whitelist, Secondary Confirmation, and Audit
More cross-page permissionsMore permissions for Chrome Extension. If it is abused, it will bring privacy risksToken authorization, trusted application list, minimum permissions, and user visible confirmation
Beta Capability MaturityMCP Server Marked BetaFirst for Demo and Internal PoC, Production Caution Assessment
interaction boundarydoes not support dragging, hovering, right-clicking, complex visual operations, etc.avoid these interactions or modify the page when selecting a process

12. Relationship with related programs

CategoryRepresentRelationship with Page Agent
browser-use class browser Agentbrowser-usePage Agent borrows its DOM processing and prompt ideas, but the goal is client-side web page enhancement
Traditional RPAUiPath, Shadow Knife, etc.RPA is more cross-application process automation; Page Agent is more suitable for embedding Web products to enhance user experience
Multimodal Browser AgentVisual Screenshot ModelMultimodal can understand visual content, but the cost and authority are higher; Page Agent is lighter, but limited by DOM semantics
SaaS built-in CopilotSalesforce Copilot and Microsoft Copilot classesPage Agent can be used as the underlying GUI operation capability of self-developed SaaS Copilot
Enterprise Agent PlatformDify, Coze, LangGraph, MCP Client, etc.Browser GUI operation can be connected to existing Agent tool chain through MCP/extension

13. My Pre-Sales Judgment

The value of Page Agent lies in that it takes "AI question and answer" to "AI executable operations", which is especially suitable for B- side Web systems. The problem with many enterprise systems is not that they have no functions, but that they have too many functions, too deep paths, and users do not know how to use them. Page Agent's page embedded mode is very suitable for packaging complex operations into natural language portals and verifying the actual value of AI Copilot with low transformation costs.

Its most suitable pre-sales entry point is not "automatic replacement of manual", but "to assist users to complete complex page operations". In the expression of the scheme, it should be emphasized that it can be controlled, audited, confirmed and taken over. Promises to customers should also be clear boundaries: it is not good at visual recognition, not suitable for large-scale server-side crawlers, not suitable for unprotected execution of high-risk actions.

It is recommended that the pre-sales Demo choose a high-frequency process that customers are familiar with, such as "Create Customer Profile", "Submit Reimbursements", "Configure Product Parameters", "Search Orders and Generate Reports". During the demonstration, the user will first see the natural language portal, and then show how the Agent understands the page, operates step by step, and requests the user to confirm if necessary. This makes it easier for business parties to understand value than simply talking about technical architecture.

14. Reusable customer Q & A

Is this RPA?"Not exactly. It's more like a AI operator embedded in a web application, with a focus on enhancing user experience and operational efficiency; if you want to automate at scale across systems and browsers, you still need to combine scaling, MCP, or RPA capability assessment."
Do you need to change the backend?"No. For quick experience, it is recommended that the backend provides LLM proxy, authentication, audit, desensitization, and model routing."
Can it be privatized?"The project itself is open source MIT, and the model side supports OpenAI-compatible interfaces and local runtime paths. However, the specific privatization effect depends on the tool call capability, context length, latency, and page complexity of the customer model."
Can I operate any website?"The basic PageAgent.js needs to run on the current page after website integration; Chrome Extension can be extended to any web page and multi-label, but the permissions and security requirements are higher."
Will it be messy?"It needs to be managed by operation whitelist, system instruction, risk action confirmation, user takeover and audit log. Pre-sale PoC should not be designed to be completely unsupervised."
Can you recognize images or diagrams?"It is based primarily on DOM text and structure and does not rely on screenshots and multimodal models. Images, Canvas, WebGL, pure visual cues are not its strong points."
Which pages work best?"Pages with good semantic HTML, clear button/field labels, standardized form structure, and stable process work best."

15. Follow-up recommendations

  1. Select a customer's real process to do PoC for 1-2 weeks, and give priority to pages with intensive forms, high frequency repetition and controllable risks.
  2. Combing the DOM and accessibility quality of the page, and supplementing the button text, form label and ARIA attributes.
  3. Determine the model route: public cloud, Qwen, OpenAI-compatible gateway, local Ollama/LM Studio, or the existing model platform of the customer.
  4. Design security policies: back-end agent, data desensitization, operation whitelist, secondary confirmation, audit log, exception takeover.
  5. Re-evaluate Chrome Extension and MCP Server if cross-page or external agent platforms are involved.