1. One sentence positioning
Scrapling is a data collection framework for modern websites.
It is not a simple HTML parsing library, nor is it an automated tool that will only open the browser, but it combines the following capabilities:
-High performance HTML parsing and selectors.
-HTTP request, dynamic browser, stealth browser three Fetcher.
-Session of sustainable operation.
-Spider framework for large-scale tasks.
-Adaptive element positioning, still try to find the target element after the page structure changes.
-Agent rotation, block detection, pause resume, development cache.
-CLI and Interactive Web Scraping Shell.
-MCP Server for AI Agents.
Pre-sales can be expressed like this:
Scrapling is suitable for upgrading "web page data collection" from scattered scripts to maintainable data collection projects: simple pages go through fast HTTP, dynamic pages go through browser, protected pages go through stealth mode, large-scale tasks go through Spider, and can provide AI Agent with controllable and low token web page extraction capability through MCP.
2. What does it mostly do?
2.1 fast page capture and analysis
Scrapling provide a selector experience similar to Scrapy/Parsel and BeautifulSoup, supporting:
-CSS selector.
-XPath.
-BeautifulSoup style 'find_all '.
-Text search.
-Regular search.
-Chain selector.
-Parent, child, sibling node navigation.
-Similar element lookup.
-Automatic selector generation.
Typical example:
from scrapling.fetchers import Fetcher, FetcherSession
with FetcherSession(impersonate='chrome') as session:
page = session.get('https://quotes.toscrape.com/', stealthy_headers=True)
quotes = page.css('.quote .text::text').getall()
Pre-Sales Interpretation:
-For static pages and ordinary list pages, it can be used as a lighter collection method than Selenium/Playwright.
-For the team, the API is more complete than the naked 'requests lxml', and the maintenance cost is lower.
2.2 adaptive element positioning
The Scrapling README emphasizes that parser learns from site changes and automatically repositions elements when pages are updated. Typical writing:
from scrapling.fetchers import StealthyFetcher
StealthyFetcher.adaptive = True
p = StealthyFetcher.fetch('https://example.com', headless=True, network_idle=True)
products = p.css('.product', auto_save=True)
products = p.css('.product', adaptive=True)
This is important to the business, because one of the biggest maintenance costs of Web Capture is page revamped:
-class name change.
-DOM level changes.
-List structure slightly adjusted.
-Merchandise card style update.
Scrapling does not guarantee that all revisions will be fixed automatically, but it attempts to reduce selector vulnerability and is suitable for pre-sales talk about "reducing the maintenance cost of acquisition scripts".
2.3 Three types of Fetcher cover different websites
The official documentation divides Fetcher into three categories:
| Fetcher | Fit to Scenarios | Features | Pre-Sales Understanding |
|---|---|---|---|
| 'Fetcher' | Ordinary static page, page that can be completed by HTTP request | Fastest speed, low resource consumption, browser TLS/headers simulation, HTTP/3 support | Priority, lowest cost |
| 'DynamicFetcher' | JS dynamic loading, SPA, small automation, medium protection | Use Playwright Chromium/Chrome | Use when the page needs to execute JS |
| 'StealthyFetcher' | Dynamic pages, anti-crawling protection, Cloudflare Turnstile/Interstitial, etc. | Stealth browser, fingerprint disguise, anti-robot bypass capabilities are stronger | Higher cost, but suitable for complex sites |
Pre-sales advice:
Don't use the heaviest browser and stealth mode as soon as you come up. The collection scheme should be layered first: ordinary requests are given priority, dynamic pages are reused in browsers, and protection is indeed reused in StealthyFetcher. This controls cost, speed and stability.
2.4 Spider Massive Crawling Framework
The Scrapling Spider system is similar to Scrapy, but integrates its own parser and fetcher to support:
-'start_urls '.
-async 'parse' callback.
-'Request'/'Response' object.
-Concurrent crawl.
-Throttling and download delay by domain.
-Session more.
-Request priority queue.
-URL to go heavy.
-blocked request detection and retry.
-robots.txt optional compliance.
-checkpoint pause/resume.
-streaming output.
-Development mode cache response.
-JSON/JSONL export.
! Scrapling Spider Architecture
The official Spider data stream can be understood:
- Spider generates an initial request.
- Scheduler enter the priority queue and do fingerprint de-duplication.
- Crawler Engine fetching requests based on concurrency, domain name limit, download delay, and robots.txt rules.
- Session Manager choose HTTP, Dynamic or Stealthy Session based on 'sid.
- Session crawl the page and return to the Response.
- callback output item or follow-up Request.
- If 'crawldir' is set, the system will save the checkpoint and restore it later.
2.5 MCP Server for AI Agent
The Scrapling MCP Server is an important highlight that distinguishes it from traditional crawler libraries. It can directly expose the Scrapling acquisition capability to MCP-supporting Agents such as Claude, Cursor, and Claude Code.
The official MCP Server provides 10 types of tools:
| Tools | Purpose |
|---|---|
| 'get' | Fast HTTP crawl, support browser fingerprint simulation |
| 'bulk_get' | Multi-URL Concurrent HTTP Fetch |
| 'fetch' | Chromium/Chrome dynamic content crawling |
| 'bulk_fetch' | Multi-page browser concurrent crawling |
| 'stealthy_fetch' | incognito browser crawl, handle Cloudflare and other protection |
| 'bulk_stealthy_fetch' | Multi-URL stealth Concurrent Crawling |
| 'screenshot' | Session a screenshot of the opened browser and return the visible image content of the model |
| 'open_session' | Create a persistent browser Session |
| 'close_session' | Close Session |
| 'list_sessions' | View activity Session |
The official document emphasizes an important value: Scrapling MCP allows CSS selector to be used to narrow the scope of content before giving the content to the AI, thus reducing irrelevant content from entering the context and saving token.
This is very suitable for pre-sales talk about AI Agent scenarios:
Many web extraction tools will stuff the entire page into the large model, and then let the model find the field. Scrapling MCP can extract the target area with a selector at the tool layer and then hand it over to the model for understanding, which is faster, less token, and less affected by page noise.
2.6 CLI and Interactive Shell
Scrapling can also be used directly from the terminal:
scrapling shell
scrapling extract get 'https://example.com' content.md
scrapling extract get 'https://example.com' content.txt --css-selector '#fromSkipToProducts' --impersonate 'chrome'
scrapling extract fetch 'https://example.com' content.md --css-selector '#fromSkipToProducts' --no-headless
scrapling extract stealthy-fetch 'https://nopecha.com/demo/cloudflare' captchas.html --css-selector '#padded_content a' --solve-cloudflare
This is useful for pre-sales PoC because you can demonstrate without having to write the complete project first:
-The text of the webpage is Markdown.
-Pumping the content of a CSS selector.
-Extract dynamic web pages by browser.
-Verify feasibility with stealth mode for protected pages.
3. Applicable Scenario
3.1 Open Data Acquisition and Data Pipeline
Suitable for:
-News, announcements, bidding, policy web page collection.
-Brand official website, product page and document page are regularly captured.
-Industry intelligence and public web archiving.
-Web data entry for data lakes/knowledge bases.
Pre-sales value:
-More standardized than handwritten scripts.
-Lighter than browser automation.
-Support for progressive scaling from single page to crawler framework.
3.2 price monitoring and competition intelligence
Suitable for:
-E-commerce commodity title, price, inventory, evaluation data collection.
-Competition official website activity page monitoring.
-Hotel, airline, ticketing, SaaS pricing page change tracking.
Why appropriate:
-Adaptive selector can reduce the maintenance cost of page revision.
-Multi-Session and agent rotation are suitable for medium-sized grabs.
-Spider supports checkpoint and streaming, suitable for long tasks.
Pre-Sales Reminder:
-Such scenarios are prone to service terms, frequency restrictions, and commercial data compliance. It is necessary to clearly collect only allowed data and control the frequency of requests.
3.3 SEO, Public Opinion and Content Monitoring
Suitable for:
-Search results page or public page content extraction.
-Brand reference monitoring.
-Site structure, title, body, link check.
-Content change monitoring.
Scrapling CLI and Markdown output for building content monitoring pipelines.
3.4 AI Agent Web Page Extraction Tool
Suitable for:
-Make up a web page reading/extracting MCP tool for the enterprise Agent platform.
-Let Agent draw goods, articles and lists according to specified selector.
-Do conversational analysis of dynamic and complex pages.
-Let the Agent take screenshots and combine visual understanding.
Why it is valuable:
-MCP Server built-in tools, low access cost.
-The bulk tool is supported to prevent the Agent from slowly grasping one URL at a time.
-Support session reuse, reduce repeated open browser.
-Support prompt injection cleaning, the default 'main_content_only 'will remove hidden elements, HTML comments, zero-width characters and other potential injection content.
Development and debugging of 3.5 acquisition scripts
Suitable for data engineers and crawler engineers:
-Try selector quickly with an interactive shell.
-Converts a curl request into a Scrapling request.
-Development mode cache response to avoid repeatedly hit the target site during debugging.
-Use the browser to open the request result and quickly confirm the page status.
4. Not quite the scene
| Not suitable for the scene | Reason | Suggestion |
|---|---|---|
| Unauthorized collection of sensitive data | README explicitly requires compliance with laws, privacy, ToS and robots.txt | Compliance assessment required |
| Extremely large-scale commercial crawler platform | The Scrapling is a framework, not a complete distributed scheduling platform | Need to cooperate with queues, task scheduling, agent pools, monitoring, and storage systems |
| Only need to parse local HTML | It may be lighter to directly use lxml, BeautifulSoup, Parsel | Small tasks do not need to introduce full fetcher/spider |
| Strong verification code/login/wind control link | StealthyFetcher it has the ability, it does not mean that it bypasses all protections indefinitely | requires manual verification, proxy service and legal authorization |
| Customers with unclear collection legality boundary | High data compliance risk | Do legal and business boundary confirmation first |
5. Architecture and core competencies
5.1 ability stratification
HTTP / Dynamic / Stealthy"] Fetcher --> Parser["Parser / Selector
CSS / XPath / Text / Similarity"] Parser --> Spider["Spider Framework
Scheduler / Engine / Session Manager"] Spider --> Output["Items / JSON / JSONL / Markdown / HTML"] Spider --> Ops["Proxy / Checkpoint / Cache / Stats / Retry"]
Relationship between 5.2 and Scrapy
Scrapling's Spider borrows from Scrapy, but with a modernizing trade-off:
| Concept | Scrapy | Scrapling |
|---|---|---|
| Spider | 'scrapy. Spider' | 'scrapling.spiders.Spider' |
| Callback | Sync 'parse' as the main | async 'parse' |
| Downloader | Downloader Middlewares | Session Manager, Multiple Session |
| Pause/Resume | 'JOBDIR' | 'crawldir' |
| Export | Feed exports | 'to_json()' / 'to_jsonl()' or hooks |
| Streaming | Non-core competencies | 'async for item in spider.stream()' |
| Multi-Session | Requires customization | Native support for routing with different session IDs |
| blocked Detection | Custom Middleware | Built-in is_blocked() and retry hook |
Pre-sales judgment:
-If the customer already has a mature Scrapy system, it is not necessary to replace it.
-If customers want to start with a new project and need dynamic web pages, stealth, MCP and AI Agent access at the same time, the Scrapling will be more integrated.
How to use #6.
6.1 installation
The basic installation contains only the parser engine:
pip install scrapling
If you want to use fetchers / spiders:
pip install "scrapling[fetchers]"
scrapling install
If you want to use MCP:
pip install "scrapling[ai]"
scrapling install
scrapling mcp
If you want to use shell and extract commands:
pip install "scrapling[shell]"
Install All:
pip install "scrapling[all]"
scrapling install
Docker:
docker pull pyd4vinci/scrapling
docker pull ghcr.io/d4vinci/scrapling:latest
6.2 Spider Example
from scrapling.spiders import Spider, Request, Response
class QuotesSpider(Spider):
name = "quotes"
start_urls = ["https://quotes.toscrape.com/"]
concurrent_requests = 10
async def parse(self, response: Response):
for quote in response.css('.quote'):
yield {
"text": quote.css('.text::text').get(),
"author": quote.css('.author::text').get(),
}
next_page = response.css('.next a')
if next_page:
yield response.follow(next_page[0].attrib['href'])
result = QuotesSpider().start()
result.items.to_json("quotes.json")
6.3 Multi-Session Example
from scrapling.spiders import Spider, Request, Response
from scrapling.fetchers import FetcherSession, AsyncStealthySession
class MultiSessionSpider(Spider):
name = "multi"
start_urls = ["https://example.com/"]
def configure_sessions(self, manager):
manager.add("fast", FetcherSession(impersonate="chrome"))
manager.add("stealth", AsyncStealthySession(headless=True), lazy=True)
async def parse(self, response: Response):
for link in response.css('a::attr(href)').getall():
if "protected" in link:
yield Request(link, sid="stealth")
else:
yield Request(link, sid="fast", callback=self.parse)
6.4 MCP Server Configuration Example
Claude Desktop configuration example:
{
"mcpServers": {
"ScraplingServer": {
"command": "scrapling",
"args": ["mcp"]
}
}
}
HTTP transport:
scrapling mcp --http --host '127.0.0.1' --port 80007. What can I say before sales
7.1 for business
Scrapling can help us to stably connect public web page data into business systems, such as price monitoring, competition tracking, announcement collection and content monitoring. It is more maintainable than one-off scripts, lighter than pure browser automation, and can handle dynamic pages and partial anti-crawl scenarios.
7.2 for Technical Leader
its advantage is clear layering: light HTTP for ordinary pages, browser for dynamic pages, stealth for complex protection pages, Spider for large-scale tasks, bulk/session for multi-URL, and MCP for Agent scenarios. This allows you to select the lowest-cost but sufficiently stable crawl path by scenario.
7.3 AI-oriented Agent Platform Leader
The highlight of Scrapling MCP is that the tool layer first extracts the target element and then hands it over to the model to avoid stuffing the whole page of irrelevant content into the context. For web page reading, product information extraction, competition page analysis, and knowledge base collection, this is more controllable and token-saving than ordinary web page reading tools.
8. Frequently Asked Customer Questions
| Scrapy is a mature crawler framework; Scrapling emphasizes modern web pages, dynamic pages, stealth fetcher, adaptive selector, MCP, and AI Agent integration. The existing Scrapy system may not be replaced, and new projects can be evaluated for Scrapling. | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Can it bypass anti-crawl? | It provides StealthyFetcher, fingerprint disguise, Cloudflare Turnstile/Interstitial related capabilities, but it is not a universal bypass. The frequency must be legally authorized, controlled, and verified against the target site. | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Can I directly use it for the Agent? | Yes, the Scrapling provides MCP Server and supports tools such as get/fetch/stealthy_fetch/bulk/session/screenshot. | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Is it very heavy? | The basic parser is very light. fetchers, browser, MCP, and Shell are optional dependencies. Can be installed by scene. | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Can I do large-scale collection? | Spider supports concurrency, multi-session, checkpoint, streaming, and agent rotation, but a complete large-scale platform also requires scheduling, queuing, storage, monitoring, and agent pools. | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Can the website be automatically restored after revision? | The adaptive selector can reduce maintenance costs, but it cannot guarantee that all revisions will be restored without feeling. Critical collection tasks still require monitoring and exception alarms. | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Is it compliant? | The tool itself is neutral; the project README explicitly requires compliance with laws, privacy, terms of service and robots.txt. Confirm the compliance boundary before commercial use. |
9. PoC Recommendations
9.1 PoC Topic Selection
It is recommended to choose a legal, well-defined scenario that can measure value:
-Capture the title, price, and inventory of 100 public product pages.
-Monitor 20 competitive pricing page changes.
-- Grab a batch of public announcements and turn them into Markdown.
-Extract MCP from a web page to the enterprise Agent and compare token consumption.
-For a dynamic web page to do ordinary HTTP, DynamicFetcher, StealthyFetcher layered comparison.
9.2 PoC Acceptance Index
| Indicator | Description |
|---|---|
| Acquisition Success Rate | Whether multiple runs are stable gets field |
| Field Accuracy | Whether the title, price, date, etc. are correct. |
| recovery rate of structural changes | whether the page class/level can still be extracted after slight changes |
| Speed | Single page time, batch time, concurrent throughput |
| Resource usage | Common request vs CPU/memory requested by browser |
| Blocking Rate | Percentage of blocked/captcha/403 by target station |
| Compliance and Controllability | Whether the robots.txt, frequency, data range, and log are clear |
| Agent token cost | Difference in token consumption before and after MCP selector extraction |
9.3 demo script
- Use 'Fetcher' to grab a normal page.
- Draw fields with CSS selector.
- Use 'DynamicFetcher' to grab a JS page.
- Use 'Spider' to do multi-page crawling and export JSON.
- Use CLI to turn the page Markdown.
- Use MCP to let Agent draw specified fields by selector.
- Display the compliance settings: frequency limit, robots.txt, request log, and data field whitelist.
10. Risks and Considerations
10.1 compliance risk is number one
The disclaimer Scrapling README clearly emphasizes that:
-For education and research.
-Users should comply with local and international data capture and privacy laws.
-Respect website terms of service and robots.txt.
-The author is not responsible for the abuse.
This must be put up front in pre-sales, and projects should not be packaged as "unlimited crawl of any website" tools.
10.2 anti-climbing ability is not equal to commercially available license
Even if it is technically accessible, it does not mean that it can be collected commercially. Customer needs to confirm:
-Whether the data is public.
-Whether there is authorization.
-Whether commercial use is allowed.
-Whether personal information is involved.
-Whether there is a frequency limit.
-Whether a stop mechanism is required.
10.3 Stealth and browser modes are more expensive
Browser mode is slower and more memory intensive than normal HTTP; stealth mode is heavier. Large-scale tasks have to be layered:
- First ordinary request.
- No more dynamic.
- Still no stealth.
- Proxy pools and scheduling are introduced when multiple sites and multiple areas are required.
10.4 MCP Session should pay attention to resource release
Official documents remind: close the persistent Session when it runs out, otherwise the browser will always be open. Agent workflows need to include 'close_session and exception pockets.
11. My Pre-Sales Judgment
Scrapling is a very suitable for the "data collection AI agent" cross scene of the project.
Its advantage is not a single point of functionality, but the combination is complete:
-Lightweight HTTP.
-Dynamic browser.
-Stealth mode.
-Adaptive selector.
-Spider concurrency framework.
-CLI.
-MCP.
-Docker.
From a pre-sales perspective, the recommended positioning is:
Adaptive data collection framework for modern web pages, suitable for building an engineering platform for legal public data collection, competition monitoring, AI Agent web page extraction and collection scripts.
Not recommended positioning is:
The promise of "universal anti-crawl bypass tool" or "any website can catch.
Best for advancing customers:
- There is a need for public web page data collection, but the current script maintenance cost is high.
- An enterprise knowledge base/data lake is being made and web data entry is required.
- To be a AI Agent, you need a controllable web page extraction tool.
- Do price/competition/SEO/content monitoring.
- Python data engineering capabilities, hope to build their own rather than fully purchase SaaS.
12. REFERENCE
-GitHub repository:D4Vinci/Scrapling
-Official README:README.md
-Chinese README:README_CN.md
-Official Documentation:Scrapling Docs
-Fetcher selection:Fetchers basics
-Spider Architecture:Spiders architecture
-MCP Server:Scrapling MCP Server Guide
-Spider architecture diagram:spider_architecture.png
-Cover diagram:cover_light.svg
Information verification date: 2026-06-30. This note has not been written into real-time stars/forks due to anonymous access to GitHub API triggering stream restriction. Project capabilities, installation methods, benchmarks, MCP tools and compliance reminders are mainly based on official README, Chinese README and ReadTheDocs documents.