Scrapling - AI Navigation

← Back to Project List

Scrapling is an adaptive web Scraping framework that covers the complete link from single page crawling, dynamic web crawling, stealth browser crawling, to concurrent Spider, agent rotation, pause resume, CLI, interactive shell and MCP Server. What deserves the most attention before sales is to package the real collection problems such as "web page structure change, anti-crawling restriction, dynamic loading, AI Agent low token extraction" into a set of Python developer-friendly framework. It is suitable for scenarios such as public data collection, competition/price monitoring, SEO/public opinion, AI Agent web page extraction, and data pipeline prototyping. However, legal authorization, robots.txt, terms of service, and privacy compliance must be emphasized.

1. One sentence positioning

Scrapling is a data collection framework for modern websites.

It is not a simple HTML parsing library, nor is it an automated tool that will only open the browser, but it combines the following capabilities:

-High performance HTML parsing and selectors.

-HTTP request, dynamic browser, stealth browser three Fetcher.

-Session of sustainable operation.

-Spider framework for large-scale tasks.

-Adaptive element positioning, still try to find the target element after the page structure changes.

-Agent rotation, block detection, pause resume, development cache.

-CLI and Interactive Web Scraping Shell.

-MCP Server for AI Agents.

Pre-sales can be expressed like this:

Scrapling is suitable for upgrading "web page data collection" from scattered scripts to maintainable data collection projects: simple pages go through fast HTTP, dynamic pages go through browser, protected pages go through stealth mode, large-scale tasks go through Spider, and can provide AI Agent with controllable and low token web page extraction capability through MCP.

2. What does it mostly do?

2.1 fast page capture and analysis

Scrapling provide a selector experience similar to Scrapy/Parsel and BeautifulSoup, supporting:

-CSS selector.

-XPath.

-BeautifulSoup style 'find_all '.

-Text search.

-Regular search.

-Chain selector.

-Parent, child, sibling node navigation.

-Similar element lookup.

-Automatic selector generation.

Typical example:

from scrapling.fetchers import Fetcher, FetcherSession

with FetcherSession(impersonate='chrome') as session:
    page = session.get('https://quotes.toscrape.com/', stealthy_headers=True)
    quotes = page.css('.quote .text::text').getall()

Pre-Sales Interpretation:

-For static pages and ordinary list pages, it can be used as a lighter collection method than Selenium/Playwright.

-For the team, the API is more complete than the naked 'requests lxml', and the maintenance cost is lower.

2.2 adaptive element positioning

The Scrapling README emphasizes that parser learns from site changes and automatically repositions elements when pages are updated. Typical writing:

from scrapling.fetchers import StealthyFetcher

StealthyFetcher.adaptive = True
p = StealthyFetcher.fetch('https://example.com', headless=True, network_idle=True)
products = p.css('.product', auto_save=True)
products = p.css('.product', adaptive=True)

This is important to the business, because one of the biggest maintenance costs of Web Capture is page revamped:

-class name change.

-DOM level changes.

-List structure slightly adjusted.

-Merchandise card style update.

Scrapling does not guarantee that all revisions will be fixed automatically, but it attempts to reduce selector vulnerability and is suitable for pre-sales talk about "reducing the maintenance cost of acquisition scripts".

2.3 Three types of Fetcher cover different websites

The official documentation divides Fetcher into three categories:

Fetcher	Fit to Scenarios	Features	Pre-Sales Understanding
'Fetcher'	Ordinary static page, page that can be completed by HTTP request	Fastest speed, low resource consumption, browser TLS/headers simulation, HTTP/3 support	Priority, lowest cost
'DynamicFetcher'	JS dynamic loading, SPA, small automation, medium protection	Use Playwright Chromium/Chrome	Use when the page needs to execute JS
'StealthyFetcher'	Dynamic pages, anti-crawling protection, Cloudflare Turnstile/Interstitial, etc.	Stealth browser, fingerprint disguise, anti-robot bypass capabilities are stronger	Higher cost, but suitable for complex sites

Pre-sales advice:

Don't use the heaviest browser and stealth mode as soon as you come up. The collection scheme should be layered first: ordinary requests are given priority, dynamic pages are reused in browsers, and protection is indeed reused in StealthyFetcher. This controls cost, speed and stability.

2.4 Spider Massive Crawling Framework

The Scrapling Spider system is similar to Scrapy, but integrates its own parser and fetcher to support:

-'start_urls '.

-async 'parse' callback.

-'Request'/'Response' object.

-Concurrent crawl.

-Throttling and download delay by domain.

-Session more.

-Request priority queue.

-URL to go heavy.

-blocked request detection and retry.

-robots.txt optional compliance.

-checkpoint pause/resume.

-streaming output.

-Development mode cache response.

-JSON/JSONL export.

! Scrapling Spider Architecture

The official Spider data stream can be understood:

Spider generates an initial request.
Scheduler enter the priority queue and do fingerprint de-duplication.
Crawler Engine fetching requests based on concurrency, domain name limit, download delay, and robots.txt rules.
Session Manager choose HTTP, Dynamic or Stealthy Session based on 'sid.
Session crawl the page and return to the Response.
callback output item or follow-up Request.
If 'crawldir' is set, the system will save the checkpoint and restore it later.

2.5 MCP Server for AI Agent

The Scrapling MCP Server is an important highlight that distinguishes it from traditional crawler libraries. It can directly expose the Scrapling acquisition capability to MCP-supporting Agents such as Claude, Cursor, and Claude Code.

The official MCP Server provides 10 types of tools:

Tools	Purpose
'get'	Fast HTTP crawl, support browser fingerprint simulation
'bulk_get'	Multi-URL Concurrent HTTP Fetch
'fetch'	Chromium/Chrome dynamic content crawling
'bulk_fetch'	Multi-page browser concurrent crawling
'stealthy_fetch'	incognito browser crawl, handle Cloudflare and other protection
'bulk_stealthy_fetch'	Multi-URL stealth Concurrent Crawling
'screenshot'	Session a screenshot of the opened browser and return the visible image content of the model
'open_session'	Create a persistent browser Session
'close_session'	Close Session
'list_sessions'	View activity Session

The official document emphasizes an important value: Scrapling MCP allows CSS selector to be used to narrow the scope of content before giving the content to the AI, thus reducing irrelevant content from entering the context and saving token.

This is very suitable for pre-sales talk about AI Agent scenarios:

Many web extraction tools will stuff the entire page into the large model, and then let the model find the field. Scrapling MCP can extract the target area with a selector at the tool layer and then hand it over to the model for understanding, which is faster, less token, and less affected by page noise.

2.6 CLI and Interactive Shell

Scrapling can also be used directly from the terminal:

scrapling shell

scrapling extract get 'https://example.com' content.md
scrapling extract get 'https://example.com' content.txt --css-selector '#fromSkipToProducts' --impersonate 'chrome'
scrapling extract fetch 'https://example.com' content.md --css-selector '#fromSkipToProducts' --no-headless
scrapling extract stealthy-fetch 'https://nopecha.com/demo/cloudflare' captchas.html --css-selector '#padded_content a' --solve-cloudflare

This is useful for pre-sales PoC because you can demonstrate without having to write the complete project first:

-The text of the webpage is Markdown.

-Pumping the content of a CSS selector.

-Extract dynamic web pages by browser.

-Verify feasibility with stealth mode for protected pages.

3. Applicable Scenario

3.1 Open Data Acquisition and Data Pipeline

Suitable for:

-News, announcements, bidding, policy web page collection.

-Brand official website, product page and document page are regularly captured.

-Industry intelligence and public web archiving.

-Web data entry for data lakes/knowledge bases.

Pre-sales value:

-More standardized than handwritten scripts.

-Lighter than browser automation.

-Support for progressive scaling from single page to crawler framework.

3.2 price monitoring and competition intelligence

Suitable for:

-E-commerce commodity title, price, inventory, evaluation data collection.

-Competition official website activity page monitoring.

-Hotel, airline, ticketing, SaaS pricing page change tracking.

Why appropriate:

-Adaptive selector can reduce the maintenance cost of page revision.

-Multi-Session and agent rotation are suitable for medium-sized grabs.

-Spider supports checkpoint and streaming, suitable for long tasks.

Pre-Sales Reminder:

-Such scenarios are prone to service terms, frequency restrictions, and commercial data compliance. It is necessary to clearly collect only allowed data and control the frequency of requests.

3.3 SEO, Public Opinion and Content Monitoring

Suitable for:

-Search results page or public page content extraction.

-Brand reference monitoring.

-Site structure, title, body, link check.

-Content change monitoring.

Scrapling CLI and Markdown output for building content monitoring pipelines.

3.4 AI Agent Web Page Extraction Tool

Suitable for:

-Make up a web page reading/extracting MCP tool for the enterprise Agent platform.

-Let Agent draw goods, articles and lists according to specified selector.

-Do conversational analysis of dynamic and complex pages.

-Let the Agent take screenshots and combine visual understanding.

Why it is valuable:

-MCP Server built-in tools, low access cost.

-The bulk tool is supported to prevent the Agent from slowly grasping one URL at a time.

-Support session reuse, reduce repeated open browser.

-Support prompt injection cleaning, the default 'main_content_only 'will remove hidden elements, HTML comments, zero-width characters and other potential injection content.

Development and debugging of 3.5 acquisition scripts

Suitable for data engineers and crawler engineers:

-Try selector quickly with an interactive shell.

-Converts a curl request into a Scrapling request.

-Development mode cache response to avoid repeatedly hit the target site during debugging.

-Use the browser to open the request result and quickly confirm the page status.

4. Not quite the scene

Not suitable for the scene	Reason	Suggestion
Unauthorized collection of sensitive data	README explicitly requires compliance with laws, privacy, ToS and robots.txt	Compliance assessment required
Extremely large-scale commercial crawler platform	The Scrapling is a framework, not a complete distributed scheduling platform	Need to cooperate with queues, task scheduling, agent pools, monitoring, and storage systems
Only need to parse local HTML	It may be lighter to directly use lxml, BeautifulSoup, Parsel	Small tasks do not need to introduce full fetcher/spider
Strong verification code/login/wind control link	StealthyFetcher it has the ability, it does not mean that it bypasses all protections indefinitely	requires manual verification, proxy service and legal authorization
Customers with unclear collection legality boundary	High data compliance risk	Do legal and business boundary confirmation first

5. Architecture and core competencies

5.1 ability stratification

flowchart LR User["开发者 / AI Agent"] --> CLI["CLI / Shell / MCP"] CLI --> Fetcher["Fetcher 层
HTTP / Dynamic / Stealthy"] Fetcher --> Parser["Parser / Selector
CSS / XPath / Text / Similarity"] Parser --> Spider["Spider Framework
Scheduler / Engine / Session Manager"] Spider --> Output["Items / JSON / JSONL / Markdown / HTML"] Spider --> Ops["Proxy / Checkpoint / Cache / Stats / Retry"]

Relationship between 5.2 and Scrapy

Scrapling's Spider borrows from Scrapy, but with a modernizing trade-off:

Concept	Scrapy	Scrapling
Spider	'scrapy. Spider'	'scrapling.spiders.Spider'
Callback	Sync 'parse' as the main	async 'parse'
Downloader	Downloader Middlewares	Session Manager, Multiple Session
Pause/Resume	'JOBDIR'	'crawldir'
Export	Feed exports	'to_json()' / 'to_jsonl()' or hooks
Streaming	Non-core competencies	'async for item in spider.stream()'
Multi-Session	Requires customization	Native support for routing with different session IDs
blocked Detection	Custom Middleware	Built-in is_blocked() and retry hook

Pre-sales judgment:

-If the customer already has a mature Scrapy system, it is not necessary to replace it.

-If customers want to start with a new project and need dynamic web pages, stealth, MCP and AI Agent access at the same time, the Scrapling will be more integrated.

How to use #6.

6.1 installation

The basic installation contains only the parser engine:

pip install scrapling

If you want to use fetchers / spiders:

pip install "scrapling[fetchers]"
scrapling install

If you want to use MCP:

pip install "scrapling[ai]"
scrapling install
scrapling mcp

If you want to use shell and extract commands:

pip install "scrapling[shell]"

Install All:

pip install "scrapling[all]"
scrapling install

Docker:

docker pull pyd4vinci/scrapling
docker pull ghcr.io/d4vinci/scrapling:latest

6.2 Spider Example

from scrapling.spiders import Spider, Request, Response

class QuotesSpider(Spider):
    name = "quotes"
    start_urls = ["https://quotes.toscrape.com/"]
    concurrent_requests = 10

    async def parse(self, response: Response):
        for quote in response.css('.quote'):
            yield {
                "text": quote.css('.text::text').get(),
                "author": quote.css('.author::text').get(),
            }

        next_page = response.css('.next a')
        if next_page:
            yield response.follow(next_page[0].attrib['href'])

result = QuotesSpider().start()
result.items.to_json("quotes.json")

6.3 Multi-Session Example

from scrapling.spiders import Spider, Request, Response
from scrapling.fetchers import FetcherSession, AsyncStealthySession

class MultiSessionSpider(Spider):
    name = "multi"
    start_urls = ["https://example.com/"]

    def configure_sessions(self, manager):
        manager.add("fast", FetcherSession(impersonate="chrome"))
        manager.add("stealth", AsyncStealthySession(headless=True), lazy=True)

    async def parse(self, response: Response):
        for link in response.css('a::attr(href)').getall():
            if "protected" in link:
                yield Request(link, sid="stealth")
            else:
                yield Request(link, sid="fast", callback=self.parse)

6.4 MCP Server Configuration Example

Claude Desktop configuration example:

{
  "mcpServers": {
    "ScraplingServer": {
      "command": "scrapling",
      "args": ["mcp"]
    }
  }
}

HTTP transport:

scrapling mcp --http --host '127.0.0.1' --port 8000

7. What can I say before sales

7.1 for business

Scrapling can help us to stably connect public web page data into business systems, such as price monitoring, competition tracking, announcement collection and content monitoring. It is more maintainable than one-off scripts, lighter than pure browser automation, and can handle dynamic pages and partial anti-crawl scenarios.

7.2 for Technical Leader

its advantage is clear layering: light HTTP for ordinary pages, browser for dynamic pages, stealth for complex protection pages, Spider for large-scale tasks, bulk/session for multi-URL, and MCP for Agent scenarios. This allows you to select the lowest-cost but sufficiently stable crawl path by scenario.

7.3 AI-oriented Agent Platform Leader

The highlight of Scrapling MCP is that the tool layer first extracts the target element and then hands it over to the model to avoid stuffing the whole page of irrelevant content into the context. For web page reading, product information extraction, competition page analysis, and knowledge base collection, this is more controllable and token-saving than ordinary web page reading tools.

8. Frequently Asked Customer Questions


Scrapy is a mature crawler framework; Scrapling emphasizes modern web pages, dynamic pages, stealth fetcher, adaptive selector, MCP, and AI Agent integration. The existing Scrapy system may not be replaced, and new projects can be evaluated for Scrapling.
Can it bypass anti-crawl?	It provides StealthyFetcher, fingerprint disguise, Cloudflare Turnstile/Interstitial related capabilities, but it is not a universal bypass. The frequency must be legally authorized, controlled, and verified against the target site.
Can I directly use it for the Agent?	Yes, the Scrapling provides MCP Server and supports tools such as get/fetch/stealthy_fetch/bulk/session/screenshot.
Is it very heavy?	The basic parser is very light. fetchers, browser, MCP, and Shell are optional dependencies. Can be installed by scene.
Can I do large-scale collection?	Spider supports concurrency, multi-session, checkpoint, streaming, and agent rotation, but a complete large-scale platform also requires scheduling, queuing, storage, monitoring, and agent pools.
Can the website be automatically restored after revision?	The adaptive selector can reduce maintenance costs, but it cannot guarantee that all revisions will be restored without feeling. Critical collection tasks still require monitoring and exception alarms.
Is it compliant?	The tool itself is neutral; the project README explicitly requires compliance with laws, privacy, terms of service and robots.txt. Confirm the compliance boundary before commercial use.

9. PoC Recommendations

9.1 PoC Topic Selection

It is recommended to choose a legal, well-defined scenario that can measure value:

-Capture the title, price, and inventory of 100 public product pages.

-Monitor 20 competitive pricing page changes.

-- Grab a batch of public announcements and turn them into Markdown.

-Extract MCP from a web page to the enterprise Agent and compare token consumption.

-For a dynamic web page to do ordinary HTTP, DynamicFetcher, StealthyFetcher layered comparison.

9.2 PoC Acceptance Index

Indicator	Description
Acquisition Success Rate	Whether multiple runs are stable gets field
Field Accuracy	Whether the title, price, date, etc. are correct.
recovery rate of structural changes	whether the page class/level can still be extracted after slight changes
Speed	Single page time, batch time, concurrent throughput
Resource usage	Common request vs CPU/memory requested by browser
Blocking Rate	Percentage of blocked/captcha/403 by target station
Compliance and Controllability	Whether the robots.txt, frequency, data range, and log are clear
Agent token cost	Difference in token consumption before and after MCP selector extraction

9.3 demo script

Use 'Fetcher' to grab a normal page.
Draw fields with CSS selector.
Use 'DynamicFetcher' to grab a JS page.
Use 'Spider' to do multi-page crawling and export JSON.
Use CLI to turn the page Markdown.
Use MCP to let Agent draw specified fields by selector.
Display the compliance settings: frequency limit, robots.txt, request log, and data field whitelist.

10. Risks and Considerations

10.1 compliance risk is number one

The disclaimer Scrapling README clearly emphasizes that:

-For education and research.

-Users should comply with local and international data capture and privacy laws.

-Respect website terms of service and robots.txt.

-The author is not responsible for the abuse.

This must be put up front in pre-sales, and projects should not be packaged as "unlimited crawl of any website" tools.

10.2 anti-climbing ability is not equal to commercially available license

Even if it is technically accessible, it does not mean that it can be collected commercially. Customer needs to confirm:

-Whether the data is public.

-Whether there is authorization.

-Whether commercial use is allowed.

-Whether personal information is involved.

-Whether there is a frequency limit.

-Whether a stop mechanism is required.

10.3 Stealth and browser modes are more expensive

Browser mode is slower and more memory intensive than normal HTTP; stealth mode is heavier. Large-scale tasks have to be layered:

First ordinary request.
No more dynamic.
Still no stealth.
Proxy pools and scheduling are introduced when multiple sites and multiple areas are required.

10.4 MCP Session should pay attention to resource release

Official documents remind: close the persistent Session when it runs out, otherwise the browser will always be open. Agent workflows need to include 'close_session and exception pockets.

11. My Pre-Sales Judgment

Scrapling is a very suitable for the "data collection AI agent" cross scene of the project.

Its advantage is not a single point of functionality, but the combination is complete:

-Lightweight HTTP.

-Dynamic browser.

-Stealth mode.

-Adaptive selector.

-Spider concurrency framework.

-CLI.

-MCP.

-Docker.

From a pre-sales perspective, the recommended positioning is:

Adaptive data collection framework for modern web pages, suitable for building an engineering platform for legal public data collection, competition monitoring, AI Agent web page extraction and collection scripts.

Not recommended positioning is:

The promise of "universal anti-crawl bypass tool" or "any website can catch.

Best for advancing customers:

There is a need for public web page data collection, but the current script maintenance cost is high.
An enterprise knowledge base/data lake is being made and web data entry is required.
To be a AI Agent, you need a controllable web page extraction tool.
Do price/competition/SEO/content monitoring.
Python data engineering capabilities, hope to build their own rather than fully purchase SaaS.

12. REFERENCE

-GitHub repository:D4Vinci/Scrapling

-Official README:README.md

-Chinese README:README_CN.md

-Official Documentation:Scrapling Docs

-Fetcher selection:Fetchers basics

-Spider Architecture:Spiders architecture

-MCP Server:Scrapling MCP Server Guide

-Spider architecture diagram:spider_architecture.png

-Cover diagram:cover_light.svg

Information verification date: 2026-06-30. This note has not been written into real-time stars/forks due to anonymous access to GitHub API triggering stream restriction. Project capabilities, installation methods, benchmarks, MCP tools and compliance reminders are mainly based on official README, Chinese README and ReadTheDocs documents.