1. Project Overview
| Dimension | Information |
|---|---|
| Project Name | Hyper-Extract |
| GitHub | yifanfeng97/Hyper-Extract |
| Official Documentation | Hyper-Extract Docs |
| Chinese Document | Hyper-Extract Chinese Document |
| Project Orientation | LLM-driven Intelligent Knowledge Extraction and Knowledge Evolution Framework |
| Main portal | CLI:'he';MCP Server: he-mcp; Python API |
| Latest Version | v0.3.0, Published: 2026-06-19 |
| PyPI package | 'hyperextract', requires Python '>= 3.11' |
| License | Apache License 2.0 |
| Project Maturity | 'pyproject.toml' is annotated as Development Status :: 3 - Alpha |
| GitHub heat | about 2.8k stars, 324 forks, 3 issues, 0 PRs, statistical time: 2026-06-30 |
The official slogan of the project is Smart Knowledge Extraction CLI. The core idea can be understood as: not only let the large model "summarize the document", but also stably extract the document into a persistent, retrievable, evolvable and exportable knowledge object.
The difference between it and ordinary RAG tools is that ordinary RAG is more inclined to "slice, vectorize, recall and answer"; Hyper-Extract put more emphasis on turning the text into structured knowledge, such as character diagram, event timeline, spatial diagram, financial entity relationship and contractual obligation relationship, and then searching, displaying, answering and exporting around these structures.
2. Official Key Schematic
The following images are from the project warehouse itself and are suitable for use in pre-sales materials to explain product capabilities and workflow.
2.1 Project Logo
2.2 Overall Workflow Schematic
This diagram is suitable for the core value of "from document to knowledge structure": input unstructured text, after LLM extraction and structured processing, and finally precipitate knowledge assets that can be queried, displayed and exported.
2.3 Auto-Types Knowledge Structure Matrix
This graph is one of the most important capabilities of the project: the Hyper-Extract not only supports traditional knowledge graphs, but also supports a variety of knowledge structures such as List, Set, Model, Hypergraph, Temporal Graph, Spatial Graph, and Spatio-Temporal Graph.
Visualization Effect of 2.4 Chinese Knowledge Graph
! Hyper-Extract Chinese Graph Show
This map is suitable for Chinese customers to show: the extracted knowledge structure can be directly visualized, not just in JSON.
Schematic diagram of 2.5 architecture
The architecture diagram can be used for technical communication: after text processing, LLM extraction, structure merging, Auto-Type instance is finally generated, and operations such as search, chat, visualization, save/load are supported.
2.6 the CLI usage interface
3. What can it mainly do
3.1 to extract unstructured documents into strongly typed knowledge objects
The basic movements of the Hyper-Extract are:
he parse input.md -t general/biography_graph -o ./output/ -l zh
Based on the template and target knowledge structure, it extracts the input document as a Knowledge Abstract, that is, a knowledge object that can be saved, queried, and evolved.
Structures that can be extracted include:
| Type | Description | Typical Use |
|---|---|---|
| AutoModel | Single Structured Object | Person Profile, Product Specification, Financial Report Summary |
| AutoList | Ordered List | Step, Process, Timeline, Event Sequence |
| AutoSet | de-regroup | label, keyword, capability item, risk item |
| AutoGraph | Binary relationship knowledge map | Person relationship, organization relationship, concept relationship |
| AutoHypergraph | Multi-entity relationship hypergraph | Multi-party collaboration, complex business events, joint risks |
| AutoTemporalGraph | Time Chart | Historical Events, Project Milestones, Financial Changes |
| AutoSpatialGraph | Spatial map | Geographical location, facility network, supply chain spatial relationship |
| AutoSpatioTemporalGraph | Time and Space Chart | Who was at what time, where and what happened |
Pre-sales expression can be simplified as follows: it does not only "read" the document, but "extracts, connects and saves" the entities, relationships, events, times, places and business structures in the document ".
3.2 provide 80 domain templates
The project has built-in 80 YAML templates that cover several areas:
| Domain | Told customer scenarios |
|---|---|
| Finance | Financial report analysis, announcement analysis, risk factor extraction, indicator relationship combing |
| Legal | Contract Terms, Obligations, Legal Facts, Compliance Review |
| Medical | Medical history, guidelines, medical knowledge, disease/symptom/treatment relationship |
| TCM | Relationship between traditional Chinese medicine, prescriptions, symptoms and medicinal materials |
| Industry | Industrial Documentation, Equipment Descriptions, Fault Reports, Process Procedures |
| General | Papers, People, Organizations, Encyclopedia Documents, General Knowledge Graph |
This is very valuable for pre-sales, because it is not necessary to design the schema from scratch when making PoC. You can use the built-in template to quickly run through the sample first, and then customize the template according to the customer's business.
3.3 Supports multiple extraction engines and RAG/GraphRAG methods
The project claims to support 10 Extraction Engines, including GraphRAG, LightRAG, Hyper-RAG, KG-Gen, Cog-RAG, etc.
Before the sale can say:
-For simple field extraction, you can use Model/List/Set.
-AutoGraph can be used for entity-relationship-intensive materials.
-For businesses with complex multi-subject relationships, such as multiple companies, multiple contracts, and multiple events, you can try to Hypergraph.
-For businesses with time and place, such as public opinion, case, accident, and supply chain, you can try to Temporal/Spatial/Spatio-Temporal Graph.
3.4 support knowledge incremental evolution
In addition to parsing 'he parse' for the first time, you can also use:
he feed ./output/ new_document.md
This means that the knowledge object is not a one-time artifact, but can be updated as new documents enter. For corporate knowledge bases, industry intelligence, and research repositories, this is a good fit to talk about "continuously evolving knowledge assets".
3.5 support search, Q & A and visualization
Common commands include:
he search ./output/ "苏轼有哪些重要作品?"
he talk ./output/ -i
he show ./output/
Among them:
-'he search' for semantic search.
-'he talk' for quizzes based on knowledge object chat.
-'he show' is used to visualize the knowledge graph.
3.6 Support Obsidian export
Hyper-Extract have a capability that is well suited for knowledge management clients:
he export obsidian ./output/ -o ./vault/
It exports the knowledge graph class structure into Obsidian Markdown:
-Each node turns into a Markdown note.
-Node fields are written to the YAML frontmatter.
-The relationship is transformed into '[[wikilink]]'.
-Automatic generation of index pages.
This is easy to value for internal knowledge management, research teams, consulting teams, pre-sales databases, and personal knowledge bases.
Note: The official document shows that Obsidian export mainly supports Auto-Type of atlas, such as AutoGraph, AutoHypergraph, AutoTemporalGraph, AutoSpatialGraph, and AutoSpatioTemporalGraph. Non-atlas AutoList/AutoSet/AutoModel are not suitable for direct export to Obsidian atlas.
3.7 support for MCP Server
MCP Server was introduced in v0.3.0, available through:
pip install 'hyperextract[mcp]'
he-mcp
Tools provided by MCP include:
-'list_templates'
-'info'
-'search'
-'ask'
-'export_obsidian'
Official positioning is read-only and export, do not create, do not modify, do not delete Knowledge Abstract. Before sales, it can be said as follows: Claude Desktop, Cursor, Codex or other MCP clients can access existing knowledge objects to realize the connection "from knowledge base to Agent tool.
4. Applicable Scenario
Understanding 4.1 Research Papers and Technical Documents
Suitable for:
-University research team
-Enterprise Research Institute
-Technical consulting team
-Investment research/industry research team
Problem Solved:
-There are too many papers to read.
-Want to quickly extract the relationship between authors, methods, tasks, data sets, indicators, and conclusions.
-Want to precipitate the paper library into a searchable, question-and-answer knowledge map.
Example speech:
We can batch convert papers, white papers, and technical reports into structured knowledge graphs, and then quickly locate "what problems a method solves, what datasets are used, and what methods are relevant" through semantic search and question answering ".
4.2 Financial Results and Announcement Analysis
Suitable for:
-Securities Research
-Investment Analysis
-Risk Management Team
-Corporate Financial Analysis Team
Problem Solved:
-Financial reports, announcements, and minutes of telephone meetings are information-intensive.
-Risk factors, business segments, core metrics, management statements are scattered in long documents.
-Want to upgrade from "document reading" to "entity relationship and risk knowledge base".
Example command:
he parse earnings.md -t finance/earnings_graph -o ./finance_kb/
he search ./finance_kb/ "What are the key risk factors?"
Pre-sales value:
-Build a knowledge map of financial events faster.
-Support for cross-report search.
-Can be used for investment research assistant, risk review, announcement summary, industry knowledge precipitation.
4.3 Legal Contracts and Compliance Review
Suitable for:
-Legal team
-Compliance Department
-Law firm
-Contract management system vendor
Problem Solved:
-There are many terms of the contract, and the relationship between obligations, subjects, conditions and liability for breach of contract is complex.
-It is difficult for traditional keyword search to identify "who bears what obligations to whom and under what conditions".
-Need to convert the contract text into a structured review object.
Pre-sales value:
-Extraction of contract subject, terms, obligations, duration and risk points.
-Combing the relationship between compliance documents and regulatory clauses.
-Provide knowledge structure basis for contract Q & A, risk warning and clause comparison.
4.4 medical, Chinese medicine and professional knowledge base
Suitable for:
-Medical information manufacturers
-Medical Knowledge Base Team
-Research Institute of Traditional Chinese Medicine
-Medical Content Operations Team
Problem Solved:
-Complex entity relationships in medical documents, including disease, symptom, drug, treatment, contraindication, guideline evidence.
-There are special relationships among medicinal materials, prescriptions, symptoms and treatment methods in TCM documents.
Pre-Sales Reminder:
Medical scenarios must emphasize "assisted analysis" and "manual review" and cannot be used as the basis for automatic diagnosis or automatic prescription. Hyper-Extract suitable for knowledge collation and retrieval enhancement, not suitable for direct high-risk medical decision-making.
4.5 industrial documentation and equipment knowledge base
Suitable for:
-Manufacturing enterprises
-Industrial software vendors
-Equipment operation and maintenance team
-Quality and safety management team
Problem Solved:
-Equipment manuals, fault reports, maintenance records, SOP a lot.
-There is a complex relationship between fault phenomena, causes, processing steps, components, and working conditions.
-Front-line personnel need quick inquiries and questions and answers.
Pre-sales value:
-Draw the operation and maintenance documents into a fault knowledge map.
-Support "symptom-> cause-> processing scheme" relational query.
-Combined with Agent or customer service system as maintenance assistant.
4.6 Enterprise Knowledge Management and Obsidian/Markdown Knowledge Base
Suitable for:
-Consulting firm
-Pre-sales team
-R & D knowledge management team
-Personal Knowledge Base Heavy User
Problem Solved:
-A lot of documentation, but lack of connection between knowledge.
-The cost of manually establishing double chains in Markdown/Obsidian is high.
-Want to automatically break down documents into nodes, relationships, and indexes.
Pre-sales value:
-Outputs readable, editable, and migratable Markdown via 'he export obsidian.
-Customer information, industry reports and product documents can be converted into Obsidian double-chain knowledge base.
-Not bound to a specific business platform for customer acceptance.
4.7 Agent Knowledge Tool Access
Suitable for:
-Team working on Enterprise Agent Platform
-AI assistant/knowledge assistant development team
-Technical team using Claude Desktop, Cursor, Codex, MCP
Problem Solved:
-Agents require stable, structured, and queryable external knowledge.
-The cost of directly letting the Agent read the original document is high and the reliability is poor.
-The existing knowledge object needs to be exposed to the Agent as a tool.
Pre-sales value:
-Open knowledge objects into tools such as 'search', 'ask', and 'export_obsidian through MCP.
-More stable than "Agent reads the document from the beginning every time.
-Suitable as a knowledge base PoC for enterprise agents.
5. Not suitable for the scene
| Scenario | Reason |
|---|---|
| Only need full-text search | Elasticsearch, vector database or plain RAG may be simpler |
| Only do fixed field ETL | Traditional rules, OCR table extraction, data pipeline more certain |
| High-risk automated decision making | LLM extraction has errors and illusions that must be manually reviewed |
| Hyperscale production graph database | Hyper-Extract is more like an extraction and knowledge object framework than a Neo4j/JanusGraph-level graph database |
| High demand for determinism | LLM output is affected by model, cue words and context |
| The model does not support structured output. | The project strongly depends on the output capability of structured JSON and schema. |
| Customer does not accept LLM calls to external APIs at all | Local vLLM available, but requires GPU and deployment capabilities |
6. Core Competence List
| Capabilities | Descriptions | Pre-Sales Value |
|---|---|---|
| Strong Type Knowledge Extraction | Draw text into Pydantic/Auto-Type objects | Upgrade from Summary to Knowledge Assets |
| 8 types of knowledge structures | Model/List/Set/Graph/Hypergraph/Temporal/Spatial/SpatioTemporal | Can cover more complex business relationships |
| Template system | 80 YAML template, covering financial, legal, medical, traditional Chinese medicine, industrial, general | PoC start fast, customizable |
| Multiple Extraction Methods | Supports GraphRAG, LightRAG, Hyper-RAG, KG-Gen and Other Ideas | Different Extraction Strategies Can be Selected According to Scenarios |
| Incremental evolution | 'he feed' continuously feeds new documents | Knowledge base can be continuously updated |
| Semantic Search | 'he search' | Quickly find answers from structured knowledge |
| Knowledge Q & A | 'he talk' or MCP 'ask' | Suitable for enterprise knowledge assistant |
| Visualization | 'he show' | The demonstration effect is intuitive and suitable for PoC reporting |
| Obsidian export | 'he export obsidian' | Output Markdown and double chains to reduce customer migration concerns |
| MCP Server | 'he-mcp' | Easy access to Agent ecosystem |
| Cloud model and local model | OpenAI, Anthropic, Alibaba Cloud, and local vLLM | Can cover public cloud and privatization requirements |
7. Architecture and way of working
The official architecture can be summarized:
7.1 text processing
The document is split into chunks. The official schema documentation mentions that the default size is about 2048 characters and the overlap is about 256. Large documents are split into multiple blocks for parallel processing.
Pre-Sales Reminder:
-Long documents can be processed, but the cost and time will rise.
-Official documentation suggests that documents over 50KB may produce 25 chunks and slow down significantly.
-For mass production use, you need to plan for concurrency, caching, model costs, and failure retries.
7.2 LLM Extraction
The system generates prompt words according to the template, calls the LLM's structured output capability, and then parses it into the Pydantic/Auto-Type structure.
This means that model capabilities are critical:
-Support 'json_schema models are more suitable.
-Output unstable models can affect the quality of the extraction.
-The local model needs to select the version and service method to adapt the structured output.
7.3 merge and deduplication
After extracting multiple chunks, the system needs to merge entities, relationships, and fields:
-Entity de-weight
-Relationship Merge
-Conflict handling
-Index Construction
This is one of the key values of the project, because the same entity is often repeated in different paragraphs in real documents.
7.4 operation layer
After the extraction is complete, you can do:
-'search': semantic search
-'chat/talk': Q & A around knowledge objects
-'show': map visualization
-'dump/load': save and load
-'export obsidian': export knowledge base
How to use #8.
8.1 CLI Installation
Official use of 'uv ':
uv tool install hyperextract
he --version
You can also use 'pipx ':
pipx install hyperextract
he --version
8.2 initialization configuration
he config init -k YOUR_OPENAI_API_KEY
Configurations are saved by default in:
~/.he/config.toml
8.3 Chinese Document Quick Example
he parse examples/zh/sushi.md -t general/biography_graph -o ./output/ -l zh
he search ./output/ "苏轼有哪些重要的作品?"
he show ./output/
he export obsidian ./output/ -o ./vault/
Example of 8.4 English Paper/Technical Document
he parse paper.pdf -t general/academic_graph -o ./paper_kb/ -l en
he search ./paper_kb/ "What are the key methods and datasets?"
he show ./paper_kb/
Example of 8.5 Financial Report Analysis
he parse earnings.md -t finance/earnings_graph -o ./finance_kb/ -l en
he search ./finance_kb/ "What are the key risk factors?"
he talk ./finance_kb/ -i
8.6 Incremental Update
he feed ./finance_kb/ new_earnings_report.md
he build-index ./finance_kb/ -f
Suitable for continuous addition of new announcements, new reports, new contracts, new cases.
8.7 Python API
uv pip install hyperextract
from hyperextract import Template
ka = Template.create("general/biography_graph")
with open("examples/en/tesla.md") as f:
result = ka.parse(f.read())
result.show()
The Python API is suitable for embedding into existing customer systems, such as documentation platforms, knowledge base backends, data processing pipeline, or internal Agent platforms.
8.8 MCP Server
Installation:
pip install 'hyperextract[mcp]'
Start:
he-mcp
Or:
python -m hyperextract.mcp_server
You can access the client that supports MCP, and use the knowledge object generated by the Hyper-Extract as a tool for the Agent to use. Officials emphasize that MCP Server is read-only and export-capable and will not create, modify, or delete knowledge objects.
8.9 local model deployment
Local vLLM deployments are available as mentioned in the official documentation, for example:
-LLM:Qwen3.5-9B GPTQ-Marlin
-Embedding:BAAI/bge-m3
It can be used as a "feasible path for privatization deployment" before sales, but it should be noted that:
-Requires GPU resources.
-Customers are required to have local model service operation and maintenance capabilities.
-It is officially recommended that local vLLM use a non-thinking model or turn off thinking, because thinking tags may interfere with constrained JSON output.
9. Supported models and Provider
Typical support scenarios listed in the official documentation:
| Provider | Model | Description |
|---|---|---|
| OpenAI | gpt-4o, gpt-4o-mini, gpt-5 | Native json_schema, for structured output |
| Anthropic | Claude Opus/Sonnet/Haiku Series | Can be LLM, but there is no embedding API, need to be matched with other embedding |
| Alibaba Cloud Bailian | 'qwen-plus', 'qwen-turbo', 'deepseek-r1', etc. | Friendly to domestic customers |
| Local vLLM | Qwen3.5-9B GPTQ-Marlin | Suitable for privatization or data not out of domain |
| Embedding | OpenAI 'text-embedding-3-small', Refined text-embedding-v4, Local bge-m3 | Used for semantic search and indexing |
Pre-sales advice:
-If the customer allows cloud APIs, PoC prefers a cloud model that stably supports structured output.
-If the customer is concerned about data security, prepare a local vLLM scenario, but evaluate GPU, throughput, latency, and extraction quality separately.
-If the customer is in China, Alibaba Cloud is one of the easier options to communicate.
10. What to say before sales
10.1 Business-oriented Words
It can be said:
Hyper-Extract can turn a large number of unstructured documents into knowledge assets that can be queried, associated, and continuously updated. It does not just summarize the document, but extracts the key objects such as people, organizations, events, indicators, terms, time and place in the document, and establishes relationships to facilitate subsequent search, question and answer, visualization and knowledge base precipitation.
Business Value:
-Reduce the cost of manual reading of long documents.
-Improve cross-document retrieval and knowledge discovery efficiency.
-Precipitate expert experience and historical data into structured knowledge.
-Provide basic data for enterprise knowledge assistants, investment and research assistants, contract assistants, and operation and maintenance assistants.
10.2 Technology-Oriented Words
It can be said:
Hyper-Extract is an LLM knowledge extraction framework with Pydantic/Auto-Type as the core. It defines the target schema through templates, calls LLM that supports structured output for extraction, then performs chunk-level merging, deduplication, index construction, and provides CLI, Python API, Obsidian export, and MCP Server.
Technical value:
-Typed output for easy access to downstream systems.
-Supports a variety of complex graph structures, not limited to common knowledge graphs.
-Template can be extended, suitable for domain customization.
-Supports both cloud model and local model routes.
-Can be integrated with Agent/MCP ecosystem.
10.3 management-oriented speech
It can be said:
It can be used as a rapid test tool for enterprise knowledge assetization. In the early stage, a small number of documents and templates are used to quickly verify "whether documents can become structured knowledge". In the mid-term, knowledge question answering and visualization are accessed. In the later stage, it is decided whether to get through to the enterprise knowledge base, graph database, search platform and Agent platform.
Management value:
-PoC low cost.
-Demo effect intuitive.
-Scalable gradually, without the need to start building a large knowledge graph platform.
-Open output format to reduce the risk of technology lock-in.
11. Recommended PoC scheme
PoC 1: Financial Report/Announcement Knowledge Extraction
Target customers:
-Investment research team
-Financial Information Service Provider
-Corporate Strategy/Finance Team
Input material:
-3-5 company annual reports
-3-5 announcements
-1-2 phone minutes
Verification point:
-Whether the company, business sector, financial indicators, risk factors, management statements can be extracted.
-Whether it is possible to search across documents for "major risk changes in a company in recent periods".
-Whether you can use a graph to show the relationship between company, business and indicators.
Demo command:
he parse annual_report.md -t finance/earnings_graph -o ./finance_kb/ -l zh
he feed ./finance_kb/ announcement.md
he search ./finance_kb/ "这家公司当前最主要的经营风险是什么?"
he show ./finance_kb/
PoC 2: Extracting Contractual Obligations and Risk Relationships
Target customers:
-Legal Department
-Contract management system vendor
-Compliance Team
Input material:
-5-10 sample contracts
-A list of customer-defined risks
Verification point:
-Whether the subject of the contract, obligations, duration, conditions, and liability for breach of contract can be identified.
-Whether dependencies between terms can be extracted.
-Can you answer "What are the payment obligations of Party A" and "Which terms are high-risk".
Output format:
-Contract Knowledge Map
-Risk point list
-Obsidian/Markdown knowledge base
-Follow-up can access the contract assistant
PoC 3: Knowledge Mapping of Scientific Papers
Target customers:
-Scientific research team
-Research Institute
-Technical Strategy Team
Input material:
-10 papers under a topic
Verification point:
-Extract authors, institutions, tasks, methods, data sets, indicators, conclusions.
-Search for "which methods use the same data set".
-Visualize the relationship between technical routes.
Demo command:
he parse paper.pdf -t general/academic_graph -o ./paper_kb/ -l en
he feed ./paper_kb/ paper2.pdf
he search ./paper_kb/ "Which methods are related to retrieval augmented generation?"
he export obsidian ./paper_kb/ -o ./vault/
PoC 4: Intra-Enterprise Knowledge Base Obsidian/MCP
Target customers:
-Consulting firm
-Pre-sales Team
-Enterprise Knowledge Management Team
-Agent Platform Team
Input material:
-Product documentation
-FAQ
-Solution Materials
-Customer Cases
Verification point:
-The ability to extract the relationship between products, capabilities, industries, customer pain points, solutions, and cases.
-Can export Obsidian double-chain knowledge base.
-Whether the Agent can be queried through MCP.
Demo Path:
he parse solution_docs.md -t general/knowledge_graph -o ./solution_kb/ -l zh
he export obsidian ./solution_kb/ -o ./vault/
he-mcp12. Differences from similar tools
The project README compares GraphRAG, LightRAG, KG-Gen, ATOM, Hyper-Extract and other tools. Hyper-Extract officials emphasize their differences mainly in:
| Dimension | Characteristics of Hyper-Extract |
|---|---|
| Knowledge Structure | Not only ordinary Knowledge Graph, but also Temporal, Spatial, Hypergraph, and Spatio-Temporal |
| Domain Templates | Built-in 80 templates covering multiple industries |
| Use Ingress | CLI friendly for fast PoC |
| Knowledge evolution | Support 'feed' incremental update |
| Knowledge Management | Supports Obsidian export |
| Agent Integration | Support MCP Server |
| Localization | Support for local vLLM and local embedding |
Don't talk about it as "replacing all RAG/knowledge graph platforms" before sales. The more appropriate positioning is:
An extraction layer and PoC tool that can quickly turn documents into structured knowledge objects can be combined with existing vector libraries, graph databases, knowledge bases, and Agent platforms.
13. Risks and Considerations
13.1 Project Still in Alpha Phase
'pyproject.toml' is marked Alpha. Although GitHub is hot, from the perspective of production system, we need to pay attention:
-Whether the API is stable
-Whether the template quality is stable
-Maturity of large-scale document processing
-Whether error handling and logging meet enterprise requirements
-Whether the rhythm of community maintenance is continuous
13.2 LLM extraction is not 100% accurate
May appear:
-Entity leakage
-Relationship misjudgment
-Date/value error
-Merge Deduplication Error
-Model illusion
Therefore, it is suitable for "auxiliary analysis, knowledge collation, manual review" and not suitable for entering high-risk business decisions without review.
13.3 strong dependence on structured output capability
If the model does not support JSON schema, the extraction stability will decrease. The official Provider documentation also distinguishes between json_schema and json_object capabilities.
Pre-sales advice:
-PoC preferred official recommended model.
-Local model to do structured output capability test.
-Set up manual sampling evaluation for key fields.
13.4 costs and delays need to be assessed
The long document is sliced into multiple chunks, each of which may call LLM. Needs Assessment:
-token cost
-Concurrency Capability
-Failed retry
-Vector index cost
-Document update frequency
13.5 Obsidian export is not an enterprise permission system
Obsidian export is great for knowledge management and presentation, it is not a rights management, auditing, version approval, data governance platform. Enterprise production landing, still need to combine the document management system, knowledge base platform or permission system.
13.6 it is not a full graph database
Hyper-Extract can generate and display maps, but don't equate it with map databases like Neo4j, JanusGraph, and NebulaGraph. A more accurate positioning is the "atlas extraction and knowledge object generation layer".
14. Frequently Asked Customer Questions
Q1: How is it different from normal RAG?
Ordinary RAG is mainly dicing, vector retrieval and answering. Hyper-Extract put more emphasis on abstracting documents into structured knowledge, such as entities, relationships, events, times, places, hyperedges, and then searching, answering, and exporting. It is better suited to scenarios that require relational understanding and knowledge precipitation.
Q2: Can it be deployed privately?
You can take the route of local vLLM local embedding. The combination of Qwen3.5-9B GPTQ-Marlin and bge-m3 is mentioned in the official documentation. However, privatization requires GPUs, model services, performance tuning, and extraction quality assessment.
Q3: Can it handle Chinese?
Yes. Chinese documents, Chinese examples, and Chinese atlas are available. '-l zh' is also supported in CLI'.
Q4: Can it directly connect to the enterprise knowledge base?
It can be integrated through Python API, CLI output, JSON/YAML, Obsidian Markdown, MCP, etc. However, if you want to receive enterprise authority, audit, approval and search platform, you need to do secondary development.
Q5: Is it suitable for direct production?
Direct commitment is not recommended. The project is currently labeled Alpha. It is suitable to do PoC, internal tools and knowledge extraction tests first, and then decide whether to be productive according to stability, accuracy, throughput and maintenance capability.
Q6: Does it support Claude?
Claude is supported to be Anthropic as an LLM, but Claude does not have a embedding API and needs to be matched with a OpenAI-compatible embedding or other embedding provider.
Q7: Can it import Obsidian?
Yes. 'he export obsidian' will export the graph knowledge object into Markdown double-chain notes, one note for each node, and generate an index page.
15. Pre-sales scoring
| Dimension | Rating | Description |
|---|---|---|
| Demo Attraction | 4.5/5 | Atlas, Obsidian, MCP are easy to demonstrate |
| PoC Boot Speed | 4/5 | CLI Template Reduces Boot Cost |
| Enterprise Landing Maturity | 3/5 | Alpha Phase, Secondary Verification Required |
| Scene Coverage | 4.5/5 | Financial, Legal, Medical, Industrial, Scientific Research, Knowledge Management |
| Privatization Friendly | 3.5/5 | Local vLLM is supported, but O & M costs exist |
| Differentiation | 4/5 | Multi Auto-Type and Obsidian/MCP are highlights |
16. My judgment
Hyper-Extract is a very suitable for pre-sale to do "document knowledge extraction PoC" project. Its biggest highlight is not a single-point algorithm, but a combination of several capabilities that are easy to explain before sales:
-Document Extraction
-Multiple knowledge structures
-Templated Domain Adaptation
-Visualization
-Semantic search and Q & A
-Obsidian export
-MCP Access Agent
-Cloud model/local model two routes
For customers, if they just want to be a common document question and answer robot, Hyper-Extract may not be the simplest route. However, if customers are concerned about "how relationships, events, entities, indicators and terms in documents are precipitated into knowledge assets", it is very worthy of being a PoC tool.
The most recommended pre-sale cut-in method:
- First select a long document scenario that the customer is familiar with, such as financial reports, contracts, papers, and equipment manuals.
- Use the built-in template to run out of the knowledge map.
- Use 'he show' to show the diagram.
- Use 'he search' or 'he talk' as a question and answer.
- Use 'he export obsidian' to export to a double-chain knowledge base.
- If the customer has an Agent platform, then demonstrate the MCP query.
This allows a natural transition from "a lot of documents and endless reading" to "structured knowledge assets" and "Agent available knowledge tools", and the sales narrative is relatively smooth.