1. Project/Product Overview
| Dimension | Information |
|---|---|
| Project name | DSPy(Declarative Self-improving Python) |
| Developer | Stanford NLP Labs (Omar Khattab, Christopher Potts, Matei Zaharia, etc.) |
| Open Source License | MIT |
| Main Language | Python |
| GitHub Stars | 35,718(2026-07-02 query) |
| Forks | 3,039 |
| Commits | 4,550 |
| Created | 2023-01-09 |
| Last Updated | 2026-07-01 (Activity: High, 523 + PR/Year) |
| Latest Version | v3.3.0b1(ReActV2 Module + Improved LM/BaseLM), 109 Release in total |
| Monthly Downloads | 6.4 million +(PyPI) |
| Contributors | 433 + |
| Discord | 8,400 + members |
| Academic Papers | 9 (ICLR 2024 Papers + ICLR 2026 GEPA Oral), and 60 + Tutorials and Cookbook |
| official website | https://dspy.ai |
| Production users | Shopify (cost reduction 75x), Databricks (cost reduction 90x,Genie platform), Dropbox(Dash correlation score, 45% NMSE reduction), AWS(Nova model migration), JetBlue (multiple chatbots on the Databricks), Replit (code repair), Sephora, VMware, Moody's, Nous Research(Hermes Agent self-evolution) |
| Dependent | 1,800 + Downstream Items |
| Observability | Native MLflow Tracing (OpenTelemetry-based) integration |
| Enterprise Tools | MLflow Model Serving deployment, PyPI publishing, MCP Server export |
| Related projects | dspy-go(Go language compatible implementation), GEPA independent library |
2. What does it mostly do?
DSPy's core innovation is to upgrade LLM calls from "handwritten strings" to "declarative compilers". This isn't another Prompt template library-it's a complete compiler framework.
2.1 core three elements
DSPy builds and optimizes LLM systems through three layers of abstraction:
| Elements | Descriptions | Technical Features |
|---|---|---|
| Signature (Signature) | Use typed Python Field to declare the input and output of a task instead of a handwritten Prompt string. | Support type constraints, description fields, default values, Literal types, and Image multimodal fields. |
| Module (Module) | Combinable components that control the execution strategy of LLM. The same Signature can replace different Modules | Predict (direct), ChainOfThought (step-by-step inference), ReAct (tool inference loop), ReActV2, MultiChainComparison, ProgramOfThought, RLM (recursive language model) |
| Optimizer (optimizer) | automatic compilation: automatically search for optimal Prompt and examples according to labeled data and evaluation indicators | GEPA(ICLR 2026 Oral, reflective gene evolution, 35x fewer rounds than GRPO), MIPROv2 (joint optimization instructions and examples), BootstrapFewShot, etc. |
2.2 Compiler Workflow
传统 Prompt 工程:
写 Prompt 字符串 → 测试 → 效果不好 → 改字符串 → 再测试 → 换模型又得重来
DSPy 工作流:
定义 Signature(声明任务) → 选 Module(执行策略) → Optimizer 编译(自动优化) → 保存为 JSON
Key difference: DSPy's optimization results in a savable, reusable, versioned JSON file, rather than a "by chance" piece of text. After changing the model, you only need to recompile it once, and the code does not need to be changed.
2.3 Core Concept: From Prompt to Program
DSPy takes LLM application development from "assembly language"(Prompt strings) to "high-level language" (declarative Python):
# 传统方式:脆弱的 Prompt 字符串
prompt = """You are a helpful assistant. Given the following email, extract
the event name and date. Return in JSON format. Email: {email}"""
# DSPy 方式:声明式类型化签名
class ExtractEvent(dspy.Signature):
"""Extract event details from an email."""
email: str = dspy.InputField()
event_name: str = dspy.OutputField()
date: str = dspy.OutputField()
extract = dspy.Predict(ExtractEvent)
result = extract(email="Team offsite this Thursday at 2pm")
# Prediction(event_name="Team Offsite", date="Thursday")
When you need step-by-step reasoning, just replace the Module and the code does not change:
extract = dspy.ChainOfThought(ExtractEvent) # 自动加推理步骤
When a tool call is required:
agent = dspy.ReAct("question -> answer", tools=[search, calc])
2.4 Advanced Competency
-Multimodal:'dspy. Image' as a field type, directly processing pictures, charts
-Assertions (assertion): runtime constraint check, automatic retry error correction
-PythonInterpreter: Security sandbox code execution, support for mathematical reasoning
-RLM (Recursive Language Model): Recursive Reasoning Module (new paper December 2025)
-Fine-tuning Integration: The optimizer not only adjusts Prompt, but also generates fine-tuning data (BetterTogether papers)
-Asynchronous support: Native 'async' execution, thread-safe, suitable for high-throughput scenarios
-Observability:MLflow Tracing (based on OpenTelemetry), each call can be traced
3. Applicable Scenario
| Scenario | Description | Typical Customer/Case |
|---|---|---|
| Prompt Optimization Project | With labeled data, you need to systematically find the optimal Prompt and sample combination | Any LLM application team that has been online but does not meet the performance standards |
| Classification/Extraction Task | Text Classification, Entity Extraction, Intent Recognition, Structured Information Extraction | Shopify (full-platform merchant metadata extraction, cost reduction 75x) |
| RAG Pipeline Optimization | Optimized Retrieval → Sorting → Generated Complete Pipeline, Automatic Reference Adjustment | Dropbox Dash (Correlation Score, 45% NMSE Reduction) |
| Model Migration/Cost Reduction | Migrate from a large model to a small model, and compile the small model output to approach the large model | AWS Nova migration and Databricks(90x cost reduction) |
| Agent Behavior Optimization | Automatically optimizes the Agent inference chain and tool invocation strategy | Databricks Genie (table search) and Replit (code repair) |
| Multimodal Task | Image Understanding, Chart Analysis (Image Field Type) | Medical Imaging (Stanford Prompt Triage, up to +3400% improvement) |
| Security Detection | Jailbreak detection, Prompt injection detection | DSPy security Pipeline (including cryptography untamperable state) |
| Quality Assessment Automation | Build an automated scoring/assessment pipeline to reduce manual labeling | LLM-as-Judge system for each enterprise |
| Research and experimentation | Systematic experiments to quickly compare different models, strategies, and optimization methods | Academic research, AB testing |
4. Not quite the scene
| Scenario | Reason | Alternative Suggestions |
|---|---|---|
| Simple One-time Prompt ** | DSPy's optimized compilation requires labeling data and evaluation indicators, and a single call is over-designed | Write a few lines of Prompt directly or call directly with API |
| No labeled data at all | Optimizer must rely on calculable indicators (accuracy, F1, etc.) | First use manual evaluation to go online, accumulate labeled data, and then introduce DSPy |
| Low-Code/Drag-and-Drop Development | DSPy is a pure Python framework with no GUI interface | Dify / Coze / MaxKB |
| High requirements for indicator design | The optimization effect depends heavily on the quality of the evaluation indicators, and incorrect indicators will lead to degradation. | Someone needs to design a metric function. |
| Compilation cost sensitive * | GEPA/MIPROv2 optimization process requires a large number of LLM calls (although 35x more efficient than RL) | Small tasks can be compiled without signing modules |
| Requires extremely low latency | The compiled module still calls LLM, which cannot reduce the inference latency itself | Model distillation, quantization |
| Pure non-LLM tasks | DSPy is an LLM programming framework, not suitable for traditional ML/rule systems | Scikit-learn, XGBoost, etc. |
5. Core Competence List
5.1 Signature (signature) ability
| inline shorthand signature | '"question -> answer"' one-line definition | ||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Class definition signature | Inherited 'dspy. Signature', supports InputField/OutputField type declarations | ||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Type Constraints | 'str', 'int', 'float', 'bool', 'Literal["a","B"]', 'list[dict]', etc. | ||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Multimodal Field | 'dspy. Image', 'dspy. Video' as field type | ||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Description field | 'desc = "often between 1 and 5 words" 'Constraint output format | ||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Default | 'dspy. InputField(default="N/A")' | ||||||||||||||||||||||||||||||||||||||||||||||||||||||
| docstring | '"" "Task description." ""' as a task semantic description |
5.2 Module Matrix
| Module | Purpose | Applicable scenarios |
|---|---|---|
| 'dspy. Predict' | Call LLM directly without additional inference | Simple classification, extraction |
| 'dspy. ChainOfThought' | Require LLM to reason before outputting | Tasks that require logical deduction |
| 'dspy. ReAct' | Inference action loop, support tool invocation | Agent, multi-step inference, search |
| 'dspy. ReActV2' | Improved ReAct, better inference structure (v3.3.0b1 new) | Complex Agent Scenarios |
| 'dspy. MultiChainComparison' | Best output after comparing multiple inference chains | Tasks to be consensus |
| 'dspy. ProgramOfThought' | Code Generation Execution Inference | Math Inference, Programming Tasks |
| 'dspy. RLM' | Recursive Language Model, Multi-Layer Nested Reasoning | Deep Reasoning, Complex Logic |
| 'dspy. Module' (Custom) | Combine multiple Signature and Modules | Complex multi-step Pipeline |
5.3 Optimizer (optimizer) matrix
| Optimizer | Principle | Applicable Scenarios |
|---|---|---|
| GEPA | Gene Evolution Pareto Frontier Natural Language Reflection (ICLR 2026 Oral) | The current most recommended production-level optimizer, 35x fewer rounds than GRPO |
| MIPROv2 | Joint optimization command text and few-shot example | Basic optimization requirements, zero sample to less sample |
| BootstrapFewShot | Automatically select the best examples from the training set | Quickly build a few-shot program |
| BootstrapFewShotWithRandomSearch | Random Search Example Guide | Explore a Larger Search Space |
5.4 Enterprise and Engineering Capabilities
| MLflow Tracing | Tracks every LLM call based on OpenTelemetry's native observability | ||||||||||||||||||||||||||||||||||||||||||||||||||||||
| MLflow Model Serving | Deploy a compiled program as a production API with one click | ||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Asynchronous execution | 'dspy.asyncify()'supports high concurrency scenarios | ||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Thread Safe | Multi-threaded environment safe for production services | ||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Program Save/Load | 'program.save("model.json")' / 'dspy. Module.load("model.json")' | ||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Multiple Backend LM Support | Via 'dspy. LM()'unified interface, supporting OpenAI, Anthropic, local models, etc. | ||||||||||||||||||||||||||||||||||||||||||||||||||||||
| LiteLLM Integration | Lazy Load, Activate on Demand | ||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Go language compatibility | dspy-go community projects to implement compatible operation of DSPy modules in Go |
6. Architecture/deployment/integration approach
6.1 installation
# 稳定版
pip install -U dspy
# 最新开发版(从 main 分支)
pip install git+https://github.com/stanfordnlp/dspy.git
6.2 LM Configuration (Unified Model Interface)
DSPy via 'dspy. LM()'unified management of various LLM back-ends, code does not need to change due to model changes:
# OpenAI 模型
lm = dspy.LM("openai/gpt-4o-mini")
dspy.configure(lm=lm)
# Anthropic 模型
lm = dspy.LM("anthropic/claude-sonnet-4-20250514")
# 本地开源模型(通过 LiteLLM 或直接调用)
lm = dspy.LM("ollama/llama3.1")
# Azure OpenAI
lm = dspy.LM("azure/gpt-4o", api_key="...", api_version="...")
6.3 deployment mode
| Mode | Description |
|---|---|
| Local OSS | 'pip install dspy', pure Python, suitable for development experiments |
| MLflow Model Serving | Package the compiled program as an MLflow model and deploy it as a REST API |
| MLflow Tracing | OpenTelemetry-based call tracing for production monitoring |
| Asynchronous deployment | 'dspy.asyncify(program)'is converted to an asynchronous version and supports asyncio. |
| MCP Server | You can export DSPy programs as MCP-compatible service endpoints |
6.4 Integrated Ecosystem
-Observability:MLflow Tracing(OpenTelemetry standard)
-Deployment:MLflow Model Serving, PyPI release
-Model providers:OpenAI, Anthropic, Cohere, Mistral, Google Gemini, AWS Bedrock, Azure OpenAI, Ollama, on-premises HuggingFace models, and more
-Vector database: natively integrated with ColBERTv2 and other retriever, and can also access any vector library through tool functions.
-With other frameworks:LangChain DSPy official integration, DSPy module can be embedded in LangChain Pipeline;LlamaIndex can also be used complementary to DSPy
-Go Language: Community dspy-go project provides compatible implementation
How to use #7.
7.1 Minimum Complete Example
import dspy
# 1. 配置语言模型
lm = dspy.LM("openai/gpt-4o-mini")
dspy.configure(lm=lm)
# 2. 定义任务签名(替代手写 Prompt)
class SentimentClassifier(dspy.Signature):
"""Classify the sentiment of a sentence as positive, negative, or neutral."""
sentence: str = dspy.InputField()
sentiment: str = dspy.OutputField(desc="one of: positive, negative, neutral")
# 3. 选择执行策略
classify = dspy.ChainOfThought(SentimentClassifier)
# 4. 直接调用(零样本)
result = classify(sentence="The new feature is amazing and works flawlessly!")
print(result.sentiment) # "positive"
7.2 complete training process with optimization
import dspy
# 准备训练数据
trainset = [
dspy.Example(sentence="I love this product!", sentiment="positive").with_inputs("sentence"),
dspy.Example(sentence="This is terrible.", sentiment="negative").with_inputs("sentence"),
dspy.Example(sentence="It's okay, nothing special.", sentiment="neutral").with_inputs("sentence"),
# ... 更多训练样本
]
# 定义评价指标
def sentiment_accuracy(example, pred, trace=None):
return pred.sentiment.lower() == example.sentiment.lower()
# 创建分类器和优化器
classify = dspy.ChainOfThought(SentimentClassifier)
optimizer = dspy.GEPA(metric=sentiment_accuracy, auto="medium")
# 编译优化
optimized_classify = optimizer.compile(classify, trainset=trainset)
# 保存编译结果
optimized_classify.save("sentiment_classifier_v1.json")
# 加载使用
loaded = dspy.ChainOfThought(SentimentClassifier)
loaded.load("sentiment_classifier_v1.json")
result = loaded(sentence="Absolutely fantastic experience!")
print(result.sentiment) # "positive"
7.3 Agent Example (ReAct Tool Call)
import dspy
lm = dspy.LM("openai/gpt-4o")
dspy.configure(lm=lm)
# 定义工具(Python 函数即可)
def search(query: str) -> list[str]:
"""Search a knowledge base for relevant documents."""
return knowledge_base.query(query, k=3)
def calculator(expression: str) -> float:
"""Evaluate a mathematical expression."""
return dspy.PythonInterpreter({}).execute(expression)
# 创建 Agent
agent = dspy.ReAct("question -> answer", tools=[search, calculator])
# 使用 Agent
result = agent(question="What is the GDP per capita of France based on 2024 data?")
# Agent 自动推理:需要 GDP 数据和人口数据 → search → 计算除法
print(result.answer) # "$46,029"
7.4 Multimodal Example
class AnalyzeChart(dspy.Signature):
"""Describe the trend and key data points in a chart."""
chart: dspy.Image = dspy.InputField()
title: str = dspy.OutputField()
trend: str = dspy.OutputField()
data_points: list[dict] = dspy.OutputField()
analyze = dspy.Predict(AnalyzeChart)
result = analyze(chart=dspy.Image("quarterly_revenue.png"))8. What can I say before sales
8.1 a sentence positioning
- * "DSPy is Stanford's LLM compiler-turning your AI system from a handwritten Prompt 'handcrafted workshop 'into an automatically optimized 'industrial production line', Shopify saving 75 times the cost and 90 times the Databricks. "**
8.2 customer pain points → solutions
| Customer Pain Points | DSPy Solution | Quantified Effects |
|---|---|---|
| "Prompt has been adjusted for two months, but the effect is still unstable" | Optimizer automatic compilation, far better than manual Prompt found in a few hours | GEPA typical improvement: + 14% accuracy, Prompt shortened by 9.2x |
| "It's too painful to rewrite all the Prompt when changing the model" | The Signature is decoupled from the model. When changing the model, the code only needs to be recompiled once, and the code does not move. | AWS Nova Migration Case: Large Model → Small Model Seamless Switching |
| "A lot of data is marked, and I don't know how to use it to improve the effect" | Optimizer directly use the marked data as the optimization signal | 200 samples can raise the baseline from 62% to 89% |
| "The big model is too expensive, I want to use the small model but the effect is poor" | Compile optimization to make the Prompt quality of the small model approach or exceed the zero sample of the big model | Databricks: open source model + GEPA optimization> Claude Opus 4.1 (zero sample) |
| "AI system is a black box, I don't know which link is in trouble" | MLflow Tracing tracks each LLM call, each step can be observed | native OpenTelemetry standard, access to existing monitoring |
| "Three people in the team maintain 50 Prompt, going crazy" | Signature + Module code management, Prompt becomes version controllable JSON | Git management, Diff comparison, CI/CD |
| "Agent behavior is unstable, sometimes easy to use and sometimes nonsense" | Optimizer can optimize the agent's tool invocation policy and inference chain | Replit code repair, Databricks Genie Agent is verified in production |
8.3 Differentiated Selling Points
vs LangChain (the most ecologically wide LLM framework):
-DSPy does not do orchestration and integration, but focuses on optimizing the quality of LLM calls themselves-complementary to LangChain
-LangChain the official integration of DSPy, DSPy module can be used as an optimization node in the LangChain Pipeline
-DSPy's optimization ability is not available in LangChain: LangChain helps you build Pipeline, and DSPy helps you tune Pipeline.
-Positioning: LangChain can be used as the framework DSPy as the optimization layer
vs LlamaIndex(RAG dedicated framework):
-LlamaIndex strong in data connection and index management, DSPy strong in LLM call compilation and optimization
-LlamaIndex is suitable for "help LLM find the right data",DSPy is suitable for "let LLM process the data correctly"
-DSPy's RAG optimization (retrieval strategy, build quality) works with LlamaIndex index layers
vs Handwritten Prompt (Traditional Way):
-Handwriting Prompt is art, DSPy is engineering-reproducible, versionable, automatable
-Handwritten Prompt relies on personal experience, DSPy relies on data and indicators-more scientific
-Handwritten Prompt model is equivalent to rewriting, DSPy recompile
-Research data: The optimized Prompt is not necessarily effective when used outside the framework (some scenes are degraded), indicating that DSPy is an overall programming model, not just a "Prompt generator"
vs domestic framework (Dify / Coze/MaxKB, etc.):
-DSPy is the underlying LLM compiler, the domestic framework multi-bias application layer.
-DSPy is more suitable for medium and large teams with ML engineering team and need fine control
-DSPy's academic endorsement (Stanford ICLR) has strong appeal to technology decision makers
8.4 Customer Value Story Line (5-Step Approach)
- Cut in:"Is the effect stable after you launch the LLM application? How long did it take to debug the Prompt? Do you have to start all over again to change the model?"
- Resonance :"Handwriting Prompt is like manually adjusting assembly code-you have to do it all over again for another CPU (model). The big model itself is evolving, but the approach to Prompt engineering is still stuck in 2023."
- Demo : Define a classification task with 5 lines of Python on site → Change the ChainOfThought without changing the code → Automatically compile and optimize with GEPA → Improve the display effect → Save as JSON to prove "reproducible"
- Advanced : Show Production Cases-Shopify from Single GPT-5 Call to Self-Hosting Qwen-3-9B Multi-Agent Architecture, 2x Quality Improvement and Cost Reduce from $5 million/Year to Zero
- Heavy:"DSPy is not another Prompt template library-it is the compiler methodology for the ICLR 2026 Oral paper. Stanford, MIT Open Source, Shopify/Databricks/Dropbox/AWS are all in production."
9. Frequently Asked Customer Questions
| Question | Answer |
|---|---|
| What is the relationship between DSPy and LangChain/LlamaIndex? Competition or complementarity? | Complementary relationship. LangChain do orchestration (Chain and Agent organization), LlamaIndex do indexing (data connection and retrieval), and DSPy do optimization (LLM call quality improvement). LangChain have DSPy official integration. Best practice: use LangChain/LlamaIndex architecture and DSPy as the optimization layer. |
| Must I have labeled data to use it? | Optimization (Optimizer) requires calculable evaluation metrics and labeled data. However, if optimization is not required, writing declarative programs directly using Signature Module can also enjoy type safety, modularity and maintainability improvement, and the effect is equivalent to a good zero-sample Prompt. |
| How much does it cost to optimize once? | Depends on the training set size and optimizer. GEPA is much more efficient than traditional methods (35x fewer rounds than GRPO). Typical scenario: 200 training samples, gpt-4o-mini, GEPA auto = "medium", with a compilation cost of about $2-5. Databricks report 90x total cost reduction (optimization cost vs long-term inference savings). |
| Support for privatized deployment and homegrown models? | Support. DSPy achieves this through 'dspy. LM()'unified interface to any model backend, including local models deployed in Ollama (such as Qwen, Llama), vLLM, HuggingFace TGI, etc. Shopify is to replace the GPT-5 with a self-hosted Qwen-3-9B. |
| Can the compiled results be used outside the framework? | Some scenarios are possible, but not recommended. One study (University of Minho, 2025) found that the optimized Prompt may degenerate after leaving the DSPy framework, because DSPy is not just "generating Prompt", but a complete execution model. It is recommended that compiled modules remain loaded within DSPy for use. |
| How big team is DSPy suitable for? Is the technical threshold high? | Suitable for medium-sized teams with 1-2 ML engineers. The core concept (Signature → Module → Optimizer) takes only half a day to understand. Python basic LLM experience to get started. 60 tutorials and Cookbook covering common scenarios. |
| Which is better than Fine-tuning? | No contradiction. DSPy's BetterTogether paper proves that the combination of Prompt optimization Fine-tuning is the best. The DSPy optimizer can generate fine-tuning data or just do Prompt optimization. For teams with limited resources, optimizing just the Prompt(zero/few-shot) can yield significant benefits. |
| How to ensure data security? Does the troubleshooting process upload data? | DSPy itself is a local Python library, and the data does not pass through any remote server (except the LLM API you configured). Optimizer the LLM calls during compilation to go through the model backend you configured (which can be a local model). MLflow Tracing data is stored in your own environment. |
| How about Chinese task support? | The framework itself is language independent. The Chinese effect depends on the LLM used. With strong Chinese models such as Qwen and DeepSeek, DSPy optimization is equally effective. Signature docstring and field descriptions can be written in Chinese. |
10. PoC Recommendations
Recommended PoC Direction: From Handwritten Prompt to DSPy Compilation Optimization
Select an LLM task (classification/extraction/RAG) that the customer already has labeled data, and compare the compiled effect of handwritten Prompt vs DSPy.
| Phase | Content | Time | Output |
|---|---|---|---|
| 1. Environment construction | pip install dspy, configure LM (the model currently used by the customer) | 0.5 days | Runable environment |
| 2. Task Migration | Convert the Prompt logic of an existing task to DSPy Signature (without changing the business logic) | 0.5 days | Available Signature versions |
| 3. Baseline test | Run the training set with Predict zero samples and record the baseline accuracy | 0.5 days | Baseline metrics |
| 4. GEPA Optimization | Configure metric and run GEPA compilation (auto = "medium") | 0.5 days | Optimized version after compilation |
| 5. Effect Evaluation | Compare the handwritten Prompt vs compiled version on the test set, and output an improvement report | 0.5 days | Compare the evaluation report |
| 6. Model Migration Verification | Change to a cheaper model (e. g. gpt-4o-mini → gpt-5.4-nano), recompile and compare | 0.5 days | Cost Optimization Scheme |
| 7. Production Demo | Shows JSON save/load, MLflow Tracing observability | 0.5 days | Full system demo |
Overall time: 3.5 days
Validation Metrics:
-Compiled Accuracy> Handwritten Prompt
-Post-Compile Stability (Multiple Run Variance) -The effect of the compiled version after model migration is close to or higher than that of the original model handwritten version -Compile time <30 minutes (200 samples) -Compiled and saved as JSON, ready to load PoC Success Criteria (Go/No-Go): -✅Go: The compiled version is at least as effective as the handwritten version, and significantly reduces the workload of prompt maintenance. -✅Go: The model migration experiment was successful, proving that the model only needs to be recompiled. -❌No-Go: The customer has no annotation data at all and is unwilling to invest in annotation (at this time, the value of DSPy optimizer cannot be reflected)
11. Risks and Considerations
| Risk | Level | Description | Mitigation | |
|---|---|---|---|---|
| Indicator design depends on | High | The effect of DSPy optimization depends entirely on the quality of the metric function. Wrong indicators will lead to "high scores and low energy"-the program optimizes the indicators but does not really improve the quality of the task. | PoC phase gives priority to verifying metric design; Using manual evaluation to cross-validate automatic indicators. | |
| Labeling Data Threshold | High | Labeling Data (at least dozens to hundreds of pieces) is required for Optimizer compilation. Scenarios without labeled data cannot give full play to their core value. | First, guide customers to establish a labeling process (or use existing user feedback data); You can enjoy coding benefits with Signature Module without optimization | |
| Compilation Cost | Medium | The GEPA/MIPROv2 compilation process requires multiple LLM calls, which may result in a one-time cost ranging from $10 to $100 under a large training set. | Use auto = "medium" to control the search space; Use a small sample to pilot and expand again; Compilation is a one-time cost, saving compared with long-term reasoning | |
| Degeneration outside the framework | In | Research shows that the Prompt optimized by DSPy may have a lower effect when used out of the framework. | Keep the compiled results loaded and used in DSPy. Don't try to export to plain text Prompt | |
| Learning Curve | The thinking change from "Writing Prompt String" to "Declarative Programming" takes time, and the team needs to understand Signature/Module/Optimizer three-tier abstraction | 60 tutorials cover common tasks; Learning path similar to PyTorch-first run through with Predict, then gradually deepen optimization | ||
| Research Attribute | Low-Medium | DSPy originated from academia. Some modules (such as RLM) are marked as experimental. API may have breaking change between versions. | Use stable version (v3.x); Pay attention to Release Notes; Mature modules such as Predict/ChainOfThought/ReAct for critical path production | |
| Weak Chinese Ecology | Chinese | Communities and documents are mainly in English, with few Chinese tutorials and cases | Use Chinese LLM with DSPy framework; Please refer to the English document translation tool | |
| Depth of integration with LangChain and other frameworks | Low | Although there is official integration, the design philosophies of the two frameworks are different. When mixing, attention should be paid to concept mapping | Clear division of labor: layout to LangChain, optimization to DSPy; Start with independent use of DSPy and integrate when necessary | |
| Lock-in risk | Low | MIT protocol, pure Python, independent of any specific model | Compiled results are saved in JSON and auditable; Signature are standard Python classes; Fork friendly |
12. My Pre-Sales Judgment
Recommendation: Highly recommended (especially suitable for customers who have LLM applications but have unstable results or need large-scale cost reduction, especially medium and large enterprises with ML engineering capabilities)
Reason:
- Methodology Leading :"Programming, not Prompting" is not a marketing slogan-it is a compiler paradigm verified by ICLR 2024 and ICLR 2026(Oral). In an era when AI infrastructure is becoming more and more mature, the manual workshop mode optimized by Prompt will eventually be eliminated, and DSPy is at the forefront of this paradigm shift.
- Solid production cases :Shopify (cost reduction 75x, migration from GPT-5 to self-hosted Qwen-3-9B), Databricks (cost reduction 90x, open source model + optimization> Claude Opus 4.1), Dropbox(45% NMSE reduction)-these are not PoC experiments, but production systems with 10 million requests. Each case has an open technical blog/video detailing.
- ROI narrative is clear : one-time compilation cost ($2-500) for long-term reasoning cost savings ($500000/year → fraction). This is a strong business argument in front of any CFO.
- Engineering Friendly :MIT protocol open source, JSON save/load, MLflow Tracing(OpenTelemetry standard), asynchronous support, thread safety, MCP Server export-the infrastructure required for production is ready.
- Academic endorsement business verification double insurance : Stanford team continuous output: 2023 DSPy → 2024 MIPROv2/BetterTogether → 2025 GEPA(ICLR 2026 Oral)/RLM. 523 PR and 109 Release per month prove that the community is extremely active.
Recommended Customer Persona:
-The LLM application has been launched, but the Prompt maintenance cost is high and the effect is unstable.
-ML/engineering team (1-2 people), Python ability
-Want to migrate from large to small models to reduce costs
-RAG/classification/extraction/Agent scenarios with labeled data (or willing to invest in labeling)
-Medium and large enterprises that value observability and engineering
-Use multiple model providers, requiring a unified LLM call layer
Not recommended situations:
-There is no labeled data at all and no plan to label it in the short term (Optimizer value cannot be reflected)
-Requires low-code/no-code platform (Dify/Coze/MaxKB recommended)
-The team has no Python capability at all (API direct or SaaS solutions are recommended)
-Only 1-2 simple Prompt, do not involve optimization (direct handwriting)
-Extremely cost-sensitive and unable to afford LLM calls during compilation
13. REFERENCE
-GitHub repository: https://github.com/stanfordnlp/dspy
-Official website: https://dspy.ai
-Installation Guide: https://dspy.ai/getting-started/installation/
-Getting Started Tutorial: https://dspy.ai/getting-started/program-dont-prompt/
-Production deployment: https://dspy.ai/production/
-Use Case Summary: https://dspy.ai/community/use-cases/
-DSPy Framework Paper (ICLR 2024):https://arxiv.org/abs/2310.03714
-GEPA Papers (ICLR 2026 Oral):https://arxiv.org/abs/2507.19457
-MIPROv2 papers: https://arxiv.org/abs/2406.11695
-BetterTogether(Fine-tuning Prompt Opt):https://arxiv.org/abs/2407.10930
-RLM (Recursive Language Model):https://arxiv.org/abs/2512.24601
-DSP Papers (Frame Origins):https://arxiv.org/abs/2212.14024
-Shopify production case: https://www.youtube.com/watch?v=bxToahwOVpY
-Dropbox Dash Case: https://dropbox.tech/machine-learning/optimizing-dropbox-dash-relevance-judge-with-dspy
-AWS Nova Migration Case: https://aws.amazon.com/blogs/machine-learning/improve-amazon-nova-migration-performance-with-data-aware-prompt-optimization/
-Databricks 90x cost reduction: https://www.databricks.com/blog/building-state-art-enterprise-agents-90x-cheaper-automated-prompt-optimization
-Databricks Genie:https://www.databricks.com/blog/pushing-frontier-data-agents-genie
-JetBlue case: https://www.databricks.com/blog/optimizing-databricks-llm-pipelines-dspy
-Replit Code Fix: https://blog.replit.com/code-repair
-MLflow integration: https://mlflow.org/docs/latest/llms/dspy/index.html
-Observability tutorial: https://dspy.ai/tutorials/observability/
-DeepWiki DSPy:https://deepwiki.com/stanfordnlp/dspy
-Discord Community: https://discord.gg/XCGy2WDCQB
-PyPI:https://pypi.org/project/dspy/
-dspy-go(Go language implementation):https://github.com/darwishdev/dspy-go
-GEPA Independent Library: https://github.com/gepa-ai/gepa
-GEPA Use Case Set: https://gepa-ai.github.io/gepa/guides/use-cases/
-Medical Imaging DSPy Application: https://arxiv.org/abs/2511.11898
-Multi-scene DSPy optimization study: https://www.alphaxiv.co/overview/2507.03620
- analysis date: 2026-07-02 | data aging: GitHub real-time access, official website content as of access date, v3.3.0b1 *