← Back to Project List
SkillOpt is Microsoft open source "natural language skill optimizer": it does not train model weights, but uses the Markdown skill documents used by the Agent as trainable states, and repeatedly optimizes the processes such as rollout, reflection, editing, and verification gates to finally produce deployable "best_skill.md '. It is suitable for Agent capability precipitation, automatic optimization of prompt/skill engineering, continuous improvement of evaluable tasks, and R & D PoC. However, it is not an out-of-the-box enterprise Agent platform and requires data sets, scorers, model APIs, operating costs, and engineering integration.

1. Project Overview

ProjectInformation
GitHubmicrosoft/SkillOpt
Project Pagehttps://microsoft.github.io/SkillOpt/
PaperarXiv:2605.23904
PyPIskillopt
Project PositioningOptimizing Natural Language Agent Skills with Deep Learning Training Loops
Open Source LicenseMIT
Main LanguagePython
Python RequirementsPython >= 3.10
Latest PyPI Version'0.1.0 ', Checked Date: 2026-06-27
latest GitHub Release'v0.1.0', published on 2026-06-02, check date: 2026-06-27
GitHub heatabout 9.5k stars, 903 forks, 12 open issues, check date: 2026-06-27
Main Topics'agent-skills', 'self-evolving-agents'
Core ProductMarkdown skills document that can be deployed after training, 'best_skill.md'

2. Key Schematic

Project header screen

! [[17-TEMPORARY ATTACHMENT/SkillOpt/skillopt-home.png]]

The project page positions the SkillOpt as "text-space optimization for frozen agents": freeze the target model, do not change the model weight, and only optimize the natural language skill document.

Overview of methods

! [[17-TEMPORARY ATTACHMENT/SkillOpt/teaser-1.png]]

This diagram is suitable for explaining to the business side: the SkillOpt is not training a large model, but a "skill manual for Agent use". The final deliverable is a short, migratable, versionable skill document.

Training pipeline

! [[17-TEMPORARY ATTACHMENT/SkillOpt/pipeline-1.png]]

This pipeline map is the most critical map in pre-sales/program communication. It shows the basic closed loop of the SkillOpt: fixed Agent current skill document → rollout in the training set → optimizer model proposes add/delete/replace editing according to the trajectory → merge and trim editing → generate candidate skill → verified set gate → accept or reject → product' best_skill.md '.

Training Trends

! [[17-TEMPORARY ATTACHMENT/SkillOpt/epoch-trends-1.png]]

The trend chart is suitable for illustrating the SkillOpt "continuous optimization" idea: instead of writing prompt words all at once, the task performance, failure samples, and validation set performance are included in the long-term iteration.

#3 What is it?

SkillOpt can be understood as an "agent skills training framework", but the "training" here is not to train model parameters, but to train natural language skills documents.

Traditional Agent skills or cue words usually come from several sources:

WayQuestion
Manual writingRelying on expert experience, slow iteration, difficult to systematically reprint failed samples
Strong models are generated in one goFirst drafts are fast, but they don't always get better on real tasks
Agent self-reflection modificationeasy to loose, uncontrollable, may change worse
Manual A/B test promptControllable but costly, difficult to scale

SkillOpt idea is to think of a Markdown skill document as freezing the agent's "externally trainable state" and training it with a discipline similar to a deep learning optimizer. It introduces mechanisms such as epoch, batch size, learning rate, validation gate, slow update, meta skill, etc., so that natural language skills can also be iterated through feedback instead of relying on one-time prompt word projects.

4. What does it mostly do?

CapabilitiesDescriptionsPre-Sales Value
Training Markdown skillExtract add/delete/replace editing from task tracks and scoring results, and update skill documentsPrecipitate expert experience and failure lessons into reusable assets
Fixed target modelOnly optimize the external skill document without modifying the target LLM weightIt is more suitable for the existing model/agent of the enterprise and does not need to fine-tune the basic model
Verify gated updatesCandidate editors will only accept when the hold-out validation score is strictly improvedReduce the risk of "the more optimized, the worse" and facilitate controllability before sales
Output' best_skill.md'The final deployment is a common Markdown fileEasy review, easy version management, easy access to Codex/Claude Code and other skill systems
Multi-backend supportThe Release mentions that OpenAI, Azure OpenAI, Claude, Qwen, MiniMax, etc. are supportedSuitable for multi-model selection and domestic/privatization path evaluation
Multi-benchmark supportBuilt-in SearchQA, DocVQA, ALFWorld, OfficeQA, SpreadsheetBench, LiveMath, etc.Can be used for R & D evaluation of different agent task types
WebUI dashboardOptional Gradio WebUI monitoring trainingConvenient R & D team to observe the experiment process
SkillOpt-Sleep previewReview the coding agent history session offline at night, mine recurring tasks and generate skill suggestionsIt can be packaged as a proof of concept that "the more agents are used, the more organizational experience will be accumulated"
Prototype of plug-in ecosystemThe repository contains plug-in directories such as Claude Code, Codex, Copilot, Devin, and OpenClawExplain that it is exploring integration with mainstream coding agents

5. Technical Principle: Move Deep Learning Training Ideas to Text Space

The SkillOpt training cycle can be explained by the following table:

Deep Learning ConceptsSkillOpt Corresponding ConceptsExplanation
Model parametersSkill documentMarkdown skill document is trainable state
Forward passRolloutThe target agent uses the current skill to do the task, generating a track and score
Loss / errorFailure Track, Low Score SampleUse Scorer to Judge Which Tasks Did Not Do Well
Backward passReflectThe optimizer model analyzes the trajectory and proposes to edit the patch
GradientAdd / delete / replace editsStructured editing of skill documents
Learning rateBudget for the number of edits accepted per stepControl the extent of each modification to avoid excessive updates
Gradient clippingRank / clip editsSelect only the most relevant parts to edit
Validation setSelection split / validation gateCandidate skills must be accepted after validation set promotion
Checkpoint'best_skill.md 'Save skill documentation for best performance verification
Momentum / memorySlow update / meta skillSummarize long-term strategies and anti-forgetting at the epoch boundary

The core process is as follows:

flowchart LR A["准备任务数据集"] --> B["当前 skill 文档"] B --> C["目标 Agent rollout"] C --> D["轨迹和评分结果"] D --> E["优化器模型反思失败/成功模式"] E --> F["生成 add/delete/replace 编辑"] F --> G["合并、排序、按 learning rate 裁剪"] G --> H["候选 skill"] H --> I{"验证集 gate 是否提升"} I -->|接受| J["更新 current skill / best_skill.md"] I -->|拒绝| K["rejected-edit buffer"] J --> C

6. Applicable Scenarios

6.1 Agent Skill Precipitation and Continuous Optimization

It is suitable for customers who are already using or preparing to build Agent, but encounter the situation of "writing a lot of prompt words, scattered experience in documents/group chat/work orders, and repeated stepping on pits by different teams. The selling point of the SkillOpt is to transform the failure track into a reviewable skill document update and form a reusable agent operation experience of the organization.

6.2 automatic optimization of prompt / skill for evaluable tasks

If the customer has a clear task set and scoring criteria, such as question and answer accuracy, table processing accuracy, code task pass rate, and document extraction accuracy, the SkillOpt can be used as an "automated prompt optimizer" or "skill optimizer" for research and development experiments.

6.3 Coding Agent's Long-Term Memory/Night Refresh

The SkillOpt-Sleep preview version is aimed at local coding agents such as Claude Code, Codex, and Copilot. The idea is to review historical sessions at night, mine repeated tasks, and replay offline, and then organize the verified experience into long-term skills. This direction is very suitable for pre-sales talk about "how enterprise coding assistants accumulate organizational habits, project specifications and common repair experience".

Evaluation of 6.4 Enterprise Agent Platform Capability

The SkillOpt has multiple built-in benchmark and allows for new benchmark. For scenarios where enterprises need to build agent platforms, compare models, and agent harness, they can be used as part of the R & D evaluation/optimization link.

6.5 expert experience productization

In the fields of operation and maintenance, finance, legal affairs, data analysis, Office automation and other fields, experts know that "how to do is not easy to go wrong". SkillOpt can try to write these experiences into an initial skill and then continue to improve through real tasks and validation sets.

7. Not quite the scene

ScenarioReason
Open creation without evaluable signalsSkillOpt strongly relies on rollout scoring and validation gate; Without a scorer, it is difficult to tell whether editing is really good.
Enterprise SaaS that only wants to work out of the boxSkillOpt is an open source research/development framework, not a complete commercial platform
Task samples are very few and cannot be reproducedThe lack of train/validation/test segmentation will weaken the reliability of optimization
Real-time online reasoning optimizationThe advantage of SkillOpt is offline training skill; No increase in model calls during online reasoning
High Noise ScorerIf scoring is unstable, the optimizer learns the wrong signal
Requires model parameter-level capability improvementIt does not fine-tune model weights;
Extremely sensitive to the cost of LLM callsThe training phase requires a lot of rollout, reflection, and evaluation of calls, with a budget estimate before PoC

How to use #8.

8.1 installation

Official installation method:

git clone https://github.com/microsoft/SkillOpt.git
cd SkillOpt
pip install -e .

PyPI way:

pip install skillopt

Optional dependencies:

pip install -e ".[webui]"
pip install -e ".[claude]"
pip install -e ".[qwen]"
pip install -e ".[alfworld]"

8.2 Configuration Model Key

The official documentation requires at least one model backend to be configured, such as Azure OpenAI, OpenAI, Anthropic Claude, or local Qwen.

AZURE_OPENAI_ENDPOINT=https://your-resource.openai.azure.com/
AZURE_OPENAI_API_KEY=your-key

OPENAI_API_KEY=sk-...
ANTHROPIC_API_KEY=sk-ant-...

Run the first experiment 8.3

SearchQA is officially recommended because it is relatively the fastest:

python scripts/train.py --config configs/searchqa/default.yaml

Training output is usually saved in:

outputs///
├── steps/
├── slow_update/
├── meta_skill/
├── skills/
├── best_skill.md
├── history.json
└── config.yaml

Assess the best skills:

python scripts/eval_only.py \
  --config configs/searchqa/default.yaml \
  --skill outputs/searchqa//skills/best_skill.md

Start WebUI 8.4

pip install -e ".[webui]"
python -m skillopt_webui.app

The default port is '7860 '.

8.5 Use existing checkpoint skill

The warehouse 'ckpt/'directory provides a number of GPT-5.5 optimized skills related to papers, such as SearchQA, ALFWorld, DocVQA, OfficeQA, SpreadsheetBench, and LiveMath. They are not general purpose tools, but are used for recurrent experiments or as a reference artifact portable skills.

9. Architecture/deployment/integration approach

9.1 main modules

ModuleRole
'skillopt/engine'Main training process
'skillopt/gradient'Track reflection, edit patch aggregation
'skillopt/optimizer'Edit selection, learning rate, slow update, meta skill, etc.
'skillopt/evaluation'Verify gate
'skillopt/envs'Internal benchmark adapter
'skillopt/model'Azure OpenAI, OpenAI, Claude, Qwen, MiniMax, Codex/Claude Code harness and more
'skillopt_webui'Gradio WebUI
'skillopt_sleep'deployment-time companion preview ability
'plugins/'Integration of Claude Code, Codex, Copilot, Devin, and OpenClaw

Typical form of 9.2 enterprise landing

flowchart TD A["企业任务样本 / 历史 Agent 会话"] --> B["构建 train / validation / test"] B --> C["定义评分器 / 验收规则"] C --> D["SkillOpt 离线训练"] D --> E["产出 best_skill.md"] E --> F["人工审查和版本管理"] F --> G["发布到 Agent 平台 / Coding Agent / Prompt Gateway"] G --> H["线上使用"] H --> I["采集新轨迹和失败样本"] I --> D

Before sales, it should be emphasized that the real landing SkillOpt project is not to "become stronger automatically after installation", but to design the task data, scorer, verification set, skill release process and review mechanism together.

10. What can I say before sales

Business-oriented

Customer ConcernsRecommended Words
Agent always makes the same mistake"The value of SkillOpt is to precipitate failure experience into reusable skills, instead of manually reskilling the prompt every time."
Organizational experience is difficult to precipitate"It produces Markdown skill documents, which can be reviewed, versioned, reused, and put into the skill library of enterprise Agent."
Don't want to train the model"It doesn't change the model weights, only optimizes external skill documents, and is more friendly to existing models and closed-source models of the enterprise."
Worried about optimized variation"It introduces the validation set gate and will only be accepted if the candidate skill becomes better on the hold-out validation set."
How does the Coding Agent understand the project more and more"The SkillOpt-Sleep direction is to resume historical sessions at night, mine repetitive tasks and generate proven long-term skills."

Technology-oriented

Technical IssuesRecommended Notes
What input is required"Task dataset, initial skill, executable rollout, scorer, model API is required."
What is the output"Output' best_skill.md 'is essentially a deployable natural language skill."
Whether to increase the online inference cost"The training phase will increase the call cost. After deployment, the skill is a static document and does not add additional inference model calls."
How to extend your own tasks"Implement dataloader, rollout/scorer, EnvAdapter, and YAML config, and then register the benchmark."
How to control quality"Use train/validation/test to divide, validation gate, manual review, and version release to control."

11. PoC Recommendations

11.1 suitable PoC themes

PoC ThemeWhy is it suitable
Optimization of Q & A Skills in Enterprise Knowledge BaseWith standard answers or manual labeling, it is easy to evaluate the accuracy rate
Spreadsheet / Office AutomationComparable output for automated scoring
Coding Agent project specification precipitationCan be evaluated based on test pass rate, lint, review checklist
Customer Service/Operation and Maintenance SOP AgentHistorical Work Orders, Standard Processes, Clear Success Criteria
Data Analysis AgentWith fixed analysis tasks and expected results, it is suitable for skill

11.2 PoC range

ProjectProposal
Time2-4 weeks is reasonable
DataPrepare at least dozens to hundreds of scorable tasks and divide them into train/validation/test
Initial skillFirst, a field expert writes a seed skill, and it is not recommended to start completely blank
ModelFirst run through with a strong and stable model, then evaluate the domestic/private model
Risk ActionOnly perform offline evaluation or sandbox tasks, not directly operate the production system

Example of 11.3 Acceptance Index

IndicatorProposed target
held-out test boostClear boost compared to no skill or manual seed skill
validation set gate pass rateYou can't just look at the training set improvement, you need to look at whether the validation set is stable
skill readabilitydomain experts can read, review, and explain
Additional online costsNo additional reflection calls after deployment
MigrationThe same skill still helps on adjacent tasks or adjacent models

12. Risks and Considerations

RiskDescriptionResponse
Project MaturityPyPI Version is' 0.1.0 ',classifiers labeled Alpha,SkillOpt-Sleep previewDo R & D PoC first, do not directly promise production-level platform
Cost cannot be ignoredTarget model rollout and optimizer reflection call are required in the training phaseFirst estimate sample size, epoch, batch size, model unit price
Scorer determines the upper limitError or noise scoring will cause the optimization direction to be wrongBuild a reliable eval first, and manually check if necessary
Data leaksTrainbooks, tracks, and failures may contain sensitive informationUsing desensitized data, private models, or enterprise gateways
overfitting validation setif the validation is too small or the task distribution is biased, the skill may only adapt to local samplesstrictly keep the test split and regularly replace/expand the test set
Skill document bloatContinuous append rules may be long, conflicting, or difficult to maintainControl learning rate, periodic manual review, and rewriting
cannot replace the model capabilitywhen the model itself cannot reason or cannot be called by tools, skill promotion is limitedmodel selection and task design must be verified first
Enterprise Integration ComplexityRequires data pipeline, execution sandbox, scorer, version managementPlanning as a component of the R & D platform, not sold as a single point tool

13. Differences from related programs

ScenarioDifference
Prompt engineeringPrompt engineering is mostly driven by human experience; SkillOpt emphasizes systematic iteration based on trajectories and validation sets
Prompt optimizerSome optimizers optimize only one prompt. The SkillOpt is more like a training loop, including epoch, lr, gate, slow/meta update.
Fine-tuning/LoRAFine-tuning model weights; SkillOpt change external Markdown skill, lighter weight, reviewable, simple deployment
RAGRAG solves knowledge retrieval; SkillOpt strategies and operational skills to solve "how to do tasks"
Agent MemoryMemory always store facts or experiences; SkillOpt, more emphasis is placed on solidifying experiences into proven task strategies
Auto-reflection AgentNormal self-reflection may be unstable; SkillOpt authentication gates and bounded edits are more controllable

14. My Pre-Sales Judgment

SkillOpt is a very suitable for the "frontier agent capacity building" topic of the project. It speaks very accurately to the common pain points of customers: Agents do not only need stronger models, but also need long-term capabilities that can precipitate skills from failures, can be verified, can be versioned, and can be migrated. For customers with R & D team, evaluation awareness and Agent platform planning, it can be used as a technical reference or PoC tool for "Agent continuous optimization system.

But pre-sales expression should be steady: SkillOpt are currently more like research frameworks and developer tools than off-the-shelf enterprise platforms. The premise of real landing is that customers have scorable tasks, data sets, model budgets, and engineering teams to integrate. The most recommended way to cut in is to do a narrow scenario PoC, such as knowledge base question and answer, table processing, code repair specification, work order SOP, instead of promising "all Agent automatic evolution".

If customers are building AI Coding, Agent platform, enterprise knowledge assistant or automated office assistant, SkillOpt can be a very good highlight of the scheme: we can not only write prompt, but also build a set of continuous improvement closed loop of "experience collection → task playback → skill optimization → verification gating → manual review → release.

15. Common Customer Q & A

Is SkillOpt a model fine-tuning?No. It does not modify the model weights, but rather optimizes the external Markdown skill document.
Can it directly make our agent stronger?A task set, a scorer, and a validation set are required. It provides an optimization framework, but not a magic button with no data and no evaluation.
Will online reasoning be slower?After deployment, only 'best_skill.md' is added to the agent for use. In principle, no additional reflection calls are added. There is an additional cost in the training phase.
Can it be privatized?MIT code is open source and supports multi-model backends. However, privatization depends on customer model API, data security, operating environment and dependencies.
Which agents are suitable for?Most suitable for task-based agents that can be repeated, graded, and error patterns can be summarized, such as Q & A, tables, code, SOP, and Office automation.
What if the scorer is not correct?Then the optimization may go wrong. The first step in PoC should be to build reliable evaluation, rather than directly stacking training rounds.
Can the SkillOpt-Sleep be directly used in production?The official marking preview is suitable for demonstration and internal pilot. Production needs to carefully evaluate the stability and security of the interface.