Microsoft SkillOpt - AI Navigation

← Back to Project List

SkillOpt is Microsoft open source "natural language skill optimizer": it does not train model weights, but uses the Markdown skill documents used by the Agent as trainable states, and repeatedly optimizes the processes such as rollout, reflection, editing, and verification gates to finally produce deployable "best_skill.md '. It is suitable for Agent capability precipitation, automatic optimization of prompt/skill engineering, continuous improvement of evaluable tasks, and R & D PoC. However, it is not an out-of-the-box enterprise Agent platform and requires data sets, scorers, model APIs, operating costs, and engineering integration.

1. Project Overview

Project	Information
GitHub	microsoft/SkillOpt
Project Page	https://microsoft.github.io/SkillOpt/
Paper	arXiv:2605.23904
PyPI	skillopt
Project Positioning	Optimizing Natural Language Agent Skills with Deep Learning Training Loops
Open Source License	MIT
Main Language	Python
Python Requirements	Python >= 3.10
Latest PyPI Version	'0.1.0 ', Checked Date: 2026-06-27
latest GitHub Release	'v0.1.0', published on 2026-06-02, check date: 2026-06-27
GitHub heat	about 9.5k stars, 903 forks, 12 open issues, check date: 2026-06-27
Main Topics	'agent-skills', 'self-evolving-agents'
Core Product	Markdown skills document that can be deployed after training, 'best_skill.md'

2. Key Schematic

Project header screen

! [[17-TEMPORARY ATTACHMENT/SkillOpt/skillopt-home.png]]

The project page positions the SkillOpt as "text-space optimization for frozen agents": freeze the target model, do not change the model weight, and only optimize the natural language skill document.

Overview of methods

! [[17-TEMPORARY ATTACHMENT/SkillOpt/teaser-1.png]]

This diagram is suitable for explaining to the business side: the SkillOpt is not training a large model, but a "skill manual for Agent use". The final deliverable is a short, migratable, versionable skill document.

Training pipeline

! [[17-TEMPORARY ATTACHMENT/SkillOpt/pipeline-1.png]]

This pipeline map is the most critical map in pre-sales/program communication. It shows the basic closed loop of the SkillOpt: fixed Agent current skill document → rollout in the training set → optimizer model proposes add/delete/replace editing according to the trajectory → merge and trim editing → generate candidate skill → verified set gate → accept or reject → product' best_skill.md '.

Training Trends

! [[17-TEMPORARY ATTACHMENT/SkillOpt/epoch-trends-1.png]]

The trend chart is suitable for illustrating the SkillOpt "continuous optimization" idea: instead of writing prompt words all at once, the task performance, failure samples, and validation set performance are included in the long-term iteration.

#3 What is it?

SkillOpt can be understood as an "agent skills training framework", but the "training" here is not to train model parameters, but to train natural language skills documents.

Traditional Agent skills or cue words usually come from several sources:

Way	Question
Manual writing	Relying on expert experience, slow iteration, difficult to systematically reprint failed samples
Strong models are generated in one go	First drafts are fast, but they don't always get better on real tasks
Agent self-reflection modification	easy to loose, uncontrollable, may change worse
Manual A/B test prompt	Controllable but costly, difficult to scale

SkillOpt idea is to think of a Markdown skill document as freezing the agent's "externally trainable state" and training it with a discipline similar to a deep learning optimizer. It introduces mechanisms such as epoch, batch size, learning rate, validation gate, slow update, meta skill, etc., so that natural language skills can also be iterated through feedback instead of relying on one-time prompt word projects.

4. What does it mostly do?

Capabilities	Descriptions	Pre-Sales Value
Training Markdown skill	Extract add/delete/replace editing from task tracks and scoring results, and update skill documents	Precipitate expert experience and failure lessons into reusable assets
Fixed target model	Only optimize the external skill document without modifying the target LLM weight	It is more suitable for the existing model/agent of the enterprise and does not need to fine-tune the basic model
Verify gated updates	Candidate editors will only accept when the hold-out validation score is strictly improved	Reduce the risk of "the more optimized, the worse" and facilitate controllability before sales
Output' best_skill.md'	The final deployment is a common Markdown file	Easy review, easy version management, easy access to Codex/Claude Code and other skill systems
Multi-backend support	The Release mentions that OpenAI, Azure OpenAI, Claude, Qwen, MiniMax, etc. are supported	Suitable for multi-model selection and domestic/privatization path evaluation
Multi-benchmark support	Built-in SearchQA, DocVQA, ALFWorld, OfficeQA, SpreadsheetBench, LiveMath, etc.	Can be used for R & D evaluation of different agent task types
WebUI dashboard	Optional Gradio WebUI monitoring training	Convenient R & D team to observe the experiment process
SkillOpt-Sleep preview	Review the coding agent history session offline at night, mine recurring tasks and generate skill suggestions	It can be packaged as a proof of concept that "the more agents are used, the more organizational experience will be accumulated"
Prototype of plug-in ecosystem	The repository contains plug-in directories such as Claude Code, Codex, Copilot, Devin, and OpenClaw	Explain that it is exploring integration with mainstream coding agents

5. Technical Principle: Move Deep Learning Training Ideas to Text Space

The SkillOpt training cycle can be explained by the following table:

Deep Learning Concepts	SkillOpt Corresponding Concepts	Explanation
Model parameters	Skill document	Markdown skill document is trainable state
Forward pass	Rollout	The target agent uses the current skill to do the task, generating a track and score
Loss / error	Failure Track, Low Score Sample	Use Scorer to Judge Which Tasks Did Not Do Well
Backward pass	Reflect	The optimizer model analyzes the trajectory and proposes to edit the patch
Gradient	Add / delete / replace edits	Structured editing of skill documents
Learning rate	Budget for the number of edits accepted per step	Control the extent of each modification to avoid excessive updates
Gradient clipping	Rank / clip edits	Select only the most relevant parts to edit
Validation set	Selection split / validation gate	Candidate skills must be accepted after validation set promotion
Checkpoint	'best_skill.md '	Save skill documentation for best performance verification
Momentum / memory	Slow update / meta skill	Summarize long-term strategies and anti-forgetting at the epoch boundary

The core process is as follows:

flowchart LR A["准备任务数据集"] --> B["当前 skill 文档"] B --> C["目标 Agent rollout"] C --> D["轨迹和评分结果"] D --> E["优化器模型反思失败/成功模式"] E --> F["生成 add/delete/replace 编辑"] F --> G["合并、排序、按 learning rate 裁剪"] G --> H["候选 skill"] H --> I{"验证集 gate 是否提升"} I -->|接受| J["更新 current skill / best_skill.md"] I -->|拒绝| K["rejected-edit buffer"] J --> C

6. Applicable Scenarios

6.1 Agent Skill Precipitation and Continuous Optimization

It is suitable for customers who are already using or preparing to build Agent, but encounter the situation of "writing a lot of prompt words, scattered experience in documents/group chat/work orders, and repeated stepping on pits by different teams. The selling point of the SkillOpt is to transform the failure track into a reviewable skill document update and form a reusable agent operation experience of the organization.

6.2 automatic optimization of prompt / skill for evaluable tasks

If the customer has a clear task set and scoring criteria, such as question and answer accuracy, table processing accuracy, code task pass rate, and document extraction accuracy, the SkillOpt can be used as an "automated prompt optimizer" or "skill optimizer" for research and development experiments.

6.3 Coding Agent's Long-Term Memory/Night Refresh

The SkillOpt-Sleep preview version is aimed at local coding agents such as Claude Code, Codex, and Copilot. The idea is to review historical sessions at night, mine repeated tasks, and replay offline, and then organize the verified experience into long-term skills. This direction is very suitable for pre-sales talk about "how enterprise coding assistants accumulate organizational habits, project specifications and common repair experience".

Evaluation of 6.4 Enterprise Agent Platform Capability

The SkillOpt has multiple built-in benchmark and allows for new benchmark. For scenarios where enterprises need to build agent platforms, compare models, and agent harness, they can be used as part of the R & D evaluation/optimization link.

6.5 expert experience productization

In the fields of operation and maintenance, finance, legal affairs, data analysis, Office automation and other fields, experts know that "how to do is not easy to go wrong". SkillOpt can try to write these experiences into an initial skill and then continue to improve through real tasks and validation sets.

7. Not quite the scene

Scenario	Reason
Open creation without evaluable signals	SkillOpt strongly relies on rollout scoring and validation gate; Without a scorer, it is difficult to tell whether editing is really good.
Enterprise SaaS that only wants to work out of the box	SkillOpt is an open source research/development framework, not a complete commercial platform
Task samples are very few and cannot be reproduced	The lack of train/validation/test segmentation will weaken the reliability of optimization
Real-time online reasoning optimization	The advantage of SkillOpt is offline training skill; No increase in model calls during online reasoning
High Noise Scorer	If scoring is unstable, the optimizer learns the wrong signal
Requires model parameter-level capability improvement	It does not fine-tune model weights;
Extremely sensitive to the cost of LLM calls	The training phase requires a lot of rollout, reflection, and evaluation of calls, with a budget estimate before PoC

How to use #8.

8.1 installation

Official installation method:

git clone https://github.com/microsoft/SkillOpt.git
cd SkillOpt
pip install -e .

PyPI way:

pip install skillopt

Optional dependencies:

pip install -e ".[webui]"
pip install -e ".[claude]"
pip install -e ".[qwen]"
pip install -e ".[alfworld]"

8.2 Configuration Model Key

The official documentation requires at least one model backend to be configured, such as Azure OpenAI, OpenAI, Anthropic Claude, or local Qwen.

AZURE_OPENAI_ENDPOINT=https://your-resource.openai.azure.com/
AZURE_OPENAI_API_KEY=your-key

OPENAI_API_KEY=sk-...
ANTHROPIC_API_KEY=sk-ant-...

Run the first experiment 8.3

SearchQA is officially recommended because it is relatively the fastest:

python scripts/train.py --config configs/searchqa/default.yaml

Training output is usually saved in:

outputs///
├── steps/
├── slow_update/
├── meta_skill/
├── skills/
├── best_skill.md
├── history.json
└── config.yaml

Assess the best skills:

python scripts/eval_only.py \
  --config configs/searchqa/default.yaml \
  --skill outputs/searchqa//skills/best_skill.md

Start WebUI 8.4

pip install -e ".[webui]"
python -m skillopt_webui.app

The default port is '7860 '.

8.5 Use existing checkpoint skill

The warehouse 'ckpt/'directory provides a number of GPT-5.5 optimized skills related to papers, such as SearchQA, ALFWorld, DocVQA, OfficeQA, SpreadsheetBench, and LiveMath. They are not general purpose tools, but are used for recurrent experiments or as a reference artifact portable skills.

9. Architecture/deployment/integration approach

9.1 main modules

Module	Role
'skillopt/engine'	Main training process
'skillopt/gradient'	Track reflection, edit patch aggregation
'skillopt/optimizer'	Edit selection, learning rate, slow update, meta skill, etc.
'skillopt/evaluation'	Verify gate
'skillopt/envs'	Internal benchmark adapter
'skillopt/model'	Azure OpenAI, OpenAI, Claude, Qwen, MiniMax, Codex/Claude Code harness and more
'skillopt_webui'	Gradio WebUI
'skillopt_sleep'	deployment-time companion preview ability
'plugins/'	Integration of Claude Code, Codex, Copilot, Devin, and OpenClaw

Typical form of 9.2 enterprise landing

flowchart TD A["企业任务样本 / 历史 Agent 会话"] --> B["构建 train / validation / test"] B --> C["定义评分器 / 验收规则"] C --> D["SkillOpt 离线训练"] D --> E["产出 best_skill.md"] E --> F["人工审查和版本管理"] F --> G["发布到 Agent 平台 / Coding Agent / Prompt Gateway"] G --> H["线上使用"] H --> I["采集新轨迹和失败样本"] I --> D

Before sales, it should be emphasized that the real landing SkillOpt project is not to "become stronger automatically after installation", but to design the task data, scorer, verification set, skill release process and review mechanism together.

10. What can I say before sales

Business-oriented

Customer Concerns	Recommended Words
Agent always makes the same mistake	"The value of SkillOpt is to precipitate failure experience into reusable skills, instead of manually reskilling the prompt every time."
Organizational experience is difficult to precipitate	"It produces Markdown skill documents, which can be reviewed, versioned, reused, and put into the skill library of enterprise Agent."
Don't want to train the model	"It doesn't change the model weights, only optimizes external skill documents, and is more friendly to existing models and closed-source models of the enterprise."
Worried about optimized variation	"It introduces the validation set gate and will only be accepted if the candidate skill becomes better on the hold-out validation set."
How does the Coding Agent understand the project more and more	"The SkillOpt-Sleep direction is to resume historical sessions at night, mine repetitive tasks and generate proven long-term skills."

Technology-oriented

Technical Issues	Recommended Notes
What input is required	"Task dataset, initial skill, executable rollout, scorer, model API is required."
What is the output	"Output' best_skill.md 'is essentially a deployable natural language skill."
Whether to increase the online inference cost	"The training phase will increase the call cost. After deployment, the skill is a static document and does not add additional inference model calls."
How to extend your own tasks	"Implement dataloader, rollout/scorer, EnvAdapter, and YAML config, and then register the benchmark."
How to control quality	"Use train/validation/test to divide, validation gate, manual review, and version release to control."

11. PoC Recommendations

11.1 suitable PoC themes

PoC Theme	Why is it suitable
Optimization of Q & A Skills in Enterprise Knowledge Base	With standard answers or manual labeling, it is easy to evaluate the accuracy rate
Spreadsheet / Office Automation	Comparable output for automated scoring
Coding Agent project specification precipitation	Can be evaluated based on test pass rate, lint, review checklist
Customer Service/Operation and Maintenance SOP Agent	Historical Work Orders, Standard Processes, Clear Success Criteria
Data Analysis Agent	With fixed analysis tasks and expected results, it is suitable for skill

11.2 PoC range

Project	Proposal
Time	2-4 weeks is reasonable
Data	Prepare at least dozens to hundreds of scorable tasks and divide them into train/validation/test
Initial skill	First, a field expert writes a seed skill, and it is not recommended to start completely blank
Model	First run through with a strong and stable model, then evaluate the domestic/private model

Risk Action	Only perform offline evaluation or sandbox tasks, not directly operate the production system

Example of 11.3 Acceptance Index

Indicator	Proposed target
held-out test boost	Clear boost compared to no skill or manual seed skill
validation set gate pass rate	You can't just look at the training set improvement, you need to look at whether the validation set is stable
skill readability	domain experts can read, review, and explain
Additional online costs	No additional reflection calls after deployment
Migration	The same skill still helps on adjacent tasks or adjacent models

12. Risks and Considerations

Risk	Description	Response
Project Maturity	PyPI Version is' 0.1.0 ',classifiers labeled Alpha,SkillOpt-Sleep preview	Do R & D PoC first, do not directly promise production-level platform
Cost cannot be ignored	Target model rollout and optimizer reflection call are required in the training phase	First estimate sample size, epoch, batch size, model unit price
Scorer determines the upper limit	Error or noise scoring will cause the optimization direction to be wrong	Build a reliable eval first, and manually check if necessary
	Data leaks	Trainbooks, tracks, and failures may contain sensitive information	Using desensitized data, private models, or enterprise gateways
overfitting validation set	if the validation is too small or the task distribution is biased, the skill may only adapt to local samples	strictly keep the test split and regularly replace/expand the test set
Skill document bloat	Continuous append rules may be long, conflicting, or difficult to maintain	Control learning rate, periodic manual review, and rewriting
cannot replace the model capability	when the model itself cannot reason or cannot be called by tools, skill promotion is limited	model selection and task design must be verified first
Enterprise Integration Complexity	Requires data pipeline, execution sandbox, scorer, version management	Planning as a component of the R & D platform, not sold as a single point tool

13. Differences from related programs

Scenario	Difference
Prompt engineering	Prompt engineering is mostly driven by human experience; SkillOpt emphasizes systematic iteration based on trajectories and validation sets
Prompt optimizer	Some optimizers optimize only one prompt. The SkillOpt is more like a training loop, including epoch, lr, gate, slow/meta update.
Fine-tuning/LoRA	Fine-tuning model weights; SkillOpt change external Markdown skill, lighter weight, reviewable, simple deployment
RAG	RAG solves knowledge retrieval; SkillOpt strategies and operational skills to solve "how to do tasks"
Agent Memory	Memory always store facts or experiences; SkillOpt, more emphasis is placed on solidifying experiences into proven task strategies
Auto-reflection Agent	Normal self-reflection may be unstable; SkillOpt authentication gates and bounded edits are more controllable

14. My Pre-Sales Judgment

SkillOpt is a very suitable for the "frontier agent capacity building" topic of the project. It speaks very accurately to the common pain points of customers: Agents do not only need stronger models, but also need long-term capabilities that can precipitate skills from failures, can be verified, can be versioned, and can be migrated. For customers with R & D team, evaluation awareness and Agent platform planning, it can be used as a technical reference or PoC tool for "Agent continuous optimization system.

But pre-sales expression should be steady: SkillOpt are currently more like research frameworks and developer tools than off-the-shelf enterprise platforms. The premise of real landing is that customers have scorable tasks, data sets, model budgets, and engineering teams to integrate. The most recommended way to cut in is to do a narrow scenario PoC, such as knowledge base question and answer, table processing, code repair specification, work order SOP, instead of promising "all Agent automatic evolution".

If customers are building AI Coding, Agent platform, enterprise knowledge assistant or automated office assistant, SkillOpt can be a very good highlight of the scheme: we can not only write prompt, but also build a set of continuous improvement closed loop of "experience collection → task playback → skill optimization → verification gating → manual review → release.

15. Common Customer Q & A


Is SkillOpt a model fine-tuning?	No. It does not modify the model weights, but rather optimizes the external Markdown skill document.
Can it directly make our agent stronger?	A task set, a scorer, and a validation set are required. It provides an optimization framework, but not a magic button with no data and no evaluation.
Will online reasoning be slower?	After deployment, only 'best_skill.md' is added to the agent for use. In principle, no additional reflection calls are added. There is an additional cost in the training phase.
Can it be privatized?	MIT code is open source and supports multi-model backends. However, privatization depends on customer model API, data security, operating environment and dependencies.
Which agents are suitable for?	Most suitable for task-based agents that can be repeated, graded, and error patterns can be summarized, such as Q & A, tables, code, SOP, and Office automation.
What if the scorer is not correct?	Then the optimization may go wrong. The first step in PoC should be to build reliable evaluation, rather than directly stacking training rounds.
Can the SkillOpt-Sleep be directly used in production?	The official marking preview is suitable for demonstration and internal pilot. Production needs to carefully evaluate the stability and security of the interface.

16. REFERENCE

-GitHub:microsoft/SkillOpt

-Project Page: SkillOpt

-Thesis: SkillOpt: Executive Strategy for Self-Evolving Agent Skills

-PyPI:skillopt

-Release v0.1.0

-Official Document: Installation

-Official Document: Training Loop

-Official Document: First Experiment

-Official Document: Skill Document

-Official Document: SkillOpt-Sleep