1. Project Overview
| Project | Information |
|---|---|
| GitHub | microsoft/SkillOpt |
| Project Page | https://microsoft.github.io/SkillOpt/ |
| Paper | arXiv:2605.23904 |
| PyPI | skillopt |
| Project Positioning | Optimizing Natural Language Agent Skills with Deep Learning Training Loops |
| Open Source License | MIT |
| Main Language | Python |
| Python Requirements | Python >= 3.10 |
| Latest PyPI Version | '0.1.0 ', Checked Date: 2026-06-27 |
| latest GitHub Release | 'v0.1.0', published on 2026-06-02, check date: 2026-06-27 |
| GitHub heat | about 9.5k stars, 903 forks, 12 open issues, check date: 2026-06-27 |
| Main Topics | 'agent-skills', 'self-evolving-agents' |
| Core Product | Markdown skills document that can be deployed after training, 'best_skill.md' |
2. Key Schematic
Project header screen
! [[17-TEMPORARY ATTACHMENT/SkillOpt/skillopt-home.png]]
The project page positions the SkillOpt as "text-space optimization for frozen agents": freeze the target model, do not change the model weight, and only optimize the natural language skill document.
Overview of methods
! [[17-TEMPORARY ATTACHMENT/SkillOpt/teaser-1.png]]
This diagram is suitable for explaining to the business side: the SkillOpt is not training a large model, but a "skill manual for Agent use". The final deliverable is a short, migratable, versionable skill document.
Training pipeline
! [[17-TEMPORARY ATTACHMENT/SkillOpt/pipeline-1.png]]
This pipeline map is the most critical map in pre-sales/program communication. It shows the basic closed loop of the SkillOpt: fixed Agent current skill document → rollout in the training set → optimizer model proposes add/delete/replace editing according to the trajectory → merge and trim editing → generate candidate skill → verified set gate → accept or reject → product' best_skill.md '.
Training Trends
! [[17-TEMPORARY ATTACHMENT/SkillOpt/epoch-trends-1.png]]
The trend chart is suitable for illustrating the SkillOpt "continuous optimization" idea: instead of writing prompt words all at once, the task performance, failure samples, and validation set performance are included in the long-term iteration.
#3 What is it?
SkillOpt can be understood as an "agent skills training framework", but the "training" here is not to train model parameters, but to train natural language skills documents.
Traditional Agent skills or cue words usually come from several sources:
| Way | Question |
|---|---|
| Manual writing | Relying on expert experience, slow iteration, difficult to systematically reprint failed samples |
| Strong models are generated in one go | First drafts are fast, but they don't always get better on real tasks |
| Agent self-reflection modification | easy to loose, uncontrollable, may change worse |
| Manual A/B test prompt | Controllable but costly, difficult to scale |
SkillOpt idea is to think of a Markdown skill document as freezing the agent's "externally trainable state" and training it with a discipline similar to a deep learning optimizer. It introduces mechanisms such as epoch, batch size, learning rate, validation gate, slow update, meta skill, etc., so that natural language skills can also be iterated through feedback instead of relying on one-time prompt word projects.
4. What does it mostly do?
| Capabilities | Descriptions | Pre-Sales Value |
|---|---|---|
| Training Markdown skill | Extract add/delete/replace editing from task tracks and scoring results, and update skill documents | Precipitate expert experience and failure lessons into reusable assets |
| Fixed target model | Only optimize the external skill document without modifying the target LLM weight | It is more suitable for the existing model/agent of the enterprise and does not need to fine-tune the basic model |
| Verify gated updates | Candidate editors will only accept when the hold-out validation score is strictly improved | Reduce the risk of "the more optimized, the worse" and facilitate controllability before sales |
| Output' best_skill.md' | The final deployment is a common Markdown file | Easy review, easy version management, easy access to Codex/Claude Code and other skill systems |
| Multi-backend support | The Release mentions that OpenAI, Azure OpenAI, Claude, Qwen, MiniMax, etc. are supported | Suitable for multi-model selection and domestic/privatization path evaluation |
| Multi-benchmark support | Built-in SearchQA, DocVQA, ALFWorld, OfficeQA, SpreadsheetBench, LiveMath, etc. | Can be used for R & D evaluation of different agent task types |
| WebUI dashboard | Optional Gradio WebUI monitoring training | Convenient R & D team to observe the experiment process |
| SkillOpt-Sleep preview | Review the coding agent history session offline at night, mine recurring tasks and generate skill suggestions | It can be packaged as a proof of concept that "the more agents are used, the more organizational experience will be accumulated" |
| Prototype of plug-in ecosystem | The repository contains plug-in directories such as Claude Code, Codex, Copilot, Devin, and OpenClaw | Explain that it is exploring integration with mainstream coding agents |
5. Technical Principle: Move Deep Learning Training Ideas to Text Space
The SkillOpt training cycle can be explained by the following table:
| Deep Learning Concepts | SkillOpt Corresponding Concepts | Explanation |
|---|---|---|
| Model parameters | Skill document | Markdown skill document is trainable state |
| Forward pass | Rollout | The target agent uses the current skill to do the task, generating a track and score |
| Loss / error | Failure Track, Low Score Sample | Use Scorer to Judge Which Tasks Did Not Do Well |
| Backward pass | Reflect | The optimizer model analyzes the trajectory and proposes to edit the patch |
| Gradient | Add / delete / replace edits | Structured editing of skill documents |
| Learning rate | Budget for the number of edits accepted per step | Control the extent of each modification to avoid excessive updates |
| Gradient clipping | Rank / clip edits | Select only the most relevant parts to edit |
| Validation set | Selection split / validation gate | Candidate skills must be accepted after validation set promotion |
| Checkpoint | 'best_skill.md ' | Save skill documentation for best performance verification |
| Momentum / memory | Slow update / meta skill | Summarize long-term strategies and anti-forgetting at the epoch boundary |
The core process is as follows:
6. Applicable Scenarios
6.1 Agent Skill Precipitation and Continuous Optimization
It is suitable for customers who are already using or preparing to build Agent, but encounter the situation of "writing a lot of prompt words, scattered experience in documents/group chat/work orders, and repeated stepping on pits by different teams. The selling point of the SkillOpt is to transform the failure track into a reviewable skill document update and form a reusable agent operation experience of the organization.
6.2 automatic optimization of prompt / skill for evaluable tasks
If the customer has a clear task set and scoring criteria, such as question and answer accuracy, table processing accuracy, code task pass rate, and document extraction accuracy, the SkillOpt can be used as an "automated prompt optimizer" or "skill optimizer" for research and development experiments.
6.3 Coding Agent's Long-Term Memory/Night Refresh
The SkillOpt-Sleep preview version is aimed at local coding agents such as Claude Code, Codex, and Copilot. The idea is to review historical sessions at night, mine repeated tasks, and replay offline, and then organize the verified experience into long-term skills. This direction is very suitable for pre-sales talk about "how enterprise coding assistants accumulate organizational habits, project specifications and common repair experience".
Evaluation of 6.4 Enterprise Agent Platform Capability
The SkillOpt has multiple built-in benchmark and allows for new benchmark. For scenarios where enterprises need to build agent platforms, compare models, and agent harness, they can be used as part of the R & D evaluation/optimization link.
6.5 expert experience productization
In the fields of operation and maintenance, finance, legal affairs, data analysis, Office automation and other fields, experts know that "how to do is not easy to go wrong". SkillOpt can try to write these experiences into an initial skill and then continue to improve through real tasks and validation sets.
7. Not quite the scene
| Scenario | Reason |
|---|---|
| Open creation without evaluable signals | SkillOpt strongly relies on rollout scoring and validation gate; Without a scorer, it is difficult to tell whether editing is really good. |
| Enterprise SaaS that only wants to work out of the box | SkillOpt is an open source research/development framework, not a complete commercial platform |
| Task samples are very few and cannot be reproduced | The lack of train/validation/test segmentation will weaken the reliability of optimization |
| Real-time online reasoning optimization | The advantage of SkillOpt is offline training skill; No increase in model calls during online reasoning |
| High Noise Scorer | If scoring is unstable, the optimizer learns the wrong signal |
| Requires model parameter-level capability improvement | It does not fine-tune model weights; |
| Extremely sensitive to the cost of LLM calls | The training phase requires a lot of rollout, reflection, and evaluation of calls, with a budget estimate before PoC |
How to use #8.
8.1 installation
Official installation method:
git clone https://github.com/microsoft/SkillOpt.git
cd SkillOpt
pip install -e .
PyPI way:
pip install skillopt
Optional dependencies:
pip install -e ".[webui]"
pip install -e ".[claude]"
pip install -e ".[qwen]"
pip install -e ".[alfworld]"
8.2 Configuration Model Key
The official documentation requires at least one model backend to be configured, such as Azure OpenAI, OpenAI, Anthropic Claude, or local Qwen.
AZURE_OPENAI_ENDPOINT=https://your-resource.openai.azure.com/
AZURE_OPENAI_API_KEY=your-key
OPENAI_API_KEY=sk-...
ANTHROPIC_API_KEY=sk-ant-...
Run the first experiment 8.3
SearchQA is officially recommended because it is relatively the fastest:
python scripts/train.py --config configs/searchqa/default.yaml
Training output is usually saved in:
outputs///
├── steps/
├── slow_update/
├── meta_skill/
├── skills/
├── best_skill.md
├── history.json
└── config.yaml
Assess the best skills:
python scripts/eval_only.py \
--config configs/searchqa/default.yaml \
--skill outputs/searchqa//skills/best_skill.md
Start WebUI 8.4
pip install -e ".[webui]"
python -m skillopt_webui.app
The default port is '7860 '.
8.5 Use existing checkpoint skill
The warehouse 'ckpt/'directory provides a number of GPT-5.5 optimized skills related to papers, such as SearchQA, ALFWorld, DocVQA, OfficeQA, SpreadsheetBench, and LiveMath. They are not general purpose tools, but are used for recurrent experiments or as a reference artifact portable skills.
9. Architecture/deployment/integration approach
9.1 main modules
| Module | Role |
|---|---|
| 'skillopt/engine' | Main training process |
| 'skillopt/gradient' | Track reflection, edit patch aggregation |
| 'skillopt/optimizer' | Edit selection, learning rate, slow update, meta skill, etc. |
| 'skillopt/evaluation' | Verify gate |
| 'skillopt/envs' | Internal benchmark adapter |
| 'skillopt/model' | Azure OpenAI, OpenAI, Claude, Qwen, MiniMax, Codex/Claude Code harness and more |
| 'skillopt_webui' | Gradio WebUI |
| 'skillopt_sleep' | deployment-time companion preview ability |
| 'plugins/' | Integration of Claude Code, Codex, Copilot, Devin, and OpenClaw |
Typical form of 9.2 enterprise landing
Before sales, it should be emphasized that the real landing SkillOpt project is not to "become stronger automatically after installation", but to design the task data, scorer, verification set, skill release process and review mechanism together.
10. What can I say before sales
Business-oriented
| Customer Concerns | Recommended Words |
|---|---|
| Agent always makes the same mistake | "The value of SkillOpt is to precipitate failure experience into reusable skills, instead of manually reskilling the prompt every time." |
| Organizational experience is difficult to precipitate | "It produces Markdown skill documents, which can be reviewed, versioned, reused, and put into the skill library of enterprise Agent." |
| Don't want to train the model | "It doesn't change the model weights, only optimizes external skill documents, and is more friendly to existing models and closed-source models of the enterprise." |
| Worried about optimized variation | "It introduces the validation set gate and will only be accepted if the candidate skill becomes better on the hold-out validation set." |
| How does the Coding Agent understand the project more and more | "The SkillOpt-Sleep direction is to resume historical sessions at night, mine repetitive tasks and generate proven long-term skills." |
Technology-oriented
| Technical Issues | Recommended Notes |
|---|---|
| What input is required | "Task dataset, initial skill, executable rollout, scorer, model API is required." |
| What is the output | "Output' best_skill.md 'is essentially a deployable natural language skill." |
| Whether to increase the online inference cost | "The training phase will increase the call cost. After deployment, the skill is a static document and does not add additional inference model calls." |
| How to extend your own tasks | "Implement dataloader, rollout/scorer, EnvAdapter, and YAML config, and then register the benchmark." |
| How to control quality | "Use train/validation/test to divide, validation gate, manual review, and version release to control." |
11. PoC Recommendations
11.1 suitable PoC themes
| PoC Theme | Why is it suitable |
|---|---|
| Optimization of Q & A Skills in Enterprise Knowledge Base | With standard answers or manual labeling, it is easy to evaluate the accuracy rate |
| Spreadsheet / Office Automation | Comparable output for automated scoring |
| Coding Agent project specification precipitation | Can be evaluated based on test pass rate, lint, review checklist |
| Customer Service/Operation and Maintenance SOP Agent | Historical Work Orders, Standard Processes, Clear Success Criteria |
| Data Analysis Agent | With fixed analysis tasks and expected results, it is suitable for skill |
11.2 PoC range
| Project | Proposal | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Time | 2-4 weeks is reasonable | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Data | Prepare at least dozens to hundreds of scorable tasks and divide them into train/validation/test | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Initial skill | First, a field expert writes a seed skill, and it is not recommended to start completely blank | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Model | First run through with a strong and stable model, then evaluate the domestic/private model | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Risk Action | Only perform offline evaluation or sandbox tasks, not directly operate the production system |
Example of 11.3 Acceptance Index
| Indicator | Proposed target |
|---|---|
| held-out test boost | Clear boost compared to no skill or manual seed skill |
| validation set gate pass rate | You can't just look at the training set improvement, you need to look at whether the validation set is stable |
| skill readability | domain experts can read, review, and explain |
| Additional online costs | No additional reflection calls after deployment |
| Migration | The same skill still helps on adjacent tasks or adjacent models |
12. Risks and Considerations
| Risk | Description | Response | |
|---|---|---|---|
| Project Maturity | PyPI Version is' 0.1.0 ',classifiers labeled Alpha,SkillOpt-Sleep preview | Do R & D PoC first, do not directly promise production-level platform | |
| Cost cannot be ignored | Target model rollout and optimizer reflection call are required in the training phase | First estimate sample size, epoch, batch size, model unit price | |
| Scorer determines the upper limit | Error or noise scoring will cause the optimization direction to be wrong | Build a reliable eval first, and manually check if necessary | |
| Data leaks | Trainbooks, tracks, and failures may contain sensitive information | Using desensitized data, private models, or enterprise gateways | |
| overfitting validation set | if the validation is too small or the task distribution is biased, the skill may only adapt to local samples | strictly keep the test split and regularly replace/expand the test set | |
| Skill document bloat | Continuous append rules may be long, conflicting, or difficult to maintain | Control learning rate, periodic manual review, and rewriting | |
| cannot replace the model capability | when the model itself cannot reason or cannot be called by tools, skill promotion is limited | model selection and task design must be verified first | |
| Enterprise Integration Complexity | Requires data pipeline, execution sandbox, scorer, version management | Planning as a component of the R & D platform, not sold as a single point tool |
13. Differences from related programs
| Scenario | Difference |
|---|---|
| Prompt engineering | Prompt engineering is mostly driven by human experience; SkillOpt emphasizes systematic iteration based on trajectories and validation sets |
| Prompt optimizer | Some optimizers optimize only one prompt. The SkillOpt is more like a training loop, including epoch, lr, gate, slow/meta update. |
| Fine-tuning/LoRA | Fine-tuning model weights; SkillOpt change external Markdown skill, lighter weight, reviewable, simple deployment |
| RAG | RAG solves knowledge retrieval; SkillOpt strategies and operational skills to solve "how to do tasks" |
| Agent Memory | Memory always store facts or experiences; SkillOpt, more emphasis is placed on solidifying experiences into proven task strategies |
| Auto-reflection Agent | Normal self-reflection may be unstable; SkillOpt authentication gates and bounded edits are more controllable |
14. My Pre-Sales Judgment
SkillOpt is a very suitable for the "frontier agent capacity building" topic of the project. It speaks very accurately to the common pain points of customers: Agents do not only need stronger models, but also need long-term capabilities that can precipitate skills from failures, can be verified, can be versioned, and can be migrated. For customers with R & D team, evaluation awareness and Agent platform planning, it can be used as a technical reference or PoC tool for "Agent continuous optimization system.
But pre-sales expression should be steady: SkillOpt are currently more like research frameworks and developer tools than off-the-shelf enterprise platforms. The premise of real landing is that customers have scorable tasks, data sets, model budgets, and engineering teams to integrate. The most recommended way to cut in is to do a narrow scenario PoC, such as knowledge base question and answer, table processing, code repair specification, work order SOP, instead of promising "all Agent automatic evolution".
If customers are building AI Coding, Agent platform, enterprise knowledge assistant or automated office assistant, SkillOpt can be a very good highlight of the scheme: we can not only write prompt, but also build a set of continuous improvement closed loop of "experience collection → task playback → skill optimization → verification gating → manual review → release.
15. Common Customer Q & A
| Is SkillOpt a model fine-tuning? | No. It does not modify the model weights, but rather optimizes the external Markdown skill document. | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Can it directly make our agent stronger? | A task set, a scorer, and a validation set are required. It provides an optimization framework, but not a magic button with no data and no evaluation. | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Will online reasoning be slower? | After deployment, only 'best_skill.md' is added to the agent for use. In principle, no additional reflection calls are added. There is an additional cost in the training phase. | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Can it be privatized? | MIT code is open source and supports multi-model backends. However, privatization depends on customer model API, data security, operating environment and dependencies. | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Which agents are suitable for? | Most suitable for task-based agents that can be repeated, graded, and error patterns can be summarized, such as Q & A, tables, code, SOP, and Office automation. | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| What if the scorer is not correct? | Then the optimization may go wrong. The first step in PoC should be to build reliable evaluation, rather than directly stacking training rounds. | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Can the SkillOpt-Sleep be directly used in production? | The official marking preview is suitable for demonstration and internal pilot. Production needs to carefully evaluate the stability and security of the interface. |