模型工具能力探测与纠偏：Claude Code 跨模型适配的运行时协议层

摘要

Claude Code 这类 coding agent 对 tool use 的依赖很深：读文件、搜索、执行命令、补丁编辑、浏览器验证都要通过模型发起工具调用，再由宿主执行工具并把结果回放给模型。只要工具协议有一个环节不匹配，用户看到的就不是“模型略差一点”，而是工具不执行、参数错位、API 400、并行调用丢失、reasoning 状态断裂，甚至 agent 在错误历史里反复自我修复。

因此，Claude Code 适配 DeepSeek、Qwen 等非 Anthropic 模型，或继续扩展到其他模型后端时，不能静态假设它们都支持和 Anthropic Claude 相同的 function calling。更稳妥的做法是把模型工具能力当成运行时事实，而不是配置表常识：启动时、模型切换时、chat template 变化时、供应商升级后，都要做 capability probing，探测 schema 遵循、单工具调用、并行工具调用、tool_result 回放、thinking 状态、XML/JSON 格式、错误修复能力，再根据结果选择适配路径。

本文讨论一个面向 Claude Code 兼容层的能力探测与纠偏设计。重点不是证明某个模型“支持”或“不支持”工具调用，而是建立一套可复测、可灰度、可回滚的协议层，让不同模型在真实 agent 工作流里暴露差异，并被系统性吸收。

为什么静态兼容表不够

Anthropic 的 tool use 协议有清晰的状态机：应用在请求中声明工具，模型返回 tool_use，宿主执行工具，然后把匹配 tool_use.id 的 tool_result 放回下一条 user message。官方文档还强调 tool_result 必须紧跟前一轮 tool_use，并且在 user message content 数组前部出现。并行工具调用也有独立语义，必要时可以通过 disable_parallel_tool_use=true 关闭。

这套协议对 Claude 是原生路径，但跨模型时会遇到三类断点。

第一类是消息结构不同。OpenAI-compatible API 通常使用 tools、tool_calls、tool_call_id、role=tool，Anthropic 使用 content block 里的 tool_use 和 tool_result。Qwen3-Coder 社区案例中还出现了 XML 风格的 <tool_call>、<function=...> 格式，需要专门 parser 才能从文本里解析工具调用。

第二类是隐藏状态不同。DeepSeek thinking mode 的官方文档要求，在 tool call 后继续对话时回传 prior assistant turns 的 reasoning_content。多个社区 issue 报告了类似现象：agent 或 SDK 没有透传 reasoning 内容，导致 API 返回 400。这说明兼容层不能只转换可见的 tool_calls，还必须维护模型特定的 reasoning 状态。

第三类是“看似支持”不等于“可用于 agent”。一个模型可能能在简单示例中生成函数名和 JSON 参数，但在流式输出、嵌套 schema、并行调用、工具错误修复、工具结果回放、多轮上下文裁剪后失败。社区中关于 Qwen3-Coder 的案例就集中在这类问题上：模型生成了看似正确的 tool call，但框架没有执行；或者输出格式不是 OpenAI-compatible，需要适配 XML tool call。

所以，模型 registry 里写一句 supportsToolCall: true 只能作为提示，不能作为执行依据。agent 需要知道的是更细的能力向量。

能力维度

一个可执行的 capability profile 至少应覆盖以下维度。

维度	需要探测的问题	典型纠偏
工具定义 schema	是否遵循 `name`、`description`、`input_schema` 或 OpenAI `parameters`；是否尊重 required、enum、嵌套对象、数组	schema 降级、拆分复杂工具、增加参数校验和重试
单工具调用	是否能在明确任务中发出一个可解析工具调用	启用工具提示模板、切换 parser、退回纯文本命令确认
并行工具调用	是否会一次发出多个工具调用；多个调用的 id 和参数是否完整	支持则并行执行；不支持则串行化；必要时设置 `disable_parallel_tool_use`
工具结果回放	是否能消费上一轮工具结果并继续完成任务；id 是否正确配对	保持 `tool_use`/`tool_result` 紧邻关系；修复 role 和 content block
thinking / reasoning 状态	是否要求回传 `reasoning_content`、thinking block 或其他隐藏推理状态	保存真实 reasoning 内容并按模型协议重注入；禁止用空字符串伪造
JSON 格式	JSON 参数是否严格合法；是否会混入解释文本、尾逗号、注释、Markdown fence	容错解析、局部修复、失败后要求模型重新发工具调用
XML 格式	是否使用 `<tool_call>`、`<function=...>` 等标签，而不是结构化 API 字段	启用 XML parser；把 XML 解析结果映射为统一 ToolCall IR
流式增量	streaming 下工具调用字段是否分片稳定；是否丢参数或结束标记	累积 delta 后再解析；无法稳定时关闭 streaming tool call
错误修复	工具返回 `is_error` 或 schema 校验失败后，模型是否能修正参数并重试	自动生成结构化错误反馈；限制重试次数；失败降级到用户确认
上下文裁剪	历史清理后是否仍满足协议要求	裁剪前做闭合检查；保留模型要求的 reasoning 和最近工具链

这里的重点是“能力不是布尔值”。例如 Qwen 路径可能不是“不支持工具调用”，而是“不稳定支持 OpenAI-compatible 结构化 tool call，但在某些 chat template 下更适合 XML parser”。DeepSeek thinking mode 也不是“不能 tool call”，而是“tool call 后必须把 reasoning 状态作为协议状态保留下来”。这类能力必须被 profile 表达出来。

统一中间表示

兼容层最好不要让业务逻辑直接依赖某个供应商的原始消息格式，而是引入统一 ToolCall IR。

type ToolCallIR = {
  id: string;
  name: string;
  input: unknown;
  format: "anthropic" | "openai" | "xml" | "json-text";
  raw: unknown;
};

type ToolResultIR = {
  toolCallId: string;
  content: unknown;
  isError: boolean;
};

type CapabilityProfile = {
  model: string;
  provider: string;
  endpoint: string;
  chatTemplateHash?: string;
  schemaLevel: "strict-json-schema" | "simple-object" | "string-only";
  toolCallFormat: "anthropic" | "openai" | "xml" | "json-text" | "none";
  supportsParallelToolUse: boolean;
  requiresReasoningReplay: boolean;
  supportsStreamingToolUse: boolean;
  errorRepair: "reliable" | "limited" | "none";
  verifiedAt: string;
  probeVersion: string;
};

业务层只处理 IR：模型生成的 Anthropic tool_use、OpenAI tool_calls、Qwen XML 文本、JSON 文本都先解析成 ToolCallIR。执行工具后，再根据目标模型 profile 把 ToolResultIR 编码回对应协议。这样，探测和纠偏可以集中在 adapter 层，不会污染 agent 的工具执行逻辑。

探测协议

探测协议应当使用无副作用工具，避免在用户项目里执行真实命令。推荐准备三个固定探针工具：

[
  {
    "name": "probe_echo",
    "description": "Return the input payload unchanged.",
    "input_schema": {
      "type": "object",
      "properties": {
        "message": { "type": "string" },
        "tag": { "type": "string", "enum": ["alpha", "beta"] }
      },
      "required": ["message", "tag"]
    }
  },
  {
    "name": "probe_add",
    "description": "Add two integers.",
    "input_schema": {
      "type": "object",
      "properties": {
        "a": { "type": "integer" },
        "b": { "type": "integer" }
      },
      "required": ["a", "b"]
    }
  },
  {
    "name": "probe_fail_once",
    "description": "Always returns a structured error for repair probing.",
    "input_schema": {
      "type": "object",
      "properties": {
        "path": { "type": "string" }
      },
      "required": ["path"]
    }
  }
]

完整 probing 可以分七步。

1、基础格式探测：要求模型调用 probe_echo，输入固定字符串和 enum 值。观察它输出 Anthropic block、OpenAI tool_calls、XML tag、JSON 文本，还是自然语言说明。
2、schema 遵循探测：使用 enum、required、整数类型、嵌套对象分别测试。兼容层不只看是否调用工具，还要对参数做本地 JSON Schema 校验。
3、单工具闭环探测：执行 probe_add，把结果回放给模型，要求它基于工具结果回答。验证模型是否消费了 tool_result，而不是凭空计算或忽略工具结果。
4、并行调用探测：要求同时调用两个独立工具，例如一次 echo、一次 add。若模型只发一个调用或把两个调用混在一个参数里，标记为不支持并行，后续将计划拆成串行。
5、错误修复探测：让 probe_fail_once 返回结构化错误，例如 {"error":"path must be absolute"}，观察模型是否能修正参数并重新调用，还是陷入解释文本。
6、thinking 状态探测：对 DeepSeek thinking mode 等模型，在 tool call 后继续请求，验证是否必须回传 reasoning_content。如果缺失会 400，profile 必须记录 requiresReasoningReplay=true。
7、streaming 探测：在流式模式下重复基础格式和单工具闭环，确认 delta 能否稳定合并成完整工具调用。若不稳定，只关闭 streaming tool call，不必关闭普通 streaming。

探测结果不应该只给 pass/fail，而要记录证据。

{
  "model": "qwen3-coder-local",
  "probeVersion": "tool-probe-v1",
  "results": {
    "singleTool": {
      "status": "pass",
      "format": "xml",
      "parser": "qwen-tool-call-xml"
    },
    "parallelTool": {
      "status": "fail",
      "reason": "model emitted one combined textual block"
    },
    "schema": {
      "status": "partial",
      "failures": ["enum not respected under nested object"]
    }
  }
}

这份证据既用于运行时选择 adapter，也用于后续回归。

自动纠偏策略

能力探测之后，兼容层应当自动选择最小必要纠偏，而不是强行把所有模型塞进同一个协议。

协议编码纠偏：如果模型 profile 是 Anthropic 原生路径，就保留 content block、tool_use.id、tool_result.tool_use_id 和 is_error。如果是 OpenAI-compatible 路径，就转换为 tool_calls 和 role=tool。如果是 Qwen XML 路径，就启用 XML parser，把 <tool_call> 内的函数名和参数映射到 IR，再把工具结果用该模型更容易消费的文本或模板回放。

schema 降级纠偏：当探测发现模型不能稳定遵循复杂 JSON Schema 时，不要把完整 Claude Code 工具 schema 原样暴露。可以把复杂工具拆成多个简单工具，减少 oneOf、anyOf、深层嵌套和宽泛 union；对高风险参数做本地校验；失败后返回明确错误，让模型重发调用。

并行纠偏：当 profile 标记不支持并行工具调用时，planner 层应把可并行步骤串行化。对 Anthropic 路径可以使用 disable_parallel_tool_use=true；对其他模型则在系统提示和执行器层共同约束“每轮最多一个工具调用”。这会牺牲速度，但能避免两个工具调用被拼接成不可解析文本。

reasoning 回放纠偏：DeepSeek thinking mode 的关键是保存真实 reasoning_content，并在后续 tool call 回合完整回传。社区中 Vercel AI、OpenCode、Hermes Agent、OpenClaw、Cursor 相关讨论都指向同一类问题：兼容层丢失 prior assistant turns 的 reasoning 内容后，API 会拒绝请求。这里不能用空 reasoning、摘要 reasoning 或重新生成 reasoning 代替，必须把供应商返回的原始字段作为协议状态保存。

工具结果位置纠偏：Anthropic 文档要求 tool_result 紧跟 tool_use，并位于下一条 user message content 数组前部。兼容层做历史清理、缓存切分、消息合并时，要先检查工具链是否闭合，不能把普通文本插到未闭合工具结果之前，也不能只保留 tool_use 而删除对应 tool_result。

解析纠偏：JSON 文本路径和 XML 路径都要有容错 parser，但容错不能无限放宽。推荐顺序是：严格解析；可恢复语法修复；本地 schema 校验；若仍失败，把机器可读错误作为 tool error 或 assistant feedback 返回，让模型重新发工具调用。不要把解析失败的半截参数拿去执行真实工具。

错误修复纠偏：工具执行失败时，错误反馈要结构化，而不是只返回大段堆栈。示例：

{
  "is_error": true,
  "error_code": "SCHEMA_VALIDATION_FAILED",
  "message": "input.path must be an absolute path",
  "retryable": true
}

这样可以探测和利用模型的自修复能力。若某模型多次无法修复，就把该能力降级为 limited 或 none，改为用户确认或宿主侧补全。

灰度与回归机制

模型能力会漂移。供应商升级模型、切换 endpoint、修改 chat template、本地推理框架更新 parser，都会改变工具调用表现。因此 capability probing 不能只在第一次接入时跑一次。

推荐把 profile 绑定到四个维度：provider、model、endpoint、chatTemplateHash。任一维度变化，都重新探测。对于云端模型，还应设置 TTL，例如 24 小时或一周自动刷新一次轻量探测。

灰度发布可以分三层。

1、离线回归：在 CI 中对已知模型 adapter 跑固定 probe transcript，验证 Anthropic、DeepSeek、Qwen XML、OpenAI-compatible 四类编码路径没有破坏。
2、影子探测：用户切换模型时，先在后台运行无副作用 probes。探测失败不影响现有会话，但不允许把该模型标记为可执行工具 agent。
3、小流量启用：新 profile 先只允许低风险工具，例如只读搜索、只读文件读取；通过真实任务遥测后再开放写文件、执行命令、浏览器自动化等高风险工具。

回归指标应覆盖协议正确性和用户体验两类。

指标	说明
tool call parse success rate	模型输出能否被解析成 ToolCallIR
schema validation failure rate	参数是否通过本地 schema 校验
tool result closure error	是否出现未闭合或错配 `tool_use`/`tool_result`
reasoning replay error	是否出现 DeepSeek thinking mode 这类状态回放失败
parallel downgrade rate	有多少任务因不支持并行而被串行化
repair success rate	工具错误后模型是否能自动修正
unsafe execution blocked	parser 或 schema 阻止了多少次不可靠执行

当某个 profile 的关键指标退化时，系统应自动回退到更保守模式：关闭并行、关闭 streaming tool call、启用严格 schema、本地强校验、只读工具白名单，必要时把模型标记为 toolCallFormat=none，让它只做规划和解释，不直接驱动工具。

一个兼容层状态机

可以把运行时流程压缩成一个状态机。

Model selected
  -> Load cached capability profile
  -> If missing/stale: run probes
  -> Choose adapter and parser
  -> Encode tools for target model
  -> Receive model output
  -> Parse into ToolCallIR
  -> Validate schema and safety policy
  -> Execute tool or return structured error
  -> Encode ToolResultIR back to model protocol
  -> Preserve required reasoning/tool state
  -> Record telemetry for regression

这个状态机的好处是边界清楚。模型只负责生成意图；parser 负责把不同格式归一化；validator 决定是否允许执行；executor 只处理可信 IR；encoder 再按 profile 回放结果。任何一步失败，都能降级，而不是把错误扩散到整个 agent。

结论

Claude Code 跨模型适配的难点，不是把一个字段名从 tool_use 改成 tool_calls，而是把工具调用当成协议状态机来维护。Anthropic 的强约束、DeepSeek thinking mode 的 reasoning 回放、Qwen3-Coder 的 XML tool call 社区案例，都说明同一个“function calling”标签下面有不同的消息结构、隐藏状态、解析格式和错误恢复能力。

可靠的兼容层应当运行时探测，而不是静态假设；应当记录能力证据，而不是只写布尔开关；应当自动纠偏和灰度回归，而不是等用户任务失败后再人工定位。只有这样，DeepSeek、Qwen 等已有证据覆盖的模型，以及未来需要单独建立 profile 的其他后端，才能在 Claude Code 这类高工具密度 agent 中被稳健接入，而不是停留在 demo 级 function calling。

参考链接

• Anthropic Tool use overview
• Anthropic Define tools
• Anthropic Handle tool calls
• Anthropic Parallel tool use
• Anthropic Tool Runner SDK
• DeepSeek Thinking Mode
• DeepSeek Tool Calls
• DeepSeek Reasoning Model
• Vercel AI issue: DeepSeek thinking mode tool calling
• OpenCode issue: DeepSeek V4 thinking mode tool call
• NousResearch Hermes Agent issue: DeepSeek V4 thinking mode tool call
• OpenClaw issue: reasoning_content replay
• DeepSeek Cursor Proxy
• Cursor Forum discussion: DeepSeek reasoning_content passthrough
• Qwen Function Calling
• Alibaba Cloud Model Studio: Qwen Function calling
• Qwen3-Coder issue: XML tool call format
• Qwen Code issue: generated tool call but tool not executed
• llama.cpp issue: Qwen3-Coder XML tool call parser
• Unsloth Qwen3-Coder GGUF discussion
• Qwen HF discussion: tool calling chat template
• LM Studio issue: Qwen3-Coder streaming tool return
• Roo Code issue: Qwen3-Coder tool call failure
• Anthropic Engineering: Advanced tool use
• Tool Calling is Linearly Readable and Steerable in Language Models
• From Text to Voice: A Reproducible and Verifiable Framework for Evaluating Tool Calling LLM Agents