Files

T

sutong 750f981c7e feat: init media-center skill

资源中心——从多渠道获取资源链接，转存到夸克网盘并整理归档。
- sources/tencent-doc: 腾讯文档读取
- sources/search: 网盘搜索
- storage/quark: 夸克网盘操作
- ref/: 来源 skill 参考归档

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

2026-05-16 18:28:23 +08:00

5.1 KiB

Raw Blame History

name, description

name	description
tx-doc-large-reader	读取超大腾讯文档（传统 doc 类型）的完整流程与避坑指南。当 get_content 超时或文档超过 10 万字时，使用 doc.resolve_document_structure 替代方案提取全文。

超大腾讯文档读取流程

问题背景

腾讯文档 MCP 提供 get_content 作为通用读取接口，但对于超大文档（实践验证：85 万字/2.8 万段落）存在两个问题：

get_content 后端 TCP 超时（5 秒固定超时），返回 code:101, tcp client transport ReadFrame, i/o timeout
文档类型不同，读取工具不同：
- smartcanvas 类型 → smartcanvas.read（支持分页）
- 传统 tencentdoc 类型 → doc.resolve_document_structure（返回全文结构）

快速判断

当 get_content 超时时，先确定文档类型：

# 用 smartcanvas.read 测试文档类型（即使文档是 tencentdoc 也不会报网络错误）
mcporter call tencent-docs smartcanvas.read file_id=<FILE_ID> size=10

# 错误返回示例（说明是 tencentdoc 类型）：
# type:business, code:400008, msg:file is tencentdoc, not smartcanvas

# 正确返回示例（说明是 smartcanvas 类型）：
# 正常返回 JSON 内容

tencentdoc 类型读取流程

第一步：获取文档结构

mcporter call tencent-docs doc.resolve_document_structure file_id=<FILE_ID>

返回完整的 JSON 结构，包含 nodes 数组
每个 node 包含：text_preview（文本预览）、heading_level（标题层级）、start_index/end_index、type（段落类型）
文档越大返回数据越大（85 万字 → 8MB JSON）

第二步：提取文本内容

用 Python 从 JSON 中抽取所有 text_preview：

python -X utf8 -c "
import json

# 读取上一步保存的 JSON
with open('doc_structure.json', 'r', encoding='utf-8') as f:
    data = json.load(f)

nodes = data.get('nodes', [])
texts = []
for n in nodes:
    preview = n.get('text_preview', '')
    hl = n.get('heading_level', 0)
    if preview:
        if hl > 0:
            texts.append('#' * hl + ' ' + preview)
        else:
            texts.append(preview)

full_text = '\n'.join(texts)
with open('doc_content.txt', 'w', encoding='utf-8') as f:
    f.write(full_text)

print(f'Total paragraphs: {len(texts)}')
print(f'Total characters: {len(full_text)}')
"

第三步：查看内容

# 看开头部分
head -200 doc_content.txt

# 或者用 Read 工具

smartcanvas 类型读取流程

smartcanvas 类型支持分页读取，适合超大文档：

# 首次读取（指定每页条数）
mcporter call tencent-docs smartcanvas.read file_id=<FILE_ID> size=50

# 获取下一页（用上一页返回的 next_token）
mcporter call tencent-docs smartcanvas.read file_id=<FILE_ID> next_token=<NEXT_TOKEN> size=50

参数说明：

size：每页返回的 block 数量（建议 20-50）
next_token：分页游标，首次调用不传，后续从上次结果获取
page_id：可选，指定页面 ID

注意事项

1. 编码问题

Windows 环境下 Python 默认编码为 GBK，写入包含 emoji 的文件会报错：

UnicodeEncodeError: 'gbk' codec can't encode character

解决：使用 python -X utf8 参数或显式指定 encoding='utf-8'

2. 工具选择依据

文档类型	读取工具	特点
tencentdoc	`doc.resolve_document_structure`	一次性返回全部结构，数据量大
smartcanvas	`smartcanvas.read`	支持分页，推荐
任意类型	`get_content`	通用接口，大文档可能超时

3. 文档类型判断方法

get_content 不区分类型，但大文档可能超时
smartcanvas.read 对 tencentdoc 返回明确的业务错误码 400008
doc.* 工具对 smartcanvas 也会返回类型错误

4. 链接文本处理

text_preview 中的链接会显示为以下格式：

[普通链接: https://xxx]
[腾讯文档链接: https://docs.qq.com/...]
HYPERLINK "url"

无需额外处理，直接保留原样即可。

5. 资源消耗

85 万字文档 → 8MB JSON 响应 → 提取后约 850KB 纯文本
建议提取后及时清理中间 JSON 文件
doc.resolve_document_structure 对超大文档可能耗时较长（实测 3-5 秒），但能正常返回不超时

6. 保存到本地

提取的内容建议立即写入文件，避免因 MCP 连接不稳定导致数据丢失。

完整示例（tencentdoc）

# 1. 获取文档结构并保存
mcporter call tencent-docs doc.resolve_document_structure file_id=DR2xUcFdrSVhJTkZu > doc_structure.json

# 2. 提取文本
python -X utf8 -c "
import json
with open('doc_structure.json', 'r', encoding='utf-8') as f:
    data = json.load(f)
texts = []
for n in data.get('nodes', []):
    p = n.get('text_preview', '')
    hl = n.get('heading_level', 0)
    if p:
        texts.append(('#' * hl + ' ' + p) if hl > 0 else p)
with open('doc_content.txt', 'w', encoding='utf-8') as f:
    f.write('\n'.join(texts))
print(f'Done: {len(texts)} paragraphs')
"

# 3. 清理中间文件（可选）
rm doc_structure.json

5.1 KiB Raw Blame History Unescape Escape