feat: init media-center skill

资源中心——从多渠道获取资源链接，转存到夸克网盘并整理归档。 - sources/tencent-doc: 腾讯文档读取 - sources/search: 网盘搜索 - storage/quark: 夸克网盘操作 - ref/: 来源 skill 参考归档 Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-05-16 18:28:23 +08:00
commit 750f981c7e
37 changed files with 7847 additions and 0 deletions
@@ -0,0 +1,171 @@
+---
+name: tx-doc-large-reader
+description: 读取超大腾讯文档（传统 doc 类型）的完整流程与避坑指南。当 get_content 超时或文档超过 10 万字时，使用 doc.resolve_document_structure 替代方案提取全文。
+---
+
+# 超大腾讯文档读取流程
+
+## 问题背景
+
+腾讯文档 MCP 提供 `get_content` 作为通用读取接口，但对于**超大文档**（实践验证：85 万字/2.8 万段落）存在两个问题：
+
+1. **`get_content` 后端 TCP 超时**（5 秒固定超时），返回 `code:101, tcp client transport ReadFrame, i/o timeout`
+2. **文档类型不同，读取工具不同**：
+   - smartcanvas 类型 → `smartcanvas.read`（支持分页）
+   - 传统 tencentdoc 类型 → `doc.resolve_document_structure`（返回全文结构）
+
+## 快速判断
+
+当 `get_content` 超时时，先确定文档类型：
+
+```bash
+# 用 smartcanvas.read 测试文档类型（即使文档是 tencentdoc 也不会报网络错误）
+mcporter call tencent-docs smartcanvas.read file_id=<FILE_ID> size=10
+
+# 错误返回示例（说明是 tencentdoc 类型）：
+# type:business, code:400008, msg:file is tencentdoc, not smartcanvas
+
+# 正确返回示例（说明是 smartcanvas 类型）：
+# 正常返回 JSON 内容
+```
+
+## tencentdoc 类型读取流程
+
+### 第一步：获取文档结构
+
+```bash
+mcporter call tencent-docs doc.resolve_document_structure file_id=<FILE_ID>
+```
+
+- 返回完整的 JSON 结构，包含 `nodes` 数组
+- 每个 node 包含：`text_preview`（文本预览）、`heading_level`（标题层级）、`start_index`/`end_index`、`type`（段落类型）
+- 文档越大返回数据越大（85 万字 → 8MB JSON）
+
+### 第二步：提取文本内容
+
+用 Python 从 JSON 中抽取所有 `text_preview`：
+
+```bash
+python -X utf8 -c "
+import json
+
+# 读取上一步保存的 JSON
+with open('doc_structure.json', 'r', encoding='utf-8') as f:
+    data = json.load(f)
+
+nodes = data.get('nodes', [])
+texts = []
+for n in nodes:
+    preview = n.get('text_preview', '')
+    hl = n.get('heading_level', 0)
+    if preview:
+        if hl > 0:
+            texts.append('#' * hl + ' ' + preview)
+        else:
+            texts.append(preview)
+
+full_text = '\n'.join(texts)
+with open('doc_content.txt', 'w', encoding='utf-8') as f:
+    f.write(full_text)
+
+print(f'Total paragraphs: {len(texts)}')
+print(f'Total characters: {len(full_text)}')
+"
+```
+
+### 第三步：查看内容
+
+```bash
+# 看开头部分
+head -200 doc_content.txt
+
+# 或者用 Read 工具
+```
+
+## smartcanvas 类型读取流程
+
+smartcanvas 类型支持分页读取，适合超大文档：
+
+```bash
+# 首次读取（指定每页条数）
+mcporter call tencent-docs smartcanvas.read file_id=<FILE_ID> size=50
+
+# 获取下一页（用上一页返回的 next_token）
+mcporter call tencent-docs smartcanvas.read file_id=<FILE_ID> next_token=<NEXT_TOKEN> size=50
+```
+
+参数说明：
+- `size`：每页返回的 block 数量（建议 20-50）
+- `next_token`：分页游标，首次调用不传，后续从上次结果获取
+- `page_id`：可选，指定页面 ID
+
+## 注意事项
+
+### 1. 编码问题
+
+Windows 环境下 Python 默认编码为 GBK，写入包含 emoji 的文件会报错：
+```
+UnicodeEncodeError: 'gbk' codec can't encode character
+```
+
+**解决**：使用 `python -X utf8` 参数或显式指定 `encoding='utf-8'`
+
+### 2. 工具选择依据
+
+| 文档类型 | 读取工具 | 特点 |
+|---------|---------|------|
+| tencentdoc | `doc.resolve_document_structure` | 一次性返回全部结构，数据量大 |
+| smartcanvas | `smartcanvas.read` | 支持分页，推荐 |
+| 任意类型 | `get_content` | 通用接口，大文档可能超时 |
+
+### 3. 文档类型判断方法
+
+- `get_content` 不区分类型，但大文档可能超时
+- `smartcanvas.read` 对 tencentdoc 返回明确的业务错误码 `400008`
+- `doc.*` 工具对 smartcanvas 也会返回类型错误
+
+### 4. 链接文本处理
+
+`text_preview` 中的链接会显示为以下格式：
+```
+[普通链接: https://xxx]
+[腾讯文档链接: https://docs.qq.com/...]
+HYPERLINK "url"
+```
+无需额外处理，直接保留原样即可。
+
+### 5. 资源消耗
+
+- 85 万字文档 → 8MB JSON 响应 → 提取后约 850KB 纯文本
+- 建议提取后及时清理中间 JSON 文件
+- `doc.resolve_document_structure` 对超大文档可能耗时较长（实测 3-5 秒），但能正常返回不超时
+
+### 6. 保存到本地
+
+提取的内容建议立即写入文件，避免因 MCP 连接不稳定导致数据丢失。
+
+## 完整示例（tencentdoc）
+
+```bash
+# 1. 获取文档结构并保存
+mcporter call tencent-docs doc.resolve_document_structure file_id=DR2xUcFdrSVhJTkZu > doc_structure.json
+
+# 2. 提取文本
+python -X utf8 -c "
+import json
+with open('doc_structure.json', 'r', encoding='utf-8') as f:
+    data = json.load(f)
+texts = []
+for n in data.get('nodes', []):
+    p = n.get('text_preview', '')
+    hl = n.get('heading_level', 0)
+    if p:
+        texts.append(('#' * hl + ' ' + p) if hl > 0 else p)
+with open('doc_content.txt', 'w', encoding='utf-8') as f:
+    f.write('\n'.join(texts))
+print(f'Done: {len(texts)} paragraphs')
+"
+
+# 3. 清理中间文件（可选）
+rm doc_structure.json
+```