咱 web 端也能跑本地知识库,RAG(傲娇)-篇章 2-数据预处理与匹配结果优化

V2EX = way to explore

V2EX 是一个关于分享和探索的地方

现在注册

已注册用户请登录

• 请不要在回答技术问题时复制粘贴 AI 生成的内容

这是一个创建于 107 天前的主题，其中的信息可能已经有所发展或是发生改变。

众所周知，RAG 最基本的流程是：

数据处理 → 向量化 → 存储 → 匹配文本 → 结果优化 → 最终的匹配结果

其中，数据预处理与匹配结果优化尤为重要。

1. 数据预处理：文本分块（ Chunking ）

在处理一篇长篇幅的文章时，通常需要将整个文本切分为多个小块，每个小块分别向量化后再存储。

拆分块的重要性

如果分块方式不合理，可能会导致无法命中真正相关的内容。因此，合理的分块策略至关重要。

如何拆分？

看这里：

👉 Text Splitters Overview - LangChain

文章类型数据：推荐使用 Text-structured
HTML 等结构化标签类型：推荐使用 Document-structured

2. 弥补 RAG 匹配缺陷：大小块 + 关键词索引

即使文本被拆成多个块，匹配结果依旧可能不准确。因为 RAG 本身存在局限性：若问题与任何文本块都不相关，匹配效果自然不佳。

优化方案：参考 Danswer 架构

简单来说,就是将文本拆分成不同大小的块再配上关键词索引

大块文本：提高语义层面的相关性
小块文本 + 关键词索引：提高细节命中率

小块可以提供更多细节，但也可能带来噪音信息。

3. 匹配结果优化

多维度匹配后可能得到大量候选文本，因此需要做进一步排序：

基于向量匹配与关键词匹配的分数加权排序
使用轻量级的 rerank 模型
最终形成一组合理的匹配文本

4. Web 端落地（纯浏览器端）

文本块处理

import { RecursiveCharacterTextSplitter } from "@langchain/textsplitters"

const getBaseTextRecursiveSplitter = () => {
    const bigSplitter = new RecursiveCharacterTextSplitter({
        chunkSize: SPLITTER_BIG_CHUNK_SIZE,
        chunkOverlap: SPLITTER_BIG_CHUNK_OVERLAP,
        separators: SPLITTER_SEPARATORS
    });

    const miniSplitter = new RecursiveCharacterTextSplitter({
        chunkSize: SPLITTER_MINI_CHUNK_SIZE,
        chunkOverlap: SPLITTER_MINI_CHUNK_OVERLAP,
        separators: SPLITTER_SEPARATORS
    });

    return {
        bigSplitter,
        miniSplitter
    }
}

关键词索引

使用 lunr.js + jieba（处理中文更优）

匹配结果的优化

考虑到运行在客户端的性能问题没使用 rerank 模型，仅通过加权与归一化排序处理并优化了一点打分逻辑

参考代码如下:

let [lshRes, fullIndexResFromDB] = await Promise.all([
    searchLshIndex(),
    searchFullTextIndex(),
]) as [Search.LshItemRes[], lunr.Index.Result[]]

// 向量匹配排序
const sortedLshRes = lshRes.sort((a, b) => b.similarity - a.similarity)
                           .slice(0, config.SEARCH_RESULT_HEADER_SLICE_SIZE)

// 全文匹配排序
const sortedFullIndexResFromDB = fullIndexResFromDB.sort((a, b) => b.score - a.score)
                                                   .slice(0, config.SEARCH_RESULT_HEADER_SLICE_SIZE)

// 重新打分、归一化
await FullTextIndex.loadJieBa()
const fullIndexFromDBTextChunkRes = await store.getBatch({
    storeName: constant.TEXT_CHUNK_STORE_NAME,
    keys: sortedFullIndexResFromDB.map((item) => Number(item.ref))
})

FullTextIndex.add([{ field: 'text' }], fullIndexFromDBTextChunkRes.map(item => ({
    id: item.id,
    text: item.text
})))

let newFullIndexRes = FullTextIndex.search(question)
newFullIndexRes = newFullIndexRes.sort((a, b) => b.score - a.score)
const maxScore = newFullIndexRes[0]?.score || 1
const reRankFullIndexRes = newFullIndexRes.map(item => ({
    ...item,
    score: item.score / maxScore
}))

// 合并向量和关键词匹配结果
let mixIndexSearchedRes: { id: number, score: number }[] = []
const vectorWeight = config.SEARCHED_VECTOR_WEIGHT
const fullTextWeight = config.SEARCHED_FULL_TEXT_WEIGHT

sortedLshRes.forEach(lshItem => {
    const match = reRankFullIndexRes.find(item => Number(item.ref) === lshItem.id)
    if (match) {
        mixIndexSearchedRes.push({
            id: lshItem.id,
            score: lshItem.similarity * vectorWeight + match.score * fullTextWeight
        })
    } else {
        mixIndexSearchedRes.push({
            id: lshItem.id,
            score: lshItem.similarity
        })
    }
})

// 补充关键词匹配的尾部数据
const lshTailStartIndex = Math.floor(vectorWeight * sortedLshRes.length)
const lshTailMaxScore = sortedLshRes.slice(lshTailStartIndex)?.[0]?.similarity || 1

reRankFullIndexRes.forEach(item => {
    if (!mixIndexSearchedRes.find(i => i.id === Number(item.ref))) {
        mixIndexSearchedRes.push({
            id: Number(item.ref),
            score: item.score * lshTailMaxScore
        })
    }
})

mixIndexSearchedRes = mixIndexSearchedRes
    .sort((a, b) => b.score - a.score)
    .filter(item => item.score > config.SEARCH_SCORE_THRESHOLD)

啥,你觉得这一套不靠谱,看下面!!!