Enqueued related words: Subword

Tokenizer

Definition / 释义

“Tokenizer”指分词器：把一段文本切分成更小的单位（tokens，如词、子词、字符或符号）的工具或程序，常用于自然语言处理、搜索、编译器等场景。

Pronunciation / 发音（IPA）

/ˈtoʊkənaɪzər/

Examples / 例句

The tokenizer splits the sentence into words.
分词器把这句话切分成一个个单词。

In modern NLP pipelines, the tokenizer often converts text into subword tokens so rare words can still be represented reliably.
在现代自然语言处理流程中，分词器常把文本转换为子词级别的标记，从而让罕见词也能更稳定地表示。

Etymology / 词源

“Tokenizer”由 token（标记、代币）+ 动词后缀 -ize（使成为、进行……处理）再加表示“人/工具”的 -er 构成，字面意思是“进行标记化处理的工具/程序”。在计算语言学中，“token”常指文本中可操作的最小单位之一，因此“tokenizer”就专指执行切分与标记化步骤的组件。

Related Words / 相关词汇

Literary Works / 文学与名作中的用例

Speech and Language Processing（Dan Jurafsky, James H. Martin）——在分词、标记化与文本预处理章节中常见“tokenizer/ tokenization”等术语。
Natural Language Processing with Python（Steven Bird, Ewan Klein, Edward Loper）——讨论文本处理与分词实践时会使用“tokenizer”及相关概念。
Introduction to Information Retrieval（Christopher D. Manning, Prabhakar Raghavan, Hinrich Schütze）——在搜索与索引构建中涉及分词（tokenization）与分词器的讨论。