Text Extraction

Definition / 释义

text extraction：文本抽取；从图片、PDF、网页或其他非结构化/半结构化数据中提取可编辑、可搜索的文字内容的过程（常见于 OCR、信息抽取与数据处理场景）。在某些语境下也可泛指“从大量文本中抽取关键信息”。

Pronunciation / 发音

/ˈtɛkst ɪkˈstrækʃən/

Examples / 例句

Text extraction helps turn scanned documents into editable files.
文本抽取能把扫描文档转换成可编辑的文件。

In the pipeline, text extraction is followed by cleaning, entity recognition, and indexing for search.
在这条处理流程中，文本抽取之后通常还会进行清洗、实体识别，并建立索引以便检索。

Etymology / 词源

text 来自拉丁语 textus（“编织之物、文字织成的结构”），引申为“文章、文本”；extraction 来自拉丁语 extrahere（ex- “向外” + trahere “拉、拖”），本义是“抽出、提取”。组合在一起就形成“把文字从载体中提取出来”的意思，现代多用于计算机与数据处理领域。

Related Words / 相关词

Literary Works / 文学与著作中的用例

Speech and Language Processing（Dan Jurafsky, James H. Martin）：在信息抽取、文本处理与 NLP 流程相关章节中常讨论“抽取（extraction）”任务与方法。
Foundations of Statistical Natural Language Processing（Christopher D. Manning, Hinrich Schütze）：涉及从文本中抽取信息、特征与统计模式的经典内容。
Handbook of Natural Language Processing（Nitin Indurkhya, Fred J. Damerau 编）：在信息抽取、文本分析等综述章节中常出现与 text extraction 相近的术语与用法。