postgres之分词搜索

针对一些模糊搜索的场景，有时搜索的内容是一句话，此时就要根据这句话的内容进行分词，根据分词进行模糊匹配，这里其实比较成熟的方案可以使用Elasticsearch来进行分词搜索，Elasticsearch使用中文分词插件进行分词，检索时还可以返回相似度等，但是Elasticsearch资源消耗太大了，这里使用postgres来简单的实现这个功能。

postgres文档检索匹配

For text search purposes, each document must be reduced to the preprocessed tsvector format. Searching and ranking are performed entirely on the tsvector representation of a document — the original text need only be retrieved when the document has been selected for display to a user. We therefore often speak of the tsvector as being the document, but of course it is only a compact representation of the full document.

检索的流程是先将原本的内容进行矢量化预处理，然后对矢量化后的文档进行匹配

文档：指的是矢量化(tsvector)后的数据
检索：这里的检索的源也需要搜索规范化(tsquery)后的数据
匹配：@@
具体匹配明细：例如同时包含，或等，是通过tsquery时使用连接符号实现的，详见文档

准备

PostgreSQL：数据库
PostgreSQL的插件zhparser：中文分词插件,它依赖的中文分词库是 SCWS
PostgreSQL的插件pg_trgm：相似度插件

这里直接使用docker镜像：zhparser/zhparser:alpine-16 进行实现，该镜像中已经集成了zhparser,pg_trgm; 如果想通过包安装的方式可以查看zhparser github了解。

数据启动&插件加载

使用docker-compose.yml 启动数据库服务

version: '3.0'
services:
  db:

    image: zhparser/zhparser:alpine-16
    restart: always
    ports:
      - 5432:5432
    volumes:
      - ./pg_data:/var/lib/postgresql/data  
    environment:
      POSTGRES_PASSWORD: 123456
      PGDATA: var/lib/postgresql/data/pgdata

连接到数据库，加载 zhparser,pg_trgm 插件

CREATE EXTENSION zhparser;
CREATE EXTENSION pg_trgm;
-- 查看已经加载了那些插件
SELECT * FROM pg_extension;

数据库插件使用

zhparser使用流程一般如下

创建分词搜索配置,解析库指定为`zhparser

CREATE TEXT SEARCH CONFIGURATION testzhcfg (PARSER = zhparser);

其中scws的配置，默认都为false,以下做几个示例

忽略所有的标点等特殊符号: zhparser.punctuation_ignore = f
将词典全部加载到内存里: zhparser.dict_in_memory = f
短词复合: zhparser.multi_short = f
加载本地的中文词库，优先级会高于默认 zhparser.extra_dicts = 'dict_extra.txt,mydict.xdb'

为分词搜索配置增加 token 映射

ALTER TEXT SEARCH CONFIGURATION testzhcfg ADD MAPPING FOR n,v,a,i,e,l WITH simple;

查询token映射类表 ``

select ts_token_type('zhparser');

测试使用

select to_tsvector('testzhcfg', '大家想一起吃大红薯吗') as ts_v,
plainto_tsquery('testzhcfg', '吃红薯') as ts_q,
to_tsvector('testzhcfg', '大家想一起吃大红薯吗') @@ plainto_tsquery('testzhcfg', '吃红薯') as check_status;

注意

plainto_tsquery: 解析结果使用 &符号连接
phraseto_tsquery: 解析结果使用 <->
websearch_to_tsquery: 更灵活一些，可以解析连接符号，详情参考

参考资料

😊 😃 😄 😁 😆 😅 😂 🤣 🙂 🙃 😉 😇 😏 😌 😍 😘 😗 😙 😚 😋 😛 😜 😝 😒 😔 😖 😞 😟 😠 😡 😳 😨 😰 😥 😢 😭 😱 😲 😵 😷 🤒 🤕 🤢 😴 🤤 😪 😫 😬 😮 🤲 🤜 🤛 🤚 🤝 🙏 🤞 🤟 🤘 🤙 👌 👍 👎 ✊ 👊 👏 🙌 👐 💪

icibdgygcv

博主真是太厉害了！！！

2024-09-22 18:56
bqmjlzddrz

看的我热血沸腾啊https://www.237fa.com/

2024-10-01 21:13
ojhvkxmcef

想想你的文章写的特别好https://www.ea55.com/

2024-10-04 21:43
kuqmfuuobc

不错不错，我喜欢看 https://www.ea55.com/

2024-10-04 21:43

postgres之分词搜索

postgres文档检索匹配

准备

数据启动&插件加载

数据库插件使用

注意

参考资料

评论4

icibdgygcv 回复

bqmjlzddrz 回复

ojhvkxmcef 回复

kuqmfuuobc 回复

icibdgygcv

bqmjlzddrz

ojhvkxmcef

kuqmfuuobc