vtnews 2020-07-29
ES提供的分词是英文分词,对中文做分词时会拆成单字而不是词语,非常不好,因此索引信息含中文时需要使用中文分词器插件。
docker pull elasticsearch:7.8.0
# 在Linux根目录创建docker文件夹并进入文件夹 mkdir /docker cd /docker # 下载IK插件文件(如果提示没有wget命令则先执行:`yum install -y wget`,再执行下载命令) wget https://github.com/medcl/elasticsearch-analysis-ik/releases/download/v7.8.0/elasticsearch-analysis-ik-7.8.0.zip # 可选项:wget下载过慢可先用浏览器将文件下载到本地再上传到Linux(如果提示没有rz命令则先执行:`yum install -y lrzsz`,再执行上传命令,选择elasticsearch-analysis-ik-7.8.0.zip文件) rz # 解压(如果提示没有unzip命令则先执行:`yum install -y unzip`,再执行下载命令) unzip elasticsearch-analysis-ik-7.8.0.zip -d elasticsearch-analysis-ik
注意:ElasticSearch镜像版本要与IK分词器一致(我使用elasticsearch:7.8.1镜像与elasticsearch-analysis-ik-7.8.0插件,构建镜像后无法使用)
vi DockerFile
FROM elasticsearch:7.8.0 ADD elasticsearch-analysis-ik /usr/share/elasticsearch/plugins/elasticsearch-analysis-ik
docker build -f DockerFile -t elasticsearch-ik:7.8.0 .
镜像构建成功:
[ elasticsearch-ik]# docker build -f DockerFile -t elasticsearch-ik:7.8.0 . Sending build context to Docker daemon 14.39MB Step 1/2 : FROM elasticsearch:7.8.0 ---> 121454ddad72 Step 2/2 : ADD elasticsearch-analysis-ik /usr/share/elasticsearch/plugins/elasticsearch-analysis-ik ---> Using cache ---> 2af03d5426d3 Successfully built 2af03d5426d3 Successfully tagged elasticsearch-ik:7.8.0
docker run -d -p 9200:9200 -p 9300:9300 -e "discovery.type=single-node" --name elasticsearch_test elasticsearch-ik:7.8.0
curl localhost:9200
显示如下即启动成功:
[ docker]# curl localhost:9200 { "name" : "9f832bbeb44a", "cluster_name" : "docker-cluster", "cluster_uuid" : "8GAjHyQEToO6PMl8dDoemQ", "version" : { "number" : "7.8.0", "build_flavor" : "default", "build_type" : "docker", "build_hash" : "757314695644ea9a1dc2fecd26d1a43856725e65", "build_date" : "2020-06-14T19:35:50.234439Z", "build_snapshot" : false, "lucene_version" : "8.5.1", "minimum_wire_compatibility_version" : "6.8.0", "minimum_index_compatibility_version" : "6.0.0-beta1" }, "tagline" : "You Know, for Search" }
这里使用的是postman
请求url:http://192.168.0.199:9200/_analyze
请求方式:post
在请求体body中请求入参格式:
{ "analyzer": "chinese", "text": "今天是个好日子" }
参数说明:
analyzer:可填项有:chinese|ik_max_word|ik_smart,其中chinese是ES的默认分词器选项,ik_max_word(最细粒度划分)和ik_smart(最少划分)是ik中文分词器选项
text:要进行分词操作的内容
{ "analyzer": "chinese", "text": "今天是个好日子" }
结果:
{ "tokens": [ { "token": "今", "start_offset": 0, "end_offset": 1, "type": "<IDEOGRAPHIC>", "position": 0 }, { "token": "天", "start_offset": 1, "end_offset": 2, "type": "<IDEOGRAPHIC>", "position": 1 }, { "token": "是", "start_offset": 2, "end_offset": 3, "type": "<IDEOGRAPHIC>", "position": 2 }, { "token": "个", "start_offset": 3, "end_offset": 4, "type": "<IDEOGRAPHIC>", "position": 3 }, { "token": "好", "start_offset": 4, "end_offset": 5, "type": "<IDEOGRAPHIC>", "position": 4 }, { "token": "日", "start_offset": 5, "end_offset": 6, "type": "<IDEOGRAPHIC>", "position": 5 }, { "token": "子", "start_offset": 6, "end_offset": 7, "type": "<IDEOGRAPHIC>", "position": 6 } ] }
{ "analyzer": "ik_smart", "text": "今天是个好日子" }
结果:
{ "tokens": [ { "token": "今天是", "start_offset": 0, "end_offset": 3, "type": "CN_WORD", "position": 0 }, { "token": "个", "start_offset": 3, "end_offset": 4, "type": "CN_CHAR", "position": 1 }, { "token": "好日子", "start_offset": 4, "end_offset": 7, "type": "CN_WORD", "position": 2 } ] }
{ "analyzer": "ik_max_word", "text": "今天是个好日子" }
结果:
{ "tokens": [ { "token": "今天是", "start_offset": 0, "end_offset": 3, "type": "CN_WORD", "position": 0 }, { "token": "今天", "start_offset": 0, "end_offset": 2, "type": "CN_WORD", "position": 1 }, { "token": "是", "start_offset": 2, "end_offset": 3, "type": "CN_CHAR", "position": 2 }, { "token": "个", "start_offset": 3, "end_offset": 4, "type": "CN_CHAR", "position": 3 }, { "token": "好日子", "start_offset": 4, "end_offset": 7, "type": "CN_WORD", "position": 4 }, { "token": "日子", "start_offset": 5, "end_offset": 7, "type": "CN_WORD", "position": 5 } ] }
另外一部分,则需要先做聚类、分类处理,将聚合出的分类结果存入ES集群的聚类索引中。数据处理层的聚合结果存入ES中的指定索引,同时将每个聚合主题相关的数据存入每个document下面的某个field下。