DockerFile构建ElasticSearch镜像安装IK中文分词器插件

vtnews 2020-07-29

DockerFile构建ElasticSearch镜像安装IK中文分词器插件

为什么要安装IK中文分词器?

ES提供的分词是英文分词,对中文做分词时会拆成单字而不是词语,非常不好,因此索引信息含中文时需要使用中文分词器插件。

一、环境及文件准备

环境准备
  • VMWare版本:15.5.5
  • 操作系统:CentOS7
  • Docker版本:19.03.12
文件准备:
  • 拉取ElasticSearch镜像,版本:7.8.0
    docker pull elasticsearch:7.8.0
  • 下载中文分词器插件,版本:7.8.0
# 在Linux根目录创建docker文件夹并进入文件夹
mkdir /docker
cd /docker
# 下载IK插件文件(如果提示没有wget命令则先执行:`yum install -y wget`,再执行下载命令)
wget https://github.com/medcl/elasticsearch-analysis-ik/releases/download/v7.8.0/elasticsearch-analysis-ik-7.8.0.zip
# 可选项:wget下载过慢可先用浏览器将文件下载到本地再上传到Linux(如果提示没有rz命令则先执行:`yum install -y lrzsz`,再执行上传命令,选择elasticsearch-analysis-ik-7.8.0.zip文件)
rz
# 解压(如果提示没有unzip命令则先执行:`yum install -y unzip`,再执行下载命令)
unzip elasticsearch-analysis-ik-7.8.0.zip -d elasticsearch-analysis-ik

注意:ElasticSearch镜像版本要与IK分词器一致(我使用elasticsearch:7.8.1镜像与elasticsearch-analysis-ik-7.8.0插件,构建镜像后无法使用)

二、构建镜像并启动:

1. 创建DockerFile:进入docker文件夹执行vi DockerFile
FROM elasticsearch:7.8.0
ADD elasticsearch-analysis-ik /usr/share/elasticsearch/plugins/elasticsearch-analysis-ik
2. 创建镜像:在docker文件夹路径下执行docker build -f DockerFile -t elasticsearch-ik:7.8.0 .

镜像构建成功:

[ elasticsearch-ik]# docker build -f DockerFile -t elasticsearch-ik:7.8.0 .
Sending build context to Docker daemon  14.39MB
Step 1/2 : FROM elasticsearch:7.8.0
 ---> 121454ddad72
Step 2/2 : ADD elasticsearch-analysis-ik /usr/share/elasticsearch/plugins/elasticsearch-analysis-ik
 ---> Using cache
 ---> 2af03d5426d3
Successfully built 2af03d5426d3
Successfully tagged elasticsearch-ik:7.8.0
3. 创建并启动容器

docker run -d -p 9200:9200 -p 9300:9300 -e "discovery.type=single-node" --name elasticsearch_test elasticsearch-ik:7.8.0

4. 验证ElasticSearch启动成功:curl localhost:9200

显示如下即启动成功:

[ docker]# curl localhost:9200
{
  "name" : "9f832bbeb44a",
  "cluster_name" : "docker-cluster",
  "cluster_uuid" : "8GAjHyQEToO6PMl8dDoemQ",
  "version" : {
    "number" : "7.8.0",
    "build_flavor" : "default",
    "build_type" : "docker",
    "build_hash" : "757314695644ea9a1dc2fecd26d1a43856725e65",
    "build_date" : "2020-06-14T19:35:50.234439Z",
    "build_snapshot" : false,
    "lucene_version" : "8.5.1",
    "minimum_wire_compatibility_version" : "6.8.0",
    "minimum_index_compatibility_version" : "6.0.0-beta1"
  },
  "tagline" : "You Know, for Search"
}

三、测试分词器:

这里使用的是postman
请求url:http://192.168.0.199:9200/_analyze
请求方式:post
在请求体body中请求入参格式:

{
    "analyzer": "chinese",
    "text": "今天是个好日子"
}

参数说明:
analyzer:可填项有:chinese|ik_max_word|ik_smart,其中chinese是ES的默认分词器选项,ik_max_word(最细粒度划分)和ik_smart(最少划分)是ik中文分词器选项
text:要进行分词操作的内容

1. 测试使用默认分词器
{
    "analyzer": "chinese",
    "text": "今天是个好日子"
}

结果:

{
    "tokens": [
        {
            "token": "今",
            "start_offset": 0,
            "end_offset": 1,
            "type": "<IDEOGRAPHIC>",
            "position": 0
        },
        {
            "token": "天",
            "start_offset": 1,
            "end_offset": 2,
            "type": "<IDEOGRAPHIC>",
            "position": 1
        },
        {
            "token": "是",
            "start_offset": 2,
            "end_offset": 3,
            "type": "<IDEOGRAPHIC>",
            "position": 2
        },
        {
            "token": "个",
            "start_offset": 3,
            "end_offset": 4,
            "type": "<IDEOGRAPHIC>",
            "position": 3
        },
        {
            "token": "好",
            "start_offset": 4,
            "end_offset": 5,
            "type": "<IDEOGRAPHIC>",
            "position": 4
        },
        {
            "token": "日",
            "start_offset": 5,
            "end_offset": 6,
            "type": "<IDEOGRAPHIC>",
            "position": 5
        },
        {
            "token": "子",
            "start_offset": 6,
            "end_offset": 7,
            "type": "<IDEOGRAPHIC>",
            "position": 6
        }
    ]
}
2. 测试使用ik分词器ik_smart
{
    "analyzer": "ik_smart",
    "text": "今天是个好日子"
}

结果:

{
    "tokens": [
        {
            "token": "今天是",
            "start_offset": 0,
            "end_offset": 3,
            "type": "CN_WORD",
            "position": 0
        },
        {
            "token": "个",
            "start_offset": 3,
            "end_offset": 4,
            "type": "CN_CHAR",
            "position": 1
        },
        {
            "token": "好日子",
            "start_offset": 4,
            "end_offset": 7,
            "type": "CN_WORD",
            "position": 2
        }
    ]
}
3. 测试使用ik分词器ik_max_word
{
    "analyzer": "ik_max_word",
    "text": "今天是个好日子"
}

结果:

{
    "tokens": [
        {
            "token": "今天是",
            "start_offset": 0,
            "end_offset": 3,
            "type": "CN_WORD",
            "position": 0
        },
        {
            "token": "今天",
            "start_offset": 0,
            "end_offset": 2,
            "type": "CN_WORD",
            "position": 1
        },
        {
            "token": "是",
            "start_offset": 2,
            "end_offset": 3,
            "type": "CN_CHAR",
            "position": 2
        },
        {
            "token": "个",
            "start_offset": 3,
            "end_offset": 4,
            "type": "CN_CHAR",
            "position": 3
        },
        {
            "token": "好日子",
            "start_offset": 4,
            "end_offset": 7,
            "type": "CN_WORD",
            "position": 4
        },
        {
            "token": "日子",
            "start_offset": 5,
            "end_offset": 7,
            "type": "CN_WORD",
            "position": 5
        }
    ]
}

相关推荐