xiaoxiaokeke 2019-03-25
机器翻译:
两种思想流派:
Babel's Fish: 1997年由AltaVista搜索引擎推出的世界上第一个网络翻译工具。
Seq2Seq建模简介:用于NLP任务,如文本摘要,语音识别,DNA序列建模等。
典型的seq2seq模型有两个主要成分:
a)编码器
b)解码器
使用Keras在Python中实现:
我们将使用来自http://www.manythings.org/anki/的中文 - 英语句子对数据。和文件夹名称是cmn-eng,将包含cmn.txt。
1、导入所需的库:
import string import re from numpy import array, argmax, random, take import pandas as pd from keras.models import Sequential from keras.layers import Dense, LSTM, Embedding, RepeatVector from keras.preprocessing.text import Tokenizer from keras.callbacks import ModelCheckpoint from keras.preprocessing.sequence import pad_sequences from keras.models import load_model from keras import optimizers import matplotlib.pyplot as plt %matplotlib inline pd.set_option('display.max_colwidth', 200)
2、阅读IDE中的数据:
# function to read raw text file def read_text(filename): # open the file file = open(filename, mode='rt', encoding='utf-8') # read all text text = file.read() file.close() return text # split a text into sentences def to_lines(text): sents = text.strip().split(' ') sents = [i.split(' ') for i in sents] return sents data = read_text("cmn.txt") cmn_eng = to_lines(data) cmn_eng = array(cmn_eng)
我们使用50000个句子对来减少模型的训练时间。
cmn_eng = cmn_eng[:50000,:]
文本预处理:
大多数数据都是非结构化的
a)文本清理
# Remove punctuation cmn_eng[:,0] = [s.translate(str.maketrans('', '', string.punctuation)) for s in cmn_eng[:,0]] cmn_eng[:,1] = [s.translate(str.maketrans('', '', string.punctuation)) for s in cmn_eng[:,1]] cmn_eng
b)文本到序列转换:
我们将分别在两个单独的英语和汉语列表中捕获所有句子的长度。
# empty lists eng_l = [] cmn_l = [] # populate the lists with sentence lengths for i in cmn_eng[:,0]: eng_l.append(len(i.split())) for i in cmn_eng[:,1]: cmn_l.append(len(i.split())) length_df = pd.DataFrame({'eng':eng_l, 'cmn':cmn_l}) length_df.hist(bins = 30) plt.show()
接下来,我们使用Keras Tokenizer()类对文本数据进行向量化。
# function to build a tokenizer def tokenization(lines): tokenizer = Tokenizer() tokenizer.fit_on_texts(lines) return tokenizer # prepare english tokenizer eng_tokenizer = tokenization(cmn_eng[:, 0]) eng_vocab_size = len(eng_tokenizer.word_index) + 1 eng_length = 8 print('English Vocabulary Size: %d' % eng_vocab_size) # prepare Deutch tokenizer cmn_tokenizer = tokenization(cmn_eng[:, 1]) cmn_vocab_size = len(cmn_tokenizer.word_index) + 1 cmn_length = 8 print('Deutch Vocabulary Size: %d' % cmn_vocab_size) # encode and pad sequences def encode_sequences(tokenizer, length, lines): # integer encode sequences seq = tokenizer.texts_to_sequences(lines) # pad sequences with 0 values seq = pad_sequences(seq, maxlen=length, padding='post') return seq
3. 模型构建:
现在,我们将数据分成训练和测试集,用于模型训练和评估。
from sklearn.model_selection import train_test_split # split data into train and test set train, test = train_test_split(cmn_eng, test_size=0.2, random_state = 12) # prepare training data trainX = encode_sequences(cmn_tokenizer, cmn_length, train[:, 1]) trainY = encode_sequences(eng_tokenizer, eng_length, train[:, 0]) # prepare validation data testX = encode_sequences(cmn_tokenizer, cmn_length, test[:, 1]) testY = encode_sequences(eng_tokenizer, eng_length, test[:, 0])
现在我们定义我们的Seq2Seq模型架构:
# build NMT model def define_model(in_vocab,out_vocab, in_timesteps,out_timesteps,units): model = Sequential() model.add(Embedding(in_vocab, units, input_length=in_timesteps, mask_zero=True)) model.add(LSTM(units)) model.add(RepeatVector(out_timesteps)) model.add(LSTM(units, return_sequences=True)) model.add(Dense(out_vocab, activation='softmax')) return model # model compilation model = define_model(cmn_vocab_size, eng_vocab_size, cmn_length, eng_length, 512)
使用RMSprop优化器:
rms = optimizers.RMSprop(lr=0.001) model.compile(optimizer=rms, loss='sparse_categorical_crossentropy')
sparse_categorical_crossentropy用作损失函数。
4. 训练我们的模型
我们将训练30个周期,batch为512,验证分为20%。80%的数据将用于训练模型,其余数据用于评估模型。我们还将使用ModelCheckpoint()函数来保存验证损失最小的模型。
filename = 'model.h1.24_jan_19' checkpoint = ModelCheckpoint(filename, monitor='val_loss', verbose=1, save_best_only=True, mode='min') # train model history = model.fit(trainX, trainY.reshape(trainY.shape[0], trainY.shape[1], 1), epochs=30, batch_size=512, validation_split = 0.2,callbacks=[checkpoint], verbose=1)
5. 加载保存的模型并对不可见的data-testX进行预测。
model = load_model('model.h1.24_jan_19') preds = model.predict_classes(testX.reshape((testX.shape[0],testX.shape[1])))