ming00 2020-04-09
贝叶斯算法主要用于分类数据预测
以下为垃圾邮件分类算法
数据
type,text ham,00 00 00 are 0089 0089 having a good week. Just checking in ham,K..give back my thanks. ham,Am also doing in cbe only. But have to pay. spam,"complimentary 4 STAR Ibiza Holiday or £10,000 cash needs your URGENT collection. 09066364349 NOW from Landline not to lose out! Box434SK38WP150PPM18+" spam,okmail: Dear Dave this is your final notice to collect your 4* Tenerife Holiday or #5000 CASH award! Call 09061743806 from landline. TCs SAE Box326 CW25WX 150ppm ham,Aiya we discuss later lar... Pick u up at 4 is it? ham,Are you this much buzy ham,Please ask mummy to call father spam,Marvel Mobile Play the official Ultimate Spider-man game (£4.50) on ur mobile right now. Text SPIDER to 83338 for the game & we ll send u a FREE 8Ball wallpaper ham,"fyi I'm at usf now, swing by the room whenever" ham,"Sure thing big man. i have hockey elections at 6, shouldn€˜t go on longer than an hour though" ham,I anything lor... ham,"By march ending, i should be ready. But will call you for sure. The problem is that my capital never complete. How far with you. How's work and the ladies" ham,"Hmm well, night night " ham,K I'll be sure to get up before noon and see what's what ham,Ha ha cool cool chikku chikku:-):-DB-) ham,Darren was saying dat if u meeting da ge den we dun meet 4 dinner. Cos later u leave xy will feel awkward. Den u meet him 4 lunch lor. ham,He dint tell anything. He is angry on me that why you told to abi. ham,Up to u... u wan come then come lor... But i din c any stripes skirt... spam,"U can WIN £100 of Music Gift Vouchers every week starting NOW Txt the word DRAW to 87066 TsCs www.ldew.com SkillGame,1Winaweek, age16.150ppermessSubscription" ham,2mro i am not coming to gym machan. Goodnight. ham,ARR birthday today:) i wish him to get more oscar. ham,Reading gud habit.. Nan bari hudgi yorge pataistha ertini kano:-) ham,"I sent my scores to sophas and i had to do secondary application for a few schools. I think if you are thinking of applying, do a research on cost also. Contact joke ogunrinde, her school is one me the less expensive ones" ham,"Could you not read me, my Love ? I answered you" ham,So what did the bank say about the money? ham,Well if I'm that desperate I'll just call armand again ham,"Fuuuuck I need to stop sleepin, sup" ham,So how's the weather over there? ham,Ok thanx... ham,Ok.ok ok..then..whats ur todays plan ham,1Apple/Day=No Doctor. 1Tulsi Leaf/Day=No Cancer. 1Lemon/Day=No Fat. 1Cup Milk/day=No Bone Problms 3 Litres Watr/Day=No Diseases Snd ths 2 Whom U Care..:-) ham,"Sorry, I'll call later" ham,Will do. Was exhausted on train this morning. Too much wine and pie. You sleep well too spam,U have won a nokia 6230 plus a free digital camera. This is what u get when u win our FREE auction. To take part send NOKIA to 83383 now. POBOX114/14TCR/W1 16 ham,Ron say fri leh. N he said ding tai feng cant make reservations. But he said wait lor. ham,"Call me when you/carlos is/are here, my phone's vibrate is acting up and I might not hear texts" ham,Oh k :)why you got job then whats up? spam,"SPJanuary Male Sale! Hot Gay chat now cheaper, call 08709222922. National rate from 1.5p/min cheap to 7.8p/min peak! To stop texts call 08712460324 (10p/min)" ham,Yeah you should. I think you can use your gt atm now to register. Not sure but if there's anyway i can help let me know. But when you do be sure you are ready. ham,Nationwide auto centre (or something like that) on Newport road. I liked them there ham,He is there. You call and meet him ham,Yeah sure I'll leave in a min spam,URGENT! Your Mobile number has been awarded with a £2000 prize GUARANTEED. Call 09061790121 from land line. Claim 3030. Valid 12hrs only 150ppm ham,"Mah b, I'll pick it up tomorrow" ham,Then she dun believe wat? ham,I've sent u my part..
python算法
# 编码转换模块 import codecs from sklearn.naive_bayes import MultinomialNB from sklearn.feature_extraction.text import CountVectorizer if __name__ == '__main__': corpus = [] labels = [] corpus_test = [] labels_test = [] # 读取文件 f = codecs.open("../../sms_spam.txt", "rb") count = 0 while True: line = f.readline() # 第一行不处理 if count == 0: count = count + 1 continue if line: # 修改byte类型为str类型,python2是str python3是byte line=line.decode() count = count + 1 line = line.split(",") # 维度,特征参数 sentence = line[1] # 构建训练集特征值 corpus.append(sentence) # 目标参数 label = line[0] # 构建训练集目标值 将支付串转为0 1 if "ham" == label: labels.append(0) elif "spam" == label: labels.append(1) # 构建测试集 if count > 5550: corpus_test.append(sentence) if "ham" == label: labels_test.append(0) elif "spam" == label: labels_test.append(1) else: break # 创建训练集 # CountVectorizer是将文本向量转换成稀疏表示数值向量(字符频率向量) vectorizer 将文档词块化 # 把corpus 数据中的数据转成“字符频率” vectorizer = CountVectorizer() fea_train = vectorizer.fit_transform(corpus) # 所有出现的字符按 ascii码顺序排序组建特征维度 print (vectorizer.get_feature_names()) # 按特征维度统计每行的字符出现次数 print (fea_train.toarray()) # 创建测试集 # 在已统计的vectorizer基础上带入测试集数据,如果测试集数据中有新单词出现,不做统计 vectorizer2 = CountVectorizer(vocabulary=vectorizer.vocabulary_) fea_test = vectorizer2.fit_transform(corpus_test) print (vectorizer2.get_feature_names()) print (fea_test.toarray()) # 创建贝叶斯分类模型,带入训练数据 # alpha = 1 拉普拉斯估计给每个单词加1 clf = MultinomialNB(alpha=1) clf.fit(fea_train, labels) # 在模型中带入测试数据,得出预测值 pred = clf.predict(fea_test); for p in pred: if p == 0: print ("正常邮件") else: print ("垃圾邮件") for i in range(len(pred)): print(pred[i] ,"\t",labels_test[i])
spark算法
package com.sunbin import org.apache.spark.SparkConf import org.apache.spark.SparkContext import org.apache.log4j.{ Level, Logger } import org.apache.spark.mllib.linalg.Vectors import org.apache.spark.mllib.regression.LabeledPoint import org.apache.spark.mllib.feature.HashingTF import org.apache.spark.mllib.classification.NaiveBayes object Naive_bayes { def main(args: Array[String]): Unit = { //1 构建Spark对象 val conf=new SparkConf().setMaster("local[2]").setAppName("bayes") val sc=new SparkContext(conf) Logger.getRootLogger.setLevel(Level.WARN) val data_path1 = "sms_spam.txt" val lines= sc.textFile(data_path1, 2) val tf = new HashingTF(numFeatures = 100000) // 构建数据集 val parsedData=lines.map(line=>{ val parts= line.split(",") // 将文本特征转成向量 val features= tf.transform(parts(1).split(" ")) if (parts(0) == "ham"){ LabeledPoint(0, features) // LabeledPoint(parts(0).toDouble, Vectors.dense(parts(1).split(' 、').map(_.toDouble))) }else{ LabeledPoint(1, features) // LabeledPoint(parts(0).toDouble, Vectors.dense(parts(1).split(' 、').map(_.toDouble))) } }) parsedData.cache() // 切分数据集,训练集和测试集 val splits= parsedData.randomSplit(Array(0.9,0.1), seed=1l) val test=splits(1) val train=splits(0) // 训练模型 val model = NaiveBayes.train(train, lambda=1.0) // 测试数据 val predictionAndLabel = test.map(p =>{ println(model.predict(p.features), " ",p.label) (model.predict(p.features), p.label) }) predictionAndLabel.count() } }