Winifred 2015-07-29
Classification(2)NLPandClassifierImplementation
1.GeneratetheFeatureMap
NLP-NaturalLanguageProcessing
removethenoise,removethehtmltag,removethestopword(forexample,of,ainEnglish,的,啊inChinese)
stem(changethestoppedtostop),
NLPforChinese
https://github.com/xpqiu/fnlp/
NLPforEnglish
Stanford
http://nlp.stanford.edu/software/index.shtml
http://nlp.stanford.edu/software/corenlp.shtml
http://nlp.stanford.edu/software/segmenter.shtml
http://nlp.stanford.edu/software/tagger.shtml
http://nlp.stanford.edu/software/CRF-NER.shtml
http://nlp.stanford.edu/software/lex-parser.shtml
http://nlp.stanford.edu/software/classifier.shtml
apacheNLP
http://opennlp.apache.org/
RemoveStopWord
OnesourceforStopWorkd
https://raw.githubusercontent.com/muhammad-ahsan/WebSentiment/master/mit-Stopwords.txt
PorterStemmer
convertthe‘ate’->‘eat’andetc.
coalescefunctioninSpark
decreasethenumberofpartitionsintheRDDtonumParitions.
TF-IDF
http://spark.apache.org/docs/latest/mllib-feature-extraction.html#tf-idf
TermFrequency-InverseDocumentFrequency
Denoteatermbyt,adocumentbyd,andthecorpusbyD.TermfrequencyTF(t,d)isthenumberoftimesthattermtappearsindocumentd.
ThedocumentfrequencyDF(t,D)isthenumberofdocumentsthatcontainstermt.
Inversedocumentfrequencyisanumericalmeasureofhowmuchinformationatermprovides:
IDF(t,D)=log((|D|+1)/(DF(t,D)+1))
|D|isthetotalnumberofdocumentsinthecorpus.
DF=String/Int
IDF=String,Double=LogValue
IDFSwithIndex=String—>(Double,Index)
2.GenerateTrainingData
Itseemstomethatzeppelincanloadthejarfromremote
z.load("com.amazonaws:aws-java-sdk:1.10.4.1")
AmazonS3Operation
importcom.amazonaws.services.s3._
importcom.amazonaws.services.s3.model._
importcom.amazonaws.services.s3.transfer.TransferManager
importcom.amazonaws.services.s3.transfer.Upload
/**
*UploadafiletoS3
*/
defuploadToS3(client:AmazonS3Client,bucket:String,key:String,file:File):Unit={
valtm=newTransferManager()
valupload=tm.upload(bucket,key,file)
upload.waitForCompletion()
}
/**
*Readafile'scontentsfromS3
*/
defreadFileContentsFromS3(client:AmazonS3Client,bucket:String,key:String):String={
valgetObjectRequest=newGetObjectRequest(bucket,key)
valresponseHeaders=newResponseHeaderOverrides()
responseHeaders.setCacheControl("No-cache")
getObjectRequest.setResponseHeaders(responseHeaders)
valobjectStream=client.getObject(getObjectRequest).getObjectContent()
scala.io.Source.fromInputStream(objectStream).getLines().mkString("\n")
}
FeatureMapandJob
FeatureMapwillreadthefeaturesfiles.
Jobwillparsetherawdatafromxmltoobject.GetFeatures.
BinaryFeatureExtractor
LocalVector
Vectors.sparse(size,sortedElems)
CalculateanduploadthebinarylabeltotheS3
TFFeatureExtractor
TFIDFFeatureExtractor
TFIDF(t,d,D)=TF(t,d)*IDF(t,D)
3.Classifier
UniformFoldingMechanism
validationcodesblog
valmsg=(positive,negative)match{
case_iffolds<=0=>
s"Invalidnumberoffolds($folds);Mustbeapositiveinteger."
case_ifnegative.isEmpty||positive.isEmpty=>
"Insufficientnumberofsamples"+
s"(#positive:${positive.size},#negative:${negative.size})!"
case_ifpositive.size<folds=>
s"Insufficientnumberofpositivesamples(${positive.size});"+
s"Mustbe>=numberoffolds($folds)!"
case_ifnegative.size<folds=>
s"Insufficientnumberofnegativesamples(${negative.size});"+
s"Mustbe>=numberoffolds($folds)!"
case_=>
""
}
isNullOrEmpty(msg)match{
casefalse=>
logger.error("Foldvalidationfailed!")
Some(newRuntimeException(msg))
casetrue=>
logger.info("Foldvalidationsucceeded!")
None
}
Mergethedataandformatthem.
KFoldCrossValidator
GeneratetheTrainableSVM——>TrainedSVM
Validate—>ModelMetrics
ScalaTips:
1.StringTailandInit
scala>vals="123456"
s:String=123456
scala>vals1=s.tail
s1:String=23456
scala>vals2=s.init
s2:String=12345
2.Tuple2
scala>valstuff=(42,"fish")
stuff:(Int,String)=(42,fish)
scala>stuff.getClass
res2:Class[_<:(Int,String)]=classscala.Tuple2
scala>
scala>stuff._1
res3:Int=42
scala>stuff._2
res4:String=fish
3.ScalaShuffle
scala>util.Random.shuffle(List(1,2,3,4,5,6,7,8,9))
res8:List[Int]=List(7,1,3,9,5,8,2,6,4)
scala>util.Random.shuffle(List(1,2,3,4,5,6,7,8,9))
res9:List[Int]=List(5,1,2,6,9,4,8,7,3)
4.ScalaGrouped
scala>List(1,2,3,4,5,6,7,8,9,10,11,12,13).grouped(4).toList
res11:List[List[Int]]=List(List(1,2,3,4),List(5,6,7,,List(9,10,11,12),List(13))
5.ScalaListZip
scala>List(1,2,3).zip(List("one","two","three"))
res12:List[(Int,String)]=List((1,one),(2,two),(3,three))
scala>List(1,2,3).zip(List("one","two","three","four"))
res13:List[(Int,String)]=List((1,one),(2,two),(3,three))
6.ListOperation
scala>vals1=List(1,2,3,4,5,6,7).splitAt(3)
s1:(List[Int],List[Int])=(List(1,2,3),List(4,5,6,7))
scala>valt1=s1._1.last
t1:Int=3
scala>valt2=s1._1.init
t2:List[Int]=List(1,2)
scala>valt2=s1._2
t2:List[Int]=List(4,5,6,7)
References:
http://www.fnlp.org/archives/4231
example
http://www.cnblogs.com/linlu1142/p/3292982.html