Classification(2)NLP and Classifier Implementation

Winifred 2015-07-29

Classification(2)NLPandClassifierImplementation

1.GeneratetheFeatureMap

NLP-NaturalLanguageProcessing

removethenoise,removethehtmltag,removethestopword(forexample,of,ainEnglish,的,啊inChinese)

stem(changethestoppedtostop),

NLPforChinese

https://github.com/xpqiu/fnlp/

NLPforEnglish

Stanford

http://nlp.stanford.edu/software/index.shtml

http://nlp.stanford.edu/software/corenlp.shtml

http://nlp.stanford.edu/software/segmenter.shtml

http://nlp.stanford.edu/software/tagger.shtml

http://nlp.stanford.edu/software/CRF-NER.shtml

http://nlp.stanford.edu/software/lex-parser.shtml

http://nlp.stanford.edu/software/classifier.shtml

apacheNLP

http://opennlp.apache.org/

RemoveStopWord

OnesourceforStopWorkd

https://raw.githubusercontent.com/muhammad-ahsan/WebSentiment/master/mit-Stopwords.txt

PorterStemmer

convertthe‘ate’->‘eat’andetc.

coalescefunctioninSpark

decreasethenumberofpartitionsintheRDDtonumParitions.

TF-IDF

http://spark.apache.org/docs/latest/mllib-feature-extraction.html#tf-idf

TermFrequency-InverseDocumentFrequency

Denoteatermbyt,adocumentbyd,andthecorpusbyD.TermfrequencyTF(t,d)isthenumberoftimesthattermtappearsindocumentd.

ThedocumentfrequencyDF(t,D)isthenumberofdocumentsthatcontainstermt.

Inversedocumentfrequencyisanumericalmeasureofhowmuchinformationatermprovides:

IDF(t,D)=log((|D|+1)/(DF(t,D)+1))

|D|isthetotalnumberofdocumentsinthecorpus.

DF=String/Int

IDF=String,Double=LogValue

IDFSwithIndex=String—>(Double,Index)

2.GenerateTrainingData

Itseemstomethatzeppelincanloadthejarfromremote

z.load("com.amazonaws:aws-java-sdk:1.10.4.1")

AmazonS3Operation

importcom.amazonaws.services.s3._

importcom.amazonaws.services.s3.model._

importcom.amazonaws.services.s3.transfer.TransferManager

importcom.amazonaws.services.s3.transfer.Upload

/**

*UploadafiletoS3

*/

defuploadToS3(client:AmazonS3Client,bucket:String,key:String,file:File):Unit={

valtm=newTransferManager()

valupload=tm.upload(bucket,key,file)

upload.waitForCompletion()

}

/**

*Readafile'scontentsfromS3

*/

defreadFileContentsFromS3(client:AmazonS3Client,bucket:String,key:String):String={

valgetObjectRequest=newGetObjectRequest(bucket,key)

valresponseHeaders=newResponseHeaderOverrides()

responseHeaders.setCacheControl("No-cache")

getObjectRequest.setResponseHeaders(responseHeaders)

valobjectStream=client.getObject(getObjectRequest).getObjectContent()

scala.io.Source.fromInputStream(objectStream).getLines().mkString("\n")

}

FeatureMapandJob

FeatureMapwillreadthefeaturesfiles.

Jobwillparsetherawdatafromxmltoobject.GetFeatures.

BinaryFeatureExtractor

LocalVector

Vectors.sparse(size,sortedElems)

CalculateanduploadthebinarylabeltotheS3

TFFeatureExtractor

TFIDFFeatureExtractor

TFIDF(t,d,D)=TF(t,d)*IDF(t,D)

3.Classifier

UniformFoldingMechanism

validationcodesblog

valmsg=(positive,negative)match{

case_iffolds<=0=>

s"Invalidnumberoffolds($folds);Mustbeapositiveinteger."

case_ifnegative.isEmpty||positive.isEmpty=>

"Insufficientnumberofsamples"+

s"(#positive:${positive.size},#negative:${negative.size})!"

case_ifpositive.size<folds=>

s"Insufficientnumberofpositivesamples(${positive.size});"+

s"Mustbe>=numberoffolds($folds)!"

case_ifnegative.size<folds=>

s"Insufficientnumberofnegativesamples(${negative.size});"+

s"Mustbe>=numberoffolds($folds)!"

case_=>

""

}

isNullOrEmpty(msg)match{

casefalse=>

logger.error("Foldvalidationfailed!")

Some(newRuntimeException(msg))

casetrue=>

logger.info("Foldvalidationsucceeded!")

None

}

Mergethedataandformatthem.

KFoldCrossValidator

GeneratetheTrainableSVM——>TrainedSVM

Validate—>ModelMetrics

ScalaTips:

1.StringTailandInit

scala>vals="123456"

s:String=123456

scala>vals1=s.tail

s1:String=23456

scala>vals2=s.init

s2:String=12345

2.Tuple2

scala>valstuff=(42,"fish")

stuff:(Int,String)=(42,fish)

scala>stuff.getClass

res2:Class[_<:(Int,String)]=classscala.Tuple2

scala>

scala>stuff._1

res3:Int=42

scala>stuff._2

res4:String=fish

3.ScalaShuffle

scala>util.Random.shuffle(List(1,2,3,4,5,6,7,8,9))

res8:List[Int]=List(7,1,3,9,5,8,2,6,4)

scala>util.Random.shuffle(List(1,2,3,4,5,6,7,8,9))

res9:List[Int]=List(5,1,2,6,9,4,8,7,3)

4.ScalaGrouped

scala>List(1,2,3,4,5,6,7,8,9,10,11,12,13).grouped(4).toList

res11:List[List[Int]]=List(List(1,2,3,4),List(5,6,7,,List(9,10,11,12),List(13))

5.ScalaListZip

scala>List(1,2,3).zip(List("one","two","three"))

res12:List[(Int,String)]=List((1,one),(2,two),(3,three))

scala>List(1,2,3).zip(List("one","two","three","four"))

res13:List[(Int,String)]=List((1,one),(2,two),(3,three))

6.ListOperation

scala>vals1=List(1,2,3,4,5,6,7).splitAt(3)

s1:(List[Int],List[Int])=(List(1,2,3),List(4,5,6,7))

scala>valt1=s1._1.last

t1:Int=3

scala>valt2=s1._1.init

t2:List[Int]=List(1,2)

scala>valt2=s1._2

t2:List[Int]=List(4,5,6,7)

References:

http://www.fnlp.org/archives/4231

example

http://www.cnblogs.com/linlu1142/p/3292982.html

相关推荐