mahout安装配置

hanwentan 2011-07-27

http://log.medcl.net/item/2011/02/mahout_install/

Apache Mahout是一个机器学习的框架,构建在hadoop上支持大规模数据集的处理,目前最新版本0.4。

Apache Mahout 简介http://www.ibm.com/developerworks/cn/java/j-mahout/

基于 Apache Mahout 构建社会化推荐引擎http://www.ibm.com/developerworks/cn/java/j-lo-mahout/

Taste:http://taste.sourceforge.net

Mahout currently has

Collaborative Filtering User and Item based recommenders K-Means, Fuzzy K-Means clustering Mean Shift clustering Dirichlet process clustering Latent Dirichlet Allocation Singular value decomposition Parallel Frequent Pattern mining Complementary Naive Bayes classifier Random forest decision tree based classifier High performance java collections (previously colt collections) A vibrant community and many more cool stuff to come by this summer thanks to Google summer of code

mahout安装(centos)

cd /usr/local

sudomkdirmahout

sudo svn co http://svn.apache.org/repos/asf/mahout/trunk mahout

安装maven3

cd/tmp

sudowgethttp://apache.etoak.com//maven/binaries/apache-maven-3.0.2-bin.tar.gz

tarvxzfapache-maven-3.0.2-bin.tar.gz

sudo mv apache-maven-3.0.2 /usr/local/maven

vi ~/.bashrc

添加如下两行

exportM3_HOME=/usr/local/maven

export PATH=${M3_HOME}/bin:${PATH}

执行 . ~/.bashrc,使设置生效[或者先logout,之后再login]

查看maven版本,看是否安装成功

mvn -version

安装mahout

cd/usr/local/mahout

sudo mvn install

如果报JAVA_HOME is not set,如果是用sudo,检查root的java设置

vi/etc/profile

exportJAVA_HOME=/usr/local/jdk1.6/

exportCLASSPATH=.:$JAVA_HOME/jre/lib/rt.jar:$JAVA_HOME/lib/dt.jar:$JAVA_HOME/lib/tools.jar

exportPATH=$PATH:$JAVA_HOME/bin

执行. /etc/profile 再执行mvn clean install -DskipTests=true //skip tests,fast build

数据准备

cd/tmp

wget http://archive.ics.uci.edu/ml/databases/synthetic_control/synthetic_control.data

hadoop fs -mkdir testdata

hadoopfs-putsynthetic_control.datatestdata

hadoop fs -lsr testdata

如果报HADOOP_HOME环境变量没有设置

sudovi/etc/profile,添加

export HADOOP_HOME=/usr/lib/hadoop-0.20/

hadoop集群来执行聚类算法cd /usr/local/mahout

bin/mahout org.apache.mahout.clustering.syntheticcontrol.canopy.Job

bin/mahoutorg.apache.mahout.clustering.syntheticcontrol.kmeans.Job

bin/mahoutorg.apache.mahout.clustering.syntheticcontrol.fuzzykmeans.Job

bin/mahoutorg.apache.mahout.clustering.syntheticcontrol.dirichlet.Job

bin/mahout org.apache.mahout.clustering.syntheticcontrol.meanshift.Job

如果执行成功,在hdfs的/user/dev/output里面应该可以看到输出结果

GroupLensDataSets

http://www.grouplens.org/node/12,包括MovieLens Data Sets、Wikilens Data Set、Book-Crossing Data Set、Jester Joke Data Set、EachMovie Data Set

下载1m的rating数据

mkdir 1m_rating

wgethttp://www.grouplens.org/system/files/million-ml-data.tar__0.gz

tarvxzfmillion-ml-data.tar__0.gz

rm million-ml-data.tar__0.gz

拷贝数据到grouplens代码的目录,我们先本地测试下mahout的威力cp *.dat /usr/local/mahout/examples/src/main/java/org/apache/mahout/cf/taste/example/grouplens

cd /usr/local/mahout/examples/

执行

mvn-qexec:java-Dexec.mainclass="org.apache.mahout.cf.taste.example.grouplens.GroupLensRecommenderEvaluatorRunner"

如果不想做上面拷贝文件的操作,则指定输入文件位置就行,如下:

mvn-qexec:java-Dexec.mainclass="org.apache.mahout.cf.taste.example.grouplens.GroupLensRecommenderEvaluatorRunner"-Dexec.args="-iinput——file"

上传到hdfs

hadoop fs -copyFromLocal 1m_rating/  mahout_input/1mrating

补充

mahout,svn地址:https://svn.apache.org/repos/asf/mahout/trunk

https://cwiki.apache.org/MAHOUT/creating-vectors-from-text.html

将lucene索引数据转换成文本向量,指定索引目录~/index 字段名称Name,索引临时输出文件~/dict.txt ,最终结果输出文件路径output.txt,并限制最大向量数目50$/usr/local/mahout/bin/mahout lucene.vector --dir ~/index --field Name --dictOut ~/dict.txt --output output.txt --max 50 --norm 2

查看下dict的文件内容

$head-ndict.txt

10225

#termdocfreqidx

Michale670

medcl11

jack32

lopoo23

003 2 4

由上面的数据可见,dict.txt里面是我们的指定的Name字段的索引信息

使用taste-web来快速构建基于grouplens数据集的电影推荐系统

$cd taste-web/

拷贝grouplens的推荐包到taste-web的lib目录下,如果jar包还没有,转到目录执行mvninstall即可

$ cp examples/target/grouplens.jar taste-web/lib/

taste-web]$ vi recommender.properties

取消掉这一行的注释,配置使用grouplens的recommender,如下:

recommender.class=org.apache.mahout.cf.taste.example.grouplens.GroupLensRecommender

启动jetty,如果一切正常,访问8080端口,可以看到有这么个webservice,http://platformb:8080/RecommenderService.jwsmvn jetty:run-war

执行如下命令,查看推荐结果:http://platformb:8080/RecommenderServlet?userID=1

看截图1,2,结果的第一列表示推荐的评分,第二项为电影的id,简单几步就完成了一个推荐功能,是不是很强悍啊。

彪悍的配置文件们mahout安装配置

相关推荐