hanwentan 2011-07-27
http://log.medcl.net/item/2011/02/mahout_install/
Apache Mahout是一个机器学习的框架,构建在hadoop上支持大规模数据集的处理,目前最新版本0.4。
Apache Mahout 简介http://www.ibm.com/developerworks/cn/java/j-mahout/
基于 Apache Mahout 构建社会化推荐引擎http://www.ibm.com/developerworks/cn/java/j-lo-mahout/
Taste:http://taste.sourceforge.net
Collaborative Filtering User and Item based recommenders K-Means, Fuzzy K-Means clustering Mean Shift clustering Dirichlet process clustering Latent Dirichlet Allocation Singular value decomposition Parallel Frequent Pattern mining Complementary Naive Bayes classifier Random forest decision tree based classifier High performance java collections (previously colt collections) A vibrant community and many more cool stuff to come by this summer thanks to Google summer of code
mahout安装(centos)
cd /usr/local
sudomkdirmahout
sudo svn co http://svn.apache.org/repos/asf/mahout/trunk mahout安装maven3
cd/tmp
sudowgethttp://apache.etoak.com//maven/binaries/apache-maven-3.0.2-bin.tar.gz
tarvxzfapache-maven-3.0.2-bin.tar.gz
sudo mv apache-maven-3.0.2 /usr/local/mavenvi ~/.bashrc
添加如下两行
exportM3_HOME=/usr/local/maven
export PATH=${M3_HOME}/bin:${PATH}执行 . ~/.bashrc,使设置生效[或者先logout,之后再login]
查看maven版本,看是否安装成功
mvn -version安装mahout
cd/usr/local/mahout
sudo mvn install如果报JAVA_HOME is not set,如果是用sudo,检查root的java设置
vi/etc/profile
exportJAVA_HOME=/usr/local/jdk1.6/
exportCLASSPATH=.:$JAVA_HOME/jre/lib/rt.jar:$JAVA_HOME/lib/dt.jar:$JAVA_HOME/lib/tools.jar
exportPATH=$PATH:$JAVA_HOME/bin
执行. /etc/profile 再执行mvn clean install -DskipTests=true //skip tests,fast build数据准备
cd/tmp
wget http://archive.ics.uci.edu/ml/databases/synthetic_control/synthetic_control.datahadoop fs -mkdir testdata
hadoopfs-putsynthetic_control.datatestdata
hadoop fs -lsr testdata如果报HADOOP_HOME环境变量没有设置
sudovi/etc/profile,添加
export HADOOP_HOME=/usr/lib/hadoop-0.20/hadoop集群来执行聚类算法cd /usr/local/mahout
bin/mahout org.apache.mahout.clustering.syntheticcontrol.canopy.Job
bin/mahoutorg.apache.mahout.clustering.syntheticcontrol.kmeans.Job
bin/mahoutorg.apache.mahout.clustering.syntheticcontrol.fuzzykmeans.Job
bin/mahoutorg.apache.mahout.clustering.syntheticcontrol.dirichlet.Job
bin/mahout org.apache.mahout.clustering.syntheticcontrol.meanshift.Job如果执行成功,在hdfs的/user/dev/output里面应该可以看到输出结果
GroupLensDataSets
http://www.grouplens.org/node/12,包括MovieLens Data Sets、Wikilens Data Set、Book-Crossing Data Set、Jester Joke Data Set、EachMovie Data Set下载1m的rating数据
mkdir 1m_rating
wgethttp://www.grouplens.org/system/files/million-ml-data.tar__0.gz
tarvxzfmillion-ml-data.tar__0.gz
rm million-ml-data.tar__0.gz拷贝数据到grouplens代码的目录,我们先本地测试下mahout的威力cp *.dat /usr/local/mahout/examples/src/main/java/org/apache/mahout/cf/taste/example/grouplens
cd /usr/local/mahout/examples/
执行
mvn-qexec:java-Dexec.mainclass="org.apache.mahout.cf.taste.example.grouplens.GroupLensRecommenderEvaluatorRunner"
如果不想做上面拷贝文件的操作,则指定输入文件位置就行,如下:
mvn-qexec:java-Dexec.mainclass="org.apache.mahout.cf.taste.example.grouplens.GroupLensRecommenderEvaluatorRunner"-Dexec.args="-iinput——file"
上传到hdfs
hadoop fs -copyFromLocal 1m_rating/ mahout_input/1mrating补充
mahout,svn地址:https://svn.apache.org/repos/asf/mahout/trunk
https://cwiki.apache.org/MAHOUT/creating-vectors-from-text.html
将lucene索引数据转换成文本向量,指定索引目录~/index 字段名称Name,索引临时输出文件~/dict.txt ,最终结果输出文件路径output.txt,并限制最大向量数目50$/usr/local/mahout/bin/mahout lucene.vector --dir ~/index --field Name --dictOut ~/dict.txt --output output.txt --max 50 --norm 2
查看下dict的文件内容
$head-ndict.txt
10225
#termdocfreqidx
Michale670
medcl11
jack32
lopoo23
003 2 4由上面的数据可见,dict.txt里面是我们的指定的Name字段的索引信息
使用taste-web来快速构建基于grouplens数据集的电影推荐系统
$cd taste-web/
拷贝grouplens的推荐包到taste-web的lib目录下,如果jar包还没有,转到目录执行mvninstall即可
$ cp examples/target/grouplens.jar taste-web/lib/taste-web]$ vi recommender.properties
取消掉这一行的注释,配置使用grouplens的recommender,如下:
recommender.class=org.apache.mahout.cf.taste.example.grouplens.GroupLensRecommender启动jetty,如果一切正常,访问8080端口,可以看到有这么个webservice,http://platformb:8080/RecommenderService.jwsmvn jetty:run-war
执行如下命令,查看推荐结果:http://platformb:8080/RecommenderServlet?userID=1
看截图1,2,结果的第一列表示推荐的评分,第二项为电影的id,简单几步就完成了一个推荐功能,是不是很强悍啊。
彪悍的配置文件们