
leshem 2010-06-05

Nutch vs Lucene


Nutch 是一个应用程序,可以以 Lucene 为基础实现搜索引擎应用。

Nutch vs GRUB


Nutch 是开源的,可以建立自己内部网的搜索引擎,也可以针对整个网络建立搜索引擎。自由(Free)而免费(Free)。

Nutch vs Larbin




Nutch 的早期版本不支持中文搜索,而最新的版本(2004-Aug-04 发布了 0.5)已经做了很大的改进。相对先前的 0.4 版本,有 20 多项的改进,结构上也更具备扩展性。0.5 版经过测试,对中文搜索支持的也很好。


前提条件(这里Linux 为例,如果是 Windows 参见手册):

  • Java 1.4.x 。因为我的系统上安装的Oracle 10g 已经有 Java 了。设定环境变量:NUTCH_JAVA_HOME 。
    [root@fc3 ~]# export NUTCH_JAVA_HOME=/u01/app/oracle/product/10.1.0/db_1/jdk/jre
  • Tomcat 4.x 。从这里下载。
  • 足够的磁盘空间。我预留了 4G 的空间。


[root@fc3 ~]# wget


[root@fc3 ~]# tar -zxvf nutch-0.5.tar.gz
[root@fc3 ~]# mv nutch-0.5 nutch

测试一下 nutch 命令:

[root@fc3 nutch]# bin/nutch 
Usage: nutch COMMAND
where COMMAND is one of:
  crawl             one-step crawler for intranets
  admin             database administration, including creation
  inject            inject new urls into the database
  generate          generate new segments to fetch
  fetchlist         print the fetchlist of a segment
  fetch             fetch a segment's pages
  dump              dump a segment's pages
  index             run the indexer on a segment's fetcher output
  merge             merge several segment indexes
  dedup             remove duplicates from a set of segment indexes
  updatedb          update database from a segment's fetcher output
  mergesegs         merge multiple segments into a single segment
  readdb            examine arbitrary fields of the database
  analyze           adjust database link-analysis scoring
  server            run a search server
  CLASSNAME         run the class named CLASSNAME
Most commands print help when invoked w/o parameters.
[root@fc3 nutch]#

Nutch 的爬虫有两种方式

  • 爬行企业内部网(Intranet crawling)。针对少数网站进行。用 crawl 命令。
  • 爬行整个互联网。 使用低层的 inject, generate, fetchupdatedb 命令。具有更强的可控制性。


在 nutch 目录中创建一个包含该网站顶级网址的文件 urls ,包含如下内容:

然后编辑conf/crawl-urlfilter.txt 文件,设定过滤信息,我这里只修改了MY.DOMAIN.NAME:

# accept hosts in MY.DOMAIN.NAME


[root@fc3 nutch]# bin/nutch crawl urls -dir crawl.demo -depth 2 -threads 4 >& crawl.log

depth 参数指爬行的深度,这里处于测试的目的,选择深度为 2 ;threads 参数指定并发的进程 这是设定为 4 ;

在该命令运行的过程中,可以从 crawl.log 中查看 nutch 的行为以及过程:

050102 200336 loading file:/u01/nutch/conf/nutch-site.xml
050102 200336 crawl started in: crawl.demo 
050102 200336 rootUrlFile = urls 
050102 200336 threads = 4
050102 200336 depth = 2
050102 200336 Created webdb at crawl.demo/db
050102 200336 Starting URL processing
050102 200336 Using URL filter:
050102 200337 Plugins: looking in: /u01/nutch/plugins                  
050102 200337 parsing: /u01/nutch/plugins/parse-html/plugin.xml        
050102 200337 parsing: /u01/nutch/plugins/parse-pdf/plugin.xml         
050102 200337 parsing: /u01/nutch/plugins/parse-ext/plugin.xml         
050102 200337 parsing: /u01/nutch/plugins/parse-msword/plugin.xml      
050102 200337 parsing: /u01/nutch/plugins/query-site/plugin.xml        
050102 200337 parsing: /u01/nutch/plugins/protocol-http/plugin.xml     
050102 200337 parsing: /u01/nutch/plugins/creativecommons/plugin.xml
050102 200337 parsing: /u01/nutch/plugins/language-identifier/plugin.xml
050102 200337 parsing: /u01/nutch/plugins/query-basic/plugin.xml       
050102 200337 logging at INFO                                          
050102 200337 fetching                        
050102 200337 = null                                   
050102 200337 http.proxy.port = 8080                                   
050102 200337 http.timeout = 10000                                     
050102 200337 http.content.limit = 65536                               
050102 200337 http.agent = NutchCVS/0.05 (Nutch;; n
[email protected])
050102 200337 fetcher.server.delay = 1000                              
050102 200337 http.max.delays = 100                                    
050102 200338 setting encoding to GB18030    
050102 200338 CC: found in rdf of http:
050102 200338 CC: found text in               
050102 200338 status: 1 pages, 0 errors, 12445 bytes, 1067 ms          
050102 200338 status: 0.9372071 pages/s, 91.12142 kb/s, 12445.0 bytes/page
050102 200339 Updating crawl.demo/db                                   
050102 200339 Updating for crawl.demo/segments/20050102200336          
050102 200339 Finishing update                                         
                                                                       64,1           7%
之后配置 Tomcat (我的 tomcat 安装在 /opt/Tomcat) ,

[root@fc3 nutch]# rm -rf /opt/Tomcat/webapps/ROOT*
[root@fc3 nutch]# cp nutch*.war /opt/Tomcat/webapps/ROOT.war
[root@fc3 webapps]# cd /opt/Tomcat/webapps/
[root@fc3 webapps]# jar xvf ROOT.war
[root@fc3 webapps]# ../bin/ start

浏览器中输入 http://localhost:8080 查看结果(远程查看需要将 localhost 换成相应的IP):
