momomoniqwer 2011-09-26
一、准备
个人在学习中采用Struts2 + Hibernate3.2 + Spring2.5 + Compass2.2.0, 一下图片为本次学习中用到的jar包:

图中圈出的jar包为本次学习的主要部分,另外用绿色框圈出的jar包为分词器,主要用来做实验看分词效果的,选用一个即可。
二、什么是Compass
Compass是一个Java搜索框架。它封装了Lucene,增加了一些Lucene不支持的特性(例如实时更新索引),支持各种数据(Java对象、xml、json)到索引的映射,支持各种数据源(JDBC, Hibernate, iBatis)

图解:
三、与Spring、Hibernate整合
这里主要结合代码进行。
1.数据库脚本(Oracle)
--创建表Article create table ARTICLE ( ID NUMBER,--ID,主键 TITLE VARCHAR2(100 ),--标题 CONTENT CLOB,--文章内容 PUBDATE DATE--发布日期 )
2.配置Compass的OSEM 以及Hibernate映射
import
java.io.Serializable;
import
java.util.Date;
import
javax.persistence.Column;
import
javax.persistence.Entity;
import
javax.persistence.GeneratedValue;
import
javax.persistence.Id;
import
javax.persistence.Lob;
import
javax.persistence.Table;
import
javax.persistence.Temporal;
import
javax.persistence.TemporalType;
import
org.compass.annotations.Index;
import
org.compass.annotations.Searchable;
import
org.compass.annotations.SearchableId;
import
org.compass.annotations.SearchableProperty;
import
org.compass.annotations.Store;
import
org.hibernate.annotations.GenericGenerator;
@Searchable
(alias =
"article"
)
@Entity
@Table
(name =
"ARTICLE"
, schema =
"SCOTT"
)
public
class
Article
implements
Serializable {
private
static
final
long
serialVersionUID = 1L;
private
Long id;
private
String title;
private
Date pubdate =
new
Date();
private
String content;
@SearchableId
(
name = "id"
,
store = Store.NO,
index = Index.NOT_ANALYZED)
@Id
@GeneratedValue
(generator =
"paymentableGenerator"
)
@GenericGenerator
(name =
"paymentableGenerator"
, strategy =
"increment"
)
public
Long getId() {
return
id;
}
public
void
setId(Long id) {
this
.id = id;
}
@SearchableProperty
(
name = "title"
,
store = Store.YES,
index = Index.ANALYZED)
@Column
(name =
"TITLE"
)
public
String getTitle() {
return
title;
}
public
void
setTitle(String title) {
this
.title = title;
}
@SearchableProperty
(
name = "pubdate"
,
store = Store.NO,
index = Index.UN_TOKENIZED)
@Temporal
(TemporalType.TIMESTAMP)
@Column
(name =
"PUBDATE"
)
public
Date getPubdate() {
return
pubdate;
}
public
void
setPubdate(Date pubdate) {
this
.pubdate = pubdate;
}
@SearchableProperty
(
name = "content"
,
index = Index.TOKENIZED,
store = Store.YES,
converter = "htmlPropertyConverter"
)
@Lob
@Column
(name =
"CONTENT"
)
public
String getContent() {
return
content;
}
public
void
setContent(String content) {
this
.content = content;
}
} 说明:
@Searchable(alias="article")表示这个是可以搜索实体,别名为article.
@SearchableId这个是实体搜索的标识ID,和hibernate里的概念差不多,用来区分索引文件里的实体索引。
@SearchableProperty(index = Index.NOT_ANALYZED, store = Store.NO) 表示这个属性存入索引文件,不进行分词,不存储要索引中。另外在getContent()方法上的@SearchableProperty中还加入了converter = "htmlPropertyConverter",主要是用来将文章中的HTML标签进行过滤获取纯文本,在建立到索引中。在后面会具体介绍这个转换器。
3.建立Compass搜索的类
import
java.util.ArrayList;
import
java.util.List;
import
javax.annotation.Resource;
import
org.compass.core.Compass;
import
org.compass.core.CompassHighlighter;
import
org.compass.core.CompassHits;
import
org.compass.core.CompassQuery;
import
org.compass.core.CompassQueryBuilder;
import
org.compass.core.CompassSession;
import
org.compass.core.CompassTemplate;
import
org.compass.core.CompassHighlighter.TextTokenizer;
import
org.compass.core.CompassQuery.SortPropertyType;
import
org.springframework.stereotype.Component;
import
com.compass.example.dao.SearchArticleDao;
import
com.compass.example.model.Article;
@Component
(
"SearchArticleDao"
)
public
class
SearchArticleDaoCompass
implements
SearchArticleDao {
@Resource
private
CompassTemplate compassTemplate;
@Override
public
List<Article> searchWithList(
final
String queryString) {
Compass compass = compassTemplate.getCompass();
CompassSession session = compass.openSession();
CompassQueryBuilder builder = session.queryBuilder();
CompassQuery compassQuery = builder.queryString(queryString).toQuery().addSort("article.id"
,SortPropertyType.STRING);
CompassHits compassHits = compassQuery.hits();
List<Article> articles = new
ArrayList<Article>();
for
(
int
i=
0
; i<compassHits.length(); i++) {
Article article = (Article) compassHits.data(i);
CompassHighlighter highlighter = compassHits.highlighter(i);
String title = highlighter.fragment("title"
);
if
(title !=
null
) {
article.setTitle(title);
}
String content = highlighter.setTextTokenizer(TextTokenizer.AUTO).fragment("content"
);
if
(content !=
null
) {
article.setContent(content);
}
articles.add(article);
}
return
articles;
}
} 索引的查询主要是根据传过来的参数,关键字keyword,是搜索的关键字
String title = hits.highlighter(i).fragment("title");这段是检索titile这个属性有没有出现搜索的关键字,有就将它高亮(其实就是在关键字前后加个html标记设置颜色,等下可以看到在配置文件里可以自由设置高亮的颜色).
Stringcontent=hits.highlighter(i).setTextTokenizer(
CompassHighlighter.TextTokenizer.AUTO)
.fragment("content");
这段代码和上面的title具有一样的一样的功能,另外还多了个很重要的功能,自动选择正文中最匹配关键字的内容中的一部分输出。因为很多时候一篇文章几千字,我们只想显示有关键字的那部分的摘要,这时候这个功能就很方便.4.建立索引,将在服务器启动时或定时重建索引
import
org.compass.gps.CompassGps;
import
org.springframework.beans.factory.InitializingBean;
/**
* 通过quertz 定时调度定时重建索引或自动随Spring ApplicationContext 启动而重建
* 索引的Builder。会在启动后延时数秒新开线程调用compassGps.index()函数。
* 默认会在Web应用每次启动时重建索引,可以设置BuildIndex属性为false来禁止此功能。
* 也可以不用本builder,编写手动调用compasssGps.index()的代码。
* @author YinGuojun
*
*/
public
class
CompassIndexBuilder
implements
InitializingBean {
/*是否需要建立索引,可以被设置为false使本Builder失效*/
private
boolean
buildIndex =
false
;
/*索引操作线程延时启动的时间,单位为秒*/
private
int
lazyTime =
10
;
/*Compass封装*/
private
CompassGps compassGps;
private
Thread indexThread =
new
Thread() {
@Override
public
void
run() {
try
{
System.out.println("lazyTime: "
+ lazyTime);
Thread.sleep(lazyTime * 1000
);
System.out.println("begin compass index ..."
);
long
beginTime = System.currentTimeMillis();
// 重建索引.
// 如果compass实体中定义的索引文件已存在,索引过程中会建立临时索引,
// 索引完成后再进行覆盖.
compassGps.index();
long
costTime = System.currentTimeMillis() - beginTime;
System.out.println("compss index finished."
);
System.out.println("costed "
+ costTime +
" milliseconds"
);
} catch
(InterruptedException e) {
e.printStackTrace();
}
}
};
/**
* 实现<code>InitializingBean</code>接口,在完成注入后调用启动索引线程.
*
* @see org.springframework.beans.factory.InitializingBean#afterPropertiesSet()
*/
@Override
public
void
afterPropertiesSet()
throws
Exception {
if
(buildIndex) {
indexThread.setDaemon(true
);
indexThread.setName("Compass Indexer"
);
indexThread.start();
}
}
public
boolean
isBuildIndex() {
return
buildIndex;
}
public
void
setBuildIndex(
boolean
buildIndex) {
this
.buildIndex = buildIndex;
}
public
int
getLazyTime() {
return
lazyTime;
}
public
void
setLazyTime(
int
lazyTime) {
this
.lazyTime = lazyTime;
}
public
CompassGps getCompassGps() {
return
compassGps;
}
public
void
setCompassGps(CompassGps compassGps) {
this
.compassGps = compassGps;
}
} 5.转换器
import
org.apache.log4j.Logger;
import
org.compass.core.Property;
import
org.compass.core.converter.ConversionException;
import
org.compass.core.converter.basic.AbstractBasicConverter;
import
org.compass.core.mapping.ResourcePropertyMapping;
import
org.compass.core.marshall.MarshallingContext;
import
com.compass.example.utils.StringUtil;
public
class
HtmlPropertyConverter
extends
AbstractBasicConverter<String> {
private
static
Logger logger = Logger.getLogger(HtmlPropertyConverter.
class
);
public
HtmlPropertyConverter() {
super
();
// called by application server starting
logger.info("----------HtmlPropertyConverter Initializing ..."
);
}
/**
* 搜索的时候被调用
*/
@Override
protected
String doFromString(String str,
ResourcePropertyMapping resourcePropertyMapping,
MarshallingContext context) throws
ConversionException {
logger.info("----------calling doFromString..."
);
return
str;
}
/**
* 创建索引的时候被调用,此时将文本中的HTML标签过滤
*/
@Override
protected
Property createProperty(String value,
ResourcePropertyMapping resourcePropertyMapping,
MarshallingContext context) {
logger.info("----------calling createProperty..."
);
//过滤html标签
value = StringUtil.removeHTML(value);
return
super
.createProperty(value, resourcePropertyMapping, context);
}
public
class
StringUtil {
/**
* Remove occurences of html, defined as any text
* between the characters "<" and ">". Optionally
* replace HTML tags with a space.
* @param str
* @param addSpace
* @return
*/
public
static
String removeHTML(String str,
boolean
addSpace) {
//System.out.println(str);
if
(str ==
null
)
return
""
;
StringBuffer ret = new
StringBuffer(str.length());
int
start =
0
;
int
beginTag = str.indexOf(
"<"
);
int
endTag =
0
;
if
(beginTag == -
1
)
return
str;
while
(beginTag >= start) {
if
(beginTag >
0
) {
ret.append(str.substring(start, beginTag));
// replace each tag with a space (looks better)
if
(addSpace) ret.append(
" "
);
}
endTag = str.indexOf(">"
, beginTag);
// if endTag found move "cursor" forward
if
(endTag > -
1
) {
start = endTag + 1
;
beginTag = str.indexOf("<"
, start);
}
// if no endTag found, get rest of str and break
else
{
ret.append(str.substring(beginTag));
break
;
}
}
// append everything after the last endTag
if
(endTag >-
1
&& endTag +
1
< str.length()) {
ret.append(str.substring(endTag + 1
));
}
//System.out.println(ret.toString());
return
ret.toString().trim();
}
/**
* Remove occurences of html, defined as any text
* between the characters "<" and ">".
* Replace any HTML tags with a space.
* @param str
* @return
*/
public
static
String removeHTML(String str) {
return
removeHTML(str,
true
);
}
}
5.配置文件
6.效果(英文)

中文

四、问题总结
1.异常there are more terms than documents in field "XXX", but it's impossible to sort on tokenized fields.
在Luncene的API中对Sort的说明中有以下描述:
The fields used to determine sort order must be carefully chosen. Documents must contain a single term in such a field, and the value of the term should indicate the document's relative position in a given sort order. The field must be indexed, but should not be tokenized, and does not need to be stored
(unless you happen to want it back with the rest of your document data). In other words:
document.add (new Field ("byNumber", Integer.toString(x), Field.Store.NO, Field.Index.NOT_ANALYZED));
描述中红色部分需特别注意,排序的字段必须被索引,并且不应该被tokennized,也就是在注解@SearchableProperty中的index=Index.NOT_ANALYZED, store=Store.NO,括号中的说明不是很明白,希望知道的可以给我点提示,再此谢谢了。
2.异常java.lang.RuntimeException: field "XXX" does not appear to be indexed
对多个表建索引后进行搜索,在添加排序条件时,如果不指定SortPropertyType,那么在没有指定converter的字段上排序时会抛以上异常, 但如果只对单个表建索引,不会有这个问题。
五、本次学习在网上查找各种资料的汇总,对引用到别处博客内容的博主表示感谢!文章尚有很多不完善之处,望指正,本人不胜感激!
六、其他资料
Compass入门
http://www.yeeach.com/2008/03/23/compass-%E5%85%A5%E9%97%A8%E6%8C%87%E5%8D%97/
关于高亮显示的解决方案
http://jdkcn.com/entry/the-better-revolution-about-the-compass-lucene-highlight.html,此网站开放源码,有助于大家学习。
引自:http://blog.csdn.net/ygj26/article/details/5552059