Lucene 5 Code Example
Indexing
Analyzer
- WhitespaceAnalyzer: Splits tokens on whitespace
- SimpleAnalyzer: Splits tokens on non-letters, and then lowercases
- StopAnalyzer: Same as SimpleAnalyzer, but also removes stop words
- StandardAnalyzer: Most sophisticated analyzer that knows about certain token types, lowercases, removes stop words
public class LuceneExamples {
public static void main(String[] args) {
indexDirectory();
search("java");
}
private static void indexDirectory() {
//Apache Lucene Indexing Directory .txt files
try {
//indexing directory (not using File)
//you can create and cache the IndexWriter as it is costy to create
Path path = Paths.get("C:/Users/Tuna/Desktop/lucene-5.1.0/indexes");
Directory directory = FSDirectory.open(path);
IndexWriterConfig config = new IndexWriterConfig(new StandardAnalyzer());
config.setOpenMode(IndexWriterConfig.OpenMode.CREATE_OR_APPEND);
IndexWriter indexWriter = new IndexWriter(directory, config);
indexWriter.deleteAll();
File f = new File("C:/Users/Tuna/Desktop/sample"); // current directory
for (File file : f.listFiles()) {
System.out.println("indexed " + file.getCanonicalPath());
Document doc = new Document();
doc.add(new TextField("path", file.getName(), Store.YES));
FileInputStream is = new FileInputStream(file);
BufferedReader reader = new BufferedReader(new InputStreamReader(is));
StringBuffer stringBuffer = new StringBuffer();
String line = null;
while((line = reader.readLine())!=null){
stringBuffer.append(line).append("\n");
}
reader.close();
doc.add(new TextField("contents", stringBuffer.toString(), Store.YES));
indexWriter.addDocument(doc);
}
indexWriter.close();
directory.close();
} catch (Exception e) {
// TODO: handle exception
e.printStackTrace();
}
}
}
- indexDir is the directory where you want to create your index.
- Directory is a flat list of files used for storing index. It can be a RAMDirectory, FSDirectory or a DB based directory.
- FSDirectory implements Directory and saves indexes in files in file system.
- IndexWriterConfig.Open mode creates a writer in create or create_append or append mode. Create mode creates a new index if it does not exist or overwrites an existing one.
- In create_append mode, if index is created then it will not be overwritten. You can note that this method does not close the writer b/c creating IndexWriter is an costly operation. Thus we should not create a writer everytime we have to write a document to the index. Instead we should create a pool of IndexWriter and use a thread system to get the writer from the pool write to the index and then return the writer to the pool.
public void addBookToIndex(BookVO bookVO) throws Exception {
Document document = new Document();
document.add(new StringField("title", bookVO.getBook_name(), Field.Store.YES));
document.add(new StringField("author", bookVO.getBook_author(), Field.Store.YES));
document.add(new StringField("category", bookVO.getCategory(), Field.Store.YES));
document.add(new IntField("numpage", bookVO.getNumpages(), Field.Store.YES));
document.add(new FloatField("price", bookVO.getPrice(), Field.Store.YES));
IndexWriter writer = this.luceneUtil.getIndexWriter();
writer.addDocument(document);
writer.commit();
}
Search
Query supported by Lucene
- TermQuery
- BooleanQuery
- WildcardQuery
- PhraseQuery
- PrefixQuery
- MultiPhraseQuery
- FuzzyQuery
- RegexpQuery
- TermRangeQuery
- NumericRangeQuery
- ConstantScoreQuery
- DisjunctionMaxQuery
- MatchAllDocsQuery
private static void search(String text) {
//Apache Lucene searching text inside .txt files
try {
//you should cache and create a pool for IndexSearcher as it is costly to create
Path path = Paths.get("C:/Users/Tuna/Desktop/lucene-5.1.0/indexes");
Directory directory = FSDirectory.open(path);
IndexReader indexReader = DirectoryReader.open(directory);
IndexSearcher indexSearcher = new IndexSearcher(indexReader);
QueryParser queryParser = new QueryParser("contents", new StandardAnalyzer());
Query query = queryParser.parse(text);
TopDocs topDocs = indexSearcher.search(query,10);
System.out.println("totalHits " + topDocs.totalHits);
for (ScoreDoc scoreDoc : topDocs.scoreDocs) {
Document document = indexSearcher.doc(scoreDoc.doc);
System.out.println("path " + document.get("path"));
System.out.println("content " + document.get("contents"));
}
} catch (Exception e) {
// TODO: handle exception
e.printStackTrace();
}
}
- The Analyzer used in index searcher = analyzer for the index writer.
- Creating IndexSearcher is a costly operation hence it makes sense to pre-create a pool of IndexSearcher and use it in similar way as IndexWriter.
final MAX_RESULTS = 10000;
//======= prepare query ==============
Sort sort = new Sort(new SortField("author", SortField.Type.STRING));
Query termQuery = new TermQuery(new Term("completeText","intelligence"));
PrefixQuery prefixQuery = new PrefixQuery(new Term("completeText",""));
//full text
String field = "title";
Analyzer analyzer = new StandardAnalyzer(Version.LUCENE_43);
QueryParser parser = new QueryParser(Version.LUCENE_43, field, analyzer);
BooleanQuery query = new BooleanQuery();
query.add(new WildcardQuery(new Term(field, value)), BooleanClause.Occur.SHOULD);
//======= perform search ==============
// search with sort and max
final TopDocs topDocs = this.indexSearcher.search(query, MAX_RESULTS, sort);
// search with max
final TopDocs topDocs = this.indexSearcher.search( query, MAX_RESULTS );
// use HitCollector if max is not clear
final ArrayList<Integer> docs = new ArrayList<Integer>();
searcher.search( new TermQuery( t ), new HitCollector() {
public void collect(int doc, float score) {
docs.add(doc);
}
});
//======= read result =================
for (ScoreDoc scoreDoc : topDocs.scoreDocs ) {
Document doc = searcher.doc( scoreDoc.doc )
//access document fields
System.out.println(document.get("title"));
// "FILE" is the field that recorded the original file indexed
File f = new File( doc.get( "FILE" ) );
}
// from hit collector
for(Integer docid : docs) {
Document doc = searcher.doc(docid);
...
}
- As well as traditional tf.idf vector space model, Lucene 5.0 has:
- BM25
- drf (divergence from randomness)
- ib (information (theory)-based similarity)
indexSearcher.setSimilarity(new BM25Similarity());
BM25Similarity custom = new BM25Similarity(1.2, 0.75); // k1, b
indexSearcher.setSimilarity(custom);
Multi-valued Field
Document doc = new Document();
doc.add(new TextField(“author”, “chris manning”));
doc.add(new TextField(“author”,“prabhakar raghavan”));
Reference