Lucene 5 Code Example

Indexing

Analyzer

  • WhitespaceAnalyzer: Splits tokens on whitespace
  • SimpleAnalyzer: Splits tokens on non-letters, and then lowercases
  • StopAnalyzer: Same as SimpleAnalyzer, but also removes stop words
  • StandardAnalyzer: Most sophisticated analyzer that knows about certain token types, lowercases, removes stop words
public class LuceneExamples {

    public static void main(String[] args) {
        indexDirectory();
        search("java");
    }   

    private static void indexDirectory() {      
         //Apache Lucene Indexing Directory .txt files     
         try {  
             //indexing directory (not using File)   
             //you can create and cache the IndexWriter as it is costy to create
             Path path = Paths.get("C:/Users/Tuna/Desktop/lucene-5.1.0/indexes");
             Directory directory = FSDirectory.open(path);
             IndexWriterConfig config = new IndexWriterConfig(new StandardAnalyzer());  
             config.setOpenMode(IndexWriterConfig.OpenMode.CREATE_OR_APPEND); 
             IndexWriter indexWriter = new IndexWriter(directory, config);

             indexWriter.deleteAll();
             File f = new File("C:/Users/Tuna/Desktop/sample"); // current directory     
             for (File file : f.listFiles()) {
                    System.out.println("indexed " + file.getCanonicalPath());               
                    Document doc = new Document();
                    doc.add(new TextField("path", file.getName(), Store.YES));
                    FileInputStream is = new FileInputStream(file);
                    BufferedReader reader = new BufferedReader(new InputStreamReader(is));
                    StringBuffer stringBuffer = new StringBuffer();
                    String line = null;
                    while((line = reader.readLine())!=null){
                      stringBuffer.append(line).append("\n");
                    }
                    reader.close();
                    doc.add(new TextField("contents", stringBuffer.toString(), Store.YES));
                    indexWriter.addDocument(doc);           
             }               
             indexWriter.close();           
             directory.close();
        } catch (Exception e) {
            // TODO: handle exception
            e.printStackTrace();
        }                   
    }

 }
  • indexDir is the directory where you want to create your index.
  • Directory is a flat list of files used for storing index. It can be a RAMDirectory, FSDirectory or a DB based directory.
  • FSDirectory implements Directory and saves indexes in files in file system.
  • IndexWriterConfig.Open mode creates a writer in create or create_append or append mode. Create mode creates a new index if it does not exist or overwrites an existing one.
  • In create_append mode, if index is created then it will not be overwritten. You can note that this method does not close the writer b/c creating IndexWriter is an costly operation. Thus we should not create a writer everytime we have to write a document to the index. Instead we should create a pool of IndexWriter and use a thread system to get the writer from the pool write to the index and then return the writer to the pool.
public void addBookToIndex(BookVO bookVO) throws Exception {
     Document document = new Document();
     document.add(new StringField("title", bookVO.getBook_name(), Field.Store.YES));
     document.add(new StringField("author", bookVO.getBook_author(), Field.Store.YES));
     document.add(new StringField("category", bookVO.getCategory(), Field.Store.YES));
     document.add(new IntField("numpage", bookVO.getNumpages(), Field.Store.YES));
     document.add(new FloatField("price", bookVO.getPrice(), Field.Store.YES));
     IndexWriter writer =  this.luceneUtil.getIndexWriter();
     writer.addDocument(document);
     writer.commit();
 }

lucene store index term vector

Query supported by Lucene

  • TermQuery
  • BooleanQuery
  • WildcardQuery
  • PhraseQuery
  • PrefixQuery
  • MultiPhraseQuery
  • FuzzyQuery
  • RegexpQuery
  • TermRangeQuery
  • NumericRangeQuery
  • ConstantScoreQuery
  • DisjunctionMaxQuery
  • MatchAllDocsQuery
private static void search(String text) {   
   //Apache Lucene searching text inside .txt files
   try {  
        //you should cache and create a pool for IndexSearcher as it is costly to create
        Path path = Paths.get("C:/Users/Tuna/Desktop/lucene-5.1.0/indexes");
        Directory directory = FSDirectory.open(path);       
        IndexReader indexReader =  DirectoryReader.open(directory);
        IndexSearcher indexSearcher = new IndexSearcher(indexReader);

        QueryParser queryParser = new QueryParser("contents",  new StandardAnalyzer());  
        Query query = queryParser.parse(text);
        TopDocs topDocs = indexSearcher.search(query,10);
        System.out.println("totalHits " + topDocs.totalHits);
        for (ScoreDoc scoreDoc : topDocs.scoreDocs) {           
            Document document = indexSearcher.doc(scoreDoc.doc);
            System.out.println("path " + document.get("path"));
            System.out.println("content " + document.get("contents"));
        }
    } catch (Exception e) {
         // TODO: handle exception
         e.printStackTrace();
    }               
}
  • The Analyzer used in index searcher = analyzer for the index writer.
  • Creating IndexSearcher is a costly operation hence it makes sense to pre-create a pool of IndexSearcher and use it in similar way as IndexWriter.
final MAX_RESULTS = 10000;

//======= prepare query ==============
Sort sort = new Sort(new SortField("author", SortField.Type.STRING));
Query termQuery = new TermQuery(new Term("completeText","intelligence"));
PrefixQuery prefixQuery = new PrefixQuery(new Term("completeText",""));

//full text
String field = "title";
Analyzer analyzer = new StandardAnalyzer(Version.LUCENE_43);
QueryParser parser = new QueryParser(Version.LUCENE_43, field, analyzer);
BooleanQuery query = new BooleanQuery();
query.add(new WildcardQuery(new Term(field, value)), BooleanClause.Occur.SHOULD);


//======= perform search ==============      
// search with sort and max
final TopDocs topDocs = this.indexSearcher.search(query, MAX_RESULTS, sort);

// search with max
final TopDocs topDocs = this.indexSearcher.search( query, MAX_RESULTS );

// use HitCollector if max is not clear
final ArrayList<Integer> docs = new ArrayList<Integer>();
searcher.search( new TermQuery( t ), new HitCollector() {
    public void collect(int doc, float score) {
        docs.add(doc);
    }
});

//======= read result =================
for (ScoreDoc scoreDoc : topDocs.scoreDocs ) {
    Document doc = searcher.doc( scoreDoc.doc )   
    //access document fields
    System.out.println(document.get("title"));
    // "FILE" is the field that recorded the original file indexed
    File f = new File( doc.get( "FILE" ) );
}

// from hit collector
for(Integer docid : docs) {
    Document doc = searcher.doc(docid);
    ...
}
  • As well as traditional tf.idf vector space model, Lucene 5.0 has:
    • BM25
    • drf (divergence from randomness)
    • ib (information (theory)-based similarity)
indexSearcher.setSimilarity(new BM25Similarity());
BM25Similarity custom = new BM25Similarity(1.2, 0.75); // k1, b
indexSearcher.setSimilarity(custom);

Multi-valued Field

Document doc = new Document();
doc.add(new TextField(“author”, “chris manning”));
doc.add(new TextField(“author”,“prabhakar raghavan”));

Reference