java - Add BM25 scoring in Lucene -


i new-comer lucene. using lucene in java using lucene-3.6.0.jar. followed tutorial http://www.tutorialspoint.com/lucene/. base code follows:

public class lucenetester { string indexdir = "data/indexdir"; string datadir = "data/datadir"; indexer indexer; searcher searcher;  public static void test() {     lucenetester tester;     try {         tester = new lucenetester();         tester.createindex();         tester.search("malformed");     } catch (ioexception e) {         e.printstacktrace();     } catch (parseexception e) {         e.printstacktrace();     } }  private void createindex() throws ioexception {     indexer = new indexer(indexdir);     int numindexed;     long starttime = system.currenttimemillis();     numindexed = indexer.createindex(datadir, new textfilefilter());     long endtime = system.currenttimemillis();     indexer.close();     system.out.println(numindexed + " file indexed, time taken: "             + (endtime - starttime) + " ms"); }  private void search(string searchquery) throws ioexception, parseexception {     searcher = new searcher(indexdir);     long starttime = system.currenttimemillis();     term term = new term(luceneconstants.contents, searchquery);     query query = new fuzzyquery(term);     system.out.println("query: " + query.tostring());     topdocs hits = searcher.search(query, sort.relevance);     long endtime = system.currenttimemillis();     system.out.println(hits.totalhits + " documents found. time :"             + (endtime - starttime));     (scoredoc scoredoc : hits.scoredocs) {         document doc = searcher.getdocument(scoredoc);         system.out.println("file: " + doc.get(luceneconstants.file_path));     }     searcher.close(); } 

now, instead of default scoring technique want use bm25 similarity. how it?

lucene versions before lucene 4.0 didn't have information needed bm25, i.e. document level idf , average field lengths, implementing bm25 in such old version isn't directly possible (you store information needed externally and/or approximate them, see: http://www.slideshare.net/yuvalf/bm25-scoring-for-lucene-from-academia-to-industry idea).

from 4.0 lucene includes (at first experimental) implementation of bm25: https://lucene.apache.org/core/4_0_0/core/org/apache/lucene/search/similarities/bm25similarity.html

as femtorgon suggested using lucene 6 or newer, give bm25 out of box. if doesn't , can @ least use lucene 4+ can change default similarity bm25:

indexsearcher searcher = ... searcher.setsimilarity(new bm25similarity()); 

Comments

Popular posts from this blog

matlab - error with cyclic autocorrelation function -

django - (fields.E300) Field defines a relation with model 'AbstractEmailUser' which is either not installed, or is abstract -

c# - What is a good .Net RefEdit control to use with ExcelDna? -