java - Add BM25 scoring in Lucene -
i new-comer lucene. using lucene in java using lucene-3.6.0.jar. followed tutorial http://www.tutorialspoint.com/lucene/. base code follows:
public class lucenetester { string indexdir = "data/indexdir"; string datadir = "data/datadir"; indexer indexer; searcher searcher; public static void test() { lucenetester tester; try { tester = new lucenetester(); tester.createindex(); tester.search("malformed"); } catch (ioexception e) { e.printstacktrace(); } catch (parseexception e) { e.printstacktrace(); } } private void createindex() throws ioexception { indexer = new indexer(indexdir); int numindexed; long starttime = system.currenttimemillis(); numindexed = indexer.createindex(datadir, new textfilefilter()); long endtime = system.currenttimemillis(); indexer.close(); system.out.println(numindexed + " file indexed, time taken: " + (endtime - starttime) + " ms"); } private void search(string searchquery) throws ioexception, parseexception { searcher = new searcher(indexdir); long starttime = system.currenttimemillis(); term term = new term(luceneconstants.contents, searchquery); query query = new fuzzyquery(term); system.out.println("query: " + query.tostring()); topdocs hits = searcher.search(query, sort.relevance); long endtime = system.currenttimemillis(); system.out.println(hits.totalhits + " documents found. time :" + (endtime - starttime)); (scoredoc scoredoc : hits.scoredocs) { document doc = searcher.getdocument(scoredoc); system.out.println("file: " + doc.get(luceneconstants.file_path)); } searcher.close(); }
now, instead of default scoring technique want use bm25 similarity. how it?
lucene versions before lucene 4.0 didn't have information needed bm25, i.e. document level idf , average field lengths, implementing bm25 in such old version isn't directly possible (you store information needed externally and/or approximate them, see: http://www.slideshare.net/yuvalf/bm25-scoring-for-lucene-from-academia-to-industry idea).
from 4.0 lucene includes (at first experimental) implementation of bm25: https://lucene.apache.org/core/4_0_0/core/org/apache/lucene/search/similarities/bm25similarity.html
as femtorgon suggested using lucene 6 or newer, give bm25 out of box. if doesn't , can @ least use lucene 4+ can change default similarity bm25:
indexsearcher searcher = ... searcher.setsimilarity(new bm25similarity());
Comments
Post a Comment