Apache lucene indexing and searching

6/13/2023

Kill list is the only part of committed index which can change. New documents or new versions of already indexed documents are indexed in new segments and old versions invalidated in previous segments using so called kill list. So the question arise – how can we change already indexed document? When Lucene execute search for a query it search in all available segments.

This index (or index part) is called segment. Lucene finish all service operations with index and close it, so it's available for searching. In some point in time application decide to commit (publish) all the changes in the index. It's important to understand though, that Lucene index is append only. There are many more complications which are not so important for basic understanding. posting list can contains not only identifiers of the documents, but also offset of the given word inside document (potentially several instances) and some other additional information.words can be preprocessed using stemming algorithm to reduce flexia of the language.Lucene may skip some words based on the particular Analyzer given.In reality of course things are more complicated: This index is persisted on long-term storage then. Each line of this index (word) is called posting list. This is basically it – the index from the word to the list of documents containing given word. So if document with text "To be or not to be" and id=1 comes in, inverted index would look like: → 1 Lucene form what is called "inverted index" from documents. Indexing process is quite simple if you ignore low-level details. It seems your question more about index merging than about indexing itself. That provides a list of documents that match the query. When a query is issued it is processed through the same analyzer that was used to build the index and then used to look up the matching term(s) in the index. The most popular stemming algorithm for the english language is the Porter stemming algorithm:

Terms are generated using an analyzer which stems each word to its root form. You could think of it as a bit like a hashtable.

It then stores the terms in an index file where each term is associated with the documents that contain it. In a nutshell, when lucene indexes a document it breaks it down into a number of terms.

There's an even more recent version at, but it seems to have less information in it than the older one. That answer gives a pretty good, deep explanation - essentially, not so much make concurrent updates of the index "more amenable" (because you can decide to not re-balance a B-Tree immediately, thereby gaining about the same concurrent performance as a Skip-List), but rather, Skip-Lists save you from having to work on the (delayed or not) balancing operation (ultimately) needed by B-Trees (In fact, as the answer shows/references, there is probably very little performance difference between B-Trees and Skip-Lists, if either are "done right.")Įdit 12/2014: Updated to an archived version due to the original being deleted, probably the best more recent alternative is And he points to the paper that describes the particular FST algorithm Lucene uses, too.įor those curious why Lucene uses Skip-Lists, while most databases use (B+)- and/or (B)-Trees, take a look at the right SO answer regarding this question (Skip-Lists vs. Michael McCandless (also) does a pretty good and terse job of explaining how and why Lucene uses a (minimal acyclic) FST to index the terms Lucene stores in memory, essentially as a SortedMap, and gives a basic idea for how FSTs work (i.e., how the FST compacts the byte sequences to make the memory use of this mapping grow sub-linear). Lucene then loads (as already said: possibly, only some of) those terms into a Finite State Transducer, in an FST implementation loosely inspired by Morfologick. So once the inverted (term) index - which is based on a Skip-List data structure - is built from the documents, the index is stored on disk. And the Wikipedia entry on indexing Skip-Lists also explains why Lucene's Skip-List implementation is called a multi-level Skip-List - essentially, to make O(log n) look-ups possible (again, much like B-Trees). Note that by using Skip-Lists, the index can be traversed from one hit to another, making things like set and, particularly, range queries possible (much like B-Trees). Note, however, that Lucene does not (necessarily) load all indexed terms to RAM, as described by Michael McCandless, the author of Lucene's indexing system himself. In a nutshell, Lucene builds an inverted index using Skip-Lists on disk, and then loads a mapping for the indexed terms into memory using a Finite State Transducer (FST).

0 Comments

Apache lucene indexing and searching

Leave a Reply.

Author

Archives

Categories