Product Relevancy Score

Indexing Data & Scoring

Hawk will use the Lucene query framework to retrieve search results. Below is a brief description of the Lucene scoring formula.

Hawksearch ingests data feed files and runs an import process to translate the data in a format that can be consumed by the indexing service. In order to retrieve results Lucene queries are run on the generated index. During the query generation process, Hawksearch analyzes which fields are available on the index and generates the query based on the settings mentioned below.

Following are the factors that are used in the query generation process and which affect the score calculation and relevancy:

  1. Tokenization – Fields are stored in the Lucene Index either “AS-IS” or they are stemmed and stored in the Lucene index. Tokenization helps in the allowing for variations of a word to be searchable. For example, for a search for “read,” other matching words like “reading”, “reader” are considered, and matching records are returned within the results set as well.
  2. Query this Field – The “Is Queryable” flag on the fields section controls whether that field is also used as a separate clause in the Lucene query in addition to being part of the content field as noted above. The content field is always set up as queryable and will be stemmed and stored in the index.
  3. Wildcarding - Wildcards or prefix queries can be enabled on certain fields like “title”, “SKU” so as to allow for partial matches. If any fields have these settings enabled, then appropriate term clauses are added to the Lucene query for partial matches.
  4. Field Boost - If boosts are added to a specific fields, these are accounted for when generating the query and calculating the score. For example if “title” has a boost of 20 on the field and “description” does not, then a match against the “title” field will return higher score than the match against the “description” field.
  5. Phrasing - If the keyword contains a phrase, matches against the exact phrase and appearance of the individual words on the phrase are taken into account when determining the score of the record. For example when a user searches for “blue furniture,” the scores are calculated differently depending on whether the words “blue furniture” appear together in the same order, or if they appear on the record but individually as “blue” on the color field and the “furniture” in the title. For exact matches a higher score will be applied.
  6. Term Frequency – The higher number of times a term appears within a record’s searchable fields, the higher the score is.
  7. Rare Appearance Rate – Rarer terms that match within a records’ searchable fields give a higher influence to the score.
  8. Number of Query Terms – The higher number of query terms that matched within a records’ searchable fields will provide a higher score than other products that matched less query terms.
  9. Numerical Sorting - Lucene does lexicographical sorting. (Which can also colloquially be referred to as alphabetical sorting.) Therefore Negative and NULL values are compared as equal.