How does search on AMO work?¶
This is documenting our current state of how search is implemented in addons-server. We will be using this to plan future improvements so please note that we are aware that the process written below is not perfect and has bugs here and there.
Please see https://github.com/orgs/mozilla/projects/17#card-10287357 for more planning.
Our search contains Add-ons (
addons index) and Add-on compatibility report documents (
In addition to that we store the following data:
- Add-on Versions (Indexer / Serializer)
- Files for each Add-on Version
- Compatibility information for each Add-on Version
As well as
- Previews (image links)
- Translations (Translations mapping generation)
And various other add-on related properties. See the Add-on Indexer / Serializer for more details.
Both use similar filtering and scoring code. For legacy reasons they’re not identical. We should focus on our API-based search though since the legacy search will be removed once support for Thunderbird and Seamonkey is moved to a new platform.
The legacy search uses ElasticSearch to query the data and then requests the actual model objects from the database. The newer API-based search only hits ElasticSearch and uses data directly stored from ES which is much more efficient.
Flow of a search query through AMO¶
Let’s assume we search on addons-frontend (not legacy) the search query hits the API and gets handled by
AddonSearchView, which directly queries ElasticSearch and doesn’t involve the database at all.
There are a few filters that are described in the /api/v4/addons/search/ docs but most of them are not very relevant for raw search queries. Examples are filters by guid, platform, category, add-on type or appversion (application version compatibility).
Much more relevant for raw add-on searches (and this is primarily used when you use the search on the frontend) is
It composes various rules to define a more or less usable ranking:
These are the ones using the strongest boosts, so they are only applied to a specific set of fields like the name, the slug and authors.
Applied rules (merged via
boost=100) - our attempt to implement exact matches
- Prefer phrase matches that allows swapped terms (
- If a query is < 20 characters long, try to do a fuzzy match on the search query (
- Then text matches, using the standard text analyzer (
- Then text matches, using a language specific analyzer (
- Then look for the query as a prefix (
- If we have a matching analyzer, add a query to
Rules 4, 5 and 6 are added for
These are the ones using the weakest boosts, they are applied to fields containing more text like description, summary and tags.
Applied rules (merged via
- Look for phrase matches inside the summary (
- Look for phrase matches inside the summary using language specific
- Look for phrase matches inside the description (
- Look for phrase matches inside the description using language
specific analyzer (
- Look for matches inside tags (
- Append a separate
matchquery for every word to boost tag matches (