Phrase matching

[[phrase-matching]] === Phrase Matching

In the same way that the match query is the go-to query for standard full-text search, the match_phrase query((("proximity matching", "phrase matching")))((("phrase matching")))((("match_phrase query"))) is the one you should reach for when you want to find words that are near each other:

[source,js]

GET /my_index/my_type/_search { "query": { "match_phrase": { "title": "quick brown fox" } }

}

// SENSE: 120_Proximity_Matching/05_Match_phrase_query.json

Like the match query, the match_phrase query first analyzes the query string to produce a list of terms. It then searches for all the terms, but keeps only documents that contain all of the search terms, in the same positions relative to each other. A query for the phrase quick fox would not match any of our documents, because no document contains the word quick immediately followed by fox.

[TIP]

The match_phrase query can also be written as a match query with type phrase:

[source,js]

"match": { "title": { "query": "quick brown fox", "type": "phrase" }

}

// SENSE: 120_Proximity_Matching/05_Match_phrase_query.json

==================================================

==== Term Positions

When a string is analyzed, the analyzer returns not((("phrase matching", "term positions")))((("matchphrase query", "position of terms")))((("position-aware matching"))) only a list of terms, but also the _position, or order, of each term in the original string:

[source,js]

GET /_analyze?analyzer=standard

Quick brown fox

// SENSE: 120_Proximity_Matching/05_Term_positions.json

This returns the following:

[role="pagebreak-before"]

[source,js]

{ "tokens": [ { "token": "quick", "start_offset": 0, "end_offset": 5, "type": "", "position": 1 <1> }, { "token": "brown", "start_offset": 6, "end_offset": 11, "type": "", "position": 2 <1> }, { "token": "fox", "start_offset": 12, "end_offset": 15, "type": "", "position": 3 <1> } ]

}

<1> The position of each term in the original string.

Positions can be stored in the inverted index, and position-aware queries like the match_phrase query can use them to match only documents that contain all the words in exactly the order specified, with no words in-between.

==== What Is a Phrase

For a document to be considered a((("match_phrase query", "documents matching a phrase")))((("phrase matching", "criteria for matching documents"))) match for the phrase ``quick brown fox,'' the following must be true:

  • quick, brown, and fox must all appear in the field.

  • The position of brown must be 1 greater than the position of quick.

  • The position of fox must be 2 greater than the position of quick.

If any of these conditions is not met, the document is not considered a match.

[TIP]

Internally, the match_phrase query uses the low-level span query family to do position-aware matching. ((("match_phrase query", "use of span queries for position-aware matching")))((("span queries")))Span queries are term-level queries, so they have no analysis phase; they search for the exact term specified.

Thankfully, most people never need to use the span queries directly, as the match_phrase query is usually good enough. However, certain specialized fields, like patent searches, use these low-level queries to perform very specific, carefully constructed positional searches.

==================================================