ELK Enhanced Search Operations | Part6

In case you are landing here directly, it’s recommended to read through this documentation first.

Following are the topics, which we shall be touching through this blog :-

  • Boolean query example with ES.
  • Pagination with ES.
  • Sorting of results on fields.
  • Applying Filters on fields.
  • Query & Filter Examples
  • Fuzzy Search.
  • Prefix search.
  • Wildcard search.
  • Auto-suggestion.

Question:- Fetch those records from ES, which have following movie-name : “star wars” and are released after 1980.

And the response thus obtained is :-

**** ***** ***** ELASTIC-SEARCH PAGINATION***** ***********

Question:- What’s the understanding of paginated-results ?

Question:- Fetch only first N records from ES, who have “Sci-Fi” in genre attribute ?

Answer:- Here, we are supplying the 2 highlighted parameters :-

And the results this obtained are :-

Note that, though total matching results are 4, but we are interested only in fetching first 2 results i.e. we have asked for results in paginated manner.

Question:- For an Index holding too huge data, can it give us back results from : 10,373 to 10,383 ? Is it efficient query ?

Answer:- This query is not going to be that easy, as it still has to retrieve, collect and sort everything in order to figure out what result number 10,373 to 10,383 are and it has to figure out all 10,372 results before that as well, in order to give you what those results are for the range that you requested. So , you can really start to kill performance when you start paginating deep into your results.

Question:- What’s the best bet on retrieving the results from ES ?

Answer:-

  • You should enforce an upper bound on how many results you’ll return to your users, otherwise some nasty person will abuse your system and bring your system to its knees.
  • Even Web-sites like Google, have upper bounds on how many results they return for this reason.

Question:- Demonstrate the example to fetch results for Page #2 from ES, who have “Sci-Fi” in genre attribute ?

Answer:- Here, we are supplying the 2 highlighted parameters :-

And the results this obtained are :-

Note that, though total matching results are 4, but we are interested only in fetching first 2 results i.e. we have asked for results in paginated manner.

*** ***** ***** ELASTIC-SEARCH SORTING ***** ***********

Question:- What’s the understanding of sorting with ES ?

Answer:- Normally, it’s pretty simple if you’re dealing with a field that is numeric in nature or something that’s straightforward to search… All you have to do is stick sort equals, whatever the field is, you want to sort by at the end of your URI, just like this. Example :- If I want to sort the movie’s index and the movies that are in it by the release-date, here is how we would do it :-

You’ll get back your results sorted by the release date, just as planned;

Question:- Can sorting be done for the documents, on the basis of a field whose type is text?

Answer:- NO, we can’t perform sorting on fields, which are Indexed & Analysed OR those fields which are of type text. It would case us : “IllegalArgumentException”.

Question:- Why sorting is challenging with text type of fields ?

Answer:- Sorting becomes tricky when you’re dealing with analyzed text fields. Following points for reference :-

  • If you have a text field, like the “title” in our movie dataset, those are going to be analysed for full text search so that you can do partial matches and get back fuzzy queries.
  • You can’t use that, for sorting documents, because the inverted index just contains the individual terms of that title.
  • We can do partial-matching, but the actual entire string as a whole is not being stored, so we can’t sort by the actual movie title itself.

Question:- Is there some solution to this problem i.e. Say we want to perform search on text fields?

Answer:- Work-around for the same is to set up a subfield that is not analyzed. So if you know that you need to do full text search on a field, but you also want to sort on the entirety of that field as well, you can keep two copies of it around. Example :-

  • Here, we have title field, we’re saying that the title field itself remains as a text type. That means that it is analysed for full text search.
  • Along with that, we’re also creating a field within that called raw and that is being analyzed as a keyword type, which, as you may recall, is not analyzed — that just stores a straight up copy of the title in the raw field.

Question:- How to handle the Production-Indexes, which doesn’t have these raw-fields ?

Answer:- Now unfortunately, you can’t change the mapping on an existing index. In the real world production environments, you’d create the new index and transfer all data from the older one to the newer Index and delete the other one and switch over immediately right to the new Index.

Question:- Can you demonstrate the aforesaid solution in practical ?

Part1:- Let’s first blew-up our existing Index. Note that, at production environment, we shall always create new Index first and then switch over to it.

Part2:- Let’s now define the mappings for the field “title”. Here, we are also defining the additional field “raw” with type as “keyword”.

Part3:- Let’s now do the bulk indexing using file :-

Part4:- Let’s now evaluate the dynamic-mappings, auto-guessed by ES:-

Part5:- Let’s now perform sorting-operation on the “title.raw” field.

*** ***** ***** ELASTIC-SEARCH FILTERs ***** ***********

Question:- Let’s perform some complex query with ES ?

Here, we are searching the movies index and we’re doing a boolean query that combines the following three things :- (Recall that, bool query combines on things) :-

  • We have, here a must clause, that means that the query must match the genre “Sci-Fi” AND
  • It must not match the title term trek AND
  • It must also have a range filter, in the year between 2010 and 2015.

So breaking this down, what we’re doing is looking for science fiction movies that do not have the term trek in the title, that were released in the year 2010 through 2015.

And the result thus obtained are :-

Question:- Let’s perform below asked query with ES ?

And the result thus obtained are :-

*** ***** ***** ELASTIC-SEARCH Fuzzy Search***** ***********

Question:- What’s Fuzziness in ElasticSearch ?

Answer:- Fuzzy-Search is more about dealing with typos and misspellings. So, you may have noticed that most search engines can deal with a certain level of typos or misspellings and things like that on the part of the user. Good news is that, ElasticSearch supports it too.

Question:- How does Fuzziness in ElasticSearch works ?

Answer:- The basic concept of fuzzy matches is a thing called the Levenshtein edit distance and it’s a fancy name for a pretty simple concept.

Question:- What does Levenshtein Edit distance means ?

Answer:- It allows us to quantify common misspellings and typos. There are three different classes of these: substitutions, insertions, and deletions. So let’s look at three examples here.

  • For substitution of characters → Tthat would catch things where someone just typed in the wrong character by mistake. So for example, if someone misspelled interstellar as intersteller, with an ‘e’, instead of an ‘a’, that would still match if we were willing to tolerate Levenshtein edit distance of one, because there was one character that was substituted for what it really should have been.
  • For Insertion of characters → If I were to mistakenly insert an extra character that shouldn’t have been there. If I went from interstellar to insterstellar, you know, put in an extra ‘s’ there, that shouldn’t have been there, that would still match if I were willing to tolerate Levenshtein edit distance of one because one extra character was inserted that shouldn’t have been there.
  • For Deletion of characters → Deletions work the same way. If I misspelled interstellar to have one ‘l’, instead of two, that could also match as well because that too has Levenshtein edit distance of one.

So basically every mistake, where there is a substitution or insertion or deletion, will count as a value toward the Levenshtein edit distance, and you can specify how much of a tolerance you’re willing to have there.

Question:- Whats the recommended value, that we should set for Levenshtein Edit distance?

Answer:- AUTO. Yes, the answer is we should leave it to ES to decide, how much fuzziness should be tolerated. Here is ES’s criteria for AUTO :-

  • If the input-string is of max length upto 2, then we can’t tolerate any mistake.
  • If the input-string length is between 3 & 5, then we can tolerate mistake of upto 1 character.
  • If the input-string length is between 5 & above, then we can tolerate mistake of upto 2 characters max.

Question:- Let’s demonstrate the Fuzzy-Search in action ?

Answer:-

Part #1.) Let’s first perform normal search on the “title” attribute :- Note that, we had purposefully specified the wrong spelling while querying. The correct spelling is : “Interstellar”. The spelling that we specified in search query is: “Intersteller”. To our expectation, we got ZERO results.

Results thus obtained are :-

Part #2.) Let’s now perform fuzzy search on the “title” attribute :- Note that, we had purposefully specified the fuzziness value of 1, which means that we are interested to tolerate the spelling mistakes upto 1 unit of Levenshtein distance. In other words, now we’re saying we’re willing to tolerate a Levenshtein edit distance of one here so we can have up to one substitution, insertion or deletion here and we’ll still be able to match. Query looks like:-

Results thus obtained are :-

Part #3.) Let’s further perform fuzzy search on the “title” attribute but here, let’s say we have screwed up search term with 2 issues. We’ll actually have two substitutions there instead of one, but since we’re specifying a fuzziness of one, that should not work, right? Because that’s actually a Levenshtein distance of two, there are two things screwed up in that text. The correct spelling is : “Interstellar”. The spelling that we specified in search query is: “Intursteller”. To our expectation, we should get ZERO results. Query for the same looks like:-

Results thus obtained are :-

Part #4.) Let’s now supply the value of fuzziness to 2. The correct spelling is : “Interstellar”. The spelling that we specified in search query is: “Intursteller”. To our expectation, we should get some results. Query for the same looks like:-

Results thus obtained are :-

Part #5.) Let’s try searching for wars with a ‘z’, with a fuzziness of one. So that substitution of the ‘z’ for an ‘s’, should still be okay, and sure enough, it came back with Star Wars results OR really any movie that had wars in the title or any movie that had a word that had ‘w’ or ‘a’ or ‘r’ or ‘z’ as a search term with one of those characters being substituted, up to one of those characters. Query for the same looks like:-

Results thus obtained are :-

Part #6.) Let’s try searching for wars with a ‘z’ (i.e. we supply the search term as “warz” ) with a fuzziness of two, so that by substituting upto any two characters from the search-term, if it results in any matching movie in the database (i.e. Elastic), the same shall be reported. Query for the same looks like:-

Results thus obtained are :-

*** ***** ***** ELASTIC-SEARCH Prefix Match***** ***********

Question:- What’s Prefix-Match in ElasticSearch ?

Answer:- A prefix query is exactly what it sounds like, you provide a prefix in any string where that prefix matches will come back as a search result, and elastic search is actually surprisingly efficient at doing this. It comes down to how the inverse index is stored and sorted, so it’s actually pretty quick for elastic search to do something like this, you’d be surprised at how fast a prefix query can return results.

Note that, it’s necessary for a field to be of type “text”, in order for us to perform the prefix-match-query in ElasticSearch.

Part #1.) Let’s first see, the mappings of the Index :-

Part #2.) Let’s perform the prefix based search onto our Index :-

And the results for the afore-fired query are as follows :-

Note that, in all of the results thus obtained above, we have year for which prefix is 201. All the movies which have year 1959 or something, have really not been received in the results back.

Question:- What’s Wildcard based query in ElasticSearch ?

Answer:- We can use the wildcard syntax using the star(⭐️ ) character to represent a wildcard.There’s also a regexp query as well, where you can specify a full regular expression for a given field as well. That can also be supported, and works just fine.

Query.) Let’s perform the wildcard based search onto our Index :-

And the results for the afore-fired query are as follows :-

Note that, we only received ONE single movie result in the response for which the year of movie starts with 19*.

*** ***** ***** ELASTIC-SEARCH AutoSuggestion Match****************

Question:- What’s Query time-search-as-you-type in ElasticSearch ?

Answer:- In order for ‘query time search-as-you-type’ to work, you don’t have to index your data in any particular way for this to work, it’s just using the same prefix search capability that we talked about earlier.

  • In this example, let’s just imagine that the user typed in the term Star Trek. You can use a specialised query called match phrase prefix and it’s just like match prefix that we’ve looked at before for prefix searches, but it works on the phrase level. So, by typing in Star Trek, it will search for any titles in this example that begin with the phrase Star Trek.
  • You can also specify a slop value with that query. So if you want to provide more flexibility with the ordering of the words in that phrase and things like that, you can specify a slop, and with that, you can actually get back results for people who searched for Trek star or titles that don’t quite match that phrase exactly and might have stuff in between the terms, if you want to.

Question:- Can you demonstrate an example for match phrase prefix based search ?

Example #1.) Say we want to search for all those movies, which contains a term “star” and also contains some term which begins with “fo”.

Part #1.) Let’s first demonstrate the action of following simple search query :-

And the result obtained is :-

Clearly we obtained all the results, which contains word “star” in their titles, but we need more specific result-set, which also contains somme word which begins with “fo”.

Part #2.) Let’s now see in action below prefix search query :-

And the result obtained is :-

Clearly we obtained ZERO results, because the prefix query searches for exact search “star fo”.

Part #3.) Let’s now see in action Match Phrase based search query :-

And the result obtained is :-

Clearly we obtained ZERO results, because the match_phrase query searches for exact search “star fo” in title field.

Part #4.) Let’s now see in action Match Phrase Prefix based search query :-

And the result obtained is :-

Part #5.) Let’s finally see in action Match Phrase Prefix based search query along with Slop based search query :-

And the result obtained is :-

Clearly we obtained ONE result, because the match_phrase_prefix query along with slop value, searches for exact search “star” and also searches for “fo” in prefix somewhere in title field. It would give us back any document that has the term star and any other term with prefix as fo within 5 terms of each other.

Note that, afore-mentioned query is little expensive from resource utilisation prospective and we would see for it’s improvisation in next blog.

That’s all in this section. If you liked reading this blog, kindly do press on clap button multiple times, to indicate your appreciation. We would see you in next series.

References :-

Software Engineer for Big Data distributed systems

Software Engineer for Big Data distributed systems