ELK Concurrency, Analysers and Data-Modelling | Part3

In case you are directly landing here, it’s advisable to first check out this page.

Following are the topics, which we shall be touching through this blog :-

  • Optimistic Concurrency control.
  • Retry On Conflicts.
  • Simple Query based search.
  • Text & English Analyser.
  • Data-Modelling with ElasticSearch.

Question: Let’s say, we are using ElasticSearch as our data-store for storing the Page-View-Counts of a particular page? Imagine, we have 2 concurrent users, who both are trying to update/increment the page-view-count. The current page-count is 10. Both of those users, are trying to update the counter to 11. How do we make sure that, we store the counter value as 12 and not the 11 ?

Answer: Solution to aforesaid problem is Optimistic concurrency control.

  • We have a sequence number and the primary shard that owns that sequenceBy taking the sequence number and primary term together, we have a unique chronological record of this given document.
  • Here, we have two different clients that are trying to retrieve the current view count for a given page document from elastic search and they both get the number ten back, but when you request something from elastic search, it also gives you back a sequence number for that document.
  • So, I now know that the view count of ten, is associated explicitly with a given sequence number of that document, and that sequence number in turn, is associated with a primary term. Let’s say, that sequence number is nine, just for the sake of argument.
  • So now, when these guys say that I want to write a new value for that page count, I can specify that I’m basing that on what I saw in sequence number nine from primary term one.
  • So when you do an update you can specify the sequence number and primary term that you want to update explicitly. So what would happen if two people tried to update the same sequence is only one of them would succeed let’s say the first one actually successfully wrote the count of 11 given the sequence number nine.
  • The other one would try again on that particular client. So I would just go back to try to reacquire the current view count for that page. Start over basically and then I’ll get back sequence 10 of that document which contains 11 and I could then increment that to twelve and write it again hopefully successfully this time.

Now you don’t have to necessarily do this all by hand. There’s a parameter called retry on conflicts when you do an update that will allow you to automatically retry if this happens.

Question: In what scenario, should we be using the above feature of retry ?

Answer:- If you have many web servers or many clients that are trying to talk to ElasticSearch at once and trying to update the same document at the same time you can use these sequence numbers in order to ensure that you’re not overriding on each other and retry on conflicts and using an explicit sequence number in your updates are ways to work around this issue.

Question: Which two parameters are we talking about ?

  • _seq_no is 12.
  • _primary_term is 1.

Question: How should we do update (Full Update) of document using the _seq_no AND _primary_term ?

Answer:- Note that, post the document-update, the _seq_no have now changed to 14.

Question: What shall happen, if we try to update the same document with same _seq_no AND _primary_term again ?

Answer:- This operation would not succeed and would lead to “Version_Conflict_Engine_exception”.

Question: Do we always have to perform this sort of retry operation manually OR whether there is any automated way as well provided by ElasticSearch ?

Answer:- Well Yes, there do exists an automated way of retry_on_conflict, which automatically would perform retry in case, there is any conflict in record is realised. Well, here we don’t have concurrency and therefore it works flawlessly.

Question: Can we also use retry_on_conflict feature for the Full-Updates as well ?

Answer:- Nops. This option doesn’t works with case of PUT command i.e. Full Update to ES. This only works with the Partial-Update.

Question: Next, Let’s perform the simple-search on the title having “Star Trek” ?

Answer:- This shall be the query, we would be issuing to the ES :-

curl — location — request GET ‘localhost:9200/movies/_search’ \
— header ‘Content-Type: application/json’ \
— data-raw ‘{
“query” : {
“match” : {
“title” : “Star Trek”
}
}
}’

Following are the results, we have obtained on this search :-

{
"took": 676,
"timed_out": false,
"_shards": {
"total": 1,
"successful": 1,
"skipped": 0,
"failed": 0
},
"hits": {
"total": {
"value": 2,
"relation": "eq"
},
"max_score": 2.129195,
"hits": [
{
"_index": "movies",
"_type": "_doc",
"_id": "135569",
"_score": 2.129195,
"_source": {
"id": "135569",
"title": "Star Trek Beyond",
"year": 2016,
"genre": [
"Action",
"Adventure",
"Sci-Fi"
]
}
},
{
"_index": "movies",
"_type": "_doc",
"_id": "122886",
"_score": 0.5935682,
"_source": {
"id": "122886",
"title": "Star Wars: Episode VII - The Force Awakens",
"year": 2015,
"genre": [
"Action",
"Adventure",
"Fantasy",
"Sci-Fi",
"IMAX"
]
}
}
]
}
}

So we got back : both Star Trek and Star Wars movies as a result of that query for Star Trek. And again the reason is, because when we have an analysed text field like our titles here we can actually have partial matches come back.

Conclusion :- So, by searching for Star Trek, we got Star Trek Beyond the top hit but also Star Wars because that was a partial hit on the search terms within Star Trek, however the score was a little bit lower and that’s a good thing at-least.

Question: How does this searching works ? Why at all, the movie having “Star Wars” in title, did appeared upon search of “Star Trek” ?

Answer:- The search term that we put in Star Trek got brought broken up into two unique search terms and looking at the inverted index for the index, that map back to Star Wars and Star Trek, because those both match the term star at least.

Question: How can we delete the entire Index ?

Question: Let’s assume that, we have to enable the exact search on the attribute “genre” in Movies Index. How the same can be done ?

Answer:- By defining the mapping of the attributes like this :-

  • For attribute “genre”, we’re going to say that’s going to be of type keyword and that means that we are only going to do exact matches on that field. No analyser will be run on the genre fields at all. If you want to get back search results on the genre it’s going to have to be an exact match case sensitive the whole works.
  • For attribute “title”, we are going to have it of type text and that actually will have an analyser applied to it and we can do things like partial matches you know normalising for lowercase and uppercase synonyms things like that.
  • On this field “title”, we can also specify the specific analyser, that we want to run on that text field. With “english” analyser, we can apply stop words and synonyms that might be specific to the English language.

Question: Let’s re-ingest our bulk data, again to our same Index ?

Recall that, we have following 5 records, to which we shall be ingesting through Bulk-Ingestion-API and Out of 5, around 4 records have exact genre as “Sci-Fi”. Note the case-sensitivity as well.

Question: Now, Let’s perform the simple-search on the genre attribute having “sci” ?

Answer:- This shall be the query, we would be issuing to the ES :-

curl --location --request GET 'localhost:9200/movies/_search' \
--header 'Content-Type: application/json' \
--data-raw '{
"query" : {
"match" : {
"genre" : "sci"
}
}
}'

Here, are the results thus obtained :-

This time, we got back nothing so that’s exactly what we wanted because “genre” is now a keyword field only pure-exact match will work on the “genre” now.

Question: Next, Let’s perform another-search on the genre attribute having “sci-fi” ?

Answer:- This shall be the query, we would be issuing to the ES :-

curl --location --request GET 'localhost:9200/movies/_search' \
--header 'Content-Type: application/json' \
--data-raw '{
"query" : {
"match" : {
"genre" : "sci-fi"
}
}
}'

Again, we got ZERO results here, because there is no such document out there in our Index, which contains “sci-fi” value in the attribute genre. Reason is that, it has to be an exact-match and case-sensitive even.

Conclusions on Search :-

  • If you want a field to be analysed, make sure it’s a text field and that will allow you to do partial matching and be a little bit more forgiving on your search results.
  • But if you do want exact matches for search terms make sure you make your text fields a keyword field instead.

Question: What are the various approaches to do Data-Modelling with ElasticSearch ?

Answer: There are two ways of storing the data i.e. proper RDBMS style and NoSql way. These days, storage is cheap.

  • Normalised way of storing the data :- (Movie-Id && Movie-Title are stored in ONE index) AND (Movie-Id, User-id and Rating) are stored in 2 different indices.
  • De-Normalised way of storing the data :- All fields i.e. (Movie-Id, Movie-Title, User-id and Rating) are stored in 1 single Index.

Question: How can we setup the Parent-Child relationship in ElasticSearch ?

Answer:- In order to demonstrate parent child mapping in elastic search, we shall use example to associate films with the franchise they came from.

Question: Let’s insert the first data-row of franchise into our newly launched Index ?

Answer:- We would now insert the data into this Index “Series” : we are inserting into the “series” index, film to franchise relationship where the name franchise, we are inserting a franchise here, and the franchise name is Star Wars.

{ “create” : { “_index” : “series”, “_id” : “1”, “routing” : 1} }
{ “id”: “1”,
“film_to_franchise”:
{“name”: “franchise”},
“title” : “Star Wars”}

So that’s assigning the title Star Wars to a new franchise in our series index.

Question: Let’s insert the first data-row of film into our newly launched Index ?

Answer:- We now proceed to assign films to that franchise. This time is going to be a film and note here the parent as 1. here. The title of this film is Star Wars episode 4 — A new hope.

{ "create" : { "_index" : "series", "_id" : "260", "routing" : 1} }
{ "id": "260",
"film_to_franchise":
{"name": "film", "parent": "1"},
"title" : "Star Wars: Episode IV - A New Hope",
"year":"1977",
"genre":["Action", "Adventure", "Sci-Fi"]}

That means that this particular film is a child of parent id one, which is Star Wars, so that’s how we’re tying to link-back these films to their franchise parent.

References :-

Software Engineer for Big Data distributed systems