ELK stack working and sneak-peek into internals | Part-2

In case you are landing here directly, it’s suggested that you go and read through this for fundamentals.

In this blog, we shall be looking at following concepts :-

  • Fundamentals of Data-Ingestion into Elastic-Search.

Fundamentals of Data Ingestion into Elastic-Search :- The process of Indexing the data into the elastic-search takes too much time. The reason behind this is, because elastic has to pass the data through a process called as Analysis. In this step, it prepares the data before it can actually index it. In this process of preparation, it takes raw-text as input and converts it into the Inverted-Index, so that data is searchable. This process itself is actually time-taking. Below is the series of steps involved in this process of ‘Analysis’ :-

  • Say, we supply a document with intent to index it into elastic-search.

Once a particular document has passed through the process of Analysis, Elastic-Search would then feed the Inverted-Index and place this document into an in-memory buffer-memory. Once this buffer gets full, this buffer-memory gets committed to a particular Segment. Once a particular segment is full, it would be known as Immutable Inverted Index. Now, this segment contains the final processed data which is searchable.

Lets see an real-time example of applying the process of ‘Analysis’ over these 2 below documents. Please note that, Analyser is always applied over some particular field of a Document, say ‘Field1’. Here, we are assuming that :-

  • In Document-1, there is a ‘Field1’ and it has the value:- “The thin lifeguard was swimming in the lake”.

For every document indexing process, the Analyser is involved which performs operation of Tokenising, Filtering and then putting it into an Index.

Now, we can formalise the major steps, as performed by ‘Analyser’ during the process of Indexing and Querying as well :-

  • Removal of stop-words.

Lets see, how the inverted-index looks like after passing both of these documents through Analyser. Pay attention that, stop-words have been removed, all words have been lower-cased, all the words have been stemmed and synonyms have been removed as well. E.g. The words ‘swimming’ and ‘Swimmers’ can be stemmed to word ‘swim’.

Analyser also plays a key role during Querying-time as well and all the aforesaid steps are applied in this process of querying as well. Lets take example from stemming prospective during Querying:- If a search query contains the word ‘swimming’, then again this search query shall go through the process of stemming and it would be trimmed/stemmed to ‘swim’. Lets take another example from removal of stop-words prospective, say a search-query came for “the thin”, then it shall be passed through process of tokenisation, then stop-words shall be removed from this search-query like as below :-

The Inverted-Index also stores the position of every particular word inside an document as well. Elastic-Search also have some pre-built Analysers. The above one that we detailed is a Standard Analyser. Other analysers available are Whitespace Analyser, Simple Analyser, English Analyser, etc.

  • Lets see an real-time example of how the particular document gets analysed by ‘Standard Analyser’ :-

The ‘Standard Analyser’ removes the punctuations, but doesn’t removes the digits in the text :-

  • Lets see an real-time example of how the particular document gets analysed by ‘Whitespace Analyser’ :-

The ‘Whitespace Analyser’ neither removes the punctuations from a word and nor removes the digits in the text :-

  • Lets see an real-time example of how the particular document gets analysed by ‘Simple Analyser’. The ‘simple’ analyser can get rid of punctuations and digits both from a particular word. See below example, how different terms were tokenised with Simple Analyser :-

Lets now define the Index ourselves :-

The default no. of shards created by elastic-search for any Index are 5 i.e. a particular logical index shall be inherently stored as divided-index into 5 shards. Below picture shows, how we created an Index. We specified the no of shards, replicas and types of the fields in this document :-

Lets now try to get the information of this Index from Elastic-Search :-

Now, if we try to create a document, which has some additional field(as defined above at the index creation time), still elastic-search would index this document and would automatically guess the type of the newly added field. Lets see example below, we try to add a document of type ‘customer’ with a new field as ‘address’ :-

Here, Elastic-search automatically guesses the type of the field ‘address’ and therefore the new schema of index now looks like below. This is how elastic-search behaves dynamically :-

However, we have liberty to define the enforcement of strict structure as well with Elastic at the Index creation time.

  • Say, we want that, in case someone tries to index a document into elastic and that document have an additional field(other than the pre-specified fields at the Index creation time) to which we want to be ignored, then we set the below mapping property to ‘false’. Lets see below example :-
  • Say, we want that, in case someone tries to index a document into elastic and that document have an additional field(other than the pre-specified fields at the Index creation time) to which we strictly don’t want to ingest at all, then we can set the below mapping property to ‘‘strict’. Lets see below example :-

Let’s now see how to query/searching from Elastic-Search :-

The main power of Elastic-Search is the search-ability very very quickly. Whenever we query to Elastic-Search for a particular keyword, a relevancy score is being returned to us. This score represents, how relevant this document is ?? For e.g. The document which contains more no. of occurrences of search-keyword/query, would be more relevant and would have high relevancy score. The more relevant documents(i.e. documents having high relevancy score) would appear in beginning of the search-results. Elastic-Search provides an ‘_search’ endpoint. In its response, ‘hits’ array contains the data.

The ‘_search’ endpoint is quite a powerful way to query the elastic. We can specify the term-query like below and it shall fetch all those documents which contains the searched value :-

We demonstrated all aforesaid ways of querying to the Elastic using Kibana DevTools. We can also directly query to Elastic using CURL commands :-

Now, Lets see some examples for understanding the DSL querying syntax to Elastic-Search. Here, for facilitating these queries, we have pre-populated the Elastic-Search with 10 documents into an Index called ‘courses’.

Example1:- We can get all documents present into any Index with search-query ‘_search’ and specifying ‘match_all’. Notice that, all of the documents being returned are present inside the ‘hits’ key and every document also have relevancy-score as 1, as all documents have equal relevancy.

Example2:- Lets say we only want to get those courses which contains the keyword ‘computer’ in them. Notice below that, we got the value of “hits.total.value” being 2, which means that there are 2 matching documents.

Example3:- Now, say we modify the ‘name’ field of the document-id-4 and now say if we again fire the same search-query, then we get different relevancy-scores for each document as shown below. Pay attention that, the document whose ‘name’ field contains more than 1 occurrence of the particular term ‘computer’ would get the higher relevancy score.

Example4:- Now, say we want to fetch those documents, which have keyword ‘computer’ in the field ‘name’ AND contains keyword ‘C8’ in field ‘room’, below shall be the query :-

Please note that, the ‘must’ clause can only be used inside the ‘bool’ clause only.

Example5:- Now, say we want to fetch those documents, which have keyword ‘accounting’ in the field ‘name’ AND contains keyword ‘e7’ in field ‘room’ AND keyword ‘bill’ must NOT be present in the nested field ‘professor.name’, then below shall be the query. So, we use the ‘must_match ’ query here.

Example6:- Now, say we want to fetch those documents, which have keyword ‘accounting’ in the field ‘name’ OR field ‘professor.department’, then below shall be the query as shown. Please note here that, those documents which contains this term in both ‘name’ and ‘professor.department’ fields, would have higher score. So, we use the ‘multi_match ’ query here.

Example7:- Now, say we want to fetch those documents, which contains the text ‘from the business school taken by final’ in the field ‘course_description’, then below shall be the query as shown. Please note here that, all the words in the searched phrase are full tokens and no partials. So, we use the ‘match_phrase ’ query here.

Example8:- Now, say we want to fetch those documents, which contains the text ‘from the business school taken by fin’ in the field ‘course_description’, then below shall be the query as shown. Please note here that, the phrase is incomplete here and the word ‘fin’ is not present in the field. So, we use the ‘match_phrase_prefix’ query here.

Example9:- Now, say we want to fetch those documents, whose field ‘students_enrolled’ contains the value greater-than/equal-to 10 and lesser-than/equal-to 20. So, we use the ‘range’ query here.

Example 10:- Now, say we want to fetch those documents, whose field ‘name’ contains value as ‘accounting’, whose filed ‘room’ must-not contain the value ‘e7’ and whose field ‘students_enrolled’ should contain the value greater-than/equal-to 10 and lesser-than/equal-to 20. So, we use the ‘combination of bool + range’ query here. Pl note here that, elastic-search also fetches those records for which the value of ‘students_enrolled’ is more than 20 as well, but that doc has too less relevancy score. Moroever, it was fetched at all because, we had the range query inside the should clause. The interpretation of the ‘should’ clause is : “Its nice to have” !!

Example 11:- Now, say we want to fetch those documents, whose field ‘name’ contains value as ‘accounting’, whose filed ‘room’ must-not contain the value ‘e7’ and whose field ‘students_enrolled’ must contain the value greater-than/equal-to 10 and lesser-than/equal-to 20. So, again, we use the ‘combination of bool + range’ query here. Pay attention to the change in order of clauses as compared to aforesaid scenario.

Example 12:- Now, say we want to fetch those documents, whose field ‘students_enrolled’ should have value within 10 and 17 AND for those documents the field ‘name’ OR ‘course_description’ contains the value ‘market’. Also, we want to give double weightage to those documents who have value ‘market’ being present in the field ‘course_description’. We shall use the Field-Boosting here :- Please note here that, as we increase the weightage, so as the relevancy-score for those documents increases.

Now, Lets see some examples for understanding the DSL filtering syntax to Elastic-Search. Following are the major distinctions b/w query & filters :-

  • The filtering query doesn’t returns the relevancy score for the documents returned. It just filters out those documents which matches to the search criteria and returns them.

Please note that results of the elastic-search query which contains the portion of query inside the query context(but outside the filtering context) would result in relevancy-scores for the documents in the output. Another important thing to note here is: whatever can be done using the query, can be put well inside the filter context as well.

Example13:- Lets say we want to filter out those courses-documents which contains the keyword ‘accounting’ in the field ‘name’.

Example14:- Lets say we want to filter out those courses-documents which contains the keyword ‘accounting’ in the field ‘name’ and have the value as ‘bill’ for the field ‘professor.name’.

Example15:- Lets say we want to filter out those courses-documents which contains the keyword ‘accounting’ in the field ‘name’ AND have the value as ‘bill’ for the field ‘professor.name’ AND have count of students_enrolled field greater-than/equal-to 10 and lesser-than/equal-to 20. Pay attention to the range query used inside the filter context :-

Example16:- Lets now, see the usage of BULK APIs being provided by Elastic-Search :- Here, we indexed many documents at-once into the elastic-search.

Example17:- Now, when we invoke the GET API call, it would not fetch all the documents, but by-default it would only fetch the 10 documents from elastic-search. In order to specify, Count of documents to be fetched, we can specify the “size” parameter :-

Example18:- Say we want to fetch the records from 0 to 5 and then we want to do sorting of those records basis of price in decreasing order, below shall be the query for the same. Please note that, first all records are being fetched from elastic-search(as evident from value of the field ‘hits.total.value’), then it does the sorting over all of those records and then apply the pagination(fetching 0 to 5 records) over those records.

Example19:- Lets now see the Aggregation over the data from Elastic. The first simplest aggregation is COUNT. Lets now count the number of vehicles whose value for field ‘make’ is ‘Toyota’.

Example20:- Lets say, we want to get the aggregate count of vehicles every colour wise , we can apply our custom-aggregator like shown below. Please note here that, using the “keyword” based fieldName is crucial here, because that’s how Elastic-search automatically deduces the data-type of the field over which aggregation has to be applied.

Example21:- Lets say, we want to get the aggregate count of vehicles every make-wise, we can apply our custom-aggregator. Also, say we want to know the max & min pricing for each of the vehicle-make, we can use the below approach. Please note here that, the applied aggregation runs over the scope of specified query i.e. “all the records”. Another point to note here is that, the aggregation query by default shows the all matching records under key ‘hits’. In case, we don’t want the records to be fetched, we can specify the “size” as 0.

Example22:- What we did above so far, there do exists a shot-cut to this as well. To get the stats directly, we can use the ‘stats’ aggregator as shown below :-

Introducing LOGSTASH:- Logstash is a powerful open-source data processing pipeline. Its used to ingest the data from multitude of sources simultaneously.

There are 3 stages through which data can pass in case of pipeline :-

Starting LogStash locally :-

Assuming that ElasticSearch and Kibana both are running in separate terminals. With option ‘-e’, we are intending to start the Logstash with configurations being provided on the fly.

aditya-MAC:bin aditya$ ./logstash -e ‘input { stdin {}} output { stdout{}}’

Sending Logstash logs to /Users/aditya/Documents/LEARNING/ELK-Stack/MY-WORKS/logstash-7.8.1/logs which is now configured via log4j2.properties

The stdin plugin is now waiting for input:

[2020–10–08T07:21:11,382][INFO ][logstash.agent ] Pipelines running {:count=>1, :running_pipelines=>[:main], :non_running_pipelines=>[]}

[2020–10–08T07:21:11,756][INFO ][logstash.agent ] Successfully started Logstash API endpoint {:port=>9600}

Lets see LogStash in action :- We send some data in standard input and receive the same in output along with timestamp being added on its own.

Charitableness should not be forgotten. Its a duty of mankind to mankind.

{

“message” => “Charitableness should not be forgotten.Its a duty of mankind to mankind.”,

“host” => “aditya-MAC”,

“@version” => “1”,

“@timestamp” => 2020–10–08T01:55:19.368Z

}

That’s all in this section. If you liked reading this blog, kindly do press on clap button multiple times, to indicate your appreciation. We would see you in next series.

References :-

--

--

Software Engineer for Big Data distributed systems

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store