ELK stack working and sneak-peek into internals

Fundamentals of Elastic-Search:-

ElasticSearch is a Document-oriented search-engine, which means we can search & delete the documents into the elastic-search. The purpose of the elastic-search is to facilitate searching, as pointed right in its name. Its very powerful and does the searching in lightning fast speed. For e.g. Just like Google, it searches and finds the relevant matching documents. Similarly, Elastic-Search retrieves those documents which matches to the keywords in the search query. It supports :-

  • Analyze the Documents.
  • Indexing of Documents.
  • Deletion of Documents.
  • Retrieval of Documents.
  • Search of Documents.

The way, it works is similar to the way the Index works in the Book.

With the help of Index, we can quickly find all those page-numbers where a particular keyword occurs and hence the user can directly flip to the concerned pages. The Index can be considered as the Search-Engine for that book. This technique allows for very efficient searching capabilities, which we can leverage to build a search-engine like the Elastic-Search. In case of Elastic-search, it maintains the Inverted index, where every word (belonging to all the sentences or documents) can make its way to Inverted-index. For e.g. for every term, it would maintain the map of its corresponding occurrence into the respective documents. The english casing of the words also plays important role, based upon which the different tokens gets sorted. Elastic-search can index billions of documents, as it is very very scalable.

Whenever, search-operation is being performed over the Elastic-search, it pays more attention on the relevancy of the documents, before fetching them. Relevancy-scores are assigned to the documents, depending upon the match between the actual documents and terms. Lets try to setup the elastic-search at our machine.

Step.1) First step is to launch the Elastic into a terminal window.

aditya-MAC:kibana-7.8.1-darwin-x86_64 aditya$ pwd

/Users/aditya/Documents/LEARNING/ELK-Stack/MY-WORKS/kibana-7.8.1-darwin-x86_64

aditya-MAC:bin b0218162$ ./elasticsearch

[2020–10–02T17:12:36,962][INFO ][o.e.n.Node ] [aditya-MAC] version[7.8.1], pid[12593], build[default/tar/b5ca9c58fb664ca8bf9e4057fc229b3396bf3a89/2020–07–21T16:40:44.668009Z], OS[Mac OS X/10.13.6/x86_64], JVM[AdoptOpenJDK/OpenJDK 64-Bit Server VM/14.0.1/14.0.1+7]

[2020–10–02T17:12:36,987][INFO ][o.e.n.Node ] [aditya-MAC] JVM home [/Users/aditya/Documents/LEARNING/ELK-Stack/MY-WORKS/elasticsearch-7.8.1/jdk.app/Contents/Home]

Step.2) Check if Elastic is running fine :-

aditya-MAC:elasticsearch-7.8.1 aditya$ curl http://localhost:9200

{

“name”: “aditya-MAC”,

“cluster_name”: “elasticsearch”,

“cluster_uuid”: “cyzSSpYfSJa2ZE9NMihwYw”,

“version”: {

“number”: “7.8.1”,

“build_flavor”: “default”,

“build_type”: “tar”,

“build_hash”: “b5ca9c58fb664ca8bf9e4057fc229b3396bf3a89”,

“build_date”: “2020–07–21T16:40:44.668009Z”,

“build_snapshot”: false,

“lucene_version”: “8.5.1”,

“minimum_wire_compatibility_version”: “6.8.0”,

“minimum_index_compatibility_version”: “6.0.0-beta1”

},

“tagline”: “You Know, for Search”

}

Step.3) Start the Kibana UI on top of Elastic :-

aditya-MAC:kibana-7.8.1-darwin-x86_64 b0218162$ cd bin/

aditya-MAC:bin aditya$ ./kibana

log [12:01:18.474] [warning][plugins-discovery] Expect plugin “id” in camelCase, but found: apm_oss

log [12:01:18.508] [warning][plugins-discovery] Expect plugin “id” in camelCase, but found: triggers_actions_ui

log [12:01:29.828] [info][plugins-service] Plugin “visTypeXy” is disabled.

log [12:01:29.828] [info][plugins-service] Plugin “endpoint” is disabled.

log [12:01:29.829] [info][plugins-service] Plugin “ingestManager” is disabled.

log [12:01:29.829] [info][plugins-service] Plugin “lists” is disabled.

log [12:01:31.751] [warning][config][deprecation] Setting [elasticsearch.username] to “elastic” is deprecated. You should use the “kibana_system” user instead.

log [12:01:31.751] [warning][config][deprecation] Config key [monitoring.cluster_alerts.email_notifications.email_address] will be required for email notifications to work in 8.0.”

log [12:01:31.751] [warning][config][deprecation] Setting [monitoring.username] to “elastic” is deprecated. You should use the “kibana_system” user instead.

Step.4) Check if Kibana is running fine :-

KIBANA accessible at :- http://localhost:5601/app/kibana#/home. Further Kibana can only run, if and only if, Elastic is up and running.

Step.5 ) Check health of Elastic through Kibana UI :-

There is an option to monitor the health of Kibana through Kibana dashboard. We can monitor the JVM heap usage, because Elastic is ultimately running on the top of JVM only.

Lets now see, how any elastic document looks like in Elastic :-

In contrast to RDBMS where the data is being stored in form of rows, the document in elastic is a JSON object. The documents in the Elastic are Immutable in nature. In contrast to RDBMS, where a particular column can be modified, same is not possible in the Elastic-search.

The Elastic-Search is based upon the Lucene. Its underlying principle is based upon Inverted-Index. This maps the words to the actual document-locations, where they occur. Elastic-Search also allows the concept of Full-text-search.

Interacting with Elastic-Search using out-of-the-box API endpoints :- The process of ingesting the data into the Elastic-Search is known as Indexing. To index a document into the elastic-search, it means, to insert a document into the elastic-search. The process of indexing is slowest. Lets see, how can we do this operation using the RESTful APIs being exposed by Elastic-Search :-

The ‘type’ is kind of a sub-division of an Index. For e.g. for a vehicle, sub-divisions can be like : trucks, motor-cycles, cars, etc. This ‘type’ field is getting dropped from Elastic-7 version onwards i.e. we won’t be allowed to have multiple sub-types for a given index going forward from elastic-version-7 onwards and we won’t be even having the type as part of the url. Lets see an real-time example of Indexing an document into the Elastic-search :-

If we don’t supply the id while creating a document, elastic-search would automatically create the id for us, but it’s highly recommended to supply an ‘id’ while indexing a document. So, we send the data to elastic-search using JSON syntax and we get the data from elastic-search in JSON format. This is how, we communicate with elastic-search. Here’s how the output looks like :-

The ‘_version’ represents the version-id of this document. If we try to update this document, we would see the updated version. Also, if we go on to reindex the new document on the same _id, then its going to overwrite the earlier document and increment the _version for the same. Any field starting with an underscore, is being something meta-information, which is being maintained by elastic itself. Also, notice that “result” here is being shown as ‘created’. Lets update this document couple of times, to see whether the _version increases or not :-

Notice here that “result” here is being shown as ‘updated’. Also, the “_version” value is being incremented. So, whenever we change even a particular field in any of the document in elastic, it would delete the existing document completely and then index the document in a fresh manner. The important point here to note here is that, the document in elastic is changed completely at all, whenever any particular field is being updated. Now, lets try to fetch this document as shown below. Here, In this case, it gives us the output and indicate value of ‘found’ being true. The main document is being returned into the field called ‘_source’.

Lets, see what happens, in case we try to do GET on a document, which doesn’t exists :-

In above case it shall give us the output and indicate value of ‘found’ being false. Lets try to get now, only the actual document that we indexed and NOT the full document.

Here, we have used the ‘_source’ in the URL itself and it doesn’t returns us the meta-fields in the response JSON. Now, if we only want to check, whether a particular document really exists in the particular index or not, we use HEAD command :-

Elastic-Search also provides us an API to update a document, but it works exactly the same way of doing a POST operation. Lets modify the value of a particular field in a given document :-

Now, lets see whether document really got modified or not :- Indeed yes.

Now, lets see the way to delete this document from elastic-search.

Now, if we try to do GET over this document again, we might not be able to get this document anymore. Although, NO document is being returned to us through the end-point, but the document still exists somewhere and elastic-search has just marked that document as deleted and later all those types of documents(i.e. marked as deleted) shall purge those documents. This process happens under the hoods. Therefore, those documents shall not be deleted immediately, hence the disk-space might not be freed-up immediately.

Elastic-search is also capable of automatically deducing the data-type of the various fields of the document. For the document, that we above indexed, lets see what data-type, elastic have deduced. For example, data-type of the field ‘price’ has been rightly deduced as ‘float’ and data-type of the field ‘Color’ has been deduced as ‘Text’.

Now, say, if new documents are added(to the existing index), which have additional fields, then elastic-search would automatically adjust its structure i.e. mappings. For example, say we added a new document(as demonstrated below), which have additional field ‘driver’, then mappings of Elastic-Search would automatically re-adjust and its mappings now looks like :-

POST /vehicles/_doc/212
{
“make”: “Maruti”,
“Color”: “Silver”,
“HP”: 110,
“milage”: 51000,
“price”: 8396.23,
“driver”: “Havloc”
}

Next, we can only have a single ‘type’ in the particular index. Therefore, if we try to create another ‘type’ within same index, elastic-search shall throw an exception :-

However, we can create a document of entirely different format in the same existing index as demonstrated below :- (Pay attention to difference in endpoints being used for above and below case).

Recommended Elastic-Search storage mechanism :-

Elastic-search is a very very scalable data-store for storing millions & billions of documents. Under the hoods, any Index is physically stored in form of Shards. A particular index could be very well split into multiple Shards. So, if we define an index into say, 2 shards, then storage shall be divided into 2 storage-parts i.e. Shards i.e. All the documents, actually gets stored into 2 shards. These 2 Shards are also called as primary shards, because these are actually our data-stores.

Now, we can have a set-up, where each of these shards are being powered on 2 different instances (Note that, both of these machines are the parts of same elastic cluster). So, for example, we can very well have Shard-0 being hosted on Node1(i.e. Instance-1) and Shard-1 being hosted on Node2(i.e. Instance-2).

Now, from point of view of fault-tolerance, we should also have replica-shards, which shall be exact copy of the primary-shards.

  • Say, we only have 1 machine as part of the cluster, then we would surely have both the shards being present on the same node and in case this machine goes down, we shall be in the problem.
  • Similarly, if we don’t have any replica-shards, and both of the primary shards are being deployed at different machines and in case any one of the node goes down, we shall end-up loosing half of our data.

Next, we are free to send our request to any node of our Elastic-cluster. Every node on the cluster is fully capable of serving any sort of the request, as every node in the cluster knows about every document in the cluster and thus, it can forward the request directly to that node, in case it doesn’t have the required data. For e.g. Say, Client requests to index the document(with id as 2) to Node2, then it may redirect the request to Node1 and then data shall be replicated to ReplicaNode1. The same is demonstrated in below diagram. Also, Please be understood that, Index in Elastic-search is merely an logical representation of how the data is being organised across the shards ? To ingest the data (i.e. in order to decide the shard, where the data would go), Elastic-search inherently uses the Hashing.

For serving the GET request, Elastic uses the Round-Robining approach based upon the load-balancing. Say a request came for id as 2, then it can either be served from node-1 hosting the primary-shard or it can be served from node-2 hosting the replica-shard. Round-Robin means to divide the traffic over set of nodes, in a particular fashion rather than just banging on a particular node.

A Shard is basically a LUCENE index. A Shard is a physical container of the data. Each node contains a shard generally and on this shard, some part of the data is present. Lucene was started around 1999. Elastic-Search made Lucene distributed and also it provides the basis for all complex search-queries that elastic-search can serve. Now, each shard can have multiple segments in it. A Segment can also be called as Inverted-Index. Below is how, a multiple segments inside a shard, looks like :-

Say, we have 2 shards and data is being divided into these 2 shards, then sum total of both of these documents from cluster of 2 nodes, forms an Index in Elastic-Search. See below :-

Recommended shard-size :- The shard is the unit at which Elasticsearch distributes data around the cluster. Using shards, also brings in the horizontal-scalability to the table. Lets take a very simple example to demonstrate the example of shard :- Say we have an Index of documents, which is going to contain the data somewhere less than 10 TBs and we have 200 servers, each having hard-disk of 60 GB, then we can specify 200 shards for this Index. Here, each shard would store data somewhere around 50 GBs and would live on separate machine and thats how, we shall be storing a very-large Index. We have liberty to define the different no of shards for different indexes in Elastic-Search.

Generally for production grade systems, we should avoid having very large shards as this can negatively affect the cluster’s ability to recover from failure. There is no fixed limit on how large shards can be, but a shard size of 40 to 50GB is often quoted(& recommended from Elastic Team side) as a limit that has been seen to work for a variety of use-cases.

In Elasticsearch, each query is executed in a single thread per shard. When we run a query, Elasticsearch must run that query against each shard, and then compile the individual shard results together to come up with a final result to send back. The benefit to sharding is that the index can be distributed across the nodes in a cluster for higher availability. In other words, it’s a trade-off. Multiple shards can however be processed in parallel, as can multiple queries and aggregations against the same shard. Therefore, having multiple shards would improvise the performance usually, but if the shard-size is too small, it can also cause overhead in processing & hence increase latency.

Recommended no of shards on each node :- The number of shards that we can hold on a particular node will be proportional to the amount of heap we have available, but there is no fixed limit enforced by Elastic-Search. A good rule-of-thumb is to ensure, we keep the number of shards per node below 20 per GB heap it has configured. Lets take an example: We have a node with a 30GB of heap-memory. At this node, ideally we can have a maximum of 600 shards, but the further below this limit, performance shall be better. This will eventually help the cluster to stay in good health.

Please note that, both attributes i.e. “each shard size” and “no of shards” for any ES-Index, directly impacts the speed at which ES can move shards in case of a node-failure.

Types of Nodes in ElasticSearch :- There are 4 types of nodes in ElasticSearch

  • Master Node → These are the supervisor nodes for all other nodes in the same cluster. This node is responsible for actions like Creating & Deleting an Index, Tracking which nodes are part of the cluster and allocating the shards to other nodes.
  • Master-Eligible Node → There is a property called “node.master” in elastic.yml file. If this property is set to be true (by default, it is set to be true), then this node is eligible to become a master node. Lets take an example: We have multi-node cluster with 1 master-node. In case, the server which is master node fails, the nodes which are eligible for becoming the new master, competes through a process called as Master-Election-Process and new master is being elected.
  • Data Node → This node holds the data and performs the operations such as CRUD, Search and Aggregations. To make a node as Data-Node, the property called “node.data” in elastic.yml file should be set to true (by default, it is set to be true).
  • Ingest Node → This node is used to pre-process the document, before the document is actually indexed into the Elastic-Search. To make a node as Ingest-Node, the property called “node.ingest” in elastic.yml file should be set to true (by default, it is set to be true).
  • Tribe Node → This node is used for coordination purpose.

For a production-grade systems, there should be dedicated master, data and ingest nodes.

ElasticSearch Architecture :- As we demonstrated earlier, this Shard is a self-contained Lucene index of its own. Shard may live on a different node on a cluster. If we have cluster of machines, then we can spread these shards on these machines. Every document is hashed to a particular shard based upon some mathematical formula. In other words, every shard owns some set of documents. We can also specify the resiliency against failure, using the replication-factor. So, for every primary-shard, there would be replica-shards.

As a general practice for production grade systems, its advisable to have Odd number of nodes for resiliency. Write-requests would be routed to the primary shards and then those writes would be automatically replicated to replica shards. Read requests can be routed to the primary or replica shards. The read-load can be bifurcated to multiple shards and hence the Read-capacity can be scaled. Write-capacity would be bottlenecked by the number of primary shards.

Please note that, the number of primary-shards for any index can’t be changed, after the index has been created. However, the number of replica-shards can very well be changed for enhancing read-throughput. Generally, most of the applications of the ES are read-heavy. However, if we really have the need to add more primary-shards, we can definitely re-index our data into an fresh new Index with more number of primary-shards. In below example picture, we are specifying the 2 primary shards and 1 replica, therefore, we shall end-up with total of 4 shards for this Index ‘customers’.

Below is an example of production grade Index which is divided into 12 primary shards. For each primary-shard, there do exists an replica-shard as well. If the primary-shard lives on IP1, its replica-shard would generally be living on some different IP than IP1. Also, please note that, if any query comes to this Index, ES would query to all 12 shards and then combine the results to return to the client.

Note on Recommendation: In our above case, our entire customer’s data is somewhere around 500 GB and therefore, we have kept our each shard’s size to be in range of 40–50 GB each.

Fundamentals of Data Ingestion into Elastic-Search :- The process of Indexing the data into the elastic-search takes too much time. The reason behind this is, because elastic has to pass the data through a process called as Analysis. In this step, it prepares the data before it can actually index it. In this process of preparation, it takes raw-text as input and converts it into the Inverted-Index, so that data is searchable. This process itself is actually time-taking. Below is the series of steps involved in this process of ‘Analysis’ :-

  • Say, we supply a document with intent to index it into elastic-search.
  • Elastic-Search would first break this document into the words and tokenise it. Each token is also called as Term.
  • Get rid off, from any extraneous suffixes and prefixes. We remove the stop-words, eliminate white-spaces and remove punctuations.
  • Elastic-Search then perform the lower-casing of all the terms.
  • Elastic-Search then perform the process of stemming i.e. analyse the root of a particular word and trim it. e.g. For two words ‘swimming’ and ‘swimmers’, the root(trimmed) word is swim.
  • It then do the synonym-matching. e.g. Words ‘thin’ and ‘skinny’ are almost same.

Once a particular document has passed through the process of Analysis, Elastic-Search would then feed the Inverted-Index and place this document into an in-memory buffer-memory. Once this buffer gets full, this buffer-memory gets committed to a particular Segment. Once a particular segment is full, it would be known as Immutable Inverted Index. Now, this segment contains the final processed data which is searchable.

Lets see an real-time example of applying the process of ‘Analysis’ over these 2 below documents. Please note that, Analyser is always applied over some particular field of a Document, say ‘Field1’. Here, we are assuming that :-

  • In Document-1, there is a ‘Field1’ and it has the value:- “The thin lifeguard was swimming in the lake”.
  • In Document-2, there is a ‘Field1’ and it has the value:- “Swimmers race with the skinny lifeguard in lake”.

For every document indexing process, the Analyser is involved which performs operation of Tokenising, Filtering and then putting it into an Index.

Now, we can formalise the major steps, as performed by ‘Analyser’ during the process of Indexing and Querying as well :-

  • Removal of stop-words.
  • Lowercasing.
  • Stemming.
  • Synonym-match.

Lets see, how the inverted-index looks like after passing both of these documents through Analyser. Pay attention that, stop-words have been removed, all words have been lower-cased, all the words have been stemmed and synonyms have been removed as well. E.g. The words ‘swimming’ and ‘Swimmers’ can be stemmed to word ‘swim’.

Analyser also plays a key role during Querying-time as well and all the aforesaid steps are applied in this process of querying as well. Lets take example from stemming prospective during Querying:- If a search query contains the word ‘swimming’, then again this search query shall go through the process of stemming and it would be trimmed/stemmed to ‘swim’. Lets take another example from removal of stop-words prospective, say a search-query came for “the thin”, then it shall be passed through process of tokenisation, then stop-words shall be removed from this search-query like as below :-

The Inverted-Index also stores the position of every particular word inside an document as well. Elastic-Search also have some pre-built Analysers. The above one that we detailed is a Standard Analyser. Other analysers available are Whitespace Analyser, Simple Analyser, English Analyser, etc.

  • Lets see an real-time example of how the particular document gets analysed by ‘Standard Analyser’ :-

The ‘Standard Analyser’ removes the punctuations, but doesn’t removes the digits in the text :-

  • Lets see an real-time example of how the particular document gets analysed by ‘Whitespace Analyser’ :-

The ‘Whitespace Analyser’ neither removes the punctuations from a word and nor removes the digits in the text :-

  • Lets see an real-time example of how the particular document gets analysed by ‘Simple Analyser’. The ‘simple’ analyser can get rid of punctuations and digits both from a particular word. See below example, how different terms were tokenised with Simple Analyser :-

Lets now define the Index ourselves :-

The default no. of shards created by elastic-search for any Index are 5 i.e. a particular logical index shall be inherently stored as divided-index into 5 shards. Below picture shows, how we created an Index. We specified the no of shards, replicas and types of the fields in this document :-

Lets now try to get the information of this Index from Elastic-Search :-

Now, if we try to create a document, which has some additional field(as defined above at the index creation time), still elastic-search would index this document and would automatically guess the type of the newly added field. Lets see example below, we try to add a document of type ‘customer’ with a new field as ‘address’ :-

Here, Elastic-search automatically guesses the type of the field ‘address’ and therefore the new schema of index now looks like below. This is how elastic-search behaves dynamically :-

However, we have liberty to define the enforcement of strict structure as well with Elastic at the Index creation time.

  • Say, we want that, in case someone tries to index a document into elastic and that document have an additional field(other than the pre-specified fields at the Index creation time) to which we want to be ignored, then we set the below mapping property to ‘false’. Lets see below example :-
  • Say, we want that, in case someone tries to index a document into elastic and that document have an additional field(other than the pre-specified fields at the Index creation time) to which we strictly don’t want to ingest at all, then we can set the below mapping property to ‘‘strict’. Lets see below example :-

Let’s now see how to query/searching from Elastic-Search :-

The main power of Elastic-Search is the search-ability very very quickly. Whenever we query to Elastic-Search for a particular keyword, a relevancy score is being returned to us. This score represents, how relevant this document is ?? For e.g. The document which contains more no. of occurrences of search-keyword/query, would be more relevant and would have high relevancy score. The more relevant documents(i.e. documents having high relevancy score) would appear in beginning of the search-results. Elastic-Search provides an ‘_search’ endpoint. In its response, ‘hits’ array contains the data.

The ‘_search’ endpoint is quite a powerful way to query the elastic. We can specify the term-query like below and it shall fetch all those documents which contains the searched value :-

We demonstrated all aforesaid ways of querying to the Elastic using Kibana DevTools. We can also directly query to Elastic using CURL commands :-

Now, Lets see some examples for understanding the DSL querying syntax to Elastic-Search. Here, for facilitating these queries, we have pre-populated the Elastic-Search with 10 documents into an Index called ‘courses’.

Example1:- We can get all documents present into any Index with search-query ‘_search’ and specifying ‘match_all’. Notice that, all of the documents being returned are present inside the ‘hits’ key and every document also have relevancy-score as 1, as all documents have equal relevancy.

Example2:- Lets say we only want to get those courses which contains the keyword ‘computer’ in them. Notice below that, we got the value of “hits.total.value” being 2, which means that there are 2 matching documents.

Example3:- Now, say we modify the ‘name’ field of the document-id-4 and now say if we again fire the same search-query, then we get different relevancy-scores for each document as shown below. Pay attention that, the document whose ‘name’ field contains more than 1 occurrence of the particular term ‘computer’ would get the higher relevancy score.

Example4:- Now, say we want to fetch those documents, which have keyword ‘computer’ in the field ‘name’ AND contains keyword ‘C8’ in field ‘room’, below shall be the query :-

Please note that, the ‘must’ clause can only be used inside the ‘bool’ clause only.

Example5:- Now, say we want to fetch those documents, which have keyword ‘accounting’ in the field ‘name’ AND contains keyword ‘e7’ in field ‘room’ AND keyword ‘bill’ must NOT be present in the nested field ‘professor.name’, then below shall be the query. So, we use the ‘must_match ’ query here.

Example6:- Now, say we want to fetch those documents, which have keyword ‘accounting’ in the field ‘name’ OR field ‘professor.department’, then below shall be the query as shown. Please note here that, those documents which contains this term in both ‘name’ and ‘professor.department’ fields, would have higher score. So, we use the ‘multi_match ’ query here.

Example7:- Now, say we want to fetch those documents, which contains the text ‘from the business school taken by final’ in the field ‘course_description’, then below shall be the query as shown. Please note here that, all the words in the searched phrase are full tokens and no partials. So, we use the ‘match_phrase ’ query here.

Example8:- Now, say we want to fetch those documents, which contains the text ‘from the business school taken by fin’ in the field ‘course_description’, then below shall be the query as shown. Please note here that, the phrase is incomplete here and the word ‘fin’ is not present in the field. So, we use the ‘match_phrase_prefix’ query here.

Example9:- Now, say we want to fetch those documents, whose field ‘students_enrolled’ contains the value greater-than/equal-to 10 and lesser-than/equal-to 20. So, we use the ‘range’ query here.

Example 10:- Now, say we want to fetch those documents, whose field ‘name’ contains value as ‘accounting’, whose filed ‘room’ must-not contain the value ‘e7’ and whose field ‘students_enrolled’ should contain the value greater-than/equal-to 10 and lesser-than/equal-to 20. So, we use the ‘combination of bool + range’ query here. Pl note here that, elastic-search also fetches those records for which the value of ‘students_enrolled’ is more than 20 as well, but that doc has too less relevancy score. Moroever, it was fetched at all because, we had the range query inside the should clause. The interpretation of the ‘should’ clause is : “Its nice to have” !!

Example 11:- Now, say we want to fetch those documents, whose field ‘name’ contains value as ‘accounting’, whose filed ‘room’ must-not contain the value ‘e7’ and whose field ‘students_enrolled’ must contain the value greater-than/equal-to 10 and lesser-than/equal-to 20. So, again, we use the ‘combination of bool + range’ query here. Pay attention to the change in order of clauses as compared to aforesaid scenario.

Example 12:- Now, say we want to fetch those documents, whose field ‘students_enrolled’ should have value within 10 and 17 AND for those documents the field ‘name’ OR ‘course_description’ contains the value ‘market’. Also, we want to give double weightage to those documents who have value ‘market’ being present in the field ‘course_description’. We shall use the Field-Boosting here :- Please note here that, as we increase the weightage, so as the relevancy-score for those documents increases.

Now, Lets see some examples for understanding the DSL filtering syntax to Elastic-Search. Following are the major distinctions b/w query & filters :-

  • The filtering query doesn’t returns the relevancy score for the documents returned. It just filters out those documents which matches to the search criteria and returns them.
  • The filtering query results is a fast operation as compared to the plain query matching criteria, as there is no extra computation of relevancy-score being involved in former approach.

Please note that results of the elastic-search query which contains the portion of query inside the query context(but outside the filtering context) would result in relevancy-scores for the documents in the output. Another important thing to note here is: whatever can be done using the query, can be put well inside the filter context as well.

Example13:- Lets say we want to filter out those courses-documents which contains the keyword ‘accounting’ in the field ‘name’.

Example14:- Lets say we want to filter out those courses-documents which contains the keyword ‘accounting’ in the field ‘name’ and have the value as ‘bill’ for the field ‘professor.name’.

Example15:- Lets say we want to filter out those courses-documents which contains the keyword ‘accounting’ in the field ‘name’ AND have the value as ‘bill’ for the field ‘professor.name’ AND have count of students_enrolled field greater-than/equal-to 10 and lesser-than/equal-to 20. Pay attention to the range query used inside the filter context :-

Example16:- Lets now, see the usage of BULK APIs being provided by Elastic-Search :- Here, we indexed many documents at-once into the elastic-search.

Example17:- Now, when we invoke the GET API call, it would not fetch all the documents, but by-default it would only fetch the 10 documents from elastic-search. In order to specify, Count of documents to be fetched, we can specify the “size” parameter :-

Example18:- Say we want to fetch the records from 0 to 5 and then we want to do sorting of those records basis of price in decreasing order, below shall be the query for the same. Please note that, first all records are being fetched from elastic-search(as evident from value of the field ‘hits.total.value’), then it does the sorting over all of those records and then apply the pagination(fetching 0 to 5 records) over those records.

Example19:- Lets now see the Aggregation over the data from Elastic. The first simplest aggregation is COUNT. Lets now count the number of vehicles whose value for field ‘make’ is ‘Toyota’.

Example20:- Lets say, we want to get the aggregate count of vehicles every colour wise , we can apply our custom-aggregator like shown below. Please note here that, using the “keyword” based fieldName is crucial here, because that’s how Elastic-search automatically deduces the data-type of the field over which aggregation has to be applied.

Example21:- Lets say, we want to get the aggregate count of vehicles every make-wise, we can apply our custom-aggregator. Also, say we want to know the max & min pricing for each of the vehicle-make, we can use the below approach. Please note here that, the applied aggregation runs over the scope of specified query i.e. “all the records”. Another point to note here is that, the aggregation query by default shows the all matching records under key ‘hits’. In case, we don’t want the records to be fetched, we can specify the “size” as 0.

Example22:- What we did above so far, there do exists a shot-cut to this as well. To get the stats directly, we can use the ‘stats’ aggregator as shown below :-

Introducing LOGSTASH:- Logstash is a powerful open-source data processing pipeline. Its used to ingest the data from multitude of sources simultaneously.

There are 3 stages through which data can pass in case of pipeline :-

Starting LogStash locally :-

Assuming that ElasticSearch and Kibana both are running in separate terminals. With option ‘-e’, we are intending to start the Logstash with configurations being provided on the fly.

aditya-MAC:bin aditya$ ./logstash -e ‘input { stdin {}} output { stdout{}}’

Sending Logstash logs to /Users/aditya/Documents/LEARNING/ELK-Stack/MY-WORKS/logstash-7.8.1/logs which is now configured via log4j2.properties

The stdin plugin is now waiting for input:

[2020–10–08T07:21:11,382][INFO ][logstash.agent ] Pipelines running {:count=>1, :running_pipelines=>[:main], :non_running_pipelines=>[]}

[2020–10–08T07:21:11,756][INFO ][logstash.agent ] Successfully started Logstash API endpoint {:port=>9600}

Lets see LogStash in action :- We send some data in standard input and receive the same in output along with timestamp being added on its own.

Charitableness should not be forgotten. Its a duty of mankind to mankind.

{

“message” => “Charitableness should not be forgotten.Its a duty of mankind to mankind.”,

“host” => “aditya-MAC”,

“@version” => “1”,

“@timestamp” => 2020–10–08T01:55:19.368Z

}

References :-