In this blog, we are going to study about the following concepts :-
- Fundamentals of working with Elastic-Search.
- Launch ES-Instance & Kibana-Instance on our Local-Machine.
- Comparative study of ES and RDBMS.
- Versioning of documents with ES.
- Ingesting, Updating, Deleting and Querying to a document in ES.
- ES-Storage mechanism.
- Achieving Fault-Tolerance with Replica-Shards in ES.
- Achieving Parallelisation with Replica-Shards in ES.
- What is a Shard and Index in ES.
- What is an Index in ES.
- Recommended Shard-size for our ES-Cluster.
- Recommended Shards-count for ES-Cluster.
- Various types of Nodes in ES.
- Recommended number of nodes for our ES-Cluster.
- Demonstrate any production-grade ES-Index.
Fundamentals of Elastic-Search:-
ElasticSearch is a Document-oriented search-engine, which means we can search & delete the documents into the elastic-search. The purpose of the elastic-search is to facilitate searching, as pointed right in its name. Its very powerful and does the searching in lightning fast speed. For e.g. Just like Google, it searches and finds the relevant matching documents. Similarly, Elastic-Search retrieves those documents which matches to the keywords in the search query. It supports :-
- Analyze the Documents.
- Indexing of Documents.
- Deletion of Documents.
- Retrieval of Documents.
- Search of Documents.
The way, it works is similar to the way the Index works in the Book.
With the help of Index, we can quickly find all those page-numbers where a particular keyword occurs and hence the user can directly flip to the concerned pages. The Index can be considered as the Search-Engine for that book. This technique allows for very efficient searching capabilities, which we can leverage to build a search-engine like the Elastic-Search. In case of Elastic-search, it maintains the Inverted index, where every word (belonging to all the sentences or documents) can make its way to Inverted-index. For e.g. for every term, it would maintain the map of its corresponding occurrence into the respective documents. The english casing of the words also plays important role, based upon which the different tokens gets sorted. Elastic-search can index billions of documents, as it is very very scalable.
Whenever, search-operation is being performed over the Elastic-search, it pays more attention on the relevancy of the documents, before fetching them. Relevancy-scores are assigned to the documents, depending upon the match between the actual documents and terms. Lets try to setup the elastic-search at our machine.
Step.1) First step is to launch the Elastic into a terminal window.
aditya-MAC:kibana-7.8.1-darwin-x86_64 aditya$ pwd
aditya-MAC:bin b0218162$ ./elasticsearch
[2020–10–02T17:12:36,962][INFO ][o.e.n.Node ] [aditya-MAC] version[7.8.1], pid, build[default/tar/b5ca9c58fb664ca8bf9e4057fc229b3396bf3a89/2020–07–21T16:40:44.668009Z], OS[Mac OS X/10.13.6/x86_64], JVM[AdoptOpenJDK/OpenJDK 64-Bit Server VM/14.0.1/14.0.1+7]
[2020–10–02T17:12:36,987][INFO ][o.e.n.Node ] [aditya-MAC] JVM home [/Users/aditya/Documents/LEARNING/ELK-Stack/MY-WORKS/elasticsearch-7.8.1/jdk.app/Contents/Home]
Step.2) Check if Elastic is running fine :-
aditya-MAC:elasticsearch-7.8.1 aditya$ curl http://localhost:9200
“tagline”: “You Know, for Search”
Step.3) Start the Kibana UI on top of Elastic :-
aditya-MAC:kibana-7.8.1-darwin-x86_64 b0218162$ cd bin/
aditya-MAC:bin aditya$ ./kibana
log [12:01:18.474] [warning][plugins-discovery] Expect plugin “id” in camelCase, but found: apm_oss
log [12:01:18.508] [warning][plugins-discovery] Expect plugin “id” in camelCase, but found: triggers_actions_ui
log [12:01:29.828] [info][plugins-service] Plugin “visTypeXy” is disabled.
log [12:01:29.828] [info][plugins-service] Plugin “endpoint” is disabled.
log [12:01:29.829] [info][plugins-service] Plugin “ingestManager” is disabled.
log [12:01:29.829] [info][plugins-service] Plugin “lists” is disabled.
log [12:01:31.751] [warning][config][deprecation] Setting [elasticsearch.username] to “elastic” is deprecated. You should use the “kibana_system” user instead.
log [12:01:31.751] [warning][config][deprecation] Config key [monitoring.cluster_alerts.email_notifications.email_address] will be required for email notifications to work in 8.0.”
log [12:01:31.751] [warning][config][deprecation] Setting [monitoring.username] to “elastic” is deprecated. You should use the “kibana_system” user instead.
Step.4) Check if Kibana is running fine :-
KIBANA accessible at :- http://localhost:5601/app/kibana#/home. Further Kibana can only run, if and only if, Elastic is up and running.
Step.5 ) Check health of Elastic through Kibana UI :-
There is an option to monitor the health of Kibana through Kibana dashboard. We can monitor the JVM heap usage, because Elastic is ultimately running on the top of JVM only.
Lets now see, how any elastic document looks like in Elastic :-
In contrast to RDBMS where the data is being stored in form of rows, the document in elastic is a JSON object. The documents in the Elastic are Immutable in nature. In contrast to RDBMS, where a particular column can be modified, same is not possible in the Elastic-search.
The Elastic-Search is based upon the Lucene. Its underlying principle is based upon Inverted-Index. This maps the words to the actual document-locations, where they occur. Elastic-Search also allows the concept of Full-text-search.
Interacting with Elastic-Search using out-of-the-box API endpoints :- The process of ingesting the data into the Elastic-Search is known as Indexing. To index a document into the elastic-search, it means, to insert a document into the elastic-search. The process of indexing is slowest. Lets see, how can we do this operation using the RESTful APIs being exposed by Elastic-Search :-
The ‘type’ is kind of a sub-division of an Index. For e.g. for a vehicle, sub-divisions can be like : trucks, motor-cycles, cars, etc. This ‘type’ field is getting dropped from Elastic-7 version onwards i.e. we won’t be allowed to have multiple sub-types for a given index going forward from elastic-version-7 onwards and we won’t be even having the type as part of the url. Lets see an real-time example of Indexing an document into the Elastic-search :-
If we don’t supply the id while creating a document, elastic-search would automatically create the id for us, but it’s highly recommended to supply an ‘id’ while indexing a document. So, we send the data to elastic-search using JSON syntax and we get the data from elastic-search in JSON format. This is how, we communicate with elastic-search. Here’s how the output looks like :-
The ‘_version’ represents the version-id of this document. If we try to update this document, we would see the updated version. Also, if we go on to reindex the new document on the same _id, then its going to overwrite the earlier document and increment the _version for the same. Any field starting with an underscore, is being something meta-information, which is being maintained by elastic itself. Also, notice that “result” here is being shown as ‘created’. Lets update this document couple of times, to see whether the _version increases or not :-
Notice here that “result” here is being shown as ‘updated’. Also, the “_version” value is being incremented. So, whenever we change even a particular field in any of the document in elastic, it would delete the existing document completely and then index the document in a fresh manner. The important point here to note here is that, the document in elastic is changed completely at all, whenever any particular field is being updated. Now, lets try to fetch this document as shown below. Here, In this case, it gives us the output and indicate value of ‘found’ being true. The main document is being returned into the field called ‘_source’.
Lets, see what happens, in case we try to do GET on a document, which doesn’t exists :-
In above case it shall give us the output and indicate value of ‘found’ being false. Lets try to get now, only the actual document that we indexed and NOT the full document.
Here, we have used the ‘_source’ in the URL itself and it doesn’t returns us the meta-fields in the response JSON. Now, if we only want to check, whether a particular document really exists in the particular index or not, we use HEAD command :-
Elastic-Search also provides us an API to update a document, but it works exactly the same way of doing a POST operation. Lets modify the value of a particular field in a given document :-
Now, lets see whether document really got modified or not :- Indeed yes.
Now, lets see the way to delete this document from elastic-search.
Now, if we try to do GET over this document again, we might not be able to get this document anymore. Although, NO document is being returned to us through the end-point, but the document still exists somewhere and elastic-search has just marked that document as deleted and later all those types of documents(i.e. marked as deleted) shall purge those documents. This process happens under the hoods. Therefore, those documents shall not be deleted immediately, hence the disk-space might not be freed-up immediately.
Elastic-search is also capable of automatically deducing the data-type of the various fields of the document. For the document, that we above indexed, lets see what data-type, elastic have deduced. For example, data-type of the field ‘price’ has been rightly deduced as ‘float’ and data-type of the field ‘Color’ has been deduced as ‘Text’.
Now, say, if new documents are added(to the existing index), which have additional fields, then elastic-search would automatically adjust its structure i.e. mappings. For example, say we added a new document(as demonstrated below), which have additional field ‘driver’, then mappings of Elastic-Search would automatically re-adjust and its mappings now looks like :-
Next, we can only have a single ‘type’ in the particular index. Therefore, if we try to create another ‘type’ within same index, elastic-search shall throw an exception :-
However, we can create a document of entirely different format in the same existing index as demonstrated below :- (Pay attention to difference in endpoints being used for above and below case).
Recommended Elastic-Search storage mechanism :-
Elastic-search is a very very scalable data-store for storing millions & billions of documents. Under the hoods, any Index is physically stored in form of Shards. A particular index could be very well split into multiple Shards. So, if we define an index into say, 2 shards, then storage shall be divided into 2 storage-parts i.e. Shards i.e. All the documents, actually gets stored into 2 shards. These 2 Shards are also called as primary shards, because these are actually our data-stores.
Now, we can have a set-up, where each of these shards are being powered on 2 different instances (Note that, both of these machines are the parts of same elastic cluster). So, for example, we can very well have Shard-0 being hosted on Node1(i.e. Instance-1) and Shard-1 being hosted on Node2(i.e. Instance-2).
Question:- Replica-Shards helps in achieving the Fault-Tolerance. Explain the concept ?
Answer → This is very important and critical function played by Replica-Shards :-
- We must also have replica-shards (which shall be exact copy of the primary-shards) in order to make our system Fault-Tolerant.
- For example, say the disk of Node-1 fails due to some reason, in that case, our client’s requests can still be served through the Node-2, since we still have the replica-shard (R0) available for primary-shard (P0).
- Therefore, one of the critical function of Replica-Shard is to make sure that, our system can tolerate the fault/system-failure upto a reasonable extent.
- Say, we only have 1 machine as part of the cluster, then we would surely have both the shards being present on the same node and in case this machine goes down, we shall be in the problem and our System is no-more Fault-Tolerant.
- Similarly, if we don’t have any replica-shards, and both of the primary shards are being deployed at different machines and in case any one of the node goes down, we shall end-up loosing half of our data.
Next, we are free to send our request to any node of our Elastic-cluster. Every node on the cluster is fully capable of serving any sort of the request, as every node in the cluster knows about every document in the cluster and thus, it can forward the request directly to that node, in case it doesn’t have the required data. For e.g. Say, Client requests to index the document(with id as 2) to Node2, then it may redirect the request to Node1 and then data shall be replicated to ReplicaNode1. The same is demonstrated in below diagram. Also, Please be understood that, Index in Elastic-search is merely an logical representation of how the data is being organised across the shards ? To ingest the data (i.e. in order to decide the shard, where the data would go), Elastic-search inherently uses the Hashing.
For serving the GET request, Elastic uses the Round-Robinin approach based upon the load-balancing. Say a request came for id as 2, then it can either be served from node-1 hosting the primary-shard or it can be served from node-2 hosting the replica-shard. Round-Robin means to divide the traffic over set of nodes, in a particular fashion rather than just banging on a particular node.
Question:- Replica-Shards helps in achieving the Parallelization. Explain the concept ?
Answer → This is yet another role played by Replica-Shards :-
- Imagine that we have 1 Primary-Shard (P0) and it’s 2-Replica-Shards (R0 and R1). Now, the requests shall be served by all 3 shards i.e. primary as well as both the replicas.
- If we have lot of traffic i.e. concurrency of user-query is too huge, then having more than one replicas, certainly proves to be helpful.
Note that, it completely depends upon the benchmarking of the system in order to arrive at the rationalised number of replica-shards. It depends upon lot of factors including :-
- System’s BAU traffic on ES.
- Volume of the data being housed into the ES.
- Related benchmarking that you need to perform.
Question :- What is a Shard ? Explain about it ?
Answer → A Shard is basically a LUCENE index. Lucene was started around 1999. Elastic-Search made Lucene, distributed and also it provides the basis for all complex search-queries that elastic-search can serve.
- A Shard is a physical container of the data. Each node contains a shard generally and on this shard, some part of the data is present.
- Now, each shard can have multiple segments in it. A Segment can also be called as Inverted-Index. Below is how, a multiple segments inside a shard, looks like :-
Question :- Demonstrate the creation of an ES-Index on a single/local machine ?
Answer → See below :-
- We have defined an ES Index with 1 primary-shard and 1 replica-shard.
- Screenshot below showcases that, health of our ES-Index is YELLOW.
Question :- Demonstrate the meaning of YELLOW status for ES-Index ?
Answer → See below, one of the replica-shard for our ES-Index is in UNASSIGNED state, because there is no other node being available into the cluster yet and therefore ES can’t map the replica node to the same node.
Question :- Explain ES Architecture with reference to Shards ?
Answer → This Shard is a self-contained Lucene index of its own. Shard may live on any node on a cluster.
- If we have cluster of machines, then we can spread these shards on these machines. Every document is hashed to a particular shard based upon some mathematical formula.
- In other words, every shard owns some set of documents. We can also specify the resiliency against failure, using the replication-factor. So, for every primary-shard, there would be replica-shards.
Question :- What is Index in ElasticSearch ?
Answer → Say, we have 2 shards and data is being divided into these 2 shards, then sum total of both of these documents from cluster of 2 nodes, forms an Index in Elastic-Search. See below :-
Question:- Explain what is the Recommended shard-size ?
Answer → The shard is the unit at which Elasticsearch distributes data around the cluster. Using shards, also brings in the horizontal-scalability to the table.
- Lets take a very simple example to demonstrate the example of shard :- Say we have an Index of documents, which is going to contain the data somewhere less than 10 TBs and we have 200 servers, each having hard-disk of 60 GB.
- In this case, since we have got 200 servers, we can specify 200 primary-shards for this Index, where each primary-shard would live on a unique server. Here, each shard would store data somewhere around (10 TB/ 200) i.e. 50 GBs and would live on separate machine. Note that, each of the server/node have got close to 60 GBs of hard-disk.
Thats how, we shall be storing a very-large Index. We have liberty to define the different no. of shards for different indexes in Elastic-Search.
Question:- For production-grade systems, what’s the recommended shard-size ?
Answer → Generally for production grade systems, we should avoid having very large shards as this can negatively affect the cluster’s ability to recover from failure. There is no fixed limit on how large shards can be, but a shard size of 40 to 50GB is often quoted(& recommended from Elastic Team side) as a limit that has been seen to work for a variety of use-cases.
Question:- How does ES-Query works, in case data is being housed amongst large-number of shards like 200, as asked above ?
Answer → In Elasticsearch, each query is executed in a single thread per shard.
- When we run a query, Elasticsearch must run that query against each shard, and then compile the individual shard results together to come up with a final result to send back.
- Multiple shards can however be processed in parallel, as can multiple queries and aggregations against the same shard. Therefore, having multiple shards would improvise the performance usually, but if the shard-size is too small, it can also cause overhead in processing & hence increase latency.
Question:- What’s the recommended no of shards, that we should house on every node in our ES-Cluster ?
Answer → The number of shards that we can hold on a particular node will be proportional to the amount of heap we have available, but there is no fixed limit enforced by Elastic-Search.
- A good rule-of-thumb is to ensure, we keep the number of shards per node below 20 per GB heap it has configured.
- Let’s take an example: We have a node with a 30GB of heap-memory. At this node, ideally we can have a maximum of 600 shards, but the further below this limit, performance shall be better. This will eventually help the cluster to stay in good health state.
Please note that, both attributes i.e. “each shard size” and “no of shards” for any ES-Index, directly impacts the speed at which ES can move shards in case of a node-failure.
Question:- What are various types of Nodes in ElasticSearch ?
Answer → There are 4 types of nodes in ElasticSearch :-
- Master Node → These are the supervisor nodes for all other nodes in the same cluster. This node is responsible for actions like Creating & Deleting an Index, Tracking which nodes are part of the cluster and allocating the shards to other nodes.
- Master-Eligible Node → There is a property called “node.master” in elastic.yml file. If this property is set to be true (by default, it is set to be true), then this node is eligible to become a master node. Lets take an example: We have multi-node cluster with 1 master-node. In case, the server which is master node fails, the nodes which are eligible for becoming the new master, competes through a process called as Master-Election-Process and new master is being elected.
- Data Node → This node holds the data and performs the operations such as CRUD, Search and Aggregations. To make a node as Data-Node, the property called “node.data” in elastic.yml file should be set to true (by default, it is set to be true).
- Ingest Node → This node is used to pre-process the document, before the document is actually indexed into the Elastic-Search. To make a node as Ingest-Node, the property called “node.ingest” in elastic.yml file should be set to true (by default, it is set to be true).
- Tribe Node → This node is used for coordination purpose.
For a production-grade systems, there should be dedicated master, data and ingest nodes.
Question:- In production grade systems, what’s the recommended number of nodes ?
Answer → As a general practice for production grade systems, its advisable to have Odd number of nodes for resiliency.
Question:- Where does ES routes the writes & reads ?
Answer → Here is how the ES handles the Writes & Reads :-
- Write-requests would be routed to the primary shards and then those writes would be automatically replicated to replica shards. Write-capacity would be bottlenecked by the number of primary shards.
- Read requests can be routed to the primary or replica shards. The read-load can be bifurcated to multiple shards and hence the Read-capacity can be scaled.
Question:- What’s the important aspect while defining the number of primary-shards for any given ES-Index ?
Answer → Please note the following important aspects :-
- The number of primary-shards for any index can’t be changed, after the index has been created.
- However, the number of replica-shards can very well be changed for enhancing read-throughput.
Generally, most of the applications of the ES are read-heavy and based upon the increasing need of our business, we can certainly increase the count of replica-nodes i.e. Replication-Factor as well.
Question:- How do we enhance the Write-Capacity for a given ES-Cluster ?
Answer → If we really have the need to add more primary-shards, we can definitely re-index our data into a fresh new Index with more number of primary-shards.
Question:- Demonstrate the count of shards for any given ES-Index ?
Answer → In below example picture, we are specifying the 2 primary shards and 1 replica, therefore, we shall end-up with total of 4 shards for this Index ‘customers’.
Question:- Demonstrate any ES-Index for a production-grade system ?
Answer → Below is an example of production grade Index :-
Aspect #1.) Data-Size and expected growth → Our entire customer’s data is somewhere close to 95 GB.
- Since the growth of business is very much expected in today’s Internet businesses, therefore data usually keeps on growing on a daily basis.
- We assume that, in next 2–3 years, the overall size of the customer’s data with our company shall be somewhere 4 to 5 TBs.
Aspect #2.) Defining Count of Primary-Shards → We have defined around 10 Primary-Shards for this much of Data.
- Therefore each shard’s size comes out to be around ~8 GBs each. Note that, we can’t increase the number of primary-shards, once an Index is created and the recommended size of each shard is 40 to 50 GBs, therefore we have enough of room for these 11 primary-shards to grow for the next 2–3 years.
- Recall that, we have assumption that in next 2–3 years, the overall size of the customer’s data with out company shall be somewhere 4 to 5 TBs.
Aspect #3.) Defining Count of Secondary-Shards →
- We are proceeding ahead with replication factor of 1, which means that we would have ONE copy of each of our primary-shard. Therefore we now have :- 11 Primary Shards count and 11 Secondary-Shards count.
- Note that, For each primary-shard, there do exists a replica-shard as well. If the primary-shard lives on IP1, its replica-shard would generally be living on some different IP than IP1.
Aspect #3.) Overall sizing of our ES-Cluster →
- The overall current size of our customer_index, is 95 * 2 == ~200 GBs.
- We have got 5 different machines into our ES-Cluster, which houses 11 primary-shards as well as 11 replica-shards.