ELK Dealing with Mappings | Part4

In case you are landing here directly, it’s recommended to read through this documentation first.

Following are the topics, which we shall be touching through this blog :-

  • Handling Nested Fields with ES.
  • Problem of Mapping-Explosion.
  • Flattened Data Type.
  • Partial Update basis of DocId using Post verbatim.
  • Keyword based search query on flattened data fields.
  • Explicit & Dynamic Mappings of the fields in ES.
  • Mapping Parsing Exceptions & resolution.

Question: Let’s proceed to understand the concept of Nested Fields in ElasticSearch

Answer:- So far In this blog series, we have only handled short documents with only a few associated fields. However, if we need to handle documents with many inner fields, elastic searches performance can start to suffer.

Question: Why does the performance of ElasticSearch suffers in case of nested-fields ?

Answer:- This is because each subfield gets mapped to individual fields by default with dynamic mappings. This usually leads to the problem of Mapping -Explosion. ElasticSearch offers the flattened data type to avoid mapping every subfield as individual fields, but rather as one flat and field containing the original data.

Question: Show same sample of line to be ingested into an “sys_log” index ?

Question: Show some another sample of log-line to be ingested into an “sys_log” index ?

Question: Let’s go ahead and ingest the log-line to the “sys_log” index ?

Now, though we didn’t defined the field types & mappings explicitly, Elastic is smart enough to auto predict the mappings :-

curl — location — request GET ‘http://127.0.0.1:9200/sys_log/_mapping'

Here is the response from above query :-

{
"sys_log": {
"mappings": {
"properties": {
"@timestamp": {
"type": "date"
},
"fileset": {
"properties": {
"name": {
"type": "text",
"fields": {
"keyword": {
"type": "keyword",
"ignore_above": 256
}
}
}
}
},
"host": {
"properties": {
"hostname": {
"type": "text",
"fields": {
"keyword": {
"type": "keyword",
"ignore_above": 256
}
}
},
"name": {
"type": "text",
"fields": {
"keyword": {
"type": "keyword",
"ignore_above": 256
}
}
}
}
},

"message": {
"type": "text",
"fields": {
"keyword": {
"type": "keyword",
"ignore_above": 256
}
}
},
"process": {
"properties": {
"name": {
"type": "text",
"fields": {
"keyword": {
"type": "keyword",
"ignore_above": 256
}
}
},
"pid": {
"type": "long"
}
}
}
}
}
}
}

Question:- What’s wrong with documents that contain many fields ?

Answer:- Well, the answer is they can cause your classic search cluster to go down.

  • Each field has an associated mapping type in its index. These types can be specified by the user OR elastic search can automatically assign this to the field.
  • ElasticSearch holds the mapping information of every index in it’s cluster state.
  • The cluster state also includes information such as index mappings, the node-details etc.

Question:- How can we retrieve the current cluster state ?

Answer:- Now, we can retrieve the current cluster state in our example by using the cluster state API. Though, the response is very big, we can see the data of our own choice :-

Question:- How usually, ES is setup for logging scenarios ?

  • Now, ElasticSearch is typically set up as a cluster. A ElasticSearch cluster is a collection of ElasticSearch nodes.
  • The presence of multiple nodes allows ElasticSearch to perform better indexing and searching operations.

Here, the cluster state is passed between the nodes so that clusters run smoothly within this cluster. There will be a master node that sends the latest cluster state to all the other nodes. Upon receiving the cluster state, all other nodes send an acknowledgement signal back to the master node.

Question:- What’s the impact of any new mapping create/update on the Elastic-Cluster-State ?

Answer:-

  • For each new field added to the document, anew mapping is created by ElasticSearch.
  • For each new mapping update in the index, the cluster state also changes.
  • After each cluster state change, the other nodes need to be synced.

Question:- Why the frequent addition of new fields is not recommended ?

Answer:- Frequently adding new fields to an index, not only causes the cluster state to grow but also triggers cluster state updates across all nodes which can result in delays if pushed far enough.

Question:- Now, you may ask, what’s the importance of updated cluster state and why at all, is it required to be propagated to all other nodes ?

Answer:- It’s because :-

  • Without the updated cluster state, nodes aren’t able to perform basic operations like indexing and searching.
  • This can cause memory issues within the nodes and result in poor performance and possibly lead to the cluster itself going down.

Question:- What’s this situation called as ? Is that a problem ?

Answer:- Yes, this is a problem. When an ElasticSearch cluster crashes because of too many fields in a map we call this a mapping explosion. This is especially true in environments that handle heavy loads or don’t have enough hardware to power it.

Question:- What’s the solution to this challenge of Mapping-Explosion ?

Answer:- In order to help prevent mapping explosions the ElasticSearch introduced the flattened data-type.

  • Essentially what this data type does is : map the entire object along with its inner fields into a single field.
  • In other words if a field contains inner fields, the flattened data type maps the parent field as a single type named flattened and the inner fields don’t appear in the mappings at all thereby reducing the total map fields.

Question:- How do we define the type of any particular field in ES as flattened ?

Answer:- First, we define the mappings for our new index, where we shall have the type of host attribute as flattened.

curl --location --request PUT 'localhost:9200/sys_log_flatened/_doc/1' \
--header 'Content-Type: application/json' \
--data-raw '{
"message": "[5592:1:0309/123054.737712:ERROR:child_process_sandbox_support_impl_linux.cc(79)] FontService unique font name matching request did not receive a response.",
"fileset": {
"name": "syslog"
},
"process": {
"name": "org.gnome.Shell.desktop",
"pid": 3383
},
"@timestamp": "2020-03-09T18:00:54.000+05:30",
"host": {
"hostname": "bionic",
"name": "bionic"
}

}'

Question:- What shall be the data-type, if we now investigate the mappings of this Index from ES ?

Answer:- Note that, data-type of field ‘host’ in ES is of type flattened, whereas the data-type of other field ‘process’ is still auto-deduced by ES, because we didn’t defined any mapping for the same.

curl --location --request GET 'http://localhost:9200/sys_log_flatened/_mappings'

Question:- For the first document 📃 that we ingested above, let’s partially-update it i.e. we go ahead and add some more inner-fields, under the “host” field. Recall that, the data-type, we defined for “host” field was flattened ?

curl --location --request POST 'localhost:9200/sys_log_flatened/_doc/1/_update' \
--header 'Content-Type: application/json' \
--data-raw '{
"doc" : {
"host" : {
"osVersion": "Bionic Beaver",
"osArchitecture":"x86_64"
}
}
}'

Question:- For the document 📃 that we just updated above, let’s observe whether the document has really been modified ?

Question:- Let’s again investigate the mappings of this Index from ES ? Whether have we solved the problem of Mapping-Explosion ?

Answer:-

  • To our expectation, the mappings remains as it is, because of the flattened data-type for the ‘host’ field.
  • As we intended, the newly added inner fields have not been mapped into the mappings for this Index, as they’re not in here at all. Now this is an important feature for many real world scenarios as this reduces the size of the mapping significantly and therefore mitigates the risk of a mapping explosion so so far we have seen why and how the flattened data type is used.
{
"sys_log_flatened": {
"mappings": {
"properties": {
"@timestamp": {
"type": "date"
},
"fileset": {
"properties": {
"name": {
"type": "text",
"fields": {
"keyword": {
"type": "keyword",
"ignore_above": 256
}
}
}
}
},
"host": {
"type": "flattened"
},

"message": {
"type": "text",
"fields": {
"keyword": {
"type": "keyword",
"ignore_above": 256
}
}
},
"process": {
"properties": {
"name": {
"type": "text",
"fields": {
"keyword": {
"type": "keyword",
"ignore_above": 256
}
}
},
"pid": {
"type": "long"
}
}
}
}
}
}
}

Question:- Well one problem of Mapping-Explosion is fixed, Are there any side-effects too ?

Answer:- Well yes, there are certain limitations, to be aware of when using the flattened data type :-

  • The inner-fields of the flattened data type object will be treated as keywords in elastic search.
  • This means no analysers and tokenism, shall be applied to the flattened fields. And this results in a more limited search capability.

Question:- Let’s review so far, what all documents are there in our new index?

curl --location --request GET 'localhost:9200/sys_log_flatened/_search'

Question:- Can you demonstrate the behaviour of keyword based search, with some search query on outer fields ?

Answer:- One of the fundamental thing here to note is that, even if we search the outer field, all inner-fields shall be searched for.

curl --location --request GET 'localhost:9200/sys_log_flatened/_search' \
--header 'Content-Type: application/json' \
--data-raw '{
"query" : {
"match" : {
"host" : "Bionic Beaver"
}
}
}'

Here are the results thus obtained :-

{
"took": 4,
"timed_out": false,
"_shards": {
"total": 1,
"successful": 1,
"skipped": 0,
"failed": 0
},
"hits": {
"total": {
"value": 1,
"relation": "eq"
},
"max_score": 0.39556286,
"hits": [
{
"_index": "sys_log_flatened",
"_type": "_doc",
"_id": "1",
"_score": 0.39556286,
"_source": {
"message": "[5592:1:0309/123054.737712:ERROR:child_process_sandbox_support_impl_linux.cc(79)] FontService unique font name matching request did not receive a response.",
"fileset": {
"name": "syslog"
},
"process": {
"name": "org.gnome.Shell.desktop",
"pid": 3383
},
"
@timestamp": "2020-03-09T18:00:54.000+05:30",
"host": {
"hostname": "bionic",
"name": "bionic",
"osVersion": "Bionic Beaver",
"osArchitecture": "x86_64"
}
}
}

]
}
}
  • For the match query for “Bionic Beaver”, matching document is found, as the field osVersion in one of our document had the matching text as in the response shown above.
  • This is because, the search query we provided was an exact match including the casing in “Bionic Beaver”.

Question:- Can you demonstrate the behaviour of keyword based search, with some search query on inner fields ?

Answer:- ES does support searching on inner fields for the flattened data-type, but note that, it’s going to be keyword based search.//

curl --location --request GET 'localhost:9200/sys_log_flatened/_search' \
--header 'Content-Type: application/json' \
--data-raw '{
"query" : {
"match" : {
"host.osVersion" : "Bionic Beaver"
}
}
}'

Question:- Can you demonstrate the behaviour of performing keyword based search on partial matching data, with some search query on inner fields ?

Answer:- The partial match doesn’t return any results in this case, because the fields are not analysed. This is one consideration to keep in mind when choosing the flattened data type.

curl --location --request GET 'localhost:9200/sys_log_flatened/_search' \
--header 'Content-Type: application/json' \
--data-raw '{
"query" : {
"match" : {
"host.osVersion" : "Beaver"
}
}
}'

And the response this obtained is :-

{
"took": 17,
"timed_out": false,
"_shards": {
"total": 1,
"successful": 1,
"skipped": 0,
"failed": 0
},
"hits": {
"total": {
"value": 0,
"relation": "eq"
},
"max_score": null,
"hits": []
}
}

Here’s a quick summary of the different scenarios of searching with the match query.

*********************Switching gears to Mappings******************

Question:- What are the possible ways of defining mapping for the particular Elastic Index ?

Answer:- Mapping-process defines what we can index via individual fields and their data types and also how the indexing happens by our related parameters. There are generally two ways, Indexing can happen :-

  • An explicit mapping process, where you define what fields and their types you want to store along with any additional parameters.
  • A dynamic mapping; elastic search automatically attempts to determine the appropriate data type and updates the mapping accordingly.

Question:- What are the possible issues, that can happen with Mappings ?

Answer:- There are generally two potential issues that may end up with mappings.

  • If we go with Explicit-Mapping and say the fields(in documents we Index), don’t match we’ll get an exception beyond a certain safety zone.
  • If we go with default dynamic mapping and say some new documents bring in many more fields, then we would be landing in situation of Mapping-Explosion which can take our cluster down.

Question:- Let’s go ahead and create a new Index with some mappings ?

curl --location --request PUT 'localhost:9200/microservice-logs' \
--header 'Content-Type: application/json' \
--data-raw '{
"mappings": {
"properties": {
"timestamp" : {"type" : "date"},
"service" : {"type" : "keyword"},
"host_ip" : {"type" : "ip"},
"port" : {"type" : "integer"},
"message" : {"type" : "text"}
}
}
}'

Question:- Now, let’s ingest a correct document to the afore-created Index ?

curl --location --request PUT 'localhost:9200/microservice-logs/_doc/1' \
--header 'Content-Type: application/json' \
--data-raw '{
"timestamp": "2020-04-11T12:34:56.789Z",
"service": "Document-Microservice",
"host_ip": "10.14.196.210",
"port": 8089,
"message": "Started App!"
}'

Question:- Now, let’s ingest an InCorrect document to the afore-created Index ‘microservice-logs’ ?

Answer :-

  • Note here that, we are indexing the document without specifying the ID by our own and therefore, the ES would auto-assign the random ID to this newly ingested document.
  • Also observe that, we are ingesting the document with value for field “port” as String and not the Integer.
curl --location --request POST 'localhost:9200/microservice-logs/_doc' \
--header 'Content-Type: application/json' \
--data-raw '{
"timestamp": "2020-04-11T12:34:56.789Z",
"service": "Document-Microservice",
"host_ip": "10.14.196.209",
"port": "8089",
"message": "Started App on another IP."
}'

You might question, why does it not threw the exception ? And the reason is the safety zone. Yes, the data for the port attribute, that we indexed into this Index is of type String, but still it worked.

Question:- Now, let’s have a look at all the documents present into the Index: ‘microservice-logs’ ?

Answer:- We can observe 2 documents present into the ElasticSearch :-

{
"took": 231,
"timed_out": false,
"_shards": {
"total": 1,
"successful": 1,
"skipped": 0,
"failed": 0
},
"hits": {
"total": {
"value": 3,
"relation": "eq"
},
"max_score": 1.0,
"hits": [
{
"_index": "microservice-logs",
"_type": "_doc",
"_id": "1",
"_score": 1.0,
"_source": {
"timestamp": "2020-04-11T12:34:56.789Z",
"service": "Document-Microservice",
"host_ip": "10.14.196.209",
"port": "8089",
"message": "Started App on another IP."
}
},
{
"_index": "microservice-logs",
"_type": "_doc",
"_id": "PzT70X0BhTuHVfBk00NJ",
"_score": 1.0,
"_source": {
"timestamp": "2020-04-11T12:34:56.789Z",
"service": "Document-Microservice",
"host_ip": "10.14.196.209",
"port": "8089",
"message": "Started App on another IP."
}
}
]
}
}

Question:- Now, let’s ingest yet another InCorrect document to the afore-created Index ‘microservice-logs’ ?

curl --location --request POST 'localhost:9200/microservice-logs/_doc' \
--header 'Content-Type: application/json' \
--data-raw '{
"timestamp": "2020-04-11T12:34:56.789Z",
"service": "XYZ",
"host_ip": "10.0.2.15",
"port": "NONE",
"message": "I am not well!"
}'

Observe above that, we had proposed the value of port field as “NONE”. Now, this is obviously out of the safety-zone because NONE is not the number @ all and hence thus result is, we get into the Mapper_Parsing_Exception :-

Question:- Now, let’s give an instruction to ElasticSearch that, come what may, do always accept the documents and never throw the exceptions ?

Answer:- For this purpose, we would have to first close this Index and then only, we could modify the settings.

Note: Now, we go ahead and modify the settings :-

Question:- How do we see, what are the current settings in place for the particular Index ?

curl --location --request GET 'http://localhost:9200/microservice-logs/_settings'

Question:- Let’s now open our Index, so that we can perform certain operations with it ?

curl --location --request POST 'http://localhost:9200/microservice-logs/_open'

Question:- Now, again let’s ingest yet another InCorrect document to the afore-created Index ‘microservice-logs’ and observe the behaviour ?

curl --location --request POST 'localhost:9200/microservice-logs/_doc' \
--header 'Content-Type: application/json' \
--data-raw '{
"timestamp": "2020-04-11T12:34:56.789Z",
"service": "XYZ",
"host_ip": "10.0.2.15",
"port": "NONE",
"message": "I am not well!"
}'

Observe above that, we had proposed the value of port field as “NONE”. Now, this obviously get Indexed happily :-

Question:- Now, let’s see, whether this particular document got Ingested or Not, into our ElasticSearch ?

Answer:- Well, of-course this document got ingested, but observe that, the field “port” has been marked as ignored. This is only a partial solution, because this setting has its limits and they are quite considerable.

curl --location --request GET 'localhost:9200/microservice-logs/_doc/QzQV0n0BhTuHVfBkVkOa'

Question:- So, What are the limitations of property “ignore_malformed” and can you kindly demonstrate the same ?

Answer:- Yes, the property ignore_malformed can’t handle JSON objects on the input. Recall originally that, the field ‘message’ does allows only text type, and here we are trying to supply the JSON to this field, which would surely cause an issue.

Question:- Can you demonstrate, how the ElasticSearch auto-determine the type of the new fields thus ingested into it?

Answer:- Yes, In below document, we are trying to ingest the document which have extra field ‘payload’, which is of JSON type. Note that, In our original mappings, this was something we never defined.

curl --location --request POST 'localhost:9200/microservice-logs/_doc' \
--header 'Content-Type: application/json' \
--data-raw '{
"timestamp": "2020-04-11T12:34:56.789Z",
"service": "ABC",
"host_ip": "10.0.2.15",
"port": 12345,
"message": "Application Boot complete...",
"payload": {
"data": {
"received": "Application fetching params."
}
}

}'

Question:- Amazing, Now what’s the type that ES have auto-deduced from the above data, that we ingested into it ?

Answer:- Here is the latest-mappings, we have right now for our Index, into our ElasticSearch :-

curl --location --request GET 'http://localhost:9200/microservice-logs/_mappings'

Question:- So far so good, but now even further, we got yet another log-, for which payload’s structure is different again. So, what would happen ?

Answer:- In below document, we are trying to ingest the document which have extra field ‘payload’, which is of JSON type and now the structure of the ‘payload’ is different again.

curl --location --request POST 'localhost:9200/microservice-logs/_doc' \
--header 'Content-Type: application/json' \
--data-raw '{
"timestamp": "2020-04-11T12:34:56.789Z",
"service": "Customer MS",
"host_ip": "10.0.2.15",
"port": 12345,
"message": "Received...",
"payload": {
"data": {
"received": {
"even": "more"
}
}
}
}'

As we can rationally guess, it would lead in a situation of exception for us and the reason is, with the last document that we ingested to our Index, ES auto-deduced the type for field “payload.data.received” as text :-

{
"error": {
"root_cause": [
{
"type": "mapper_parsing_exception",
"reason": "failed to parse field [payload.data.received] of type [text] in document with id 'RjQ00n0BhTuHVfBk0EPM'. Preview of field's value: '{even=more}'"
}
],
"type": "mapper_parsing_exception",
"reason": "failed to parse field [payload.data.received] of type [text] in document with id 'RjQ00n0BhTuHVfBk0EPM'. Preview of field's value: '{even=more}'",
"caused_by": {
"type": "illegal_state_exception",
"reason": "Can't get text on a START_OBJECT at 9:25"
}
},
"status": 400
}

Question:- So now, what can we do, to solve this problem ?

Answer:- We have following ways to address this problem :-

  • Well engineers on the team need to be aware of these mapping mechanics. You can also establish shared guidelines for the log fields.
  • Secondly you may consider what’s called a dead letter Q pattern, that would store the fail documents in a separate queue. This needs to be handled on an application level.

That’s all in this section. If you liked reading this blog, kindly do press on clap button & donate directly, to indicate your appreciation. We would see you in next series.

References :-

Software Engineer for Big Data distributed systems

Love podcasts or audiobooks? Learn on the go with our new app.