ELK Dealing with Mappings | Part4

In case you are landing here directly, it’s recommended to read through this documentation first.

Following are the topics, which we shall be touching through this blog :-

  • Handling Nested Fields with ES.

Question: Let’s proceed to understand the concept of Nested Fields in ElasticSearch

Answer:- So far In this blog series, we have only handled short documents with only a few associated fields. However, if we need to handle documents with many inner fields, elastic searches performance can start to suffer.

Question: Why does the performance of ElasticSearch suffers in case of nested-fields ?

Answer:- This is because each subfield gets mapped to individual fields by default with dynamic mappings. This usually leads to the problem of Mapping -Explosion. ElasticSearch offers the flattened data type to avoid mapping every subfield as individual fields, but rather as one flat and field containing the original data.

Question: Show same sample of line to be ingested into an “sys_log” index ?

Question: Show some another sample of log-line to be ingested into an “sys_log” index ?

Question: Let’s go ahead and ingest the log-line to the “sys_log” index ?

Now, though we didn’t defined the field types & mappings explicitly, Elastic is smart enough to auto predict the mappings :-

curl — location — request GET ‘http://127.0.0.1:9200/sys_log/_mapping'

Here is the response from above query :-

{
"sys_log": {
"mappings": {
"properties": {
"@timestamp": {
"type": "date"
},
"fileset": {
"properties": {
"name": {
"type": "text",
"fields": {
"keyword": {
"type": "keyword",
"ignore_above": 256
}
}
}
}
},
"host": {
"properties": {
"hostname": {
"type": "text",
"fields": {
"keyword": {
"type": "keyword",
"ignore_above": 256
}
}
},
"name": {
"type": "text",
"fields": {
"keyword": {
"type": "keyword",
"ignore_above": 256
}
}
}
}
},

"message": {
"type": "text",
"fields": {
"keyword": {
"type": "keyword",
"ignore_above": 256
}
}
},
"process": {
"properties": {
"name": {
"type": "text",
"fields": {
"keyword": {
"type": "keyword",
"ignore_above": 256
}
}
},
"pid": {
"type": "long"
}
}
}
}
}
}
}

Question:- What’s wrong with documents that contain many fields ?

Answer:- Well, the answer is they can cause your classic search cluster to go down.

  • Each field has an associated mapping type in its index. These types can be specified by the user OR elastic search can automatically assign this to the field.

Question:- How can we retrieve the current cluster state ?

Answer:- Now, we can retrieve the current cluster state in our example by using the cluster state API. Though, the response is very big, we can see the data of our own choice :-

Question:- How usually, ES is setup for logging scenarios ?

  • Now, ElasticSearch is typically set up as a cluster. A ElasticSearch cluster is a collection of ElasticSearch nodes.

Here, the cluster state is passed between the nodes so that clusters run smoothly within this cluster. There will be a master node that sends the latest cluster state to all the other nodes. Upon receiving the cluster state, all other nodes send an acknowledgement signal back to the master node.

Question:- What’s the impact of any new mapping create/update on the Elastic-Cluster-State ?

Answer:-

  • For each new field added to the document, anew mapping is created by ElasticSearch.

Question:- Why the frequent addition of new fields is not recommended ?

Answer:- Frequently adding new fields to an index, not only causes the cluster state to grow but also triggers cluster state updates across all nodes which can result in delays if pushed far enough.

Question:- Now, you may ask, what’s the importance of updated cluster state and why at all, is it required to be propagated to all other nodes ?

Answer:- It’s because :-

  • Without the updated cluster state, nodes aren’t able to perform basic operations like indexing and searching.

Question:- What’s this situation called as ? Is that a problem ?

Answer:- Yes, this is a problem. When an ElasticSearch cluster crashes because of too many fields in a map we call this a mapping explosion. This is especially true in environments that handle heavy loads or don’t have enough hardware to power it.

Question:- What’s the solution to this challenge of Mapping-Explosion ?

Answer:- In order to help prevent mapping explosions the ElasticSearch introduced the flattened data-type.

  • Essentially what this data type does is : map the entire object along with its inner fields into a single field.

Question:- How do we define the type of any particular field in ES as flattened ?

Answer:- First, we define the mappings for our new index, where we shall have the type of host attribute as flattened.

curl --location --request PUT 'localhost:9200/sys_log_flatened/_doc/1' \
--header 'Content-Type: application/json' \
--data-raw '{
"message": "[5592:1:0309/123054.737712:ERROR:child_process_sandbox_support_impl_linux.cc(79)] FontService unique font name matching request did not receive a response.",
"fileset": {
"name": "syslog"
},
"process": {
"name": "org.gnome.Shell.desktop",
"pid": 3383
},
"@timestamp": "2020-03-09T18:00:54.000+05:30",
"host": {
"hostname": "bionic",
"name": "bionic"
}

}'

Question:- What shall be the data-type, if we now investigate the mappings of this Index from ES ?

Answer:- Note that, data-type of field ‘host’ in ES is of type flattened, whereas the data-type of other field ‘process’ is still auto-deduced by ES, because we didn’t defined any mapping for the same.

curl --location --request GET 'http://localhost:9200/sys_log_flatened/_mappings'{
"sys_log_flatened": {
"mappings": {
"properties": {
"@timestamp": {
"type": "date"
},
"fileset": {
"properties": {
"name": {
"type": "text",
"fields": {
"keyword": {
"type": "keyword",
"ignore_above": 256
}
}
}
}
},
"host": {
"type": "flattened"
},

"message": {
"type": "text",
"fields": {
"keyword": {
"type": "keyword",
"ignore_above": 256
}
}
},
"process": {
"properties": {
"name": {
"type": "text",
"fields": {
"keyword": {
"type": "keyword",
"ignore_above": 256
}
}
},
"pid": {
"type": "long"
}
}
}
}
}
}
}

Question:- For the first document 📃 that we ingested above, let’s partially-update it i.e. we go ahead and add some more inner-fields, under the “host” field. Recall that, the data-type, we defined for “host” field was flattened ?

curl --location --request POST 'localhost:9200/sys_log_flatened/_doc/1/_update' \
--header 'Content-Type: application/json' \
--data-raw '{
"doc" : {
"host" : {
"osVersion": "Bionic Beaver",
"osArchitecture":"x86_64"
}
}
}'

Question:- For the document 📃 that we just updated above, let’s observe whether the document has really been modified ?

Question:- Let’s again investigate the mappings of this Index from ES ? Whether have we solved the problem of Mapping-Explosion ?

Answer:-

  • To our expectation, the mappings remains as it is, because of the flattened data-type for the ‘host’ field.
{
"sys_log_flatened": {
"mappings": {
"properties": {
"@timestamp": {
"type": "date"
},
"fileset": {
"properties": {
"name": {
"type": "text",
"fields": {
"keyword": {
"type": "keyword",
"ignore_above": 256
}
}
}
}
},
"host": {
"type": "flattened"
},

"message": {
"type": "text",
"fields": {
"keyword": {
"type": "keyword",
"ignore_above": 256
}
}
},
"process": {
"properties": {
"name": {
"type": "text",
"fields": {
"keyword": {
"type": "keyword",
"ignore_above": 256
}
}
},
"pid": {
"type": "long"
}
}
}
}
}
}
}

Question:- Well one problem of Mapping-Explosion is fixed, Are there any side-effects too ?

Answer:- Well yes, there are certain limitations, to be aware of when using the flattened data type :-

  • The inner-fields of the flattened data type object will be treated as keywords in elastic search.

Question:- Let’s review so far, what all documents are there in our new index?

curl --location --request GET 'localhost:9200/sys_log_flatened/_search'{
"took": 19,
"timed_out": false,
"_shards": {
"total": 1,
"successful": 1,
"skipped": 0,
"failed": 0
},
"hits": {
"total": {
"value": 1,
"relation": "eq"
},
"max_score": 1.0,
"hits": [
{
"_index": "sys_log_flatened",
"_type": "_doc",
"_id": "1",
"_score": 1.0,
"_source": {
"message": "[5592:1:0309/123054.737712:ERROR:child_process_sandbox_support_impl_linux.cc(79)] FontService unique font name matching request did not receive a response.",
"fileset": {
"name": "syslog"
},
"process": {
"name": "org.gnome.Shell.desktop",
"pid": 3383
},
"
@timestamp": "2020-03-09T18:00:54.000+05:30",
"host": {
"hostname": "bionic",
"name": "bionic",
"osVersion": "Bionic Beaver",
"osArchitecture": "x86_64"
}
}
}

]
}
}

Question:- Can you demonstrate the behaviour of keyword based search, with some search query on outer fields ?

Answer:- One of the fundamental thing here to note is that, even if we search the outer field, all inner-fields shall be searched for.

curl --location --request GET 'localhost:9200/sys_log_flatened/_search' \
--header 'Content-Type: application/json' \
--data-raw '{
"query" : {
"match" : {
"host" : "Bionic Beaver"
}
}
}'

Here are the results thus obtained :-

{
"took": 4,
"timed_out": false,
"_shards": {
"total": 1,
"successful": 1,
"skipped": 0,
"failed": 0
},
"hits": {
"total": {
"value": 1,
"relation": "eq"
},
"max_score": 0.39556286,
"hits": [
{
"_index": "sys_log_flatened",
"_type": "_doc",
"_id": "1",
"_score": 0.39556286,
"_source": {
"message": "[5592:1:0309/123054.737712:ERROR:child_process_sandbox_support_impl_linux.cc(79)] FontService unique font name matching request did not receive a response.",
"fileset": {
"name": "syslog"
},
"process": {
"name": "org.gnome.Shell.desktop",
"pid": 3383
},
"
@timestamp": "2020-03-09T18:00:54.000+05:30",
"host": {
"hostname": "bionic",
"name": "bionic",
"osVersion": "Bionic Beaver",
"osArchitecture": "x86_64"
}
}
}

]
}
}
  • For the match query for “Bionic Beaver”, matching document is found, as the field osVersion in one of our document had the matching text as in the response shown above.

Question:- Can you demonstrate the behaviour of keyword based search, with some search query on inner fields ?

Answer:- ES does support searching on inner fields for the flattened data-type, but note that, it’s going to be keyword based search.//

curl --location --request GET 'localhost:9200/sys_log_flatened/_search' \
--header 'Content-Type: application/json' \
--data-raw '{
"query" : {
"match" : {
"host.osVersion" : "Bionic Beaver"
}
}
}'

Question:- Can you demonstrate the behaviour of performing keyword based search on partial matching data, with some search query on inner fields ?

Answer:- The partial match doesn’t return any results in this case, because the fields are not analysed. This is one consideration to keep in mind when choosing the flattened data type.

curl --location --request GET 'localhost:9200/sys_log_flatened/_search' \
--header 'Content-Type: application/json' \
--data-raw '{
"query" : {
"match" : {
"host.osVersion" : "Beaver"
}
}
}'

And the response this obtained is :-

{
"took": 17,
"timed_out": false,
"_shards": {
"total": 1,
"successful": 1,
"skipped": 0,
"failed": 0
},
"hits": {
"total": {
"value": 0,
"relation": "eq"
},
"max_score": null,
"hits": []
}
}

Here’s a quick summary of the different scenarios of searching with the match query.

*********************Switching gears to Mappings******************

Question:- What are the possible ways of defining mapping for the particular Elastic Index ?

Answer:- Mapping-process defines what we can index via individual fields and their data types and also how the indexing happens by our related parameters. There are generally two ways, Indexing can happen :-

  • An explicit mapping process, where you define what fields and their types you want to store along with any additional parameters.

Question:- What are the possible issues, that can happen with Mappings ?

Answer:- There are generally two potential issues that may end up with mappings.

  • If we go with Explicit-Mapping and say the fields(in documents we Index), don’t match we’ll get an exception beyond a certain safety zone.

Question:- Let’s go ahead and create a new Index with some mappings ?

curl --location --request PUT 'localhost:9200/microservice-logs' \
--header 'Content-Type: application/json' \
--data-raw '{
"mappings": {
"properties": {
"timestamp" : {"type" : "date"},
"service" : {"type" : "keyword"},
"host_ip" : {"type" : "ip"},
"port" : {"type" : "integer"},
"message" : {"type" : "text"}
}
}
}'

Question:- Now, let’s ingest a correct document to the afore-created Index ?

curl --location --request PUT 'localhost:9200/microservice-logs/_doc/1' \
--header 'Content-Type: application/json' \
--data-raw '{
"timestamp": "2020-04-11T12:34:56.789Z",
"service": "Document-Microservice",
"host_ip": "10.14.196.210",
"port": 8089,
"message": "Started App!"
}'

Question:- Now, let’s ingest an InCorrect document to the afore-created Index ‘microservice-logs’ ?

Answer :-

  • Note here that, we are indexing the document without specifying the ID by our own and therefore, the ES would auto-assign the random ID to this newly ingested document.
curl --location --request POST 'localhost:9200/microservice-logs/_doc' \
--header 'Content-Type: application/json' \
--data-raw '{
"timestamp": "2020-04-11T12:34:56.789Z",
"service": "Document-Microservice",
"host_ip": "10.14.196.209",
"port": "8089",
"message": "Started App on another IP."
}'

You might question, why does it not threw the exception ? And the reason is the safety zone. Yes, the data for the port attribute, that we indexed into this Index is of type String, but still it worked.

Question:- Now, let’s have a look at all the documents present into the Index: ‘microservice-logs’ ?

Answer:- We can observe 2 documents present into the ElasticSearch :-

{
"took": 231,
"timed_out": false,
"_shards": {
"total": 1,
"successful": 1,
"skipped": 0,
"failed": 0
},
"hits": {
"total": {
"value": 3,
"relation": "eq"
},
"max_score": 1.0,
"hits": [
{
"_index": "microservice-logs",
"_type": "_doc",
"_id": "1",
"_score": 1.0,
"_source": {
"timestamp": "2020-04-11T12:34:56.789Z",
"service": "Document-Microservice",
"host_ip": "10.14.196.209",
"port": "8089",
"message": "Started App on another IP."
}
},
{
"_index": "microservice-logs",
"_type": "_doc",
"_id": "PzT70X0BhTuHVfBk00NJ",
"_score": 1.0,
"_source": {
"timestamp": "2020-04-11T12:34:56.789Z",
"service": "Document-Microservice",
"host_ip": "10.14.196.209",
"port": "8089",
"message": "Started App on another IP."
}
}
]
}
}

Question:- Now, let’s ingest yet another InCorrect document to the afore-created Index ‘microservice-logs’ ?

curl --location --request POST 'localhost:9200/microservice-logs/_doc' \
--header 'Content-Type: application/json' \
--data-raw '{
"timestamp": "2020-04-11T12:34:56.789Z",
"service": "XYZ",
"host_ip": "10.0.2.15",
"port": "NONE",
"message": "I am not well!"
}'

Observe above that, we had proposed the value of port field as “NONE”. Now, this is obviously out of the safety-zone because NONE is not the number @ all and hence thus result is, we get into the Mapper_Parsing_Exception :-

Question:- Now, let’s give an instruction to ElasticSearch that, come what may, do always accept the documents and never throw the exceptions ?

Answer:- For this purpose, we would have to first close this Index and then only, we could modify the settings.

Note: Now, we go ahead and modify the settings :-

Question:- How do we see, what are the current settings in place for the particular Index ?

curl --location --request GET 'http://localhost:9200/microservice-logs/_settings'{
"microservice-logs": {
"settings": {
"index": {
"routing": {
"allocation": {
"include": {
"_tier_preference": "data_content"
}
}
},
"verified_before_close": "true",
"mapping": {
"ignore_malformed": "true"
}
,
"number_of_shards": "1",
"provided_name": "microservice-logs",
"creation_date": "1639904977547",
"number_of_replicas": "1",
"uuid": "_B9SI_XzTridCePPScpCAw",
"version": {
"created": "7160199"
}
}
}
}
}

Question:- Let’s now open our Index, so that we can perform certain operations with it ?

curl --location --request POST 'http://localhost:9200/microservice-logs/_open'

Question:- Now, again let’s ingest yet another InCorrect document to the afore-created Index ‘microservice-logs’ and observe the behaviour ?

curl --location --request POST 'localhost:9200/microservice-logs/_doc' \
--header 'Content-Type: application/json' \
--data-raw '{
"timestamp": "2020-04-11T12:34:56.789Z",
"service": "XYZ",
"host_ip": "10.0.2.15",
"port": "NONE",
"message": "I am not well!"
}'

Observe above that, we had proposed the value of port field as “NONE”. Now, this obviously get Indexed happily :-

Question:- Now, let’s see, whether this particular document got Ingested or Not, into our ElasticSearch ?

Answer:- Well, of-course this document got ingested, but observe that, the field “port” has been marked as ignored. This is only a partial solution, because this setting has its limits and they are quite considerable.

curl --location --request GET 'localhost:9200/microservice-logs/_doc/QzQV0n0BhTuHVfBkVkOa'{
"_index": "microservice-logs",
"_type": "_doc",
"_id": "QzQV0n0BhTuHVfBkVkOa",
"_version": 1,
"_seq_no": 6,
"_primary_term": 3,
"_ignored": [
"port"
],
"found": true,
"_source": {
"timestamp": "2020-04-11T12:34:56.789Z",
"service": "XYZ",
"host_ip": "10.0.2.15",
"port": "NONE",
"message": "I am not well!"
}
}

Question:- So, What are the limitations of property “ignore_malformed” and can you kindly demonstrate the same ?

Answer:- Yes, the property ignore_malformed can’t handle JSON objects on the input. Recall originally that, the field ‘message’ does allows only text type, and here we are trying to supply the JSON to this field, which would surely cause an issue.

Question:- Can you demonstrate, how the ElasticSearch auto-determine the type of the new fields thus ingested into it?

Answer:- Yes, In below document, we are trying to ingest the document which have extra field ‘payload’, which is of JSON type. Note that, In our original mappings, this was something we never defined.

curl --location --request POST 'localhost:9200/microservice-logs/_doc' \
--header 'Content-Type: application/json' \
--data-raw '{
"timestamp": "2020-04-11T12:34:56.789Z",
"service": "ABC",
"host_ip": "10.0.2.15",
"port": 12345,
"message": "Application Boot complete...",
"payload": {
"data": {
"received": "Application fetching params."
}
}

}'

Question:- Amazing, Now what’s the type that ES have auto-deduced from the above data, that we ingested into it ?

Answer:- Here is the latest-mappings, we have right now for our Index, into our ElasticSearch :-

curl --location --request GET 'http://localhost:9200/microservice-logs/_mappings'{
"microservice-logs": {
"mappings": {
"properties": {
"host_ip": {
"type": "ip"
},
"message": {
"type": "text"
},
"port": {
"type": "integer"
},
"service": {
"type": "keyword"
},
"timestamp": {
"type": "date"
},
"payload": {
"properties": {
"data": {
"properties": {
"received": {
"type": "text",
"fields": {
"keyword": {
"type": "keyword",
"ignore_above": 256
}
}
}
}
}
}
}
}
}
}
}

Question:- So far so good, but now even further, we got yet another log-, for which payload’s structure is different again. So, what would happen ?

Answer:- In below document, we are trying to ingest the document which have extra field ‘payload’, which is of JSON type and now the structure of the ‘payload’ is different again.

curl --location --request POST 'localhost:9200/microservice-logs/_doc' \
--header 'Content-Type: application/json' \
--data-raw '{
"timestamp": "2020-04-11T12:34:56.789Z",
"service": "Customer MS",
"host_ip": "10.0.2.15",
"port": 12345,
"message": "Received...",
"payload": {
"data": {
"received": {
"even": "more"
}
}
}
}'

As we can rationally guess, it would lead in a situation of exception for us and the reason is, with the last document that we ingested to our Index, ES auto-deduced the type for field “payload.data.received” as text :-

{
"error": {
"root_cause": [
{
"type": "mapper_parsing_exception",
"reason": "failed to parse field [payload.data.received] of type [text] in document with id 'RjQ00n0BhTuHVfBk0EPM'. Preview of field's value: '{even=more}'"
}
],
"type": "mapper_parsing_exception",
"reason": "failed to parse field [payload.data.received] of type [text] in document with id 'RjQ00n0BhTuHVfBk0EPM'. Preview of field's value: '{even=more}'",
"caused_by": {
"type": "illegal_state_exception",
"reason": "Can't get text on a START_OBJECT at 9:25"
}
},
"status": 400
}

Question:- So now, what can we do, to solve this problem ?

Answer:- We have following ways to address this problem :-

  • Well engineers on the team need to be aware of these mapping mechanics. You can also establish shared guidelines for the log fields.

That’s all in this section. If you liked reading this blog, kindly do press on clap button & donate directly, to indicate your appreciation. We would see you in next series.

References :-

--

--

Software Engineer for Big Data distributed systems

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store