ELK Dealing with Mappings | Part4

  • Handling Nested Fields with ES.
  • Problem of Mapping-Explosion.
  • Flattened Data Type.
  • Partial Update basis of DocId using Post verbatim.
  • Keyword based search query on flattened data fields.
  • Explicit & Dynamic Mappings of the fields in ES.
  • Mapping Parsing Exceptions & resolution.
curl — location — request GET ‘http://127.0.0.1:9200/sys_log/_mapping'
{
"sys_log": {
"mappings": {
"properties": {
"@timestamp": {
"type": "date"
},
"fileset": {
"properties": {
"name": {
"type": "text",
"fields": {
"keyword": {
"type": "keyword",
"ignore_above": 256
}
}
}
}
},
"host": {
"properties": {
"hostname": {
"type": "text",
"fields": {
"keyword": {
"type": "keyword",
"ignore_above": 256
}
}
},
"name": {
"type": "text",
"fields": {
"keyword": {
"type": "keyword",
"ignore_above": 256
}
}
}
}
},

"message": {
"type": "text",
"fields": {
"keyword": {
"type": "keyword",
"ignore_above": 256
}
}
},
"process": {
"properties": {
"name": {
"type": "text",
"fields": {
"keyword": {
"type": "keyword",
"ignore_above": 256
}
}
},
"pid": {
"type": "long"
}
}
}
}
}
}
}
  • Each field has an associated mapping type in its index. These types can be specified by the user OR elastic search can automatically assign this to the field.
  • ElasticSearch holds the mapping information of every index in it’s cluster state.
  • The cluster state also includes information such as index mappings, the node-details etc.
  • Now, ElasticSearch is typically set up as a cluster. A ElasticSearch cluster is a collection of ElasticSearch nodes.
  • The presence of multiple nodes allows ElasticSearch to perform better indexing and searching operations.
  • For each new field added to the document, anew mapping is created by ElasticSearch.
  • For each new mapping update in the index, the cluster state also changes.
  • After each cluster state change, the other nodes need to be synced.
  • Without the updated cluster state, nodes aren’t able to perform basic operations like indexing and searching.
  • This can cause memory issues within the nodes and result in poor performance and possibly lead to the cluster itself going down.
  • Essentially what this data type does is : map the entire object along with its inner fields into a single field.
  • In other words if a field contains inner fields, the flattened data type maps the parent field as a single type named flattened and the inner fields don’t appear in the mappings at all thereby reducing the total map fields.
curl --location --request PUT 'localhost:9200/sys_log_flatened/_doc/1' \
--header 'Content-Type: application/json' \
--data-raw '{
"message": "[5592:1:0309/123054.737712:ERROR:child_process_sandbox_support_impl_linux.cc(79)] FontService unique font name matching request did not receive a response.",
"fileset": {
"name": "syslog"
},
"process": {
"name": "org.gnome.Shell.desktop",
"pid": 3383
},
"@timestamp": "2020-03-09T18:00:54.000+05:30",
"host": {
"hostname": "bionic",
"name": "bionic"
}

}'
curl --location --request GET 'http://localhost:9200/sys_log_flatened/_mappings'{
"sys_log_flatened": {
"mappings": {
"properties": {
"@timestamp": {
"type": "date"
},
"fileset": {
"properties": {
"name": {
"type": "text",
"fields": {
"keyword": {
"type": "keyword",
"ignore_above": 256
}
}
}
}
},
"host": {
"type": "flattened"
},

"message": {
"type": "text",
"fields": {
"keyword": {
"type": "keyword",
"ignore_above": 256
}
}
},
"process": {
"properties": {
"name": {
"type": "text",
"fields": {
"keyword": {
"type": "keyword",
"ignore_above": 256
}
}
},
"pid": {
"type": "long"
}
}
}
}
}
}
}
curl --location --request POST 'localhost:9200/sys_log_flatened/_doc/1/_update' \
--header 'Content-Type: application/json' \
--data-raw '{
"doc" : {
"host" : {
"osVersion": "Bionic Beaver",
"osArchitecture":"x86_64"
}
}
}'
  • To our expectation, the mappings remains as it is, because of the flattened data-type for the ‘host’ field.
  • As we intended, the newly added inner fields have not been mapped into the mappings for this Index, as they’re not in here at all. Now this is an important feature for many real world scenarios as this reduces the size of the mapping significantly and therefore mitigates the risk of a mapping explosion so so far we have seen why and how the flattened data type is used.
{
"sys_log_flatened": {
"mappings": {
"properties": {
"@timestamp": {
"type": "date"
},
"fileset": {
"properties": {
"name": {
"type": "text",
"fields": {
"keyword": {
"type": "keyword",
"ignore_above": 256
}
}
}
}
},
"host": {
"type": "flattened"
},

"message": {
"type": "text",
"fields": {
"keyword": {
"type": "keyword",
"ignore_above": 256
}
}
},
"process": {
"properties": {
"name": {
"type": "text",
"fields": {
"keyword": {
"type": "keyword",
"ignore_above": 256
}
}
},
"pid": {
"type": "long"
}
}
}
}
}
}
}
  • The inner-fields of the flattened data type object will be treated as keywords in elastic search.
  • This means no analysers and tokenism, shall be applied to the flattened fields. And this results in a more limited search capability.
curl --location --request GET 'localhost:9200/sys_log_flatened/_search'{
"took": 19,
"timed_out": false,
"_shards": {
"total": 1,
"successful": 1,
"skipped": 0,
"failed": 0
},
"hits": {
"total": {
"value": 1,
"relation": "eq"
},
"max_score": 1.0,
"hits": [
{
"_index": "sys_log_flatened",
"_type": "_doc",
"_id": "1",
"_score": 1.0,
"_source": {
"message": "[5592:1:0309/123054.737712:ERROR:child_process_sandbox_support_impl_linux.cc(79)] FontService unique font name matching request did not receive a response.",
"fileset": {
"name": "syslog"
},
"process": {
"name": "org.gnome.Shell.desktop",
"pid": 3383
},
"
@timestamp": "2020-03-09T18:00:54.000+05:30",
"host": {
"hostname": "bionic",
"name": "bionic",
"osVersion": "Bionic Beaver",
"osArchitecture": "x86_64"
}
}
}

]
}
}
curl --location --request GET 'localhost:9200/sys_log_flatened/_search' \
--header 'Content-Type: application/json' \
--data-raw '{
"query" : {
"match" : {
"host" : "Bionic Beaver"
}
}
}'
{
"took": 4,
"timed_out": false,
"_shards": {
"total": 1,
"successful": 1,
"skipped": 0,
"failed": 0
},
"hits": {
"total": {
"value": 1,
"relation": "eq"
},
"max_score": 0.39556286,
"hits": [
{
"_index": "sys_log_flatened",
"_type": "_doc",
"_id": "1",
"_score": 0.39556286,
"_source": {
"message": "[5592:1:0309/123054.737712:ERROR:child_process_sandbox_support_impl_linux.cc(79)] FontService unique font name matching request did not receive a response.",
"fileset": {
"name": "syslog"
},
"process": {
"name": "org.gnome.Shell.desktop",
"pid": 3383
},
"
@timestamp": "2020-03-09T18:00:54.000+05:30",
"host": {
"hostname": "bionic",
"name": "bionic",
"osVersion": "Bionic Beaver",
"osArchitecture": "x86_64"
}
}
}

]
}
}
  • For the match query for “Bionic Beaver”, matching document is found, as the field osVersion in one of our document had the matching text as in the response shown above.
  • This is because, the search query we provided was an exact match including the casing in “Bionic Beaver”.
curl --location --request GET 'localhost:9200/sys_log_flatened/_search' \
--header 'Content-Type: application/json' \
--data-raw '{
"query" : {
"match" : {
"host.osVersion" : "Bionic Beaver"
}
}
}'
curl --location --request GET 'localhost:9200/sys_log_flatened/_search' \
--header 'Content-Type: application/json' \
--data-raw '{
"query" : {
"match" : {
"host.osVersion" : "Beaver"
}
}
}'
{
"took": 17,
"timed_out": false,
"_shards": {
"total": 1,
"successful": 1,
"skipped": 0,
"failed": 0
},
"hits": {
"total": {
"value": 0,
"relation": "eq"
},
"max_score": null,
"hits": []
}
}
  • An explicit mapping process, where you define what fields and their types you want to store along with any additional parameters.
  • A dynamic mapping; elastic search automatically attempts to determine the appropriate data type and updates the mapping accordingly.
  • If we go with Explicit-Mapping and say the fields(in documents we Index), don’t match we’ll get an exception beyond a certain safety zone.
  • If we go with default dynamic mapping and say some new documents bring in many more fields, then we would be landing in situation of Mapping-Explosion which can take our cluster down.
curl --location --request PUT 'localhost:9200/microservice-logs' \
--header 'Content-Type: application/json' \
--data-raw '{
"mappings": {
"properties": {
"timestamp" : {"type" : "date"},
"service" : {"type" : "keyword"},
"host_ip" : {"type" : "ip"},
"port" : {"type" : "integer"},
"message" : {"type" : "text"}
}
}
}'
curl --location --request PUT 'localhost:9200/microservice-logs/_doc/1' \
--header 'Content-Type: application/json' \
--data-raw '{
"timestamp": "2020-04-11T12:34:56.789Z",
"service": "Document-Microservice",
"host_ip": "10.14.196.210",
"port": 8089,
"message": "Started App!"
}'
  • Note here that, we are indexing the document without specifying the ID by our own and therefore, the ES would auto-assign the random ID to this newly ingested document.
  • Also observe that, we are ingesting the document with value for field “port” as String and not the Integer.
curl --location --request POST 'localhost:9200/microservice-logs/_doc' \
--header 'Content-Type: application/json' \
--data-raw '{
"timestamp": "2020-04-11T12:34:56.789Z",
"service": "Document-Microservice",
"host_ip": "10.14.196.209",
"port": "8089",
"message": "Started App on another IP."
}'
{
"took": 231,
"timed_out": false,
"_shards": {
"total": 1,
"successful": 1,
"skipped": 0,
"failed": 0
},
"hits": {
"total": {
"value": 3,
"relation": "eq"
},
"max_score": 1.0,
"hits": [
{
"_index": "microservice-logs",
"_type": "_doc",
"_id": "1",
"_score": 1.0,
"_source": {
"timestamp": "2020-04-11T12:34:56.789Z",
"service": "Document-Microservice",
"host_ip": "10.14.196.209",
"port": "8089",
"message": "Started App on another IP."
}
},
{
"_index": "microservice-logs",
"_type": "_doc",
"_id": "PzT70X0BhTuHVfBk00NJ",
"_score": 1.0,
"_source": {
"timestamp": "2020-04-11T12:34:56.789Z",
"service": "Document-Microservice",
"host_ip": "10.14.196.209",
"port": "8089",
"message": "Started App on another IP."
}
}
]
}
}
curl --location --request POST 'localhost:9200/microservice-logs/_doc' \
--header 'Content-Type: application/json' \
--data-raw '{
"timestamp": "2020-04-11T12:34:56.789Z",
"service": "XYZ",
"host_ip": "10.0.2.15",
"port": "NONE",
"message": "I am not well!"
}'
curl --location --request GET 'http://localhost:9200/microservice-logs/_settings'{
"microservice-logs": {
"settings": {
"index": {
"routing": {
"allocation": {
"include": {
"_tier_preference": "data_content"
}
}
},
"verified_before_close": "true",
"mapping": {
"ignore_malformed": "true"
}
,
"number_of_shards": "1",
"provided_name": "microservice-logs",
"creation_date": "1639904977547",
"number_of_replicas": "1",
"uuid": "_B9SI_XzTridCePPScpCAw",
"version": {
"created": "7160199"
}
}
}
}
}
curl --location --request POST 'http://localhost:9200/microservice-logs/_open'
curl --location --request POST 'localhost:9200/microservice-logs/_doc' \
--header 'Content-Type: application/json' \
--data-raw '{
"timestamp": "2020-04-11T12:34:56.789Z",
"service": "XYZ",
"host_ip": "10.0.2.15",
"port": "NONE",
"message": "I am not well!"
}'
curl --location --request GET 'localhost:9200/microservice-logs/_doc/QzQV0n0BhTuHVfBkVkOa'{
"_index": "microservice-logs",
"_type": "_doc",
"_id": "QzQV0n0BhTuHVfBkVkOa",
"_version": 1,
"_seq_no": 6,
"_primary_term": 3,
"_ignored": [
"port"
],
"found": true,
"_source": {
"timestamp": "2020-04-11T12:34:56.789Z",
"service": "XYZ",
"host_ip": "10.0.2.15",
"port": "NONE",
"message": "I am not well!"
}
}
curl --location --request POST 'localhost:9200/microservice-logs/_doc' \
--header 'Content-Type: application/json' \
--data-raw '{
"timestamp": "2020-04-11T12:34:56.789Z",
"service": "ABC",
"host_ip": "10.0.2.15",
"port": 12345,
"message": "Application Boot complete...",
"payload": {
"data": {
"received": "Application fetching params."
}
}

}'
curl --location --request GET 'http://localhost:9200/microservice-logs/_mappings'{
"microservice-logs": {
"mappings": {
"properties": {
"host_ip": {
"type": "ip"
},
"message": {
"type": "text"
},
"port": {
"type": "integer"
},
"service": {
"type": "keyword"
},
"timestamp": {
"type": "date"
},
"payload": {
"properties": {
"data": {
"properties": {
"received": {
"type": "text",
"fields": {
"keyword": {
"type": "keyword",
"ignore_above": 256
}
}
}
}
}
}
}
}
}
}
}
curl --location --request POST 'localhost:9200/microservice-logs/_doc' \
--header 'Content-Type: application/json' \
--data-raw '{
"timestamp": "2020-04-11T12:34:56.789Z",
"service": "Customer MS",
"host_ip": "10.0.2.15",
"port": 12345,
"message": "Received...",
"payload": {
"data": {
"received": {
"even": "more"
}
}
}
}'
{
"error": {
"root_cause": [
{
"type": "mapper_parsing_exception",
"reason": "failed to parse field [payload.data.received] of type [text] in document with id 'RjQ00n0BhTuHVfBk0EPM'. Preview of field's value: '{even=more}'"
}
],
"type": "mapper_parsing_exception",
"reason": "failed to parse field [payload.data.received] of type [text] in document with id 'RjQ00n0BhTuHVfBk0EPM'. Preview of field's value: '{even=more}'",
"caused_by": {
"type": "illegal_state_exception",
"reason": "Can't get text on a START_OBJECT at 9:25"
}
},
"status": 400
}
  • Well engineers on the team need to be aware of these mapping mechanics. You can also establish shared guidelines for the log fields.
  • Secondly you may consider what’s called a dead letter Q pattern, that would store the fail documents in a separate queue. This needs to be handled on an application level.

--

--

Software Engineer for Big Data distributed systems

Love podcasts or audiobooks? Learn on the go with our new app.

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store