Log in

No account? Create an account

September 28th, 2017

In OpenShift Aggregated Logging https://github.com/openshift/origin-aggregated-logging the Fluentd pipeline tries very hard to ensure that the data is correct, because it depends on having clean data in the output section in order to construct the index names for Elasticsearch. If the fields and values are not correct, then the index name construction will fail with an unhelpful error like this:

2017-09-28 13:22:22 -0400 [warn]: temporarily failed to flush the buffer. next_retry=2017-09-28 13:22:23 -0400 error_class="NoMethodError"
error="undefined method `[]' for nil:NilClass" plugin_id="object:1c0bd1c"
2017-09-28 13:22:22 -0400 [warn]: /opt/app-root/src/gems/fluent-plugin-elasticsearch- `eval'
2017-09-28 13:22:22 -0400 [warn]: /opt/app-root/src/gems/fluent-plugin-elasticsearch- `eval'

There is no context about what field might be missing, what tag is matching, or even which plugin it is, the operations output or the applications output (although you do get the plugin_id, which could be used to look up the actual plugin information, if the Fluentd monitoring is enabled).
One solution is to just edit the logging-fluentd ConfigMap, and add a stdout filter in the right place:
## matches
          <filter **>
            @type stdout
          @include configs.d/openshift/output-pre-*.conf

and dump the time, tag, and record just before the outputs. The problem with this is that it will cause a feedback loop, since Fluentd is reading from its own pod log. The solution to this is to also throw away Fluentd pod logs.
## filters
          @include configs.d/openshift/filter-pre-*.conf
          @include configs.d/openshift/filter-retag-journal.conf
          <match kubernetes.journal.container.fluentd kubernetes.var.log.containers.fluentd**>
            @type null

This must come after the filter-retag-journal.conf which identifies and tags Fluentd pod log records. Then restart Fluentd (oc pod delete $fluentd_pod, oc label node, etc.). The Fluentd pod log will now contain data like this:
2017-09-28 13:44:47 -0400 output_tag: {"type":"response","@timestamp":"2017-09-28T17:44:19.524989+00:00","pid":8,"method":"head","statusCode":200,
"message":"HEAD / 200 2ms - 9.0B",

Now, if you see a record that is missing @timestamp, or a record from a pod that is missing kubernetes.namespace_name or kubernetes.namespace_id, you know that the exception is caused by one of these missing fields.