Saturday, May 23, 2026

Understanding fapolicyd-1.5 Metrics

In the last article, I went over the bigger picture of the fapolicyd 1.5 work. From it, we learned that status, metrics, and timing are now separate reports because they answer different questions.

This article is about metrics.

The important distinction is this: "fapolicyd-cli --check-status" asks whether the daemon is healthy and configured the way you expect. "fapolicyd-cli --check-metrics" asks what the daemon has been doing. Metrics are where you look for rule hits, cache behavior, default-allow decisions, queue pressure, and which rule attributes are causing work.

Start with the command:

# fapolicyd-cli --check-metrics
Last metrics reset: never
Ruleset generation: 1

Decision outcomes:
Allowed accesses: 42171
Denied accesses: 3
Allowed by rule: 42171
Allowed by fallthrough: 0

Inter-thread queue & defer activity:
Inter-thread max queue depth: 6
Subject deferred events: 0
Subject defer max depth: 0
Subject defer fallbacks: 0

Subject cache effectiveness:
Subject hits: 41632
Subject misses: 692
Subject collisions: 28
Subject evictions: 150 (0%)
Early subject cache evictions: 0
Subject BUILDING tracer evictions: 0
Subject BUILDING stale evictions: 0

Object cache effectiveness:
Object hits: 34527
Object misses: 11421
Object collisions: 3774
Object evictions: 3774 (10%)

Rule hit counts:
Hits/rule:   1      0 allow perm=any uid=0 : dir=/var/tmp/
Hits/rule:   2   7531 allow perm=any uid=0 trust=1 : all
Hits/rule:   3      0 allow perm=open exe=/usr/bin/rpm : all
Hits/rule:   4      0 allow perm=open exe=/usr/bin/python3.13 comm=dnf : all
Hits/rule:   5      0 deny_audit perm=any all : ftype=application/x-bad-elf
Hits/rule:   6   7634 allow perm=open all : ftype=application/x-sharedlib trust=1
Hits/rule:   7      0 deny_audit perm=open all : ftype=application/x-sharedlib
Hits/rule:   8    578 allow perm=execute all : trust=1
Hits/rule:   9     12 allow perm=any gid=wheel : ftype=%languages dir=/home
Hits/rule:  10      0 allow perm=any gid=wheel : ftype=%languages dir=/usr/share/git-core/templates/
Hits/rule:  11    158 allow perm=open all : ftype=%languages trust=1
Hits/rule:  12      0 deny_audit perm=any all : ftype=%languages
Hits/rule:  13     20 allow perm=any all : ftype=text/x-shellscript
Hits/rule:  14      3 deny_audit perm=execute all : all
Hits/rule:  15  26238 allow perm=open all : all

Subject attribute lookups:
Subject attr: auid requests=3 lookups=3
Subject attr: uid requests=84351 lookups=692
Subject attr: sessionid requests=0 lookups=0
Subject attr: pid requests=3 lookups=0
Subject attr: ppid requests=3 lookups=0
Subject attr: trust requests=7531 lookups=565
Subject attr: gid requests=52850 lookups=0
Subject attr: comm requests=0 lookups=0
Subject attr: exe requests=68692 lookups=1216
Subject attr: dir requests=0 lookups=0
Subject attr: ftype requests=0 lookups=0

Object attribute lookups:
Object attr: path requests=12476 lookups=9452
Object attr: dir requests=7853 lookups=2453
Object attr: device requests=0 lookups=0
Object attr: ftype requests=217466 lookups=9420
Object attr: trust requests=8376 lookups=1437
Object attr: filehash requests=0 lookups=0


The first two lines matter more than they might appear:
Last metrics reset: never
Ruleset generation: 3

If the last reset is never, then the counters have been growing since the daemon started. That is the default behavior and it is useful for a quick look at the lifetime of the daemon. If you want a smaller window, configure reset_strategy=manual in fapolicyd.conf and use "fapolicyd-cli --reset-metrics". The reset-metrics report snapshots the counters, resets them, and displays what they were at reset. It is the exact same output - however the next report generated starts fresh.

The Ruleset generation tells you which active policy the counters apply to. Rule hit counters naturally reset when new rules are successfully loaded. That is important because rule numbers and rule text can change when policy changes. You do not want rule hit counts from yesterday's policy mixed with today's policy.

Metrics are easiest to interpret when the counter window and ruleset generation are explicit.

The first real section is decision outcomes. The old way to look at fapolicyd activity was allowed versus denied. That is still useful, but it is not enough. An allow can happen because a rule matched. It can also happen because the rules had no opinion and the historical default allow behavior was used. Note: the shipped rules ends with 

deny_audit perm=execute all : all
allow perm=open all : all
The intention is to block any unknown execution and allow opens of any documents. This depends on accurate detection of any interpreted computer languages before these lines.

To better understand what kind of access decisions are made, the metrics report now separates:
Allowed accesses
Denied accesses
Allowed by rule
Allowed by fallthrough
If Allowed by fallthrough is zero, then every allow in the window came from a rule. If it is non-zero, the report prints more detail: open versus execute, trusted versus untrusted or unknown trust, and broad file type buckets such as executable, programmatic, shared library, unknown ftype, and other ftype.

The default allow indicates that policy is missing a decision rule. It is not the same as a policy rule intentionally approving the access. If a system has a large number of fallthrough execute decisions, you would want to know why. Maybe that is expected for a permissive policy. Maybe it shows a missing terminal deny. The point is that it is now visible.

Rule hit counts are next. They look like this:
Hits/rule:   1      42 allow perm=execute all : trust=1
Hits/rule:   2       3 deny_audit perm=execute all : trust=0
The exact rules will depend on your policy. A rule hit is counted when the rule actually makes the allow or deny decision. Merely iterating past a rule is not a hit. This makes the table useful for answering practical questions. Which rules are carrying the load? Which rules never fire? Did the rule I just added actually match the program I was testing?

Next come queue and defer activity. The most important queue metric is:
Inter-thread max queue depth
This is the high-water mark for fapolicyd's internal event queue in the current metrics window. It is not the kernel fanotify queue. It is the userspace queue between event intake and decision processing. If this number grows near your configured q_size, the daemon is not keeping up with the event stream.

Subject deferral is new in this release. The daemon can defer an incoming event when processing it would evict another process that is still building startup pattern state in the same subject cache slot. In a normal busy system, some deferred events may be fine. What I would watch closely is:
Subject deferred events
Subject defer max depth
Subject defer fallbacks
Subject defer fallbacks means the fixed defer array filled and the daemon had to use the historical eviction behavior. If this keeps climbing during normal workloads, look at subject cache sizing and the workload shape. A very wide fork/exec storm can create lots of subject cache collisions.

The cache sections are next. fapolicyd has subject and object caches because recomputing process and file attributes on every event would be expensive. The basic shape is:
Subject hits
Subject misses
Subject collisions
Subject evictions

Object hits
Object misses
Object collisions
Object evictions
Hits are good. Misses mean fapolicyd had to populate information - which can be costly. Collisions mean a populated cache slot did not match the current process or file identity. Evictions mean something was removed to make room. A small number of evictions is normal. A high eviction rate compared to hits suggests that the cache may be too small for the workload.

The subject cache also reports early subject cache evictions, tracer evictions, and stale BUILDING evictions. These are more health-oriented. An early eviction means a subject was evicted before startup state was complete. A tracer eviction means the occupant was being traced and could hold the slot indefinitely. A stale eviction means startup state stayed incomplete past the bounded stale window.

The last section is easy to skip, but it is one of the most useful additions: attribute lookup metrics. They look like this:
Subject attr: auid requests=1000 lookups=12
Object attr: ftype requests=1000 lookups=200

requests means policy evaluation or syslog formatting asked for the attribute. lookups means the attribute was not already present in the event cache and fapolicyd had to compute or fetch it.

That distinction matters. If requests are high but lookups are low, the cache is doing its job. If lookups are high, then that attribute is causing real work. This is also a way to see whether logging is making the daemon do extra lookups. For example, a very detailed syslog_format can request attributes that the policy did not need for the decision.

The metrics report groups activity by decision outcome, queue/defer pressure, cache behavior, and attribute lookup cost.


Once you have metrics, the next question is how to graph them. fapolicyd does not need to know about your monitoring stack. The report is simple name: value text. That means you can parse the fields you care about and ship them to whatever you already use.

If you use StatsD, you can turn selected fields into gauge updates. This is deliberately small and crude, but it shows the idea:

fapolicyd-cli --check-metrics |
awk -F': '
'/^(Allowed accesses|Denied accesses|Allowed by rule|Allowed by fallthrough|Inter-thread max queue depth|Subject deferred events|Subject defer fallbacks|Early subject cache evictions|Object evictions):/ {
    name=$1
    gsub(/ /, "_", name)
    printf "fapolicyd.%s:%s|g\n", tolower(name), $2
}' | nc -u -w0 127.0.0.1 8125 



If you use Prometheus and Grafana, remember that Grafana is the visualization layer. Something else has to collect the numbers. Prometheus supports a simple text exposition format. You could write a tiny exporter or generate a node-exporter textfile collector file from selected fapolicyd metrics.

For Prometheus, be careful with labels. It is tempting to turn every rule hit line into a metric with the whole rule text as a label. That can create high cardinality and make your monitoring system unhappy. I would start with stable low-cardinality metrics: allow/deny counts, fallthrough counts, queue depth, cache hits and evictions, subject deferral, kernel overflow, and reply errors. Rule hits are better for troubleshooting reports unless you have a controlled set of rule labels.

Here is a minimal Prometheus-style example for a few fields:

fapolicyd-cli --check-metrics |
awk -F': '
'/^Allowed accesses:/ { print "fapolicyd_allowed_accesses " $2 }
/^Denied accesses:/ { print "fapolicyd_denied_accesses " $2 }
/^Allowed by fallthrough:/ { print "fapolicyd_allowed_by_fallthrough " $2 }
/^Inter-thread max queue depth:/ { print "fapolicyd_inter_thread_max_queue_depth " $2 }
/^Subject defer fallbacks:/ { print "fapolicyd_subject_defer_fallbacks " $2 }
/^Early subject cache evictions:/ { print "fapolicyd_early_subject_cache_evictions " $2 }
'


This is not meant to be a finished exporter. It is a starting point. A real exporter should sanitize names, handle percentages, keep type/help metadata, and decide whether a field is a counter or gauge.

What should you graph first? I would start with these:
Allowed accesses
Denied accesses
Allowed by fallthrough
Inter-thread max queue depth
Subject deferred events
Subject defer fallbacks
Early subject cache evictions
Subject evictions
Object evictions
Reply errors
Kernel queue overflow
Then add attribute lookups for the fields that matter to your policy. If Object attr: ftype lookups are high, file type detection is important for your workload. If Object attr: trust lookups are high, trust database lookups are part of the cost. If subject proc attributes are hot, fork/exec-heavy
workloads may be asking for process details often.

Metrics do not tell you everything. They tell you where to look. If cache evictions are high, look at cache sizing. If fallthrough is high, look at rule coverage. If queue depth is high, look at workload bursts and decision cost. If you need to know where the time is going inside decisions, that is when the timing report comes in. That is the next article.

No comments: