Friday, July 3, 2026

fapolicyd 1.6: Generation-Aware State and Worker-Ready Internals

The fapolicyd 1.5 release was about making important failure modes visible and making rule reloads transactional. The 1.6 release builds on that by making more live daemon state explicit, reportable, and easier to hand off safely.

The short version is:

Give live daemon state a generation identity.
Report that identity in status and metrics.
Preserve the last good trust database if a rebuild fails.
Move mutable decision state toward explicit ownership.


That sounds like a lot of internal engineering work, and it is. But it improves normal operations in visible ways. Reports can now say which generation of config, rules, trust content, and LMDB storage is active. A failed trust database rebuild should not destroy the last good trust database. A maintenance command should not silently replace the physical LMDB storage without a controlled handoff.

fapolicyd 1.6 makes more live state generation-aware: the daemon can report which generation is active, and decision code can hold the generation it needs while evaluating an access request. The daemon can report which generation is active, and a decision worker pins the generation it needs while it is evaluating the access request.

This release also prepares the daemon for a future with multiple decision threads. It does not turn on a worker pool yet. Instead, it removes several shared mutable assumptions that would make a worker pool unsafe. That is really the importance of the 1.6 release - its preparation for the 2.0 release which will bring multiple decision worker threads.

 

fapolicyd 1.6 reports and manages live state as generations, giving decisions a coherent view of the state they use.

 

Generational configuration

Configuration reloads are easy to underestimate.

Some settings can only be set at startup. Queue size, cache size, daemon uid/gid, fanotify mark configuration, and the trust database map sizing policy are tied to long-lived runtime objects. They still require a restart.

Other settings affect decisions while the daemon is running. In 1.6, the decision-used settings are published as an immutable configuration generation. That currently includes fields such as permissive mode and integrity mode.

The important behavior is not just that the values can be reloaded. The important behavior is that a decision does not mix values from different reloads. When a permission event starts, fapolicyd pins the active decision configuration generation. That same generation is used while the event is constructed, evaluated, logged, audited, and answered.

This is the kind of operational property administrators expect from enterprise software. Reloads should be observable. They should be atomic from the point of view of work already in progress. They should leave enough evidence in status and metrics reports to explain which version of live state was active.

Run this on a test system with fapolicyd running:

printf 'Before reload:\n' && \
fapolicyd-cli --check-status | grep -E '^(Config generation|Ruleset generation|Trust database generation|LMDB environment generation):' && \
printf '\nMetrics context:\n' && \
fapolicyd-cli --check-metrics | grep -E '^(Last metrics reset|Config generation|Ruleset generation|Trust database generation|LMDB environment generation|Trust database entries|Trust DB lookups|Trust DB reader slots full):' && \
printf '\nAfter reload:\n' && \
sudo kill -HUP "$(cat /run/fapolicyd.pid)" && \
fapolicyd-cli --check-status | grep -E '^(Config generation|Ruleset generation|Trust database generation|LMDB environment generation):'

Before reload:
Config generation: 1 (effective since 2026-07-03 14:09:18 -0400)
Ruleset generation: 1 (effective since 2026-07-03 14:09:18 -0400)
Trust database generation: 1 (effective since 2026-07-03 14:09:24 -0400)
LMDB environment generation: 1 (effective since 2026-07-03 14:09:23 -0400)

Metrics context:
Last metrics reset: never
Config generation: 1 (effective since 2026-07-03 14:09:18 -0400)
Ruleset generation: 1 (effective since 2026-07-03 14:09:18 -0400)
Trust database generation: 1 (effective since 2026-07-03 14:09:24 -0400)
LMDB environment generation: 1 (effective since 2026-07-03 14:09:23 -0400)
Trust database entries: 213870
Trust DB lookups: 407
Trust DB reader slots full: 0

After reload:
Config generation: 2 (effective since 2026-07-03 14:09:54 -0400)
Ruleset generation: 1 (effective since 2026-07-03 14:09:18 -0400)
Trust database generation: 1 (effective since 2026-07-03 14:09:24 -0400)
LMDB environment generation: 1 (effective since 2026-07-03 14:09:23 -0400)


The exact numbers are not important. The point is that the status report now names the generations that make up the daemon's current decision state, and the metrics report carries the same context for the current counter window.

 

Rulesets are pinned during decisions

Rulesets are pinned during decisions

Rule reloads were already transactional before 1.6: a candidate ruleset is parsed and validated before it becomes the published policy, and a failed reload leaves the previous ruleset in place. In 1.6, that ruleset generation is reported alongside the active config generation, trust database generation, and LMDB environment generation.

The useful 1.6 point is not that older code used partial rules. It is that the daemon now exposes the active state boundaries more clearly, and those boundaries are the shape needed before multiple decision workers can evaluate events concurrently.

This matters even before fapolicyd becomes multi-threaded. A report can now say which ruleset generation was active. Metrics can be interpreted against a specific daemon state. Operators can reload policy and know whether the daemon accepted the new generation or kept the previous one.

It also matters for future concurrency work. Multiple decision threads should not need to stop just because a reload is preparing a new ruleset. They should be able to keep using the generation they started with while the daemon prepares and publishes the next one.


Trust reloads preserve the last good database

The largest operational change in 1.6 is the trust database model.

Trust data is one of fapolicyd's major policy inputs. Package backends and local trust files describe files that are expected to be present and trusted. The daemon stores that data in LMDB and consults it while making decisions.

Before 1.6, trust database reloads were serialized with decision processing, so live decisions were not meant to read a half-built database. The limitation was operational: rebuild work could stop normal decision processing, and failure handling around the active trust database was harder to reason about.

The 1.6 model is safer:

build a candidate trust database generation
validate it
publish metadata that makes it active
let old readers drain
reclaim retired generations later

If the candidate cannot be built, the active trust database remains active. If a decision is already reading an older generation, that reader can finish before the retired generation is reclaimed. If reloads arrive during an active rebuild, the daemon coalesces or queues them so that a configuration-changing reload is not silently lost.

A future detailed trust database sizing article will cover this in depth, including manual sizing, automatic sizing, high-water pages, and compaction. The main point for a release overview is that trust database publication is now generational, and failed reloads preserve the last good trust state.


Why the trust database file may need compaction

Most administrators do not need to know LMDB internals. The useful question is simpler:

Is the trust database content current, and is the storage file still a reasonable size?
Those are related, but they are not the same thing.

The trust database generation tells you which trusted-file list is active. When packages change or an administrator updates trust files, fapolicyd builds and publishes a new trust database generation. That answers the content question: "which trust data are new decisions using?"

The storage file is the file that holds that list on disk. After package updates, repeated trust reloads, or test rebuilds, the current trusted-file list may be normal size while the storage file still reflects earlier growth. That does not mean the trust data has problems. It means the database file may have holes that can be repacked to save space.

That is what the status report is trying to show with active pages and allocated high-water pages. Active pages describe the trust data currently in use. Allocated high-water pages describe how large the database file has grown internally. If the high-water number is much larger than the active size, the status report may recommend compaction.

Compaction means "build a clean replacement storage file from the current trust sources and swap it in safely." It is similar in spirit to vacuuming or repacking a database.

In fapolicyd 1.6, compaction is controlled. The daemon builds a clean replacement database file in a temporary directory from the current trust sources, validates it, briefly stops new trust database readers, swaps in the replacement, and reopens the database. If that fails, the old database file is preserved or restored.

The status and metrics reports call the storage-file generation the LMDB environment generation. You do not need that term for daily administration. The practical meaning is: this number changes when fapolicyd replaces the database storage file, such as after controlled compaction. The trust database generation changes when the trusted-file contents change.

In 1.6, db_max_size = auto is the default setting. Auto sizing now accounts for reload headroom - not only the final active database size. Manual numeric values are still supported, but the status report can now tell the administrator when a manual value is too small for safe reloads. When in doubt, use auto.

The (future) dedicated trust database sizing article has the commands to investigate and compact this state, so we will not duplicate that full walkthrough here.


Reports show the active generations

The status and metrics reports now identify more than one kind of generation.

fapolicyd-cli --check-status is still the place to ask whether the daemon is healthy and configured as expected. It now reports the active config generation, ruleset generation, trust database generation, LMDB environment generation, and trust database entry count.

fapolicyd-cli --check-metrics is still the place to ask what happened during the current counter window. It includes the same generation headers so the counter window has context.

That matters during investigations. If a denial happened after a rule reload, you want to know which ruleset generation was active. If package updates triggered trust reloads, you want to know which trust database generation is currently used by new decisions. If a compaction ran, you want to know the physical LMDB environment generation changed.

The reader-slot counter is worth keeping. Trust lookups now use private LMDB read transactions and cursors instead of one global read cursor. That is one of the changes needed before future decision workers can look up trust concurrently.


Worker-ready decision state

The daemon still runs with one decision thread in 1.6. But the hot path is now much closer to the shape needed for more than one.

Earlier code kept several decision-path objects in shared global or file-static state: subject and object caches, decision counters, syslog formatting buffers, deferral state, libmagic handles, device cache state, and LMDB read state.

That works only as long as one decision thread owns the world.

The 1.6 release introduces an internal decision_context object. Today there is one active context, so behavior stays compatible. The important change is ownership. Mutable decision state now has a clearer owner, which makes it possible for a later worker design to give each worker its own context instead of having all workers contend on hidden shared state.

 

fapolicyd 1.6 moves mutable decision state behind a context so future workers can own their state.

 

This does not mean all concurrency work is finished. It means the most obvious shared-state blockers are being removed before the worker pool exists.

 

Better policy diagnostics

The policy linter also became more useful.

The linter added in the previous release warned about policy rules that can accidentally let executable or programmatic content reach the default-allow path. In 1.6, those warnings became more actionable. The linter can now report rule numbers and source file-line locations when a specific rule is involved. It also warns when an old fapolicyd.rules file shadows compiled.rules during linting.

You can demonstrate this without changing the installed policy. Create a small temporary rules file that intentionally lacks a terminal execute deny:

tmp_rules="$(mktemp)" && \
printf 'allow perm=execute all : all\n' > "$tmp_rules" && \
fapolicyd-cli --check-rules "$tmp_rules" --lint; \
rc=$?; rm -f "$tmp_rules"; echo "exit status: $rc"

Rules file is valid (1 rules)
Policy lint warning: executable events can fall through; no terminal broad execute deny found after rule 1 at /tmp/tmp.SbkGW2TCQF:1
Policy lint hint: add a final "deny_audit perm=execute all : all" rule
Policy lint warning: %languages is not defined in /tmp/tmp.SbkGW2TCQF; programmatic ftype coverage cannot be checked
exit status: 1

The exact warning text may differ as the linter evolves, but the point of the demo is that the warning names the policy issue and points at the rule context that caused it.

The release also starts warning about dir=untrusted. That compatibility macro is deprecated and should be replaced with explicit object trust rules. Existing policies still parse, but administrators get a warning so they can plan the migration before the compatibility path is removed in a later major release.

For a safe local check, use a temporary rule:

tmp_rules="$(mktemp)" && \
printf 'allow perm=any dir=untrusted : path=/tmp/payload\n' > "$tmp_rules" && \
fapolicyd-cli --check-rules "$tmp_rules"; \
rc=$?; rm -f "$tmp_rules"; echo "exit status: $rc"

07/03/26 14:23:38 [ WARNING ]: rules: line:1: subject dir=untrusted is deprecated and will be removed in a future release
Rules file is valid (1 rules)
exit status: 0

 

Smaller operational improvements

Several smaller changes round out the release.

fapolicyd-cli --timer-stop now asks before replacing an existing timing report. Timing data can be expensive to collect because it is usually captured around a specific workload. Accidentally overwriting the report with a "timing not armed" result is frustrating, so the CLI now checks before it removes an existing report.

fapolicyd-cli --dump-db handles an empty active trust database correctly in the generation-based layout. fapolicyd-cli --check-trustdb now treats an empty database as success instead of a database walk failure. --check-path now reports trust database initialization failures instead of continuing with partial database state.

 

ignore_mounts got a broader risk report

The ignored-mount checker was expanded in 1.6. It no longer looks only for files matching the %languages macro. It now reports risk categories such as executable regular files, ELF/shared objects, archives and JARs, bytecode caches, plugin/runtime directories, and language files.

That will be a future article because ignored mounts are easy to misuse. The overview version is simple: ignore_mounts is still for data-only mounts, and the checker now gives administrators better evidence before they decide a mount is safe to ignore.

 

The practical takeaway

fapolicyd 1.6 is not a release where the most important feature is one new command. The important change is the publication model.

Config, rules, trust contents, and physical LMDB storage now have clearer generation boundaries. Reports expose those boundaries. Decisions pin the state they started with. Failed trust reloads preserve the last good database. Mutable decision state is moving behind an ownership boundary that can support future workers.

The result is more explicit state identity, clearer failure handling for trust database rebuilds and compaction, better context in reports, and a cleaner path toward future multi-threaded decision handling.

The upstream project is here:
https://github.com/linux-application-whitelisting/fapolicyd

If you are new to fapolicyd itself, the Red Hat documentation on
blocking and allowing applications with fapolicyd is a useful starting point.



Tuesday, June 9, 2026

Introducing netcap --advanced

netcap has traditionally answered a narrow but useful question: which network-facing processes are running with capabilities? That is still a good starting point. Network daemons are usually the first place to look when reviewing local privilege because they are reachable by something outside the process itself.

The problem is that a useful security posture review needs a little more context. It is not enough to know that a process has a socket and some capabilities. An administrator also needs to know where the socket is bound, which interface makes it reachable, which systemd unit owns it, whether the process is still root, whether the bounding set was trimmed, whether ambient
capabilities are present, and whether basic exploit-resistance settings are enabled.

That is what netcap --advanced is for.

It is not a scanner in the network sense. It does not send packets. It reads the local system's own view of sockets and processes and turns that into an exposure and privilege report for the current network namespace.

netcap advanced joins local socket reachability with process privilege and hardening state.

What advanced mode changes

The historical mode is filtered. It looks for applications that use tcp, udp, raw, or packet sockets and also have capabilities. That is useful when the main question is "what network-facing process has privilege?"

Advanced mode changes the question. It inventories reachable binds and listeners regardless of whether the owning process currently has capabilities. Then it adds posture information for the owning process.

That distinction matters. A daemon with caps: (none) can still be important if it listens on every interface, runs as uid 0, has no seccomp filter, and is managed by a service unit that could be hardened. Conversely, a daemon with a small capability set may be acceptable if it binds only to loopback, runs as a dedicated user, has NoNewPrivileges=yes, and has a trimmed bounding set.

Advanced mode also covers protocol families that are easy to miss in ordinary checks. TCP listeners are included. UDP and UDPLITE bound sockets are included. RAW and PACKET sockets are included because they imply packet craft or packet capture behavior. SCTP and DCCP listeners are collected through NETLINK_SOCK_DIAG. VSOCK listeners are collected when the build and kernel support it. Bluetooth RFCOMM and HCI sockets are reported when the Bluetooth headers and procfs data are available.

This is why the output is organized as exposure planes:

  • INET (external)
  • INET (loopback)
  • LINK-LAYER
  • BLUETOOTH
  • VSOCK

The report is local to the current network namespace. If you run it in a container namespace, you get that namespace's view. If you run it on the host, you get the host namespace's view.

For full results, run it as root. The code needs to read other processes' /proc/<pid>/fd entries so it can map socket inodes back to process owners. It also needs enough network privilege to query sock_diag for some protocols. Without those permissions, the output is best effort and can be incomplete.

How the report is built

The design is a "join" across local kernel interfaces. First, netcap --advanced snapshots interface addresses with getifaddrs(). This gives it the names and addresses that will be used later when it decides whether a bind belongs under INET (external) or INET (loopback).

Second, it walks /proc. For each process, it reads enough metadata to build the process node: command name, executable path, real uid, cgroup-derived systemd unit name, capability sets through libcap-ng, ambient capability state, bounding set state, NoNewPrivs, seccomp mode, and the current LSM label when it can be read.

While it is walking /proc/<pid>/fd, it also builds a socket inode ownership map. Procfs exposes socket file descriptors as symlinks such as socket:[123456]. The listener tables later report socket inodes. The inode map is what lets netcap say "this listening SCTP socket belongs to this
process and this unit."

Third, it reads protocol-specific socket sources. The ordinary internet socket tables come from /proc/net/tcp/proc/net/tcp6/proc/net/udp/proc/net/udp6, and the related raw and udplite tables. Packet sockets come from /proc/net/packet. SCTP and DCCP listener data comes from NETLINK_SOCK_DIAG. VSOCK data comes from sock_diag when possible and falls back to /proc/net/vsock when it must. Bluetooth RFCOMM and HCI data comes from /proc/net/rfcomm/proc/net/hci, and adapter information under /sys/class/bluetooth.

Fourth, it projects endpoints into the tree. A listener bound to 127.0.0.1 goes under loopback. A listener bound to a specific external address goes under the interface that owns that address. A wildcard bind, such as 0.0.0.0 or ::, is expanded onto the non-loopback interfaces so the tree
shows where that wildcard is reachable.

That last step is one of the most useful parts of the design. A process did not literally call bind() once per interface. But from an administrator's point of view, a wildcard listener is reachable through every non-loopback address in the namespace. The output chooses the administrative view over the raw syscall view.

Reading the tree

This is the basic hierarchy:

plane -> interface -> protocol -> bind -> port -> process -> caps/defenses/flags
For VSOCK, there is no network interface, so the VSOCK endpoints are rendered directly under the VSOCK plane.

The process line gives identity:
python3 (pid=1234 uid=0 exe=/usr/bin/python3.14 unit=netcap-demo.service)
The comm value is the kernel process name and can be truncated. The exe field is the full executable path when procfs allows it to be read. The unit field is extracted from the cgroup hierarchy and is limited to service or scope names that are useful for remediation.

The caps line shows the permitted capabilities as libcap-ng sees them. The best value is:
caps: (none)
caps: (full) means the process has full root-like capability privilege. Individual names such as net_adminnet_rawsys_adminsetuid, or sys_ptrace need review in a network-facing process. The Linux capabilities(7) manual is the reference for what each capability allows.

Two annotations deserve special attention:

  • [ambient-present]
  • [open-ended-bounding]

Ambient capabilities are inherited across execve() and can surprise people because child programs keep the privilege unless the service takes care to clear it. An open-ended bounding set is not active privilege by itself, but it means the ceiling was not trimmed. A child process may still have a path to regain capabilities through file capabilities or other transitions.

The defenses section summarizes process hardening:

defenses
  runs_as_nonroot: no
  no_new_privs: no
  seccomp: disabled
  lsm: system_u:system_r:unconfined_service_t:s0
runs_as_nonroot is derived from the real uid. no_new_privs and seccomp
come from /proc/<pid>/status, which is documented in proc_pid_status(5). The kernel's
no_new_privs documentation explains why it is useful: it prevents a task from gaining new privilege through later exec transitions.

The flags section is where the report adds interpretation:

  • wildcard-bind
  • privileged-caps
  • reuseport
  • hypervisor-plane
  • ssh-on-vsock-port-22
  • proximity-plane

wildcard-bind means the daemon is bound to all addresses for that address family. privileged-caps means the process has one of the capability classes that is especially interesting for attack-surface review. reuseport means SO_REUSEPORT was detected on the socket. hypervisor-plane means VSOCK. ssh-on-vsock-port-22 means a VSOCK listener is using port 22. That does not prove it is SSH, but it is worth investigating because port 22 has a very specific operational meaning. proximity-plane means Bluetooth reachability.

First commands

Start by finding the interface names in the namespace you are reviewing.

$ netcap --advanced --list-interfaces
enp5s0
lo

Then collect the full tree without color so the output can be pasted into a ticket or report.

$ netcap --advanced --no-color

If one interface is the one that matters, filter the report.

netcap --advanced --interface enp5s0 --no-color
├─ INET (external)
│  └─ enp5s0
│     ├─ raw6
│     │  └─ *
│     │     └─ 58
│     │        └─ NetworkManager (pid=1999 uid=0 
│     │           exe=/usr/bin/NetworkManager unit=NetworkManager.service)
│     │           ├─ caps: dac_override, kill, setgid, setuid, 
│     │           │  net_bind_service, net_admin, net_raw, sys_module, 
│     │           │  sys_chroot, audit_write [open-ended-bounding]
│     │           ├─ defenses
│     │           │  ├─ runs_as_nonroot: no
│     │           │  ├─ no_new_privs: no
│     │           │  ├─ seccomp: disabled
│     │           │  └─ lsm: system_u:system_r:NetworkManager_t:s0
│     │           └─ flags
│     │              ├─ wildcard-bind
│     │              └─ privileged-caps
│     ├─ udp
│     │  └─ *
│     │     ├─ 5353
│     │     │  └─ avahi-daemon (pid=1735 uid=70 
│     │     │     exe=/usr/bin/avahi-daemon unit=avahi-daemon.service)
│     │     │     ├─ caps: (none)
│     │     │     ├─ defenses
│     │     │     │  ├─ runs_as_nonroot: yes
│     │     │     │  ├─ no_new_privs: no
│     │     │     │  ├─ seccomp: disabled
│     │     │     │  └─ lsm: system_u:system_r:avahi_t:s0
│     │     │     └─ flags
│     │     │        ├─ wildcard-bind
│     │     │        └─ reuseport
│     │     ├─ 50795
│     │     │  └─ firefox (pid=6947 uid=4325 
│     │     │     exe=/usr/lib64/firefox/firefox)
│     │     │     ├─ caps: (none)
│     │     │     ├─ defenses
│     │     │     │  ├─ runs_as_nonroot: yes
│     │     │     │  ├─ no_new_privs: no
│     │     │     │  ├─ seccomp: disabled
│     │     │     │  └─ lsm: 
│     │     │     │     unconfined_u:unconfined_r:unconfined_t:s0-s0:c0.
│     │     │     │     c1023
│     │     │     └─ flags
│     │     │        └─ wildcard-bind
│     │     └─ 53013
│     │        └─ firefox (pid=6947 uid=4325 
│     │           exe=/usr/lib64/firefox/firefox)
│     │           ├─ caps: (none)
│     │           ├─ defenses
│     │           │  ├─ runs_as_nonroot: yes
│     │           │  ├─ no_new_privs: no
│     │           │  ├─ seccomp: disabled
│     │           │  └─ lsm: 
│     │           │     unconfined_u:unconfined_r:unconfined_t:s0-s0:c0.c1
│     │           │     023
│     │           └─ flags
│     │              └─ wildcard-bind
│     └─ udp6
│        └─ *
│           └─ 5353
│              └─ avahi-daemon (pid=1735 uid=70 exe=/usr/bin/avahi-daemon 
│                 unit=avahi-daemon.service)
│                 ├─ caps: (none)
│                 ├─ defenses
│                 │  ├─ runs_as_nonroot: yes
│                 │  ├─ no_new_privs: no
│                 │  ├─ seccomp: disabled
│                 │  └─ lsm: system_u:system_r:avahi_t:s0
│                 └─ flags
│                    ├─ wildcard-bind
│                    └─ reuseport
└─ LINK-LAYER
   └─ enp5s0
      └─ packet
         └─ *
            └─ 2054
               └─ NetworkManager (pid=1999 uid=0 
                  exe=/usr/bin/NetworkManager unit=NetworkManager.service)
                  ├─ caps: dac_override, kill, setgid, setuid, 
                  │  net_bind_service, net_admin, net_raw, sys_module, 
                  │  sys_chroot, audit_write [open-ended-bounding]
                  ├─ defenses
                  │  ├─ runs_as_nonroot: no
                  │  ├─ no_new_privs: no
                  │  ├─ seccomp: disabled
                  │  └─ lsm: system_u:system_r:NetworkManager_t:s0
                  └─ flags
                     └─ privileged-caps
For automation, use JSON.
netcap --advanced --json
The tree is better for human triage. JSON is better for a nightly report, configuration drift check, or a small script that alerts on flags such as wildcard-bind, privileged-capshypervisor-plane, or proximity-plane.

Limits

netcap --advanced is a local snapshot. A process can start or exit while the report is being built. Some procfs entries can be hidden by permissions or mount options. SCTP, DCCP, and VSOCK discovery depends on kernel support and, for the sock_diag path, sufficient privilege. Bluetooth adapter mapping can be heuristic when the kernel does not expose enough adapter information for a socket.

The report also does not tell you if a service is supposed to be there. It tells you what is there, where it is reachable, who it runs as, what capabilities it has, and what hardening is visible. The administrator still has to decide whether that service belongs on the machine.

The upstream project is here: https://github.com/stevegrubb/libcap-ng.
For background on Linux capabilities, start with capabilities(7).
For VSOCK-specific background, see vsock(7).

Wednesday, May 27, 2026

Stress Testing fapolicyd-1.5

The timing report explains where fapolicyd spent time during a bounded window. That is useful only if the workload is repeatable. This article covers the new stress harness and how to use it with status, metrics, and timing.

The stress helper is not installed by make install. It is a development, QE, sizing, and regression tool. It generates high-rate fanotify decision traffic against a running fapolicyd daemon by creating process trees and running specific workloads.

It can exercise process startup tracking, subject cache collisions, object cache churn, interpreter handling, no-shebang script handling, file opens, execs, and large file reads.

The stress harness creates controlled pressure so fapolicyd's behavior can be measured.
 

Building the Harness

Build it explicitly:

./configure --enable-stress
make -j32
The binary is:
src/tests/stress/fapolicyd-stress
--enable-stress defaults to off. A normal build does not enter this directory and does not build the helper.

If the daemon is enforcing policy, remember that this locally built helper is not automatically trusted. The interpreter workloads also use scripts from the source tree. Add the helper and scripts to the file trust database before running tests against an enforcing daemon:
stress_dir="$PWD/src/tests/stress"
sudo fapolicyd-cli --file add "$stress_dir/fapolicyd-stress"
sudo fapolicyd-cli --file add "$stress_dir/scripts"
sudo fapolicyd-cli --update
If you rebuild the helper, update the trust entry because size or hash may change.

Workloads

The harness can run several workloads:

fork-exec     tight fork/exec loops
exec-open     opens configured command paths and executes them
interpreter   runs a script directly and through a shell
noshebang     exercises programmatic content without a #! line
hash          creates and reads a large file
churn         creates many small files to churn the object cache
all           runs every workload
fork-exec is the default and the best starting point for subject-cache and startup-state pressure. It repeatedly forks a child that execs a configured command. This creates a lot of short-lived process activity, which is exactly the kind of workload that can expose startup tracking and subject cache collision problems.

exec-open adds file open traffic around the exec stream. This is useful when you want both execution and object-open activity.

interpreter and noshebang are for programmatic content paths. The first runs a script directly and through a shell. The second attempts a direct exec of a file without a `#!` line and then runs it through the selected shell.

hash creates a large generated file and reads it. This is useful when integrity mode or policy makes hashing visible in the timing report.

churn creates many distinct small files and opens them in rotation. This is for object cache churn.

all is broad coverage. It is not where I would start if I were trying to understand one bottleneck.

Process Shape

The basic process controls are roots, fanout, and depth. The estimated leaf process count is:

roots * fanout ^ depth
More leaves means more concurrent process pressure. Wide process trees are the main way to create subject-cache collisions. That is what you need if you are testing startup-state behavior and subject deferral.

A representative command is:
src/tests/stress/fapolicyd-stress --workload fork-exec --roots 32 \
  --fanout 8 --depth 1 --iterations 0 --seconds 60 --timing
That creates 256 leaf processes. Each leaf runs the selected workload until the time limit expires. Note, the --timing parameter tells the stress harness to arm and stop the timer to automatically generate a timing report so that you do not need to mess with fapolicyd-cli.

Smoke Test

A short local smoke test without daemon reports looks like this:

./fapolicyd-stress --no-status --workload fork-exec --roots 2 --iterations 10
fapolicyd stress harness
workload: fork-exec
roots: 2
fanout: 1
depth: 0
estimated leaf processes: 2
iterations per leaf: 10
seconds: 0
workdir: /tmp/fapolicyd-stress.ShyQtW

Workload summary:
wall_seconds: 0.040
operations: 20
errors: 0
throughput_ops_per_sec: 495.1

That proves the helper can run. It does not tell you much about the daemon. For daemon observations, let the harness collect status and metrics before and after the workload.

Timed Pressure Test

For a timed fork/exec pressure test against a configured daemon, run:

 ./fapolicyd-stress --workload fork-exec --roots 32 --seconds 30 --timing
fapolicyd stress harness
workload: fork-exec
roots: 32
fanout: 1
depth: 0
estimated leaf processes: 32
iterations per leaf: 100
seconds: 30
workdir: /tmp/fapolicyd-stress.uxEEQg

Workload summary:
wall_seconds: 1.801
operations: 3200
errors: 0
throughput_ops_per_sec: 1776.3

Daemon status deltas:
Inter-thread max queue depth: before=4 after=95
Subject deferred events: before=0 after=0
Subject defer max depth: before=0 after=0
Subject defer fallbacks: before=0 after=0 delta=0
Subject defer oldest age: before=0 ns after=0 ns
Early subject cache evictions: before=0 after=0 delta=0
Subject BUILDING tracer evictions: before=0 after=0 delta=0
Subject BUILDING stale evictions: before=0 after=0 delta=0
Subject collisions: before=0 after=16 delta=16
Subject evictions: before=11 after=30 delta=19
Object collisions: before=70 after=74 delta=4
Object evictions: before=70 after=74 delta=4
Allowed accesses: before=3447 after=60541 delta=57094
Denied accesses: before=3 after=3 delta=0
Kernel queue overflows: before=0 after=0 delta=0
Reply errors: before=0 after=0 delta=0

Decision timing:
Full report: /run/fapolicyd/fapolicyd.timing
Decisions: 56953
Max queue depth during timing: 95
Timed throughput: 31537.4 decisions/sec
Active decision rate: 32206.7 decisions/sec
Decision latency: avg=31.0 us max=3.31 ms p95_bucket=<=100us


With --timing, the harness verifies that the daemon has timing_collection=manual, starts timing, runs the workload, stops timing, and parses a short summary from the timing report. It also captures status and metrics before and after the workload when it can.

This gives three views:
harness output     what workload was generated
metrics deltas     what counters moved
timing report      where decision time went
The harness output has two different throughput ideas. throughput_ops_per_sec is the harness's local operation rate. It is not the same as daemon decision throughput. One local operation can generate zero, one, or multiple fanotify permission events. The timing report's decision count and throughput are the daemon-side numbers.

Subject Deferral Testing

For subject deferral testing, use the early eviction preset:

[run the following command: sudo src/tests/stress/fapolicyd-stress --preset early-evict --timing]

./fapolicyd-stress --preset early-evict --timing
fapolicyd stress harness
workload: fork-exec
roots: 32
fanout: 8
depth: 1
estimated leaf processes: 256
iterations per leaf: 0
seconds: 60
workdir: /tmp/fapolicyd-stress.jID2g5

Workload summary:
wall_seconds: 60.150
operations: 36666
errors: 0
throughput_ops_per_sec: 609.6

Daemon status deltas:
Inter-thread max queue depth: before=95 after=435
Subject deferred events: before=0 after=1
Subject defer max depth: before=0 after=1
Subject defer fallbacks: before=0 after=0 delta=0
Subject defer oldest age: before=0 ns after=0 ns
Early subject cache evictions: before=0 after=3 delta=3
Subject BUILDING tracer evictions: before=0 after=0 delta=0
Subject BUILDING stale evictions: before=0 after=3 delta=3
Subject collisions: before=89 after=36881 delta=36792
Subject evictions: before=113 after=36905 delta=36792
Object collisions: before=207 after=216 delta=9
Object evictions: before=207 after=216 delta=9
Allowed accesses: before=63638 after=672670 delta=609032
Denied accesses: before=3 after=3 delta=0
Kernel queue overflows: before=0 after=0 delta=0
Reply errors: before=0 after=0 delta=0

Decision timing:
Full report: /run/fapolicyd/fapolicyd.timing
Decisions: 608940
Max queue depth during timing: 435
Timed throughput: 10122.9 decisions/sec
Active decision rate: 10159.5 decisions/sec
Decision latency: avg=98.4 us max=19.9 ms p95_bucket=<=500us


There is also an ld-so-regression preset intended for comparing builds with and without subject deferral:

[run the following command: sudo src/tests/stress/fapolicyd-stress --preset ld-so-regression --timing]

./fapolicyd-stress --preset ld-so-regression --timing
fapolicyd stress harness
workload: fork-exec
roots: 32
fanout: 8
depth: 1
estimated leaf processes: 256
iterations per leaf: 0
seconds: 60
workdir: /tmp/fapolicyd-stress.On1G0Q

Workload summary:
wall_seconds: 60.145
operations: 32529
errors: 0
throughput_ops_per_sec: 540.8

Daemon status deltas:
Inter-thread max queue depth: before=435 after=453
Subject deferred events: before=1 after=2
Subject defer max depth: before=1 after=1
Subject defer fallbacks: before=0 after=0 delta=0
Subject defer oldest age: before=0 ns after=0 ns
Early subject cache evictions: before=3 after=3 delta=0
Subject BUILDING tracer evictions: before=0 after=0 delta=0
Subject BUILDING stale evictions: before=3 after=3 delta=0
Subject collisions: before=36901 after=70158 delta=33257
Subject evictions: before=36925 after=70182 delta=33257
Object collisions: before=295 after=298 delta=3
Object evictions: before=295 after=298 delta=3
Allowed accesses: before=673924 after=1215792 delta=541868
Denied accesses: before=3 after=3 delta=0
Kernel queue overflows: before=0 after=0 delta=0
Reply errors: before=0 after=0 delta=0

Decision timing:
Full report: /run/fapolicyd/fapolicyd.timing
Decisions: 541778
Max queue depth during timing: 453
Timed throughput: 9007.2 decisions/sec
Active decision rate: 9041.6 decisions/sec
Decision latency: avg=111 us max=1.63 ms p95_bucket=<=500us


The strongest evidence for the early-eviction problem is:
  • Subject collisions increased
  • Early subject cache evictions increased
  • The workload was wide enough to create many concurrent process startups

After deferral is working, a healthy run should show fewer early subject cache evictions and fewer unexpected denials under the same daemon configuration. Subject deferred events and Subject defer max depth may increase. That is fine. That means events were parked instead of evicting a BUILDING subject. But if Subject defer fallbacks keeps rising, the fixed defer array filled and the daemon fell back to the older eviction path.

The early-evict preset is meant to prove whether subject deferral reduces premature BUILDING evictions.

Cache and Hash Workloads

For object cache pressure, run the churn workload and watch object misses, collisions, and evictions:

./fapolicyd-stress --workload churn --roots 8 --fanout 4 --depth 1 --seconds 60 --timing
fapolicyd stress harness
workload: churn
roots: 8
fanout: 4
depth: 1
estimated leaf processes: 32
iterations per leaf: 100
seconds: 60
workdir: /tmp/fapolicyd-stress.SqRIGt

Workload summary:
wall_seconds: 0.101
operations: 3200
errors: 0
throughput_ops_per_sec: 31531.9

Daemon status deltas:
Inter-thread max queue depth: before=453 after=453
Subject deferred events: before=2 after=2
Subject defer max depth: before=1 after=1
Subject defer fallbacks: before=0 after=0 delta=0
Subject defer oldest age: before=0 ns after=0 ns
Early subject cache evictions: before=3 after=3 delta=0
Subject BUILDING tracer evictions: before=0 after=0 delta=0
Subject BUILDING stale evictions: before=3 after=3 delta=0
Subject collisions: before=70180 after=70217 delta=37
Subject evictions: before=70207 after=70244 delta=37
Object collisions: before=724 after=743 delta=19
Object evictions: before=724 after=743 delta=19
Allowed accesses: before=1218087 after=1221409 delta=3322
Denied accesses: before=3 after=3 delta=0
Kernel queue overflows: before=0 after=0 delta=0
Reply errors: before=0 after=0 delta=0

Decision timing:
Full report: /run/fapolicyd/fapolicyd.timing
Decisions: 3228
Max queue depth during timing: 32
Timed throughput: 30256.9 decisions/sec
Active decision rate: 34689.6 decisions/sec
Decision latency: avg=28.8 us max=4.32 ms p95_bucket=<=50us

Fo
r integrity cost, use the hash workload and look for hash_sha or hash_ima timing:

./fapolicyd-stress --workload hash --roots 8 --seconds 60 --timing
fapolicyd stress harness
workload: hash
roots: 8
fanout: 1
depth: 0
estimated leaf processes: 8
iterations per leaf: 100
seconds: 60
workdir: /tmp/fapolicyd-stress.1NYq01

Workload summary:
wall_seconds: 1.603
operations: 800
errors: 0
throughput_ops_per_sec: 498.9

Daemon status deltas:
Inter-thread max queue depth: before=453 after=453
Subject deferred events: before=2 after=2
Subject defer max depth: before=1 after=1
Subject defer fallbacks: before=0 after=0 delta=0
Subject defer oldest age: before=0 ns after=0 ns
Early subject cache evictions: before=3 after=3 delta=0
Subject BUILDING tracer evictions: before=0 after=0 delta=0
Subject BUILDING stale evictions: before=3 after=3 delta=0
Subject collisions: before=70230 after=70242 delta=12
Subject evictions: before=70257 after=70269 delta=12
Object collisions: before=755 after=757 delta=2
Object evictions: before=755 after=757 delta=2
Allowed accesses: before=1221624 after=1222544 delta=920
Denied accesses: before=3 after=3 delta=0
Kernel queue overflows: before=0 after=0 delta=0
Reply errors: before=0 after=0 delta=0

Decision timing:
Full report: /run/fapolicyd/fapolicyd.timing
Decisions: 830
Max queue depth during timing: 8
Timed throughput: 515.9 decisions/sec
Active decision rate: 100841.2 decisions/sec
Decision latency: avg=9.92 us max=138 us p95_bucket=<=50us


For broad coverage, run all workloads:

./fapolicyd-stress --workload all --roots 8 --fanout 4 --depth 1 --seconds 60 --timing
fapolicyd stress harness
workload: all
roots: 8
fanout: 4
depth: 1
estimated leaf processes: 32
iterations per leaf: 100
seconds: 60
workdir: /tmp/fapolicyd-stress.2rexsD

Workload summary:
wall_seconds: 16.156
operations: 76800
errors: 0
throughput_ops_per_sec: 4753.7

Daemon status deltas:
Inter-thread max queue depth: before=453 after=453
Subject deferred events: before=2 after=35
Subject defer max depth: before=1 after=3
Subject defer fallbacks: before=0 after=0 delta=0
Subject defer oldest age: before=0 ns after=0 ns
Early subject cache evictions: before=3 after=3 delta=0
Subject BUILDING tracer evictions: before=0 after=0 delta=0
Subject BUILDING stale evictions: before=3 after=3 delta=0
Subject collisions: before=70253 after=115846 delta=45593
Subject evictions: before=70280 after=122273 delta=51993
Object collisions: before=1416 after=1471 delta=55
Object evictions: before=1416 after=1471 delta=55
Allowed accesses: before=1224770 after=1751558 delta=526788
Denied accesses: before=3 after=3 delta=0
Kernel queue overflows: before=0 after=0 delta=0
Reply errors: before=0 after=0 delta=0

Decision timing:
Full report: /run/fapolicyd/fapolicyd.timing
Decisions: 526698
Max queue depth during timing: 129
Timed throughput: 32591.0 decisions/sec
Active decision rate: 33450.5 decisions/sec
Decision latency: avg=29.9 us max=12.5 ms p95_bucket=<=100us


Use all after you understand the individual workload costs. If you start with everything, the output may show several things moving at once and you will not know which workload caused which effect.

Reading the Results

After a stress run, collect both reports:

fapolicyd-cli --check-status
fapolicyd-cli --check-metrics
For queue pressure, compare Inter-thread max queue depth with configured q_size, and then look at the timing report's Queueing section. If both queue depth and queue wait are high, requests are backing up before evaluation.

For subject-cache pressure, look at:

  • Subject collisions
  • Subject evictions
  • Early subject cache evictions
  • Subject deferred events
  • Subject defer max depth
  • Subject defer fallbacks
  • Subject BUILDING tracer evictions
  • Subject BUILDING stale evictions

For object-cache pressure, look at:

  • Object misses
  • Object collisions
  • Object evictions

For policy decisions, look at:

  • Allowed accesses
  • Denied accesses
  • Allowed by rule
  • Allowed by fallthrough

For daemon health, any non-zero value in these fields deserves attention:

  • Kernel queue overflow
  • Reply errors
  • Subject defer fallbacks

The stress harness does not replace real production observation. It gives you a controlled way to move specific parts of the daemon. That is useful for regression testing, sizing experiments, and proving whether a change affected the counter or timing gate you intended to move.

The pattern is:

  • Pick one workload.
  • Run a bounded test.
  • Read status for health.
  • Read metrics for counters.
  • Read timing for latency.
  • Change one variable.
  • Run it again.

That is how the stress harness fits with the rest of the reporting work. Status tells you whether the daemon stayed healthy. Metrics tell you what moved. Timing tells you where time went. The stress tool gives you a repeatable way to make those questions concrete.

Sunday, May 24, 2026

Timing fapolicyd Decisions

The previous article looked at fapolicyd metrics. Metrics tell us what happened in a counter window. This article is about timing. Timing answers a different question: where did the daemon spend time while making decisions?

This is not something that should always be enabled. fapolicyd sits in the path of file opens and execs. Calling clock_gettime() around hot-path operations on every permission event has a cost. That is why timing collection is off by default and why the new model is manual. You arm timing, run a workload, stop timing, and read the report.

Before looking at the command line, it helps to understand what a decision is. The timing report is much easier to read if you know the path an access request takes through the daemon.

Anatomy of a Decision

At a high level, a fapolicyd decision has three phases:

event_build
evaluation
response
event_build turns the raw fanotify event into the internal data fapolicyd can reason about. evaluation walks the rules and computes the policy result. response records the outcome, prepares any logging or audit data, and writes the final decision back to the kernel.

How a fapolicyd access decision moves through the daemon.


 The timing report uses names that look like this:

phase:operation:child
For example, evaluation:mime_detection:libmagic_fallback means libmagic fallback happened while the daemon was in the evaluation phase. response:trust_db_lookup:read means trust database read work was triggered while building the response.

The stages are not a strict accounting tree. Some operations are nested. Some are lazy and only happen if a rule or log format asks for the data. Child rows do not have to add up exactly to parent rows. Treat them as measured gates in the decision path, not as a perfect flame graph.

Event Build

The first phase is event_build. The kernel gives fapolicyd a fanotify event. That event has a permission file descriptor, a process id, and a mask describing the requested access. fapolicyd has to turn that into an internal event with enough subject and object identity to evaluate policy.

The main timed gates are:

event_build:cache_flush
event_build:proc_fingerprint
event_build:fd_stat
event_build:cache_flush is measured when the daemon has to flush object cache state on the decision path. Cache flushes are not the normal cost of every decision. If this shows up prominently, something caused cached object state to be invalidated during the timing window.

event_build:proc_fingerprint reads enough process information to identify the subject and detect stale cache entries. This is the first process identity gate. On fork/exec-heavy systems, this can be a visible cost because every short lived command creates more process identity work.

event_build:fd_stat stats the fanotify file descriptor to identify the object being accessed. This gives fapolicyd stable file identity data. If this is expensive, the cost is in object fingerprinting before policy evaluation even starts.

The output of event_build is not the whole answer. It is the starting point. Some subject and object attributes are still lazy. They are computed only if a rule, syslog format, debug output, or audit response needs them.

Evaluation

The second phase is evaluation. This is where fapolicyd evaluates the active policy rules against the event.

The top-level evaluation gates are:

evaluation:lock_wait
evaluation:fd_path_resolution
evaluation:proc_detail_lookup
evaluation:lock_wait is time spent waiting for the rule lock before policy evaluation can proceed. In the current single-decision-thread design, this is usually not expected to dominate. It becomes more interesting as the code moves toward more read-side concurrency. If it is high now, rule reload or other policy maintenance activity may have overlapped the timing run.

evaluation:fd_path_resolution is path resolution for the object file descriptor. Not every rule needs a path, but rules that match on path or dir can force this work. A workload that opens many distinct files can make this more visible.

evaluation:proc_detail_lookup is the on-demand process detail gate. The initial fingerprint is collected in event_build, but rules may ask for more: auid, session id, executable path, process status fields, or other procfs details. If this row is high, the policy or output format is asking for process details that were not already cached.

MIME detection is a nested set of gates:
evaluation:mime_detection:fast_classification
evaluation:mime_detection:gather_elf
evaluation:mime_detection:libmagic_fallback
fapolicyd tries cheap classification first. The fast path recognizes common file types without using full libmagic. gather_elf is the ELF and script header scan used to identify executable and interpreter-related details. libmagic_fallback is the expensive fallback when the faster checks cannot classify the object.

If evaluation:mime_detection:libmagic_fallback is high, the daemon is often leaving the fast path. That can be caused by a workload full of files that are hard to classify cheaply, or by rules that ask for ftype on many objects.

Integrity and hashing gates are:

evaluation:hash_ima:total
evaluation:hash_sha:total
hash_ima is IMA digest collection from the security.ima extended attribute. hash_sha is SHA-style file digest collection. These are workload and policy dependent. If integrity mode or FILE_HASH rules force hashing of large files, these rows can dominate a timing run. This is one reason timing has to be a bounded diagnostic tool: hashing can be attacker-controlled cost if policy asks for it on large inputs.

Trust database lookup is also split:
evaluation:trust_db_lookup:lock_wait
evaluation:trust_db_lookup:read
lock_wait is time waiting for trust database update coordination before the read can proceed. read is the LMDB read-side work and related lookup logic. If read is high, trust lookup itself is material. If lock_wait is high, the decision path is waiting behind trust database maintenance or contention.

The evaluation phase decides whether a rule has an opinion. If a rule returns allow or deny, that is the decision source. If no rule has an opinion, fapolicyd preserves historical compatibility by allowing the access through the fallthrough path. The metrics report counts that separately as Allowed by fallthrough.

Response

The third phase is response. This is the work after rule evaluation has produced an outcome. It includes decision bookkeeping, optional syslog or debug formatting, audit response metadata, and the write back to the kernel.

The response phase has some gates that look familiar:

response:mime_detection:fast_classification
response:mime_detection:gather_elf
response:mime_detection:libmagic_fallback
response:trust_db_lookup:lock_wait
response:trust_db_lookup:read
response:fanotify_write
This is the part that can be confusing at first. Why can MIME detection or trust database lookup appear in both evaluation and response?

The answer is lazy attributes. fapolicyd does not compute every possible attribute up front. That would waste time. Instead, attributes are looked up when something asks for them. During evaluation, policy rules may ask for ftype, trust, path, hash, or process details. During response, syslog, debug, or audit formatting may ask for some of the same attributes so the daemon can explain the decision.

So the same helper can be charged to different phases depending on who caused the lookup.

If evaluation:mime_detection:* is high, policy evaluation needed file type information. If response:mime_detection:* is high, reporting or audit output needed file type information after the decision had already been made. The same logic applies to trust database lookup. Evaluation-side trust lookup is policy cost. Response-side trust lookup is reporting or audit cost.

This distinction is important. If a timing report says response-side MIME detection dominates, making policy rules simpler may not help. The cost may be coming from debug output or a detailed syslog_format. If evaluation-side MIME detection dominates, then the policy is probably asking for file type data in the hot path.

response:fanotify_write is the final write of the permission decision back to the kernel. A high value here means the daemon did the policy work and then spent measurable time completing the kernel response. That is not rule cost. It is response-path cost.

Using the Timer

Now the report structure makes more sense. The timer creates a bounded window around a workload so the daemon can aggregate timing data for those gates.

Enable manual timing in fapolicyd.conf:

timing_collection=manual
After changing the configuration, reload or restart fapolicyd so the running daemon has the setting active. Then check status:

fapolicyd-cli --check-status | head

Look for:
Timing collection mode: manual
Timing collection armed: false
If the mode is still off, the daemon will ignore timing requests. This is intentional. A system should not accidentally start collecting timing data just because someone ran the CLI.

Manual timing creates a bounded measurement window around a workload.

The CLI has two names for the commands. The documentation uses --timing-start and --timing-stop. There are also --timer-start and --timer-stop aliases. I like the timer spelling because it reads naturally, so I will use that here.

To start timing, run the following command:

sudo fapolicyd-cli --timer-start

Now run something that causes fapolicyd decisions. For a tiny smoke test, this can be as boring as:

for i in $(seq 1 100); do /usr/bin/id >/dev/null; done
Then to stop timing, run the following command:

sudo fapolicyd-cli --timer-stop

The stop command prints the timing report. The daemon writes the report to:

/run/fapolicyd/fapolicyd.timing
The report is aggregate data. It does not store one record per decision. Each stage keeps a count, total time, maximum time, and latency buckets. That keeps memory bounded even if the timing window covers a large number of decisions. Here is a sample report:

Mode: manual
Timing run: 2026-05-24 14:54:07 -0400 to 2026-05-24 14:55:13 -0400
Duration: 0:01:06
Workers: 1
Max queue depth: 399
Decisions: 576,495
Throughput: 8717.3 decisions/sec (wall clock)
Active decision rate: 9596.4 decisions/sec

TL;DR:
  - Queueing pressure reached max depth 399 of 800 (49.9%), p95 wait <=100ms, max wait 160 ms.

Overall decision latency:
  avg 104 us, max 2.81 ms
  p50 bucket <=100us, p95 bucket <=500us, p99 bucket <=500us
  <=50us 1.4%, <=100us 72.4%, <=500us 100.0%, <=1ms 100.0%, >10ms 0.0%

Queueing:
  avg wait: 27.2 ms
  max wait: 160 ms
  p95 bucket: <=100ms
  total queued time: 15707 s
  max queue depth: 399

Decision phase timing:
Phase                 Calls  Calls/Dec      Total        Avg        Max   p95 bucket   Notes
event_build         576,495       1.00     3.17 s    5.50 us     516 us       <=50us   
evaluation          576,495       1.00     1.66 s    2.89 us    2.80 ms       <=50us   
response            576,495       1.00     55.0 s    95.5 us    1.44 ms      <=500us   

Lazy helper attribution:
  Helper timings are attributed to the active logical driver: evaluation or response.
  Combined totals are evaluation + response.

Lazy helper attribution by driver:
Helper                             Eval total  Response total     Combined Response %
mime_detection:total                  11.1 ms            0 ns      11.1 ms        0.0%
mime_detection:fast_classification     510 us            0 ns       510 us        0.0%
mime_detection:gather_elf              186 us            0 ns       186 us        0.0%
mime_detection:libmagic_fallback      10.4 ms            0 ns      10.4 ms        0.0%
trust_db_lookup:total                  120 ms            0 ns       120 ms        0.0%
trust_db_lookup:read                   109 ms            0 ns       109 ms        0.0%
trust_db_lookup:lock_wait             2.43 ms            0 ns      2.43 ms        0.0%

Combined lazy helper attribution:
Helper path                                       Calls  Calls/Dec      Total   Avg/call  Amort/Dec        Max   p95 bucket
mime_detection:total                                564       0.00    11.1 ms    19.6 us      19 ns    2.54 ms       <=10us
mime_detection:fast_classification                  564       0.00     510 us     903 ns       0 ns    11.3 us       <=10us
mime_detection:gather_elf                            32       0.00     186 us    5.82 us       0 ns    10.9 us       <=10us
mime_detection:libmagic_fallback                      9       0.00    10.4 ms    1.16 ms      18 ns    2.53 ms        <=5ms
trust_db_lookup:total                            66,016       0.11     120 ms    1.81 us     207 ns    2.80 ms        <=5us
trust_db_lookup:read                             66,016       0.11     109 ms    1.65 us     189 ns    2.80 ms        <=5us
trust_db_lookup:lock_wait                        66,016       0.11    2.43 ms      36 ns       4 ns    25.7 us        <=1us
hash_sha:total                                       22       0.00    10.6 ms     481 us      18 ns    2.79 ms        <=5ms
proc_detail_lookup                               99,060       0.17     1.29 s    13.0 us    2.24 us     415 us       <=50us

Derived observations:
  Queueing showed moderate bursts: max queue depth 399 of 800 (49.9%), p95 wait <=100ms, max wait 160 ms.
  libmagic fallback is the biggest MIME contributor: 1.6% of MIME calls, 94.0% of MIME time, 18 ns amortized per decision.
  hash_sha is rare but expensive: 0.0% of decisions, 481 us avg when called, 18 ns amortized per decision.

Detailed stage timing, sorted by total time:
Stage                                                 Calls      Calls/Dec      Total        Avg        Max   p95 bucket
time_in_queue:total                                 576,495           1.00    15707 s    27.2 ms     160 ms      <=100ms
decision:total                                      576,495           1.00     60.1 s     104 us    2.81 ms      <=500us
response:total                                      576,495           1.00     55.0 s    95.5 us    1.44 ms      <=500us
response:fanotify_write                             576,495           1.00     54.0 s    93.7 us    1.44 ms      <=500us
event_build:total                                   576,495           1.00     3.17 s    5.50 us     516 us       <=50us
event_build:proc_fingerprint                        576,495           1.00     1.92 s    3.32 us     250 us       <=10us
evaluation:total                                    576,495           1.00     1.66 s    2.89 us    2.80 ms       <=50us
evaluation:proc_detail_lookup                        99,060           0.17     1.29 s    13.0 us     415 us       <=50us
event_build:fd_stat                                 576,495           1.00     733 ms    1.27 us     512 us        <=5us
evaluation:trust_db_lookup:total                     66,016           0.11     120 ms    1.81 us    2.80 ms        <=5us
evaluation:trust_db_lookup:read                      66,016           0.11     109 ms    1.65 us    2.80 ms        <=5us
evaluation:lock_wait                                576,495           1.00    24.8 ms      43 ns    42.4 us        <=1us
response:audit_metadata:total                       576,495           1.00    12.5 ms      21 ns    46.9 us        <=1us
evaluation:mime_detection:total                         564           0.00    11.1 ms    19.6 us    2.54 ms       <=10us
evaluation:hash_sha:total                                22           0.00    10.6 ms     481 us    2.79 ms        <=5ms
evaluation:mime_detection:libmagic_fallback               9           0.00    10.4 ms    1.16 ms    2.53 ms        <=5ms
evaluation:fd_path_resolution                           588           0.00    6.73 ms    11.4 us     164 us       <=50us
evaluation:trust_db_lookup:lock_wait                 66,016           0.11    2.43 ms      36 ns    25.7 us        <=1us
evaluation:mime_detection:fast_classification           564           0.00     510 us     903 ns    11.3 us       <=10us
evaluation:mime_detection:gather_elf                     32           0.00     186 us    5.82 us    10.9 us       <=10us

Stage tail summary:
  time_in_queue:total: >10ms 564,343/97.9%, >25ms 235,276/40.8%, >50ms 39,149/6.8%, >100ms 2,230/0.4%

Not observed:
  event_build:cache_flush, response:mime_detection:total, response:mime_detection:fast_classification, response:mime_detection:gather_elf, response:mime_detection:libmagic_fallback, evaluation:hash_ima:total, response:trust_db_lookup:total, response:trust_db_lookup:lock_wait, response:trust_db_lookup:read, response:syslog_debug_format:total

Notes:
  Largest queued-time contributor: time_in_queue:total (15707 s)
  Largest helper contributor: proc_detail_lookup (1.29 s)
  Largest decision phase contributor: response (55.0 s)
  Slowest observed row by max: time_in_queue:total (160 ms)

 

Interpreting the Report

The full timing report can look intimidating, but the reading order is simple:

1. Run summary
2. TL;DR
3. Overall decision latency
4. Queueing
5. Decision phase timing
6. Lazy helper attribution
7. Detailed stage timing
The run summary tells you whether the run is meaningful. Decisions should be non-zero. If it is zero, the workload did not generate timed daemon decisions. Max queue depth tells you  whether requests backed up while timing was armed.

The report has two rates. Throughput is decisions per wall-clock second while timing was armed. Active decision rate is based on accumulated decision:total worker time. If wall-clock throughput is low but active decision rate is high, the workload may have been bursty or idle part of the time. If both are low, decision work is expensive.

The TL;DR section is not magic. It is a compact set of observations derived from the same data later in the report. It may point out helper dominance, response formatting cost, queue health, or that no dominant findings were observed.

Overall decision latency is the end-to-end decision worker latency after an event has been dequeued. It includes event build, policy evaluation, optional logging or audit preparation, response selection, and the fanotify response write. Queue wait is reported separately.

Before the next section, let's talk about the p95 notation. The p95 bucket is the latency bucket that contains the 95th percentile observation. In plain terms: about 95% of the measured calls completed at or below that bucket, and about 5% were slower. If the report says a stage has p95 <=500us, then nearly all calls to that stage were fast. If it says p95 >10ms, then the slow path was not just one rare outlier; it happened often enough to affect the top 5% of calls.

It is significant because averages can hide tail latency, and maximums can overreact to one unusual event. The p95 bucket sits between them. It tells you whether slowness is recurring enough to matter. 

This section has average latency, maximum latency, p95 bucket, and tail buckets. The p95 bucket is usually more useful than the maximum. A high maximum with a low p95 means rare outliers. A high p95 means the slow path is common.

Queueing tells you how long events waited in fapolicyd's internal userspace queue before the decision worker started processing them. If queue wait is high and max queue depth is close to q_size, the daemon is being fed events faster than it can answer them.

Decision phase timing is the first section I would use for diagnosis:

event_build:total
evaluation:total
response:total
If event build dominates, look at process and object identity work. If evaluation dominates, look at rule traversal and evaluation-side helpers. If response dominates, look at logging, audit metadata, response-side helpers, and the kernel write.

Lazy helper attribution explains whether expensive helper work was caused by evaluation or response. This is where the duplicate-looking MIME and trust DB rows become useful. The same helper has very different tuning implications depending on whether policy needed it or output formatting needed it.

The detailed stage table is the final view. It is sorted by total measured time. Use it to find the main cost centers:
High event_build:proc_fingerprint              -> process identity lookup cost
High event_build:fd_stat                       -> object fingerprinting cost
High evaluation:lock_wait                      -> rule lock wait
High evaluation:total                          -> rule traversal or matching cost
High evaluation:mime_detection:libmagic_fallback -> expensive file type fallback
High evaluation:hash_sha:total                 -> SHA hashing cost
High evaluation:hash_ima:total                 -> IMA digest cost
High evaluation:trust_db_lookup:read           -> trust database read cost
High response:mime_detection:*                 -> reporting needed ftype data
High response:trust_db_lookup:read             -> reporting or audit needed trust data
High response:fanotify_write                   -> kernel response write cost
Do not expect every row to appear in every report. If a stage is not observed, the workload did not trigger that measured operation during the timing window. That can be good news. For example, no libmagic fallback means fast MIME classification handled the files observed in the run.

The best way to use timing is to make one change at a time. Change the workload, a rule, an integrity setting, or output format, then run another bounded timing window. Compare the same sections. If the dominant cost moves, you learned something.

The next article will cover the stress harness. The timer tells us where the daemon spent time. The stress tool gives us repeatable workloads that can move specific counters and timing gates on purpose.

Saturday, May 23, 2026

Understanding fapolicyd-1.5 Metrics

In the last article, I went over the bigger picture of the fapolicyd 1.5 work. From it, we learned that status, metrics, and timing are now separate reports because they answer different questions.

This article is about metrics.

The important distinction is this: "fapolicyd-cli --check-status" asks whether the daemon is healthy and configured the way you expect. "fapolicyd-cli --check-metrics" asks what the daemon has been doing. Metrics are where you look for rule hits, cache behavior, default-allow decisions, queue pressure, and which rule attributes are causing work.

Start with the command:

# fapolicyd-cli --check-metrics
Last metrics reset: never
Ruleset generation: 1

Decision outcomes:
Allowed accesses: 42171
Denied accesses: 3
Allowed by rule: 42171
Allowed by fallthrough: 0

Inter-thread queue & defer activity:
Inter-thread max queue depth: 6
Subject deferred events: 0
Subject defer max depth: 0
Subject defer fallbacks: 0

Subject cache effectiveness:
Subject hits: 41632
Subject misses: 692
Subject collisions: 28
Subject evictions: 150 (0%)
Early subject cache evictions: 0
Subject BUILDING tracer evictions: 0
Subject BUILDING stale evictions: 0

Object cache effectiveness:
Object hits: 34527
Object misses: 11421
Object collisions: 3774
Object evictions: 3774 (10%)

Rule hit counts:
Hits/rule:   1      0 allow perm=any uid=0 : dir=/var/tmp/
Hits/rule:   2   7531 allow perm=any uid=0 trust=1 : all
Hits/rule:   3      0 allow perm=open exe=/usr/bin/rpm : all
Hits/rule:   4      0 allow perm=open exe=/usr/bin/python3.13 comm=dnf : all
Hits/rule:   5      0 deny_audit perm=any all : ftype=application/x-bad-elf
Hits/rule:   6   7634 allow perm=open all : ftype=application/x-sharedlib trust=1
Hits/rule:   7      0 deny_audit perm=open all : ftype=application/x-sharedlib
Hits/rule:   8    578 allow perm=execute all : trust=1
Hits/rule:   9     12 allow perm=any gid=wheel : ftype=%languages dir=/home
Hits/rule:  10      0 allow perm=any gid=wheel : ftype=%languages dir=/usr/share/git-core/templates/
Hits/rule:  11    158 allow perm=open all : ftype=%languages trust=1
Hits/rule:  12      0 deny_audit perm=any all : ftype=%languages
Hits/rule:  13     20 allow perm=any all : ftype=text/x-shellscript
Hits/rule:  14      3 deny_audit perm=execute all : all
Hits/rule:  15  26238 allow perm=open all : all

Subject attribute lookups:
Subject attr: auid requests=3 lookups=3
Subject attr: uid requests=84351 lookups=692
Subject attr: sessionid requests=0 lookups=0
Subject attr: pid requests=3 lookups=0
Subject attr: ppid requests=3 lookups=0
Subject attr: trust requests=7531 lookups=565
Subject attr: gid requests=52850 lookups=0
Subject attr: comm requests=0 lookups=0
Subject attr: exe requests=68692 lookups=1216
Subject attr: dir requests=0 lookups=0
Subject attr: ftype requests=0 lookups=0

Object attribute lookups:
Object attr: path requests=12476 lookups=9452
Object attr: dir requests=7853 lookups=2453
Object attr: device requests=0 lookups=0
Object attr: ftype requests=217466 lookups=9420
Object attr: trust requests=8376 lookups=1437
Object attr: filehash requests=0 lookups=0


The first two lines matter more than they might appear:
Last metrics reset: never
Ruleset generation: 3

If the last reset is never, then the counters have been growing since the daemon started. That is the default behavior and it is useful for a quick look at the lifetime of the daemon. If you want a smaller window, configure reset_strategy=manual in fapolicyd.conf and use "fapolicyd-cli --reset-metrics". The reset-metrics report snapshots the counters, resets them, and displays what they were at reset. It is the exact same output - however the next report generated starts fresh.

The Ruleset generation tells you which active policy the counters apply to. Rule hit counters naturally reset when new rules are successfully loaded. That is important because rule numbers and rule text can change when policy changes. You do not want rule hit counts from yesterday's policy mixed with today's policy.

Metrics are easiest to interpret when the counter window and ruleset generation are explicit.

The first real section is decision outcomes. The old way to look at fapolicyd activity was allowed versus denied. That is still useful, but it is not enough. An allow can happen because a rule matched. It can also happen because the rules had no opinion and the historical default allow behavior was used. Note: the shipped rules ends with 

deny_audit perm=execute all : all
allow perm=open all : all
The intention is to block any unknown execution and allow opens of any documents. This depends on accurate detection of any interpreted computer languages before these lines.

To better understand what kind of access decisions are made, the metrics report now separates:
Allowed accesses
Denied accesses
Allowed by rule
Allowed by fallthrough
If Allowed by fallthrough is zero, then every allow in the window came from a rule. If it is non-zero, the report prints more detail: open versus execute, trusted versus untrusted or unknown trust, and broad file type buckets such as executable, programmatic, shared library, unknown ftype, and other ftype.

The default allow indicates that policy is missing a decision rule. It is not the same as a policy rule intentionally approving the access. If a system has a large number of fallthrough execute decisions, you would want to know why. Maybe that is expected for a permissive policy. Maybe it shows a missing terminal deny. The point is that it is now visible.

Rule hit counts are next. They look like this:
Hits/rule:   1      42 allow perm=execute all : trust=1
Hits/rule:   2       3 deny_audit perm=execute all : trust=0
The exact rules will depend on your policy. A rule hit is counted when the rule actually makes the allow or deny decision. Merely iterating past a rule is not a hit. This makes the table useful for answering practical questions. Which rules are carrying the load? Which rules never fire? Did the rule I just added actually match the program I was testing?

Next come queue and defer activity. The most important queue metric is:
Inter-thread max queue depth
This is the high-water mark for fapolicyd's internal event queue in the current metrics window. It is not the kernel fanotify queue. It is the userspace queue between event intake and decision processing. If this number grows near your configured q_size, the daemon is not keeping up with the event stream.

Subject deferral is new in this release. The daemon can defer an incoming event when processing it would evict another process that is still building startup pattern state in the same subject cache slot. In a normal busy system, some deferred events may be fine. What I would watch closely is:
Subject deferred events
Subject defer max depth
Subject defer fallbacks
Subject defer fallbacks means the fixed defer array filled and the daemon had to use the historical eviction behavior. If this keeps climbing during normal workloads, look at subject cache sizing and the workload shape. A very wide fork/exec storm can create lots of subject cache collisions.

The cache sections are next. fapolicyd has subject and object caches because recomputing process and file attributes on every event would be expensive. The basic shape is:
Subject hits
Subject misses
Subject collisions
Subject evictions

Object hits
Object misses
Object collisions
Object evictions
Hits are good. Misses mean fapolicyd had to populate information - which can be costly. Collisions mean a populated cache slot did not match the current process or file identity. Evictions mean something was removed to make room. A small number of evictions is normal. A high eviction rate compared to hits suggests that the cache may be too small for the workload.

The subject cache also reports early subject cache evictions, tracer evictions, and stale BUILDING evictions. These are more health-oriented. An early eviction means a subject was evicted before startup state was complete. A tracer eviction means the occupant was being traced and could hold the slot indefinitely. A stale eviction means startup state stayed incomplete past the bounded stale window.

The last section is easy to skip, but it is one of the most useful additions: attribute lookup metrics. They look like this:
Subject attr: auid requests=1000 lookups=12
Object attr: ftype requests=1000 lookups=200

requests means policy evaluation or syslog formatting asked for the attribute. lookups means the attribute was not already present in the event cache and fapolicyd had to compute or fetch it.

That distinction matters. If requests are high but lookups are low, the cache is doing its job. If lookups are high, then that attribute is causing real work. This is also a way to see whether logging is making the daemon do extra lookups. For example, a very detailed syslog_format can request attributes that the policy did not need for the decision.

The metrics report groups activity by decision outcome, queue/defer pressure, cache behavior, and attribute lookup cost.


Once you have metrics, the next question is how to graph them. fapolicyd does not need to know about your monitoring stack. The report is simple name: value text. That means you can parse the fields you care about and ship them to whatever you already use.

If you use StatsD, you can turn selected fields into gauge updates. This is deliberately small and crude, but it shows the idea:

fapolicyd-cli --check-metrics |
awk -F': '
'/^(Allowed accesses|Denied accesses|Allowed by rule|Allowed by fallthrough|Inter-thread max queue depth|Subject deferred events|Subject defer fallbacks|Early subject cache evictions|Object evictions):/ {
    name=$1
    gsub(/ /, "_", name)
    printf "fapolicyd.%s:%s|g\n", tolower(name), $2
}' | nc -u -w0 127.0.0.1 8125 



If you use Prometheus and Grafana, remember that Grafana is the visualization layer. Something else has to collect the numbers. Prometheus supports a simple text exposition format. You could write a tiny exporter or generate a node-exporter textfile collector file from selected fapolicyd metrics.

For Prometheus, be careful with labels. It is tempting to turn every rule hit line into a metric with the whole rule text as a label. That can create high cardinality and make your monitoring system unhappy. I would start with stable low-cardinality metrics: allow/deny counts, fallthrough counts, queue depth, cache hits and evictions, subject deferral, kernel overflow, and reply errors. Rule hits are better for troubleshooting reports unless you have a controlled set of rule labels.

Here is a minimal Prometheus-style example for a few fields:

fapolicyd-cli --check-metrics |
awk -F': '
'/^Allowed accesses:/ { print "fapolicyd_allowed_accesses " $2 }
/^Denied accesses:/ { print "fapolicyd_denied_accesses " $2 }
/^Allowed by fallthrough:/ { print "fapolicyd_allowed_by_fallthrough " $2 }
/^Inter-thread max queue depth:/ { print "fapolicyd_inter_thread_max_queue_depth " $2 }
/^Subject defer fallbacks:/ { print "fapolicyd_subject_defer_fallbacks " $2 }
/^Early subject cache evictions:/ { print "fapolicyd_early_subject_cache_evictions " $2 }
'


This is not meant to be a finished exporter. It is a starting point. A real exporter should sanitize names, handle percentages, keep type/help metadata, and decide whether a field is a counter or gauge.

What should you graph first? I would start with these:
Allowed accesses
Denied accesses
Allowed by fallthrough
Inter-thread max queue depth
Subject deferred events
Subject defer fallbacks
Early subject cache evictions
Subject evictions
Object evictions
Reply errors
Kernel queue overflow
Then add attribute lookups for the fields that matter to your policy. If Object attr: ftype lookups are high, file type detection is important for your workload. If Object attr: trust lookups are high, trust database lookups are part of the cost. If subject proc attributes are hot, fork/exec-heavy
workloads may be asking for process details often.

Metrics do not tell you everything. They tell you where to look. If cache evictions are high, look at cache sizing. If fallthrough is high, look at rule coverage. If queue depth is high, look at workload bursts and decision cost. If you need to know where the time is going inside decisions, that is when the timing report comes in. That is the next article.