Wednesday, May 27, 2026

Stress Testing fapolicyd-1.5

The timing report explains where fapolicyd spent time during a bounded window. That is useful only if the workload is repeatable. This article covers the new stress harness and how to use it with status, metrics, and timing.

The stress helper is not installed by make install. It is a development, QE, sizing, and regression tool. It generates high-rate fanotify decision traffic against a running fapolicyd daemon by creating process trees and running specific workloads.

It can exercise process startup tracking, subject cache collisions, object cache churn, interpreter handling, no-shebang script handling, file opens, execs, and large file reads.

The stress harness creates controlled pressure so fapolicyd's behavior can be measured.
 

Building the Harness

Build it explicitly:

./configure --enable-stress
make -j32
The binary is:
src/tests/stress/fapolicyd-stress
--enable-stress defaults to off. A normal build does not enter this directory and does not build the helper.

If the daemon is enforcing policy, remember that this locally built helper is not automatically trusted. The interpreter workloads also use scripts from the source tree. Add the helper and scripts to the file trust database before running tests against an enforcing daemon:
stress_dir="$PWD/src/tests/stress"
sudo fapolicyd-cli --file add "$stress_dir/fapolicyd-stress"
sudo fapolicyd-cli --file add "$stress_dir/scripts"
sudo fapolicyd-cli --update
If you rebuild the helper, update the trust entry because size or hash may change.

Workloads

The harness can run several workloads:

fork-exec     tight fork/exec loops
exec-open     opens configured command paths and executes them
interpreter   runs a script directly and through a shell
noshebang     exercises programmatic content without a #! line
hash          creates and reads a large file
churn         creates many small files to churn the object cache
all           runs every workload
fork-exec is the default and the best starting point for subject-cache and startup-state pressure. It repeatedly forks a child that execs a configured command. This creates a lot of short-lived process activity, which is exactly the kind of workload that can expose startup tracking and subject cache collision problems.

exec-open adds file open traffic around the exec stream. This is useful when you want both execution and object-open activity.

interpreter and noshebang are for programmatic content paths. The first runs a script directly and through a shell. The second attempts a direct exec of a file without a `#!` line and then runs it through the selected shell.

hash creates a large generated file and reads it. This is useful when integrity mode or policy makes hashing visible in the timing report.

churn creates many distinct small files and opens them in rotation. This is for object cache churn.

all is broad coverage. It is not where I would start if I were trying to understand one bottleneck.

Process Shape

The basic process controls are roots, fanout, and depth. The estimated leaf process count is:

roots * fanout ^ depth
More leaves means more concurrent process pressure. Wide process trees are the main way to create subject-cache collisions. That is what you need if you are testing startup-state behavior and subject deferral.

A representative command is:
src/tests/stress/fapolicyd-stress --workload fork-exec --roots 32 \
  --fanout 8 --depth 1 --iterations 0 --seconds 60 --timing
That creates 256 leaf processes. Each leaf runs the selected workload until the time limit expires. Note, the --timing parameter tells the stress harness to arm and stop the timer to automatically generate a timing report so that you do not need to mess with fapolicyd-cli.

Smoke Test

A short local smoke test without daemon reports looks like this:

./fapolicyd-stress --no-status --workload fork-exec --roots 2 --iterations 10
fapolicyd stress harness
workload: fork-exec
roots: 2
fanout: 1
depth: 0
estimated leaf processes: 2
iterations per leaf: 10
seconds: 0
workdir: /tmp/fapolicyd-stress.ShyQtW

Workload summary:
wall_seconds: 0.040
operations: 20
errors: 0
throughput_ops_per_sec: 495.1

That proves the helper can run. It does not tell you much about the daemon. For daemon observations, let the harness collect status and metrics before and after the workload.

Timed Pressure Test

For a timed fork/exec pressure test against a configured daemon, run:

 ./fapolicyd-stress --workload fork-exec --roots 32 --seconds 30 --timing
fapolicyd stress harness
workload: fork-exec
roots: 32
fanout: 1
depth: 0
estimated leaf processes: 32
iterations per leaf: 100
seconds: 30
workdir: /tmp/fapolicyd-stress.uxEEQg

Workload summary:
wall_seconds: 1.801
operations: 3200
errors: 0
throughput_ops_per_sec: 1776.3

Daemon status deltas:
Inter-thread max queue depth: before=4 after=95
Subject deferred events: before=0 after=0
Subject defer max depth: before=0 after=0
Subject defer fallbacks: before=0 after=0 delta=0
Subject defer oldest age: before=0 ns after=0 ns
Early subject cache evictions: before=0 after=0 delta=0
Subject BUILDING tracer evictions: before=0 after=0 delta=0
Subject BUILDING stale evictions: before=0 after=0 delta=0
Subject collisions: before=0 after=16 delta=16
Subject evictions: before=11 after=30 delta=19
Object collisions: before=70 after=74 delta=4
Object evictions: before=70 after=74 delta=4
Allowed accesses: before=3447 after=60541 delta=57094
Denied accesses: before=3 after=3 delta=0
Kernel queue overflows: before=0 after=0 delta=0
Reply errors: before=0 after=0 delta=0

Decision timing:
Full report: /run/fapolicyd/fapolicyd.timing
Decisions: 56953
Max queue depth during timing: 95
Timed throughput: 31537.4 decisions/sec
Active decision rate: 32206.7 decisions/sec
Decision latency: avg=31.0 us max=3.31 ms p95_bucket=<=100us


With --timing, the harness verifies that the daemon has timing_collection=manual, starts timing, runs the workload, stops timing, and parses a short summary from the timing report. It also captures status and metrics before and after the workload when it can.

This gives three views:
harness output     what workload was generated
metrics deltas     what counters moved
timing report      where decision time went
The harness output has two different throughput ideas. throughput_ops_per_sec is the harness's local operation rate. It is not the same as daemon decision throughput. One local operation can generate zero, one, or multiple fanotify permission events. The timing report's decision count and throughput are the daemon-side numbers.

Subject Deferral Testing

For subject deferral testing, use the early eviction preset:

[run the following command: sudo src/tests/stress/fapolicyd-stress --preset early-evict --timing]

./fapolicyd-stress --preset early-evict --timing
fapolicyd stress harness
workload: fork-exec
roots: 32
fanout: 8
depth: 1
estimated leaf processes: 256
iterations per leaf: 0
seconds: 60
workdir: /tmp/fapolicyd-stress.jID2g5

Workload summary:
wall_seconds: 60.150
operations: 36666
errors: 0
throughput_ops_per_sec: 609.6

Daemon status deltas:
Inter-thread max queue depth: before=95 after=435
Subject deferred events: before=0 after=1
Subject defer max depth: before=0 after=1
Subject defer fallbacks: before=0 after=0 delta=0
Subject defer oldest age: before=0 ns after=0 ns
Early subject cache evictions: before=0 after=3 delta=3
Subject BUILDING tracer evictions: before=0 after=0 delta=0
Subject BUILDING stale evictions: before=0 after=3 delta=3
Subject collisions: before=89 after=36881 delta=36792
Subject evictions: before=113 after=36905 delta=36792
Object collisions: before=207 after=216 delta=9
Object evictions: before=207 after=216 delta=9
Allowed accesses: before=63638 after=672670 delta=609032
Denied accesses: before=3 after=3 delta=0
Kernel queue overflows: before=0 after=0 delta=0
Reply errors: before=0 after=0 delta=0

Decision timing:
Full report: /run/fapolicyd/fapolicyd.timing
Decisions: 608940
Max queue depth during timing: 435
Timed throughput: 10122.9 decisions/sec
Active decision rate: 10159.5 decisions/sec
Decision latency: avg=98.4 us max=19.9 ms p95_bucket=<=500us


There is also an ld-so-regression preset intended for comparing builds with and without subject deferral:

[run the following command: sudo src/tests/stress/fapolicyd-stress --preset ld-so-regression --timing]

./fapolicyd-stress --preset ld-so-regression --timing
fapolicyd stress harness
workload: fork-exec
roots: 32
fanout: 8
depth: 1
estimated leaf processes: 256
iterations per leaf: 0
seconds: 60
workdir: /tmp/fapolicyd-stress.On1G0Q

Workload summary:
wall_seconds: 60.145
operations: 32529
errors: 0
throughput_ops_per_sec: 540.8

Daemon status deltas:
Inter-thread max queue depth: before=435 after=453
Subject deferred events: before=1 after=2
Subject defer max depth: before=1 after=1
Subject defer fallbacks: before=0 after=0 delta=0
Subject defer oldest age: before=0 ns after=0 ns
Early subject cache evictions: before=3 after=3 delta=0
Subject BUILDING tracer evictions: before=0 after=0 delta=0
Subject BUILDING stale evictions: before=3 after=3 delta=0
Subject collisions: before=36901 after=70158 delta=33257
Subject evictions: before=36925 after=70182 delta=33257
Object collisions: before=295 after=298 delta=3
Object evictions: before=295 after=298 delta=3
Allowed accesses: before=673924 after=1215792 delta=541868
Denied accesses: before=3 after=3 delta=0
Kernel queue overflows: before=0 after=0 delta=0
Reply errors: before=0 after=0 delta=0

Decision timing:
Full report: /run/fapolicyd/fapolicyd.timing
Decisions: 541778
Max queue depth during timing: 453
Timed throughput: 9007.2 decisions/sec
Active decision rate: 9041.6 decisions/sec
Decision latency: avg=111 us max=1.63 ms p95_bucket=<=500us


The strongest evidence for the early-eviction problem is:
  • Subject collisions increased
  • Early subject cache evictions increased
  • The workload was wide enough to create many concurrent process startups

After deferral is working, a healthy run should show fewer early subject cache evictions and fewer unexpected denials under the same daemon configuration. Subject deferred events and Subject defer max depth may increase. That is fine. That means events were parked instead of evicting a BUILDING subject. But if Subject defer fallbacks keeps rising, the fixed defer array filled and the daemon fell back to the older eviction path.

The early-evict preset is meant to prove whether subject deferral reduces premature BUILDING evictions.

Cache and Hash Workloads

For object cache pressure, run the churn workload and watch object misses, collisions, and evictions:

./fapolicyd-stress --workload churn --roots 8 --fanout 4 --depth 1 --seconds 60 --timing
fapolicyd stress harness
workload: churn
roots: 8
fanout: 4
depth: 1
estimated leaf processes: 32
iterations per leaf: 100
seconds: 60
workdir: /tmp/fapolicyd-stress.SqRIGt

Workload summary:
wall_seconds: 0.101
operations: 3200
errors: 0
throughput_ops_per_sec: 31531.9

Daemon status deltas:
Inter-thread max queue depth: before=453 after=453
Subject deferred events: before=2 after=2
Subject defer max depth: before=1 after=1
Subject defer fallbacks: before=0 after=0 delta=0
Subject defer oldest age: before=0 ns after=0 ns
Early subject cache evictions: before=3 after=3 delta=0
Subject BUILDING tracer evictions: before=0 after=0 delta=0
Subject BUILDING stale evictions: before=3 after=3 delta=0
Subject collisions: before=70180 after=70217 delta=37
Subject evictions: before=70207 after=70244 delta=37
Object collisions: before=724 after=743 delta=19
Object evictions: before=724 after=743 delta=19
Allowed accesses: before=1218087 after=1221409 delta=3322
Denied accesses: before=3 after=3 delta=0
Kernel queue overflows: before=0 after=0 delta=0
Reply errors: before=0 after=0 delta=0

Decision timing:
Full report: /run/fapolicyd/fapolicyd.timing
Decisions: 3228
Max queue depth during timing: 32
Timed throughput: 30256.9 decisions/sec
Active decision rate: 34689.6 decisions/sec
Decision latency: avg=28.8 us max=4.32 ms p95_bucket=<=50us

Fo
r integrity cost, use the hash workload and look for hash_sha or hash_ima timing:

./fapolicyd-stress --workload hash --roots 8 --seconds 60 --timing
fapolicyd stress harness
workload: hash
roots: 8
fanout: 1
depth: 0
estimated leaf processes: 8
iterations per leaf: 100
seconds: 60
workdir: /tmp/fapolicyd-stress.1NYq01

Workload summary:
wall_seconds: 1.603
operations: 800
errors: 0
throughput_ops_per_sec: 498.9

Daemon status deltas:
Inter-thread max queue depth: before=453 after=453
Subject deferred events: before=2 after=2
Subject defer max depth: before=1 after=1
Subject defer fallbacks: before=0 after=0 delta=0
Subject defer oldest age: before=0 ns after=0 ns
Early subject cache evictions: before=3 after=3 delta=0
Subject BUILDING tracer evictions: before=0 after=0 delta=0
Subject BUILDING stale evictions: before=3 after=3 delta=0
Subject collisions: before=70230 after=70242 delta=12
Subject evictions: before=70257 after=70269 delta=12
Object collisions: before=755 after=757 delta=2
Object evictions: before=755 after=757 delta=2
Allowed accesses: before=1221624 after=1222544 delta=920
Denied accesses: before=3 after=3 delta=0
Kernel queue overflows: before=0 after=0 delta=0
Reply errors: before=0 after=0 delta=0

Decision timing:
Full report: /run/fapolicyd/fapolicyd.timing
Decisions: 830
Max queue depth during timing: 8
Timed throughput: 515.9 decisions/sec
Active decision rate: 100841.2 decisions/sec
Decision latency: avg=9.92 us max=138 us p95_bucket=<=50us


For broad coverage, run all workloads:

./fapolicyd-stress --workload all --roots 8 --fanout 4 --depth 1 --seconds 60 --timing
fapolicyd stress harness
workload: all
roots: 8
fanout: 4
depth: 1
estimated leaf processes: 32
iterations per leaf: 100
seconds: 60
workdir: /tmp/fapolicyd-stress.2rexsD

Workload summary:
wall_seconds: 16.156
operations: 76800
errors: 0
throughput_ops_per_sec: 4753.7

Daemon status deltas:
Inter-thread max queue depth: before=453 after=453
Subject deferred events: before=2 after=35
Subject defer max depth: before=1 after=3
Subject defer fallbacks: before=0 after=0 delta=0
Subject defer oldest age: before=0 ns after=0 ns
Early subject cache evictions: before=3 after=3 delta=0
Subject BUILDING tracer evictions: before=0 after=0 delta=0
Subject BUILDING stale evictions: before=3 after=3 delta=0
Subject collisions: before=70253 after=115846 delta=45593
Subject evictions: before=70280 after=122273 delta=51993
Object collisions: before=1416 after=1471 delta=55
Object evictions: before=1416 after=1471 delta=55
Allowed accesses: before=1224770 after=1751558 delta=526788
Denied accesses: before=3 after=3 delta=0
Kernel queue overflows: before=0 after=0 delta=0
Reply errors: before=0 after=0 delta=0

Decision timing:
Full report: /run/fapolicyd/fapolicyd.timing
Decisions: 526698
Max queue depth during timing: 129
Timed throughput: 32591.0 decisions/sec
Active decision rate: 33450.5 decisions/sec
Decision latency: avg=29.9 us max=12.5 ms p95_bucket=<=100us


Use all after you understand the individual workload costs. If you start with everything, the output may show several things moving at once and you will not know which workload caused which effect.

Reading the Results

After a stress run, collect both reports:

fapolicyd-cli --check-status
fapolicyd-cli --check-metrics
For queue pressure, compare Inter-thread max queue depth with configured q_size, and then look at the timing report's Queueing section. If both queue depth and queue wait are high, requests are backing up before evaluation.

For subject-cache pressure, look at:

  • Subject collisions
  • Subject evictions
  • Early subject cache evictions
  • Subject deferred events
  • Subject defer max depth
  • Subject defer fallbacks
  • Subject BUILDING tracer evictions
  • Subject BUILDING stale evictions

For object-cache pressure, look at:

  • Object misses
  • Object collisions
  • Object evictions

For policy decisions, look at:

  • Allowed accesses
  • Denied accesses
  • Allowed by rule
  • Allowed by fallthrough

For daemon health, any non-zero value in these fields deserves attention:

  • Kernel queue overflow
  • Reply errors
  • Subject defer fallbacks

The stress harness does not replace real production observation. It gives you a controlled way to move specific parts of the daemon. That is useful for regression testing, sizing experiments, and proving whether a change affected the counter or timing gate you intended to move.

The pattern is:

  • Pick one workload.
  • Run a bounded test.
  • Read status for health.
  • Read metrics for counters.
  • Read timing for latency.
  • Change one variable.
  • Run it again.

That is how the stress harness fits with the rest of the reporting work. Status tells you whether the daemon stayed healthy. Metrics tell you what moved. Timing tells you where time went. The stress tool gives you a repeatable way to make those questions concrete.

Sunday, May 24, 2026

Timing fapolicyd Decisions

The previous article looked at fapolicyd metrics. Metrics tell us what happened in a counter window. This article is about timing. Timing answers a different question: where did the daemon spend time while making decisions?

This is not something that should always be enabled. fapolicyd sits in the path of file opens and execs. Calling clock_gettime() around hot-path operations on every permission event has a cost. That is why timing collection is off by default and why the new model is manual. You arm timing, run a workload, stop timing, and read the report.

Before looking at the command line, it helps to understand what a decision is. The timing report is much easier to read if you know the path an access request takes through the daemon.

Anatomy of a Decision

At a high level, a fapolicyd decision has three phases:

event_build
evaluation
response
event_build turns the raw fanotify event into the internal data fapolicyd can reason about. evaluation walks the rules and computes the policy result. response records the outcome, prepares any logging or audit data, and writes the final decision back to the kernel.

How a fapolicyd access decision moves through the daemon.


 The timing report uses names that look like this:

phase:operation:child
For example, evaluation:mime_detection:libmagic_fallback means libmagic fallback happened while the daemon was in the evaluation phase. response:trust_db_lookup:read means trust database read work was triggered while building the response.

The stages are not a strict accounting tree. Some operations are nested. Some are lazy and only happen if a rule or log format asks for the data. Child rows do not have to add up exactly to parent rows. Treat them as measured gates in the decision path, not as a perfect flame graph.

Event Build

The first phase is event_build. The kernel gives fapolicyd a fanotify event. That event has a permission file descriptor, a process id, and a mask describing the requested access. fapolicyd has to turn that into an internal event with enough subject and object identity to evaluate policy.

The main timed gates are:

event_build:cache_flush
event_build:proc_fingerprint
event_build:fd_stat
event_build:cache_flush is measured when the daemon has to flush object cache state on the decision path. Cache flushes are not the normal cost of every decision. If this shows up prominently, something caused cached object state to be invalidated during the timing window.

event_build:proc_fingerprint reads enough process information to identify the subject and detect stale cache entries. This is the first process identity gate. On fork/exec-heavy systems, this can be a visible cost because every short lived command creates more process identity work.

event_build:fd_stat stats the fanotify file descriptor to identify the object being accessed. This gives fapolicyd stable file identity data. If this is expensive, the cost is in object fingerprinting before policy evaluation even starts.

The output of event_build is not the whole answer. It is the starting point. Some subject and object attributes are still lazy. They are computed only if a rule, syslog format, debug output, or audit response needs them.

Evaluation

The second phase is evaluation. This is where fapolicyd evaluates the active policy rules against the event.

The top-level evaluation gates are:

evaluation:lock_wait
evaluation:fd_path_resolution
evaluation:proc_detail_lookup
evaluation:lock_wait is time spent waiting for the rule lock before policy evaluation can proceed. In the current single-decision-thread design, this is usually not expected to dominate. It becomes more interesting as the code moves toward more read-side concurrency. If it is high now, rule reload or other policy maintenance activity may have overlapped the timing run.

evaluation:fd_path_resolution is path resolution for the object file descriptor. Not every rule needs a path, but rules that match on path or dir can force this work. A workload that opens many distinct files can make this more visible.

evaluation:proc_detail_lookup is the on-demand process detail gate. The initial fingerprint is collected in event_build, but rules may ask for more: auid, session id, executable path, process status fields, or other procfs details. If this row is high, the policy or output format is asking for process details that were not already cached.

MIME detection is a nested set of gates:
evaluation:mime_detection:fast_classification
evaluation:mime_detection:gather_elf
evaluation:mime_detection:libmagic_fallback
fapolicyd tries cheap classification first. The fast path recognizes common file types without using full libmagic. gather_elf is the ELF and script header scan used to identify executable and interpreter-related details. libmagic_fallback is the expensive fallback when the faster checks cannot classify the object.

If evaluation:mime_detection:libmagic_fallback is high, the daemon is often leaving the fast path. That can be caused by a workload full of files that are hard to classify cheaply, or by rules that ask for ftype on many objects.

Integrity and hashing gates are:

evaluation:hash_ima:total
evaluation:hash_sha:total
hash_ima is IMA digest collection from the security.ima extended attribute. hash_sha is SHA-style file digest collection. These are workload and policy dependent. If integrity mode or FILE_HASH rules force hashing of large files, these rows can dominate a timing run. This is one reason timing has to be a bounded diagnostic tool: hashing can be attacker-controlled cost if policy asks for it on large inputs.

Trust database lookup is also split:
evaluation:trust_db_lookup:lock_wait
evaluation:trust_db_lookup:read
lock_wait is time waiting for trust database update coordination before the read can proceed. read is the LMDB read-side work and related lookup logic. If read is high, trust lookup itself is material. If lock_wait is high, the decision path is waiting behind trust database maintenance or contention.

The evaluation phase decides whether a rule has an opinion. If a rule returns allow or deny, that is the decision source. If no rule has an opinion, fapolicyd preserves historical compatibility by allowing the access through the fallthrough path. The metrics report counts that separately as Allowed by fallthrough.

Response

The third phase is response. This is the work after rule evaluation has produced an outcome. It includes decision bookkeeping, optional syslog or debug formatting, audit response metadata, and the write back to the kernel.

The response phase has some gates that look familiar:

response:mime_detection:fast_classification
response:mime_detection:gather_elf
response:mime_detection:libmagic_fallback
response:trust_db_lookup:lock_wait
response:trust_db_lookup:read
response:fanotify_write
This is the part that can be confusing at first. Why can MIME detection or trust database lookup appear in both evaluation and response?

The answer is lazy attributes. fapolicyd does not compute every possible attribute up front. That would waste time. Instead, attributes are looked up when something asks for them. During evaluation, policy rules may ask for ftype, trust, path, hash, or process details. During response, syslog, debug, or audit formatting may ask for some of the same attributes so the daemon can explain the decision.

So the same helper can be charged to different phases depending on who caused the lookup.

If evaluation:mime_detection:* is high, policy evaluation needed file type information. If response:mime_detection:* is high, reporting or audit output needed file type information after the decision had already been made. The same logic applies to trust database lookup. Evaluation-side trust lookup is policy cost. Response-side trust lookup is reporting or audit cost.

This distinction is important. If a timing report says response-side MIME detection dominates, making policy rules simpler may not help. The cost may be coming from debug output or a detailed syslog_format. If evaluation-side MIME detection dominates, then the policy is probably asking for file type data in the hot path.

response:fanotify_write is the final write of the permission decision back to the kernel. A high value here means the daemon did the policy work and then spent measurable time completing the kernel response. That is not rule cost. It is response-path cost.

Using the Timer

Now the report structure makes more sense. The timer creates a bounded window around a workload so the daemon can aggregate timing data for those gates.

Enable manual timing in fapolicyd.conf:

timing_collection=manual
After changing the configuration, reload or restart fapolicyd so the running daemon has the setting active. Then check status:

fapolicyd-cli --check-status | head

Look for:
Timing collection mode: manual
Timing collection armed: false
If the mode is still off, the daemon will ignore timing requests. This is intentional. A system should not accidentally start collecting timing data just because someone ran the CLI.

Manual timing creates a bounded measurement window around a workload.

The CLI has two names for the commands. The documentation uses --timing-start and --timing-stop. There are also --timer-start and --timer-stop aliases. I like the timer spelling because it reads naturally, so I will use that here.

To start timing, run the following command:

sudo fapolicyd-cli --timer-start

Now run something that causes fapolicyd decisions. For a tiny smoke test, this can be as boring as:

for i in $(seq 1 100); do /usr/bin/id >/dev/null; done
Then to stop timing, run the following command:

sudo fapolicyd-cli --timer-stop

The stop command prints the timing report. The daemon writes the report to:

/run/fapolicyd/fapolicyd.timing
The report is aggregate data. It does not store one record per decision. Each stage keeps a count, total time, maximum time, and latency buckets. That keeps memory bounded even if the timing window covers a large number of decisions. Here is a sample report:

Mode: manual
Timing run: 2026-05-24 14:54:07 -0400 to 2026-05-24 14:55:13 -0400
Duration: 0:01:06
Workers: 1
Max queue depth: 399
Decisions: 576,495
Throughput: 8717.3 decisions/sec (wall clock)
Active decision rate: 9596.4 decisions/sec

TL;DR:
  - Queueing pressure reached max depth 399 of 800 (49.9%), p95 wait <=100ms, max wait 160 ms.

Overall decision latency:
  avg 104 us, max 2.81 ms
  p50 bucket <=100us, p95 bucket <=500us, p99 bucket <=500us
  <=50us 1.4%, <=100us 72.4%, <=500us 100.0%, <=1ms 100.0%, >10ms 0.0%

Queueing:
  avg wait: 27.2 ms
  max wait: 160 ms
  p95 bucket: <=100ms
  total queued time: 15707 s
  max queue depth: 399

Decision phase timing:
Phase                 Calls  Calls/Dec      Total        Avg        Max   p95 bucket   Notes
event_build         576,495       1.00     3.17 s    5.50 us     516 us       <=50us   
evaluation          576,495       1.00     1.66 s    2.89 us    2.80 ms       <=50us   
response            576,495       1.00     55.0 s    95.5 us    1.44 ms      <=500us   

Lazy helper attribution:
  Helper timings are attributed to the active logical driver: evaluation or response.
  Combined totals are evaluation + response.

Lazy helper attribution by driver:
Helper                             Eval total  Response total     Combined Response %
mime_detection:total                  11.1 ms            0 ns      11.1 ms        0.0%
mime_detection:fast_classification     510 us            0 ns       510 us        0.0%
mime_detection:gather_elf              186 us            0 ns       186 us        0.0%
mime_detection:libmagic_fallback      10.4 ms            0 ns      10.4 ms        0.0%
trust_db_lookup:total                  120 ms            0 ns       120 ms        0.0%
trust_db_lookup:read                   109 ms            0 ns       109 ms        0.0%
trust_db_lookup:lock_wait             2.43 ms            0 ns      2.43 ms        0.0%

Combined lazy helper attribution:
Helper path                                       Calls  Calls/Dec      Total   Avg/call  Amort/Dec        Max   p95 bucket
mime_detection:total                                564       0.00    11.1 ms    19.6 us      19 ns    2.54 ms       <=10us
mime_detection:fast_classification                  564       0.00     510 us     903 ns       0 ns    11.3 us       <=10us
mime_detection:gather_elf                            32       0.00     186 us    5.82 us       0 ns    10.9 us       <=10us
mime_detection:libmagic_fallback                      9       0.00    10.4 ms    1.16 ms      18 ns    2.53 ms        <=5ms
trust_db_lookup:total                            66,016       0.11     120 ms    1.81 us     207 ns    2.80 ms        <=5us
trust_db_lookup:read                             66,016       0.11     109 ms    1.65 us     189 ns    2.80 ms        <=5us
trust_db_lookup:lock_wait                        66,016       0.11    2.43 ms      36 ns       4 ns    25.7 us        <=1us
hash_sha:total                                       22       0.00    10.6 ms     481 us      18 ns    2.79 ms        <=5ms
proc_detail_lookup                               99,060       0.17     1.29 s    13.0 us    2.24 us     415 us       <=50us

Derived observations:
  Queueing showed moderate bursts: max queue depth 399 of 800 (49.9%), p95 wait <=100ms, max wait 160 ms.
  libmagic fallback is the biggest MIME contributor: 1.6% of MIME calls, 94.0% of MIME time, 18 ns amortized per decision.
  hash_sha is rare but expensive: 0.0% of decisions, 481 us avg when called, 18 ns amortized per decision.

Detailed stage timing, sorted by total time:
Stage                                                 Calls      Calls/Dec      Total        Avg        Max   p95 bucket
time_in_queue:total                                 576,495           1.00    15707 s    27.2 ms     160 ms      <=100ms
decision:total                                      576,495           1.00     60.1 s     104 us    2.81 ms      <=500us
response:total                                      576,495           1.00     55.0 s    95.5 us    1.44 ms      <=500us
response:fanotify_write                             576,495           1.00     54.0 s    93.7 us    1.44 ms      <=500us
event_build:total                                   576,495           1.00     3.17 s    5.50 us     516 us       <=50us
event_build:proc_fingerprint                        576,495           1.00     1.92 s    3.32 us     250 us       <=10us
evaluation:total                                    576,495           1.00     1.66 s    2.89 us    2.80 ms       <=50us
evaluation:proc_detail_lookup                        99,060           0.17     1.29 s    13.0 us     415 us       <=50us
event_build:fd_stat                                 576,495           1.00     733 ms    1.27 us     512 us        <=5us
evaluation:trust_db_lookup:total                     66,016           0.11     120 ms    1.81 us    2.80 ms        <=5us
evaluation:trust_db_lookup:read                      66,016           0.11     109 ms    1.65 us    2.80 ms        <=5us
evaluation:lock_wait                                576,495           1.00    24.8 ms      43 ns    42.4 us        <=1us
response:audit_metadata:total                       576,495           1.00    12.5 ms      21 ns    46.9 us        <=1us
evaluation:mime_detection:total                         564           0.00    11.1 ms    19.6 us    2.54 ms       <=10us
evaluation:hash_sha:total                                22           0.00    10.6 ms     481 us    2.79 ms        <=5ms
evaluation:mime_detection:libmagic_fallback               9           0.00    10.4 ms    1.16 ms    2.53 ms        <=5ms
evaluation:fd_path_resolution                           588           0.00    6.73 ms    11.4 us     164 us       <=50us
evaluation:trust_db_lookup:lock_wait                 66,016           0.11    2.43 ms      36 ns    25.7 us        <=1us
evaluation:mime_detection:fast_classification           564           0.00     510 us     903 ns    11.3 us       <=10us
evaluation:mime_detection:gather_elf                     32           0.00     186 us    5.82 us    10.9 us       <=10us

Stage tail summary:
  time_in_queue:total: >10ms 564,343/97.9%, >25ms 235,276/40.8%, >50ms 39,149/6.8%, >100ms 2,230/0.4%

Not observed:
  event_build:cache_flush, response:mime_detection:total, response:mime_detection:fast_classification, response:mime_detection:gather_elf, response:mime_detection:libmagic_fallback, evaluation:hash_ima:total, response:trust_db_lookup:total, response:trust_db_lookup:lock_wait, response:trust_db_lookup:read, response:syslog_debug_format:total

Notes:
  Largest queued-time contributor: time_in_queue:total (15707 s)
  Largest helper contributor: proc_detail_lookup (1.29 s)
  Largest decision phase contributor: response (55.0 s)
  Slowest observed row by max: time_in_queue:total (160 ms)

 

Interpreting the Report

The full timing report can look intimidating, but the reading order is simple:

1. Run summary
2. TL;DR
3. Overall decision latency
4. Queueing
5. Decision phase timing
6. Lazy helper attribution
7. Detailed stage timing
The run summary tells you whether the run is meaningful. Decisions should be non-zero. If it is zero, the workload did not generate timed daemon decisions. Max queue depth tells you  whether requests backed up while timing was armed.

The report has two rates. Throughput is decisions per wall-clock second while timing was armed. Active decision rate is based on accumulated decision:total worker time. If wall-clock throughput is low but active decision rate is high, the workload may have been bursty or idle part of the time. If both are low, decision work is expensive.

The TL;DR section is not magic. It is a compact set of observations derived from the same data later in the report. It may point out helper dominance, response formatting cost, queue health, or that no dominant findings were observed.

Overall decision latency is the end-to-end decision worker latency after an event has been dequeued. It includes event build, policy evaluation, optional logging or audit preparation, response selection, and the fanotify response write. Queue wait is reported separately.

Before the next section, let's talk about the p95 notation. The p95 bucket is the latency bucket that contains the 95th percentile observation. In plain terms: about 95% of the measured calls completed at or below that bucket, and about 5% were slower. If the report says a stage has p95 <=500us, then nearly all calls to that stage were fast. If it says p95 >10ms, then the slow path was not just one rare outlier; it happened often enough to affect the top 5% of calls.

It is significant because averages can hide tail latency, and maximums can overreact to one unusual event. The p95 bucket sits between them. It tells you whether slowness is recurring enough to matter. 

This section has average latency, maximum latency, p95 bucket, and tail buckets. The p95 bucket is usually more useful than the maximum. A high maximum with a low p95 means rare outliers. A high p95 means the slow path is common.

Queueing tells you how long events waited in fapolicyd's internal userspace queue before the decision worker started processing them. If queue wait is high and max queue depth is close to q_size, the daemon is being fed events faster than it can answer them.

Decision phase timing is the first section I would use for diagnosis:

event_build:total
evaluation:total
response:total
If event build dominates, look at process and object identity work. If evaluation dominates, look at rule traversal and evaluation-side helpers. If response dominates, look at logging, audit metadata, response-side helpers, and the kernel write.

Lazy helper attribution explains whether expensive helper work was caused by evaluation or response. This is where the duplicate-looking MIME and trust DB rows become useful. The same helper has very different tuning implications depending on whether policy needed it or output formatting needed it.

The detailed stage table is the final view. It is sorted by total measured time. Use it to find the main cost centers:
High event_build:proc_fingerprint              -> process identity lookup cost
High event_build:fd_stat                       -> object fingerprinting cost
High evaluation:lock_wait                      -> rule lock wait
High evaluation:total                          -> rule traversal or matching cost
High evaluation:mime_detection:libmagic_fallback -> expensive file type fallback
High evaluation:hash_sha:total                 -> SHA hashing cost
High evaluation:hash_ima:total                 -> IMA digest cost
High evaluation:trust_db_lookup:read           -> trust database read cost
High response:mime_detection:*                 -> reporting needed ftype data
High response:trust_db_lookup:read             -> reporting or audit needed trust data
High response:fanotify_write                   -> kernel response write cost
Do not expect every row to appear in every report. If a stage is not observed, the workload did not trigger that measured operation during the timing window. That can be good news. For example, no libmagic fallback means fast MIME classification handled the files observed in the run.

The best way to use timing is to make one change at a time. Change the workload, a rule, an integrity setting, or output format, then run another bounded timing window. Compare the same sections. If the dominant cost moves, you learned something.

The next article will cover the stress harness. The timer tells us where the daemon spent time. The stress tool gives us repeatable workloads that can move specific counters and timing gates on purpose.

Saturday, May 23, 2026

Understanding fapolicyd-1.5 Metrics

In the last article, I went over the bigger picture of the fapolicyd 1.5 work. From it, we learned that status, metrics, and timing are now separate reports because they answer different questions.

This article is about metrics.

The important distinction is this: "fapolicyd-cli --check-status" asks whether the daemon is healthy and configured the way you expect. "fapolicyd-cli --check-metrics" asks what the daemon has been doing. Metrics are where you look for rule hits, cache behavior, default-allow decisions, queue pressure, and which rule attributes are causing work.

Start with the command:

# fapolicyd-cli --check-metrics
Last metrics reset: never
Ruleset generation: 1

Decision outcomes:
Allowed accesses: 42171
Denied accesses: 3
Allowed by rule: 42171
Allowed by fallthrough: 0

Inter-thread queue & defer activity:
Inter-thread max queue depth: 6
Subject deferred events: 0
Subject defer max depth: 0
Subject defer fallbacks: 0

Subject cache effectiveness:
Subject hits: 41632
Subject misses: 692
Subject collisions: 28
Subject evictions: 150 (0%)
Early subject cache evictions: 0
Subject BUILDING tracer evictions: 0
Subject BUILDING stale evictions: 0

Object cache effectiveness:
Object hits: 34527
Object misses: 11421
Object collisions: 3774
Object evictions: 3774 (10%)

Rule hit counts:
Hits/rule:   1      0 allow perm=any uid=0 : dir=/var/tmp/
Hits/rule:   2   7531 allow perm=any uid=0 trust=1 : all
Hits/rule:   3      0 allow perm=open exe=/usr/bin/rpm : all
Hits/rule:   4      0 allow perm=open exe=/usr/bin/python3.13 comm=dnf : all
Hits/rule:   5      0 deny_audit perm=any all : ftype=application/x-bad-elf
Hits/rule:   6   7634 allow perm=open all : ftype=application/x-sharedlib trust=1
Hits/rule:   7      0 deny_audit perm=open all : ftype=application/x-sharedlib
Hits/rule:   8    578 allow perm=execute all : trust=1
Hits/rule:   9     12 allow perm=any gid=wheel : ftype=%languages dir=/home
Hits/rule:  10      0 allow perm=any gid=wheel : ftype=%languages dir=/usr/share/git-core/templates/
Hits/rule:  11    158 allow perm=open all : ftype=%languages trust=1
Hits/rule:  12      0 deny_audit perm=any all : ftype=%languages
Hits/rule:  13     20 allow perm=any all : ftype=text/x-shellscript
Hits/rule:  14      3 deny_audit perm=execute all : all
Hits/rule:  15  26238 allow perm=open all : all

Subject attribute lookups:
Subject attr: auid requests=3 lookups=3
Subject attr: uid requests=84351 lookups=692
Subject attr: sessionid requests=0 lookups=0
Subject attr: pid requests=3 lookups=0
Subject attr: ppid requests=3 lookups=0
Subject attr: trust requests=7531 lookups=565
Subject attr: gid requests=52850 lookups=0
Subject attr: comm requests=0 lookups=0
Subject attr: exe requests=68692 lookups=1216
Subject attr: dir requests=0 lookups=0
Subject attr: ftype requests=0 lookups=0

Object attribute lookups:
Object attr: path requests=12476 lookups=9452
Object attr: dir requests=7853 lookups=2453
Object attr: device requests=0 lookups=0
Object attr: ftype requests=217466 lookups=9420
Object attr: trust requests=8376 lookups=1437
Object attr: filehash requests=0 lookups=0


The first two lines matter more than they might appear:
Last metrics reset: never
Ruleset generation: 3

If the last reset is never, then the counters have been growing since the daemon started. That is the default behavior and it is useful for a quick look at the lifetime of the daemon. If you want a smaller window, configure reset_strategy=manual in fapolicyd.conf and use "fapolicyd-cli --reset-metrics". The reset-metrics report snapshots the counters, resets them, and displays what they were at reset. It is the exact same output - however the next report generated starts fresh.

The Ruleset generation tells you which active policy the counters apply to. Rule hit counters naturally reset when new rules are successfully loaded. That is important because rule numbers and rule text can change when policy changes. You do not want rule hit counts from yesterday's policy mixed with today's policy.

Metrics are easiest to interpret when the counter window and ruleset generation are explicit.

The first real section is decision outcomes. The old way to look at fapolicyd activity was allowed versus denied. That is still useful, but it is not enough. An allow can happen because a rule matched. It can also happen because the rules had no opinion and the historical default allow behavior was used. Note: the shipped rules ends with 

deny_audit perm=execute all : all
allow perm=open all : all
The intention is to block any unknown execution and allow opens of any documents. This depends on accurate detection of any interpreted computer languages before these lines.

To better understand what kind of access decisions are made, the metrics report now separates:
Allowed accesses
Denied accesses
Allowed by rule
Allowed by fallthrough
If Allowed by fallthrough is zero, then every allow in the window came from a rule. If it is non-zero, the report prints more detail: open versus execute, trusted versus untrusted or unknown trust, and broad file type buckets such as executable, programmatic, shared library, unknown ftype, and other ftype.

The default allow indicates that policy is missing a decision rule. It is not the same as a policy rule intentionally approving the access. If a system has a large number of fallthrough execute decisions, you would want to know why. Maybe that is expected for a permissive policy. Maybe it shows a missing terminal deny. The point is that it is now visible.

Rule hit counts are next. They look like this:
Hits/rule:   1      42 allow perm=execute all : trust=1
Hits/rule:   2       3 deny_audit perm=execute all : trust=0
The exact rules will depend on your policy. A rule hit is counted when the rule actually makes the allow or deny decision. Merely iterating past a rule is not a hit. This makes the table useful for answering practical questions. Which rules are carrying the load? Which rules never fire? Did the rule I just added actually match the program I was testing?

Next come queue and defer activity. The most important queue metric is:
Inter-thread max queue depth
This is the high-water mark for fapolicyd's internal event queue in the current metrics window. It is not the kernel fanotify queue. It is the userspace queue between event intake and decision processing. If this number grows near your configured q_size, the daemon is not keeping up with the event stream.

Subject deferral is new in this release. The daemon can defer an incoming event when processing it would evict another process that is still building startup pattern state in the same subject cache slot. In a normal busy system, some deferred events may be fine. What I would watch closely is:
Subject deferred events
Subject defer max depth
Subject defer fallbacks
Subject defer fallbacks means the fixed defer array filled and the daemon had to use the historical eviction behavior. If this keeps climbing during normal workloads, look at subject cache sizing and the workload shape. A very wide fork/exec storm can create lots of subject cache collisions.

The cache sections are next. fapolicyd has subject and object caches because recomputing process and file attributes on every event would be expensive. The basic shape is:
Subject hits
Subject misses
Subject collisions
Subject evictions

Object hits
Object misses
Object collisions
Object evictions
Hits are good. Misses mean fapolicyd had to populate information - which can be costly. Collisions mean a populated cache slot did not match the current process or file identity. Evictions mean something was removed to make room. A small number of evictions is normal. A high eviction rate compared to hits suggests that the cache may be too small for the workload.

The subject cache also reports early subject cache evictions, tracer evictions, and stale BUILDING evictions. These are more health-oriented. An early eviction means a subject was evicted before startup state was complete. A tracer eviction means the occupant was being traced and could hold the slot indefinitely. A stale eviction means startup state stayed incomplete past the bounded stale window.

The last section is easy to skip, but it is one of the most useful additions: attribute lookup metrics. They look like this:
Subject attr: auid requests=1000 lookups=12
Object attr: ftype requests=1000 lookups=200

requests means policy evaluation or syslog formatting asked for the attribute. lookups means the attribute was not already present in the event cache and fapolicyd had to compute or fetch it.

That distinction matters. If requests are high but lookups are low, the cache is doing its job. If lookups are high, then that attribute is causing real work. This is also a way to see whether logging is making the daemon do extra lookups. For example, a very detailed syslog_format can request attributes that the policy did not need for the decision.

The metrics report groups activity by decision outcome, queue/defer pressure, cache behavior, and attribute lookup cost.


Once you have metrics, the next question is how to graph them. fapolicyd does not need to know about your monitoring stack. The report is simple name: value text. That means you can parse the fields you care about and ship them to whatever you already use.

If you use StatsD, you can turn selected fields into gauge updates. This is deliberately small and crude, but it shows the idea:

fapolicyd-cli --check-metrics |
awk -F': '
'/^(Allowed accesses|Denied accesses|Allowed by rule|Allowed by fallthrough|Inter-thread max queue depth|Subject deferred events|Subject defer fallbacks|Early subject cache evictions|Object evictions):/ {
    name=$1
    gsub(/ /, "_", name)
    printf "fapolicyd.%s:%s|g\n", tolower(name), $2
}' | nc -u -w0 127.0.0.1 8125 



If you use Prometheus and Grafana, remember that Grafana is the visualization layer. Something else has to collect the numbers. Prometheus supports a simple text exposition format. You could write a tiny exporter or generate a node-exporter textfile collector file from selected fapolicyd metrics.

For Prometheus, be careful with labels. It is tempting to turn every rule hit line into a metric with the whole rule text as a label. That can create high cardinality and make your monitoring system unhappy. I would start with stable low-cardinality metrics: allow/deny counts, fallthrough counts, queue depth, cache hits and evictions, subject deferral, kernel overflow, and reply errors. Rule hits are better for troubleshooting reports unless you have a controlled set of rule labels.

Here is a minimal Prometheus-style example for a few fields:

fapolicyd-cli --check-metrics |
awk -F': '
'/^Allowed accesses:/ { print "fapolicyd_allowed_accesses " $2 }
/^Denied accesses:/ { print "fapolicyd_denied_accesses " $2 }
/^Allowed by fallthrough:/ { print "fapolicyd_allowed_by_fallthrough " $2 }
/^Inter-thread max queue depth:/ { print "fapolicyd_inter_thread_max_queue_depth " $2 }
/^Subject defer fallbacks:/ { print "fapolicyd_subject_defer_fallbacks " $2 }
/^Early subject cache evictions:/ { print "fapolicyd_early_subject_cache_evictions " $2 }
'


This is not meant to be a finished exporter. It is a starting point. A real exporter should sanitize names, handle percentages, keep type/help metadata, and decide whether a field is a counter or gauge.

What should you graph first? I would start with these:
Allowed accesses
Denied accesses
Allowed by fallthrough
Inter-thread max queue depth
Subject deferred events
Subject defer fallbacks
Early subject cache evictions
Subject evictions
Object evictions
Reply errors
Kernel queue overflow
Then add attribute lookups for the fields that matter to your policy. If Object attr: ftype lookups are high, file type detection is important for your workload. If Object attr: trust lookups are high, trust database lookups are part of the cost. If subject proc attributes are hot, fork/exec-heavy
workloads may be asking for process details often.

Metrics do not tell you everything. They tell you where to look. If cache evictions are high, look at cache sizing. If fallthrough is high, look at rule coverage. If queue depth is high, look at workload bursts and decision cost. If you need to know where the time is going inside decisions, that is when the timing report comes in. That is the next article.

Friday, May 22, 2026

fapolicyd 1.5: Correctness First, Better Reports Second

   

The fapolicyd 1.5 release had a lot of internal work that is not very noticable. This release started with a simple objective:

  • No silent visibility loss.
  • No silent policy replacement failure.
  • No hidden default-allow behavior.

fapolicyd is an application allowlisting daemon. It receives file access events from the kernel through fanotify, evaluates policy, and replies to the kernel. When something goes wrong in that path, the daemon should not quietly lose sight of events, quietly replace policy with a bad ruleset, or quietly allow executable content because policy had no opinion.

That is the first narrative of this release: fail-safe and correctness work. The second narrative is reporting. Once the important failure modes are made explicit, they need to show up in the right place. That led to separating status, metrics, and timing into different reports.

 

fapolicyd 1.5 makes important failure modes explicit and then reports them in the right place


Fail-safe and correctness work

One of the most important changes is the transactional rule reload. Previously, a reload could destroy the active rules before the replacement ruleset had fully loaded. If parsing failed after the old policy was gone, the daemon could be left with partial or empty policy. Since fapolicyd historically allows access when no rule has an opinion, this could become a fail-open condition.

The new approach is to build and validate the candidate ruleset separately. It owns the rules, attribute sets, syslog fields, proc-status mask, rule count, and rule identity. Only after the candidate is complete does it replace the active ruleset. A failed reload preserves the previous published policy.

That also means reload failures become useful operational data. The daemon can count rule reload failures and report them as failure-action metrics. Today those counters are observe-mode diagnostics. Later, high-security deployments can use the same named failure classes to decide whether a stronger action is needed.

The parser side was tightened for the same reason. If a subject or object assignment fails while parsing a rule, that must be a parse failure. It cannot be ignored. A malformed multi-attribute rule should not accidentally become a broader rule because one invalid attribute was skipped while another valid attribute remained.

The fanotify path also got attention. FAN_Q_OVERFLOW is the kernel telling us that fanotify events were lost. This is not the same as a denied access. It is visibility loss. If the kernel queue overflows and the daemon does not make that visible, an administrator has no way to know that enforcement degraded.

So fapolicyd now detects kernel queue overflow, counts it, logs it with rate limiting, and exposes it through reporting. This counter is not just for this release. It is groundwork for later failure policy. If a future configuration wants to treat kernel event loss as a high-severity condition, the daemon has a specific event class to build on.

The daemon also now counts failed or short fanotify response writes. Every permission event needs exactly one response back to the kernel. If the response write fails, that is a correctness issue. Counting it gives us another named failure class that can be monitored now and acted on more strongly later.

The no-opinion allow path was another area that needed guardrails. Fapolicyd does not try to deny every ordinary document open. Compatibility requires a default allow when policy has no opinion. But executable and programmatic content should not accidentally fall through because of incomplete rules, parser mistakes, missing file type classification, or internal errors.

The release adds counters that distinguish "Allowed by rule" from "Allowed by fallthrough". That keeps the compatibility behavior, but it stops hiding it. If a system has default-allow execute activity, the administrator can now see it and investigate.

This is also where the new rule linter fits.


$ fapolicyd-cli --check-rules --lint fapolicyd/src/tests/fixtures/rules-valid.rules
file is valid (3 rules)
Policy lint warning: executable events can fall through; no terminal broad execute deny found
Policy lint warning: %languages is not defined; programmatic ftype coverage
cannot be checked
$ fapolicyd-cli --check-rules --lint --verbose Rules file is valid (15 rules) Policy lint found no warnings

The linter is intentionally modest. It is not a formal proof of the policy. It looks for policy shapes that are easy to get wrong: executable events that can fall through because there is no broad terminal execute deny, missing "%languages" coverage for programmatic opens, and broad open allows that can shadow programmatic-content denies.

That matters because fapolicyd policies are ordered. A rule that looks harmless in isolation can change meaning when it appears before another rule. The linter gives administrators a cheap check before loading policy. It also ties back to the default-allow metrics: if the linter warns about a default-allow gap, the metrics report can show whether the running system is actually using that gap.

If you use rules.d, fagenrules now validates the compiled rules file before installing it. That closes a practical hole. The assembled policy should be checked before it is written as the policy the daemon will later load.

Subject deferral

The subject cache work deserves a closer look because it fixes a subtle correctness problem.

fapolicyd tracks process startup state. During exec, it may see the executable, the runtime linker, libraries, interpreter activity, and other file opens that belong to the same subject. While that sequence is still being understood, the subject is considered to be in a BUILDING state.

The subject cache is indexed by pid % slots. More than one process can hash to the same slot over time. Under heavy fork/exec pressure, a new process can arrive at a slot that is already occupied by a different process that is still BUILDING. If fapolicyd evicts the BUILDING subject too early, it loses the startup context for the process already underway. When that original process comes back with another access, fapolicyd may see a later part of startup without the earlier context. That can lead to false pattern detection decisions such as the "ld_so" detection.

One idea is to collect ejected slots and try to keep them around until they complete. That sounds easy, but it is a losing proposition.

New processes keep coming and the runtime linker will pour many new accesses into the system as it assembles the app. There is no guarantee that the process we ejected will ever make another access. It might exit. It might segfault. It might be stopped by a tracer. If it does come back, there is no efficient way to find the saved subject once it is ejected to a new storage. That turns the storage into a second, unbounded, hard-to-reason-about subject store.

The design in 1.5 takes the opposite approach. Do not eject the old BUILDING subject just because a new subject arrived. Store the new access request in a bounded deferral array. This puts back pressure on the new access while the subject already underway finishes and gets out of the way. If a subject is deferred, the runtime linker does not add more accesses allowing other processes that are actionable to proceed.

 

Subject deferral sequence of events.


The array is fixed size. That is important. A fanotify permission event can hold a task waiting for a decision, and it owns a file descriptor until the daemon replies or closes it. Deferral is acceptable because it is bounded and observable. If the array fills, fapolicyd falls back to the historical eviction behavior and counts that fallback.

There are two corner cases: traced subjects and stale subjects.

If a process is being traced, it may be stopped indefinitely. We do not know whether it will continue. If a traced process holds a BUILDING slot and another request needs that slot, waiting forever would be a denial-of-service bug. A malicious actor could try to fill slots with stopped BUILDING subjects and deadlock the system. The daemon now detects traced BUILDING subjects and can eject them when their slot is needed.

There is also a timeout. Even without tracing, a BUILDING subject that has not made progress can become stale. As a last resort, stale BUILDING slots are garbage collected so the system does not wait forever on a subject that may never complete startup.

The relevant counters are:

  • Subject deferred events
  • Subject defer max depth
  • Subject defer fallbacks
  • Early subject cache evictions
  • Subject BUILDING tracer evictions
  • Subject BUILDING stale evictions
  • Subject defer oldest age

These counters tell us whether the accesses are flowing. A busy fork/exec workload may increase "Subject deferred events". That is expected. "Subject defer fallbacks" should not steadily climb during normal operation. Tracer and stale evictions should be rare and worth looking at.

The stress and timing article later in this series will show how to create subject-cache pressure deliberately and how to read these counters during a test.

Reporting philosophy

The second narrative is reporting. Before this work, the status report had become a mixed report. It included health, configuration, counters, cache statistics, and other runtime details. That makes it harder to know what a field means. Is it a current state value? A lifetime counter? A counter that should be reset? A health indicator?

The reports now have cleaner jobs.

"fapolicyd-cli --check-status" asks whether the daemon is healthy and configured as expected.


# fapolicyd-cli --check-status
Operating mode:
Permissive: false
Integrity: sha256
reset_strategy: manual
Timing collection mode: manual
Timing collection armed: false
Timing collection last start time: never
Timing collection last stop time: never
Ruleset generation: 1
Headline activity: Allowed accesses: 8762 Denied accesses: 0
Resource configuration: CPU cores: 32 q_size: 800 Subject defer array size: 256 Subject cache size: 4099 Object cache size: 16381 Trust database max pages: 18944
Resource utilization: Trust database pages in use: 14284 (75%) Subject slots in use: 207 (5%) Object slots in use: 3334 (20%) glibc arena (total memory) is: 25560 KiB, was: 1188 KiB glibc uordblks (in use memory) is: 8348 KiB, was: 960 KiB glibc fordblks (total free space) is: 17211 KiB, was: 227 KiB
Health indicators: Kernel queue overflow: 0 Filesystem errors: 0 Filesystem error last status: none Filesystem error last seen: never Reply errors: 0 Subject defer fallbacks: 0 Early subject cache evictions: 0 Subject BUILDING tracer evictions: 0 Subject BUILDING stale evictions: 0 Subject defer oldest age: 0ns Failure action queue_full (observe): 0 Failure action kernel_queue_overflow (observe): 0 Failure action worker_stall (observe): 0 Failure action rule_reload_failure (observe): 0 Failure action trust_reload_failure (observe): 0 Failure action response_write_failure (observe): 0 Failure action fanotify_filesystem_error (observe): 0
Watched mounts: watching mount: / watching mount: /dev/shm watching mount: /run watching mount: /run/credentials/systemd-journald.service watching mount: /tmp ...

Status includes operating mode, integrity mode, reset_strategy, timing collection state, ruleset generation, headline allow and deny activity, resource configuration, resource utilization, health indicators, failure action counters, and watched mounts.

The important thing is that status is where you look for current condition. Is the daemon permissive? What ruleset generation is loaded? Are there kernel queue overflows? Are reply errors non-zero? Is there an old deferred subject? Are the watched mounts what you expect?

"fapolicyd-cli --check-metrics" asks what happened in the current counter window. Metrics includes decision outcome counters, Allowed by rule, Allowed by fallthrough, queue and deferral activity, subject and object cache effectiveness, rule hit counts, and subject/object attribute lookup counters.

Some information appears in both reports on purpose. Allowed and denied accesses are useful headline activity in status, but they are also counters in metrics. Early subject evictions are health signals in status, but they also describe the current metrics window. The difference is the question being asked. Status asks whether something needs attention. Metrics asks what moved during the measured interval.

The old cache and counter information did not disappear. It moved to the metrics report where it belongs. If you previously looked in status for cache hits, misses, evictions, rule hits, or detailed runtime counters, look in "--check-metrics".

 Metrics can also be reset in controlled ways. The default reset_strategy is never, which preserves lifetime counters. manual allows a privileged "fapolicyd-cli --reset-metrics" request to snapshot and reset runtime counters.  auto is for interval reports that should each describe only the activity since the previous interval.

Timing is separate from both status and metrics. Timing collection is disabled by default because it adds measurement work to the decision path. When enabled with "timing_collection=manual", an administrator can start and stop a bounded timing window:

fapolicyd-cli --timer-start

fapolicyd-cli --timer-stop

The daemon then writes to "/run/fapolicyd/fapolicyd.timing". This report answers where time went during that specific run: queue wait, event build, rule evaluation, MIME detection, hashing, trust database lookup, logging, audit metadata, and the fanotify response write.

I will cover metrics in the next article and timing plus the stress harness in the one after that. Those deserve separate treatment. This article is only the overview: the release first made important correctness failures explicit, then gave administrators better places to look for the evidence.

The upstream project is here: https://github.com/linux-application-whitelisting/fapolicyd.

If you are new to fapolicyd itself, the Red Hat documentation on blocking and allowing applications with fapolicyd is a useful starting point.



Wednesday, February 4, 2026

Introducing cap-audit

Applications often run as root because figuring out which Linux capabilities they actually need is difficult. You might know your web server needs to bind to port 80, which requires CAP_NET_BIND_SERVICE. But what about that database daemon? Or that monitoring agent? Trial and error gets old fast, and running everything as root is a big risk.

The libcap-ng project now includes cap-audit, a tool that traces applications to determine exactly which capabilities they require. Unlike static analysis tools that guess based on which syscalls appear in the binary, cap-audit hooks into the kernel's actual capability checking functions. When the kernel asks "does this process have CAP_NET_RAW?", cap-audit records it. This is ground truth - not guesswork.

How It Works

Cap-audit uses eBPF to hook the kernel's capability checking functions - cap_capable(), ns_capable(), and their variants. These are the functions the kernel calls every time it needs to verify a capability. When you trace an application, cap-audit forks your target program, registers its PID with the eBPF program, and then watches for kernel events.

The eBPF program filters events by PID right at the kprobe entry point. This is critical. Without filtering, it would capture thousands of capability checks per second from every process on the system. By filtering for just the target application and its children, overhead drops to less than 1% for the traced app and effectively zero for everything else.

Each capability check generates an event that includes which capability was checked, whether it was granted or denied, which syscall triggered it, and a user-space stack trace. These events stream through a ring buffer to the userspace program, which aggregates them and generates a report showing exactly what your application needs.

 


Why System Context Matters

Cap-audit doesn't just tell you which capabilities were checked - it reads various sysctls to understand when those capabilities are actually required. For example, consider /proc/sys/kernel/yama/ptrace_scope. If it's set to 0, any process can ptrace any other process it could normally signal. But if it's 1 or higher, you need CAP_SYS_PTRACE. Same binary, different capability requirements depending on system configuration.

The tool gathers kernel.perf_event_paranoid, kernel.unprivileged_bpf_disabled, kernel.kptr_restrict, and several others. These aren't just informational - they directly affect which capabilities your application needs to run. A monitoring tool that reads /proc/kallsyms needs CAP_SYSLOG when kernel.kptr_restrict is 1, but not when it's 0. Cap-audit shows you both the capabilities that were actually checked and the system settings that made them necessary. 

This means the capability requirements cap-audit reports are specific to your kernel configuration. If you're deploying to containers or hardened systems with different sysctl values, you might need different capabilities. The report includes the system context so you can make informed decisions.

You Must Exercise All Code Paths 

Here's something important to understand: cap-audit traces what your application actually does, not what it could theoretically do. If your application can set the system clock which requires CAP_SYS_TIME, but you only trace it handling normal requests, you won't see that capability requirement.

Think of it like code coverage in testing. If you don't exercise a code path during tracing, its capability requirements won't appear in the report. For daemons, this means you need to trigger all administrative operations, error handling paths, and edge cases. For CLI tools, you need to use all major features and options.

Let's see it in action:

DEMO: Tracing a daemon
======================================================================
CAPABILITY ANALYSIS FOR: /usr/sbin/irqbalance (PID 8189)
======================================================================

SYSTEM CONTEXT:
----------------------------------------------------------------------
Kernel version: 6.18.7-100.fc42.x86_64
kernel.yama.ptrace_scope: 1
kernel.kptr_restrict: 1
kernel.dmesg_restrict: 1
kernel.modules_disabled: 0
kernel.perf_event_paranoid: 2
kernel.unprivileged_bpf_disabled: 2
net.core.bpf_jit_enable: 1
net.core.bpf_jit_harden: 1
net.core.bpf_jit_kallsyms: 1
vm.mmap_min_addr: 65536
fs.protected_hardlinks: 1
fs.protected_symlinks: 1
fs.suid_dumpable: 2

REQUIRED CAPABILITIES:
----------------------------------------------------------------------
setpcap (#8)
Checks: 43 granted, 0 denied
Reason: Used by prctl (syscall 157)

sys_admin (#21)
Checks: 34 granted, 0 denied
Reason: Used by clone (syscall 56)

CONDITIONAL CAPABILITIES:
----------------------------------------------------------------------
None

ATTEMPTED BUT DENIED:
----------------------------------------------------------------------
None

SUMMARY:
----------------------------------------------------------------------
Total capability checks: 77
Required capabilities: 2
Conditional capabilities: 0
Denied operations: 0

RECOMMENDATIONS:
----------------------------------------------------------------------
Programmatic solution (C with libcap-ng):
#include <cap-ng.h>
...
capng_clear(CAPNG_SELECT_BOTH);
capng_updatev(CAPNG_ADD, CAPNG_EFFECTIVE|CAPNG_PERMITTED, SETPCAP, SYS_ADMIN, -1);
if (capng_change_id(uid, gid, CAPNG_DROP_SUPP_GRP | CAPNG_CLEAR_BOUNDING))
perror("capng_change_id");

For systemd service:
[Service]
User=<non-root-user>
Group=<non-root-group>
AmbientCapabilities=setpcap sys_admin
CapabilityBoundingSet=setpcap sys_admin

For file capabilities (via filecap):
filecap /path/to/binary setpcap sys_admin

For Docker/Podman:
docker run --user $(id -u):$(id -g) \
--cap-drop=ALL \
--cap-add=setpcap \
--cap-add=sys_admin \
your-image:tag

For Kubernetes:
securityContext:
runAsUser: 1000
runAsGroup: 1000
capabilities:
drop:
- ALL
add:
- setpcap
- sys_admin

For applications that use file-based capabilities (like /usr/bin/arping with cap_net_raw+ep), cap-audit sees when those capabilities are actually exercised. A binary might have five capabilities set on the file, but only uses three during normal operation. Cap-audit shows you what's actually needed.

DEMO: File capabilities
======================================================================
CAPABILITY ANALYSIS FOR: /usr/bin/arping (PID 7940)
======================================================================

SYSTEM CONTEXT:
----------------------------------------------------------------------
Kernel version: 6.18.7-100.fc42.x86_64
kernel.yama.ptrace_scope: 1
kernel.kptr_restrict: 1
kernel.dmesg_restrict: 1
kernel.modules_disabled: 0
kernel.perf_event_paranoid: 2
kernel.unprivileged_bpf_disabled: 2
net.core.bpf_jit_enable: 1
net.core.bpf_jit_harden: 1
net.core.bpf_jit_kallsyms: 1
vm.mmap_min_addr: 65536
fs.protected_hardlinks: 1
fs.protected_symlinks: 1
fs.suid_dumpable: 2

REQUIRED CAPABILITIES:
----------------------------------------------------------------------
setpcap (#8)
Checks: 1 granted, 5 denied
Reason: Used by capset (syscall 126)

net_raw (#13)
Checks: 1 granted, 0 denied
Reason: Used by socket (syscall 41)

CONDITIONAL CAPABILITIES:
----------------------------------------------------------------------
None

ATTEMPTED BUT DENIED:
----------------------------------------------------------------------
setuid (#7)
Attempts: 1 (all denied)
Impact: Application may have reduced functionality

SUMMARY:
----------------------------------------------------------------------
Total capability checks: 8
Required capabilities: 2
Conditional capabilities: 0
Denied operations: 1

RECOMMENDATIONS:
----------------------------------------------------------------------
Programmatic solution (C with libcap-ng):
#include <cap-ng.h>
...
capng_clear(CAPNG_SELECT_BOTH);
capng_updatev(CAPNG_ADD, CAPNG_EFFECTIVE|CAPNG_PERMITTED, SETPCAP, NET_RAW, -1);
if (capng_change_id(uid, gid, CAPNG_DROP_SUPP_GRP | CAPNG_CLEAR_BOUNDING))
perror("capng_change_id");

For systemd service:
[Service]
User=<non-root-user>
Group=<non-root-group>
AmbientCapabilities=setpcap net_raw
CapabilityBoundingSet=setpcap net_raw

For file capabilities (via filecap):
filecap /path/to/binary setpcap net_raw

For Docker/Podman:
docker run --user $(id -u):$(id -g) \
--cap-drop=ALL \
--cap-add=setpcap \
--cap-add=net_raw \
your-image:tag

For Kubernetes:
securityContext:
runAsUser: 1000
runAsGroup: 1000
capabilities:
drop:
- ALL
add:
- setpcap
- net_raw

Reading the Report

The report breaks down into several sections. Required Capabilities shows capabilities that were successfully checked - these are capabilities your application actively used and will need to function. Each entry includes how many times it was checked and which syscall triggered it. "CAP_NET_BIND_SERVICE: Used by bind (syscall 49)" tells you exactly what's going on.

Conditional Capabilities shows requirements that depend on system configuration. You'll see entries like "CAP_SYS_PTRACE: Needed when kernel.yama.ptrace_scope > 0, Current value: 1 (capability needed)". This tells you the capability is required on your current system, but might not be on systems with different sysctl values.

Attempted But Denied shows capability checks that failed. These are interesting because they reveal functionality your application tried to use but couldn't. Sometimes this is fine - the application has a fallback path. Other times, it indicates reduced functionality. The report notes "Application may have reduced functionality" so you can investigate.

The recommendations section generates ready-to-use configurations for systemd, Docker, Kubernetes, and file-based capabilities. For systemd, you get AmbientCapabilities and CapabilityBoundingSet directives. For Docker, you get --cap-drop=ALL followed by specific --cap-add entries. For file capabilities, you get the filecap command with the exact capability set. These aren't just suggestions - they're the minimal set your application demonstrated it needs. It also produces a snipit of C or python code to show how to programmatically solve this in the program.

Ground Truth, Not Guesswork

The key insight is that cap-audit hooks the actual kernel capability checking functions. When mount() checks CAP_SYS_ADMIN, cap-audit sees it. When bind() checks CAP_NET_BIND_SERVICE, cap-audit sees it. There's no parsing of source code, no heuristics based on syscall names, no guessing. The kernel's security subsystem itself is telling you what capabilities are being checked.

This is why the tool requires CAP_BPF and CAP_PERFMON to run - it's instrumenting kernel internals. But once set up, it gives you authoritative answers about capability requirements. If cap-audit says your application needs three capabilities, those are the three it checked during your trace. If it says your application doesn't need elevated capabilities at all, you can confidently run it as an unprivileged user.

I should note that both of the example traces above show programs that are not doing capabilities right. There is a '-v' command line option for verbose output. It shows that even though arping is file system based capabilities, it is still calling capset when it shouldn't. Irqbalance, constantly tries setting capabilities when it should do it once and done. We'll dive into these in a future blog.

Run your applications through cap-audit during development or security audits. Exercise all functionality, check the system context, and use the generated configurations to properly scope your capabilities. File issues on github if you have a request or find something is off.