Sunday, May 24, 2026

Timing fapolicyd Decisions

The previous article looked at fapolicyd metrics. Metrics tell us what happened in a counter window. This article is about timing. Timing answers a different question: where did the daemon spend time while making decisions?

This is not something that should always be enabled. fapolicyd sits in the path of file opens and execs. Calling clock_gettime() around hot-path operations on every permission event has a cost. That is why timing collection is off by default and why the new model is manual. You arm timing, run a workload, stop timing, and read the report.

Before looking at the command line, it helps to understand what a decision is. The timing report is much easier to read if you know the path an access request takes through the daemon.

Anatomy of a Decision

At a high level, a fapolicyd decision has three phases:

event_build
evaluation
response
event_build turns the raw fanotify event into the internal data fapolicyd can reason about. evaluation walks the rules and computes the policy result. response records the outcome, prepares any logging or audit data, and writes the final decision back to the kernel.

How a fapolicyd access decision moves through the daemon.


 The timing report uses names that look like this:

phase:operation:child
For example, evaluation:mime_detection:libmagic_fallback means libmagic fallback happened while the daemon was in the evaluation phase. response:trust_db_lookup:read means trust database read work was triggered while building the response.

The stages are not a strict accounting tree. Some operations are nested. Some are lazy and only happen if a rule or log format asks for the data. Child rows do not have to add up exactly to parent rows. Treat them as measured gates in the decision path, not as a perfect flame graph.

Event Build

The first phase is event_build. The kernel gives fapolicyd a fanotify event. That event has a permission file descriptor, a process id, and a mask describing the requested access. fapolicyd has to turn that into an internal event with enough subject and object identity to evaluate policy.

The main timed gates are:

event_build:cache_flush
event_build:proc_fingerprint
event_build:fd_stat
event_build:cache_flush is measured when the daemon has to flush object cache state on the decision path. Cache flushes are not the normal cost of every decision. If this shows up prominently, something caused cached object state to be invalidated during the timing window.

event_build:proc_fingerprint reads enough process information to identify the subject and detect stale cache entries. This is the first process identity gate. On fork/exec-heavy systems, this can be a visible cost because every short lived command creates more process identity work.

event_build:fd_stat stats the fanotify file descriptor to identify the object being accessed. This gives fapolicyd stable file identity data. If this is expensive, the cost is in object fingerprinting before policy evaluation even starts.

The output of event_build is not the whole answer. It is the starting point. Some subject and object attributes are still lazy. They are computed only if a rule, syslog format, debug output, or audit response needs them.

Evaluation

The second phase is evaluation. This is where fapolicyd evaluates the active policy rules against the event.

The top-level evaluation gates are:

evaluation:lock_wait
evaluation:fd_path_resolution
evaluation:proc_detail_lookup
evaluation:lock_wait is time spent waiting for the rule lock before policy evaluation can proceed. In the current single-decision-thread design, this is usually not expected to dominate. It becomes more interesting as the code moves toward more read-side concurrency. If it is high now, rule reload or other policy maintenance activity may have overlapped the timing run.

evaluation:fd_path_resolution is path resolution for the object file descriptor. Not every rule needs a path, but rules that match on path or dir can force this work. A workload that opens many distinct files can make this more visible.

evaluation:proc_detail_lookup is the on-demand process detail gate. The initial fingerprint is collected in event_build, but rules may ask for more: auid, session id, executable path, process status fields, or other procfs details. If this row is high, the policy or output format is asking for process details that were not already cached.

MIME detection is a nested set of gates:
evaluation:mime_detection:fast_classification
evaluation:mime_detection:gather_elf
evaluation:mime_detection:libmagic_fallback
fapolicyd tries cheap classification first. The fast path recognizes common file types without using full libmagic. gather_elf is the ELF and script header scan used to identify executable and interpreter-related details. libmagic_fallback is the expensive fallback when the faster checks cannot classify the object.

If evaluation:mime_detection:libmagic_fallback is high, the daemon is often leaving the fast path. That can be caused by a workload full of files that are hard to classify cheaply, or by rules that ask for ftype on many objects.

Integrity and hashing gates are:

evaluation:hash_ima:total
evaluation:hash_sha:total
hash_ima is IMA digest collection from the security.ima extended attribute. hash_sha is SHA-style file digest collection. These are workload and policy dependent. If integrity mode or FILE_HASH rules force hashing of large files, these rows can dominate a timing run. This is one reason timing has to be a bounded diagnostic tool: hashing can be attacker-controlled cost if policy asks for it on large inputs.

Trust database lookup is also split:
evaluation:trust_db_lookup:lock_wait
evaluation:trust_db_lookup:read
lock_wait is time waiting for trust database update coordination before the read can proceed. read is the LMDB read-side work and related lookup logic. If read is high, trust lookup itself is material. If lock_wait is high, the decision path is waiting behind trust database maintenance or contention.

The evaluation phase decides whether a rule has an opinion. If a rule returns allow or deny, that is the decision source. If no rule has an opinion, fapolicyd preserves historical compatibility by allowing the access through the fallthrough path. The metrics report counts that separately as Allowed by fallthrough.

Response

The third phase is response. This is the work after rule evaluation has produced an outcome. It includes decision bookkeeping, optional syslog or debug formatting, audit response metadata, and the write back to the kernel.

The response phase has some gates that look familiar:

response:mime_detection:fast_classification
response:mime_detection:gather_elf
response:mime_detection:libmagic_fallback
response:trust_db_lookup:lock_wait
response:trust_db_lookup:read
response:fanotify_write
This is the part that can be confusing at first. Why can MIME detection or trust database lookup appear in both evaluation and response?

The answer is lazy attributes. fapolicyd does not compute every possible attribute up front. That would waste time. Instead, attributes are looked up when something asks for them. During evaluation, policy rules may ask for ftype, trust, path, hash, or process details. During response, syslog, debug, or audit formatting may ask for some of the same attributes so the daemon can explain the decision.

So the same helper can be charged to different phases depending on who caused the lookup.

If evaluation:mime_detection:* is high, policy evaluation needed file type information. If response:mime_detection:* is high, reporting or audit output needed file type information after the decision had already been made. The same logic applies to trust database lookup. Evaluation-side trust lookup is policy cost. Response-side trust lookup is reporting or audit cost.

This distinction is important. If a timing report says response-side MIME detection dominates, making policy rules simpler may not help. The cost may be coming from debug output or a detailed syslog_format. If evaluation-side MIME detection dominates, then the policy is probably asking for file type data in the hot path.

response:fanotify_write is the final write of the permission decision back to the kernel. A high value here means the daemon did the policy work and then spent measurable time completing the kernel response. That is not rule cost. It is response-path cost.

Using the Timer

Now the report structure makes more sense. The timer creates a bounded window around a workload so the daemon can aggregate timing data for those gates.

Enable manual timing in fapolicyd.conf:

timing_collection=manual
After changing the configuration, reload or restart fapolicyd so the running daemon has the setting active. Then check status:

fapolicyd-cli --check-status | head

Look for:
Timing collection mode: manual
Timing collection armed: false
If the mode is still off, the daemon will ignore timing requests. This is intentional. A system should not accidentally start collecting timing data just because someone ran the CLI.

Manual timing creates a bounded measurement window around a workload.

The CLI has two names for the commands. The documentation uses --timing-start and --timing-stop. There are also --timer-start and --timer-stop aliases. I like the timer spelling because it reads naturally, so I will use that here.

To start timing, run the following command:

sudo fapolicyd-cli --timer-start

Now run something that causes fapolicyd decisions. For a tiny smoke test, this can be as boring as:

for i in $(seq 1 100); do /usr/bin/id >/dev/null; done
Then to stop timing, run the following command:

sudo fapolicyd-cli --timer-stop

The stop command prints the timing report. The daemon writes the report to:

/run/fapolicyd/fapolicyd.timing
The report is aggregate data. It does not store one record per decision. Each stage keeps a count, total time, maximum time, and latency buckets. That keeps memory bounded even if the timing window covers a large number of decisions. Here is a sample report:

Mode: manual
Timing run: 2026-05-24 14:54:07 -0400 to 2026-05-24 14:55:13 -0400
Duration: 0:01:06
Workers: 1
Max queue depth: 399
Decisions: 576,495
Throughput: 8717.3 decisions/sec (wall clock)
Active decision rate: 9596.4 decisions/sec

TL;DR:
  - Queueing pressure reached max depth 399 of 800 (49.9%), p95 wait <=100ms, max wait 160 ms.

Overall decision latency:
  avg 104 us, max 2.81 ms
  p50 bucket <=100us, p95 bucket <=500us, p99 bucket <=500us
  <=50us 1.4%, <=100us 72.4%, <=500us 100.0%, <=1ms 100.0%, >10ms 0.0%

Queueing:
  avg wait: 27.2 ms
  max wait: 160 ms
  p95 bucket: <=100ms
  total queued time: 15707 s
  max queue depth: 399

Decision phase timing:
Phase                 Calls  Calls/Dec      Total        Avg        Max   p95 bucket   Notes
event_build         576,495       1.00     3.17 s    5.50 us     516 us       <=50us   
evaluation          576,495       1.00     1.66 s    2.89 us    2.80 ms       <=50us   
response            576,495       1.00     55.0 s    95.5 us    1.44 ms      <=500us   

Lazy helper attribution:
  Helper timings are attributed to the active logical driver: evaluation or response.
  Combined totals are evaluation + response.

Lazy helper attribution by driver:
Helper                             Eval total  Response total     Combined Response %
mime_detection:total                  11.1 ms            0 ns      11.1 ms        0.0%
mime_detection:fast_classification     510 us            0 ns       510 us        0.0%
mime_detection:gather_elf              186 us            0 ns       186 us        0.0%
mime_detection:libmagic_fallback      10.4 ms            0 ns      10.4 ms        0.0%
trust_db_lookup:total                  120 ms            0 ns       120 ms        0.0%
trust_db_lookup:read                   109 ms            0 ns       109 ms        0.0%
trust_db_lookup:lock_wait             2.43 ms            0 ns      2.43 ms        0.0%

Combined lazy helper attribution:
Helper path                                       Calls  Calls/Dec      Total   Avg/call  Amort/Dec        Max   p95 bucket
mime_detection:total                                564       0.00    11.1 ms    19.6 us      19 ns    2.54 ms       <=10us
mime_detection:fast_classification                  564       0.00     510 us     903 ns       0 ns    11.3 us       <=10us
mime_detection:gather_elf                            32       0.00     186 us    5.82 us       0 ns    10.9 us       <=10us
mime_detection:libmagic_fallback                      9       0.00    10.4 ms    1.16 ms      18 ns    2.53 ms        <=5ms
trust_db_lookup:total                            66,016       0.11     120 ms    1.81 us     207 ns    2.80 ms        <=5us
trust_db_lookup:read                             66,016       0.11     109 ms    1.65 us     189 ns    2.80 ms        <=5us
trust_db_lookup:lock_wait                        66,016       0.11    2.43 ms      36 ns       4 ns    25.7 us        <=1us
hash_sha:total                                       22       0.00    10.6 ms     481 us      18 ns    2.79 ms        <=5ms
proc_detail_lookup                               99,060       0.17     1.29 s    13.0 us    2.24 us     415 us       <=50us

Derived observations:
  Queueing showed moderate bursts: max queue depth 399 of 800 (49.9%), p95 wait <=100ms, max wait 160 ms.
  libmagic fallback is the biggest MIME contributor: 1.6% of MIME calls, 94.0% of MIME time, 18 ns amortized per decision.
  hash_sha is rare but expensive: 0.0% of decisions, 481 us avg when called, 18 ns amortized per decision.

Detailed stage timing, sorted by total time:
Stage                                                 Calls      Calls/Dec      Total        Avg        Max   p95 bucket
time_in_queue:total                                 576,495           1.00    15707 s    27.2 ms     160 ms      <=100ms
decision:total                                      576,495           1.00     60.1 s     104 us    2.81 ms      <=500us
response:total                                      576,495           1.00     55.0 s    95.5 us    1.44 ms      <=500us
response:fanotify_write                             576,495           1.00     54.0 s    93.7 us    1.44 ms      <=500us
event_build:total                                   576,495           1.00     3.17 s    5.50 us     516 us       <=50us
event_build:proc_fingerprint                        576,495           1.00     1.92 s    3.32 us     250 us       <=10us
evaluation:total                                    576,495           1.00     1.66 s    2.89 us    2.80 ms       <=50us
evaluation:proc_detail_lookup                        99,060           0.17     1.29 s    13.0 us     415 us       <=50us
event_build:fd_stat                                 576,495           1.00     733 ms    1.27 us     512 us        <=5us
evaluation:trust_db_lookup:total                     66,016           0.11     120 ms    1.81 us    2.80 ms        <=5us
evaluation:trust_db_lookup:read                      66,016           0.11     109 ms    1.65 us    2.80 ms        <=5us
evaluation:lock_wait                                576,495           1.00    24.8 ms      43 ns    42.4 us        <=1us
response:audit_metadata:total                       576,495           1.00    12.5 ms      21 ns    46.9 us        <=1us
evaluation:mime_detection:total                         564           0.00    11.1 ms    19.6 us    2.54 ms       <=10us
evaluation:hash_sha:total                                22           0.00    10.6 ms     481 us    2.79 ms        <=5ms
evaluation:mime_detection:libmagic_fallback               9           0.00    10.4 ms    1.16 ms    2.53 ms        <=5ms
evaluation:fd_path_resolution                           588           0.00    6.73 ms    11.4 us     164 us       <=50us
evaluation:trust_db_lookup:lock_wait                 66,016           0.11    2.43 ms      36 ns    25.7 us        <=1us
evaluation:mime_detection:fast_classification           564           0.00     510 us     903 ns    11.3 us       <=10us
evaluation:mime_detection:gather_elf                     32           0.00     186 us    5.82 us    10.9 us       <=10us

Stage tail summary:
  time_in_queue:total: >10ms 564,343/97.9%, >25ms 235,276/40.8%, >50ms 39,149/6.8%, >100ms 2,230/0.4%

Not observed:
  event_build:cache_flush, response:mime_detection:total, response:mime_detection:fast_classification, response:mime_detection:gather_elf, response:mime_detection:libmagic_fallback, evaluation:hash_ima:total, response:trust_db_lookup:total, response:trust_db_lookup:lock_wait, response:trust_db_lookup:read, response:syslog_debug_format:total

Notes:
  Largest queued-time contributor: time_in_queue:total (15707 s)
  Largest helper contributor: proc_detail_lookup (1.29 s)
  Largest decision phase contributor: response (55.0 s)
  Slowest observed row by max: time_in_queue:total (160 ms)

 

Interpreting the Report

The full timing report can look intimidating, but the reading order is simple:

1. Run summary
2. TL;DR
3. Overall decision latency
4. Queueing
5. Decision phase timing
6. Lazy helper attribution
7. Detailed stage timing
The run summary tells you whether the run is meaningful. Decisions should be non-zero. If it is zero, the workload did not generate timed daemon decisions. Max queue depth tells you  whether requests backed up while timing was armed.

The report has two rates. Throughput is decisions per wall-clock second while timing was armed. Active decision rate is based on accumulated decision:total worker time. If wall-clock throughput is low but active decision rate is high, the workload may have been bursty or idle part of the time. If both are low, decision work is expensive.

The TL;DR section is not magic. It is a compact set of observations derived from the same data later in the report. It may point out helper dominance, response formatting cost, queue health, or that no dominant findings were observed.

Overall decision latency is the end-to-end decision worker latency after an event has been dequeued. It includes event build, policy evaluation, optional logging or audit preparation, response selection, and the fanotify response write. Queue wait is reported separately.

Before the next section, let's talk about the p95 notation. The p95 bucket is the latency bucket that contains the 95th percentile observation. In plain terms: about 95% of the measured calls completed at or below that bucket, and about 5% were slower. If the report says a stage has p95 <=500us, then nearly all calls to that stage were fast. If it says p95 >10ms, then the slow path was not just one rare outlier; it happened often enough to affect the top 5% of calls.

It is significant because averages can hide tail latency, and maximums can overreact to one unusual event. The p95 bucket sits between them. It tells you whether slowness is recurring enough to matter. 

This section has average latency, maximum latency, p95 bucket, and tail buckets. The p95 bucket is usually more useful than the maximum. A high maximum with a low p95 means rare outliers. A high p95 means the slow path is common.

Queueing tells you how long events waited in fapolicyd's internal userspace queue before the decision worker started processing them. If queue wait is high and max queue depth is close to q_size, the daemon is being fed events faster than it can answer them.

Decision phase timing is the first section I would use for diagnosis:

event_build:total
evaluation:total
response:total
If event build dominates, look at process and object identity work. If evaluation dominates, look at rule traversal and evaluation-side helpers. If response dominates, look at logging, audit metadata, response-side helpers, and the kernel write.

Lazy helper attribution explains whether expensive helper work was caused by evaluation or response. This is where the duplicate-looking MIME and trust DB rows become useful. The same helper has very different tuning implications depending on whether policy needed it or output formatting needed it.

The detailed stage table is the final view. It is sorted by total measured time. Use it to find the main cost centers:
High event_build:proc_fingerprint              -> process identity lookup cost
High event_build:fd_stat                       -> object fingerprinting cost
High evaluation:lock_wait                      -> rule lock wait
High evaluation:total                          -> rule traversal or matching cost
High evaluation:mime_detection:libmagic_fallback -> expensive file type fallback
High evaluation:hash_sha:total                 -> SHA hashing cost
High evaluation:hash_ima:total                 -> IMA digest cost
High evaluation:trust_db_lookup:read           -> trust database read cost
High response:mime_detection:*                 -> reporting needed ftype data
High response:trust_db_lookup:read             -> reporting or audit needed trust data
High response:fanotify_write                   -> kernel response write cost
Do not expect every row to appear in every report. If a stage is not observed, the workload did not trigger that measured operation during the timing window. That can be good news. For example, no libmagic fallback means fast MIME classification handled the files observed in the run.

The best way to use timing is to make one change at a time. Change the workload, a rule, an integrity setting, or output format, then run another bounded timing window. Compare the same sections. If the dominant cost moves, you learned something.

The next article will cover the stress harness. The timer tells us where the daemon spent time. The stress tool gives us repeatable workloads that can move specific counters and timing gates on purpose.

Saturday, May 23, 2026

Understanding fapolicyd-1.5 Metrics

In the last article, I went over the bigger picture of the fapolicyd 1.5 work. From it, we learned that status, metrics, and timing are now separate reports because they answer different questions.

This article is about metrics.

The important distinction is this: "fapolicyd-cli --check-status" asks whether the daemon is healthy and configured the way you expect. "fapolicyd-cli --check-metrics" asks what the daemon has been doing. Metrics are where you look for rule hits, cache behavior, default-allow decisions, queue pressure, and which rule attributes are causing work.

Start with the command:

# fapolicyd-cli --check-metrics
Last metrics reset: never
Ruleset generation: 1

Decision outcomes:
Allowed accesses: 42171
Denied accesses: 3
Allowed by rule: 42171
Allowed by fallthrough: 0

Inter-thread queue & defer activity:
Inter-thread max queue depth: 6
Subject deferred events: 0
Subject defer max depth: 0
Subject defer fallbacks: 0

Subject cache effectiveness:
Subject hits: 41632
Subject misses: 692
Subject collisions: 28
Subject evictions: 150 (0%)
Early subject cache evictions: 0
Subject BUILDING tracer evictions: 0
Subject BUILDING stale evictions: 0

Object cache effectiveness:
Object hits: 34527
Object misses: 11421
Object collisions: 3774
Object evictions: 3774 (10%)

Rule hit counts:
Hits/rule:   1      0 allow perm=any uid=0 : dir=/var/tmp/
Hits/rule:   2   7531 allow perm=any uid=0 trust=1 : all
Hits/rule:   3      0 allow perm=open exe=/usr/bin/rpm : all
Hits/rule:   4      0 allow perm=open exe=/usr/bin/python3.13 comm=dnf : all
Hits/rule:   5      0 deny_audit perm=any all : ftype=application/x-bad-elf
Hits/rule:   6   7634 allow perm=open all : ftype=application/x-sharedlib trust=1
Hits/rule:   7      0 deny_audit perm=open all : ftype=application/x-sharedlib
Hits/rule:   8    578 allow perm=execute all : trust=1
Hits/rule:   9     12 allow perm=any gid=wheel : ftype=%languages dir=/home
Hits/rule:  10      0 allow perm=any gid=wheel : ftype=%languages dir=/usr/share/git-core/templates/
Hits/rule:  11    158 allow perm=open all : ftype=%languages trust=1
Hits/rule:  12      0 deny_audit perm=any all : ftype=%languages
Hits/rule:  13     20 allow perm=any all : ftype=text/x-shellscript
Hits/rule:  14      3 deny_audit perm=execute all : all
Hits/rule:  15  26238 allow perm=open all : all

Subject attribute lookups:
Subject attr: auid requests=3 lookups=3
Subject attr: uid requests=84351 lookups=692
Subject attr: sessionid requests=0 lookups=0
Subject attr: pid requests=3 lookups=0
Subject attr: ppid requests=3 lookups=0
Subject attr: trust requests=7531 lookups=565
Subject attr: gid requests=52850 lookups=0
Subject attr: comm requests=0 lookups=0
Subject attr: exe requests=68692 lookups=1216
Subject attr: dir requests=0 lookups=0
Subject attr: ftype requests=0 lookups=0

Object attribute lookups:
Object attr: path requests=12476 lookups=9452
Object attr: dir requests=7853 lookups=2453
Object attr: device requests=0 lookups=0
Object attr: ftype requests=217466 lookups=9420
Object attr: trust requests=8376 lookups=1437
Object attr: filehash requests=0 lookups=0


The first two lines matter more than they might appear:
Last metrics reset: never
Ruleset generation: 3

If the last reset is never, then the counters have been growing since the daemon started. That is the default behavior and it is useful for a quick look at the lifetime of the daemon. If you want a smaller window, configure reset_strategy=manual in fapolicyd.conf and use "fapolicyd-cli --reset-metrics". The reset-metrics report snapshots the counters, resets them, and displays what they were at reset. It is the exact same output - however the next report generated starts fresh.

The Ruleset generation tells you which active policy the counters apply to. Rule hit counters naturally reset when new rules are successfully loaded. That is important because rule numbers and rule text can change when policy changes. You do not want rule hit counts from yesterday's policy mixed with today's policy.

Metrics are easiest to interpret when the counter window and ruleset generation are explicit.

The first real section is decision outcomes. The old way to look at fapolicyd activity was allowed versus denied. That is still useful, but it is not enough. An allow can happen because a rule matched. It can also happen because the rules had no opinion and the historical default allow behavior was used. Note: the shipped rules ends with 

deny_audit perm=execute all : all
allow perm=open all : all
The intention is to block any unknown execution and allow opens of any documents. This depends on accurate detection of any interpreted computer languages before these lines.

To better understand what kind of access decisions are made, the metrics report now separates:
Allowed accesses
Denied accesses
Allowed by rule
Allowed by fallthrough
If Allowed by fallthrough is zero, then every allow in the window came from a rule. If it is non-zero, the report prints more detail: open versus execute, trusted versus untrusted or unknown trust, and broad file type buckets such as executable, programmatic, shared library, unknown ftype, and other ftype.

The default allow indicates that policy is missing a decision rule. It is not the same as a policy rule intentionally approving the access. If a system has a large number of fallthrough execute decisions, you would want to know why. Maybe that is expected for a permissive policy. Maybe it shows a missing terminal deny. The point is that it is now visible.

Rule hit counts are next. They look like this:
Hits/rule:   1      42 allow perm=execute all : trust=1
Hits/rule:   2       3 deny_audit perm=execute all : trust=0
The exact rules will depend on your policy. A rule hit is counted when the rule actually makes the allow or deny decision. Merely iterating past a rule is not a hit. This makes the table useful for answering practical questions. Which rules are carrying the load? Which rules never fire? Did the rule I just added actually match the program I was testing?

Next come queue and defer activity. The most important queue metric is:
Inter-thread max queue depth
This is the high-water mark for fapolicyd's internal event queue in the current metrics window. It is not the kernel fanotify queue. It is the userspace queue between event intake and decision processing. If this number grows near your configured q_size, the daemon is not keeping up with the event stream.

Subject deferral is new in this release. The daemon can defer an incoming event when processing it would evict another process that is still building startup pattern state in the same subject cache slot. In a normal busy system, some deferred events may be fine. What I would watch closely is:
Subject deferred events
Subject defer max depth
Subject defer fallbacks
Subject defer fallbacks means the fixed defer array filled and the daemon had to use the historical eviction behavior. If this keeps climbing during normal workloads, look at subject cache sizing and the workload shape. A very wide fork/exec storm can create lots of subject cache collisions.

The cache sections are next. fapolicyd has subject and object caches because recomputing process and file attributes on every event would be expensive. The basic shape is:
Subject hits
Subject misses
Subject collisions
Subject evictions

Object hits
Object misses
Object collisions
Object evictions
Hits are good. Misses mean fapolicyd had to populate information - which can be costly. Collisions mean a populated cache slot did not match the current process or file identity. Evictions mean something was removed to make room. A small number of evictions is normal. A high eviction rate compared to hits suggests that the cache may be too small for the workload.

The subject cache also reports early subject cache evictions, tracer evictions, and stale BUILDING evictions. These are more health-oriented. An early eviction means a subject was evicted before startup state was complete. A tracer eviction means the occupant was being traced and could hold the slot indefinitely. A stale eviction means startup state stayed incomplete past the bounded stale window.

The last section is easy to skip, but it is one of the most useful additions: attribute lookup metrics. They look like this:
Subject attr: auid requests=1000 lookups=12
Object attr: ftype requests=1000 lookups=200

requests means policy evaluation or syslog formatting asked for the attribute. lookups means the attribute was not already present in the event cache and fapolicyd had to compute or fetch it.

That distinction matters. If requests are high but lookups are low, the cache is doing its job. If lookups are high, then that attribute is causing real work. This is also a way to see whether logging is making the daemon do extra lookups. For example, a very detailed syslog_format can request attributes that the policy did not need for the decision.

The metrics report groups activity by decision outcome, queue/defer pressure, cache behavior, and attribute lookup cost.


Once you have metrics, the next question is how to graph them. fapolicyd does not need to know about your monitoring stack. The report is simple name: value text. That means you can parse the fields you care about and ship them to whatever you already use.

If you use StatsD, you can turn selected fields into gauge updates. This is deliberately small and crude, but it shows the idea:

fapolicyd-cli --check-metrics |
awk -F': '
'/^(Allowed accesses|Denied accesses|Allowed by rule|Allowed by fallthrough|Inter-thread max queue depth|Subject deferred events|Subject defer fallbacks|Early subject cache evictions|Object evictions):/ {
    name=$1
    gsub(/ /, "_", name)
    printf "fapolicyd.%s:%s|g\n", tolower(name), $2
}' | nc -u -w0 127.0.0.1 8125 



If you use Prometheus and Grafana, remember that Grafana is the visualization layer. Something else has to collect the numbers. Prometheus supports a simple text exposition format. You could write a tiny exporter or generate a node-exporter textfile collector file from selected fapolicyd metrics.

For Prometheus, be careful with labels. It is tempting to turn every rule hit line into a metric with the whole rule text as a label. That can create high cardinality and make your monitoring system unhappy. I would start with stable low-cardinality metrics: allow/deny counts, fallthrough counts, queue depth, cache hits and evictions, subject deferral, kernel overflow, and reply errors. Rule hits are better for troubleshooting reports unless you have a controlled set of rule labels.

Here is a minimal Prometheus-style example for a few fields:

fapolicyd-cli --check-metrics |
awk -F': '
'/^Allowed accesses:/ { print "fapolicyd_allowed_accesses " $2 }
/^Denied accesses:/ { print "fapolicyd_denied_accesses " $2 }
/^Allowed by fallthrough:/ { print "fapolicyd_allowed_by_fallthrough " $2 }
/^Inter-thread max queue depth:/ { print "fapolicyd_inter_thread_max_queue_depth " $2 }
/^Subject defer fallbacks:/ { print "fapolicyd_subject_defer_fallbacks " $2 }
/^Early subject cache evictions:/ { print "fapolicyd_early_subject_cache_evictions " $2 }
'


This is not meant to be a finished exporter. It is a starting point. A real exporter should sanitize names, handle percentages, keep type/help metadata, and decide whether a field is a counter or gauge.

What should you graph first? I would start with these:
Allowed accesses
Denied accesses
Allowed by fallthrough
Inter-thread max queue depth
Subject deferred events
Subject defer fallbacks
Early subject cache evictions
Subject evictions
Object evictions
Reply errors
Kernel queue overflow
Then add attribute lookups for the fields that matter to your policy. If Object attr: ftype lookups are high, file type detection is important for your workload. If Object attr: trust lookups are high, trust database lookups are part of the cost. If subject proc attributes are hot, fork/exec-heavy
workloads may be asking for process details often.

Metrics do not tell you everything. They tell you where to look. If cache evictions are high, look at cache sizing. If fallthrough is high, look at rule coverage. If queue depth is high, look at workload bursts and decision cost. If you need to know where the time is going inside decisions, that is when the timing report comes in. That is the next article.

Friday, May 22, 2026

fapolicyd 1.5: Correctness First, Better Reports Second

   

The fapolicyd 1.5 release had a lot of internal work that is not very noticable. This release started with a simple objective:

  • No silent visibility loss.
  • No silent policy replacement failure.
  • No hidden default-allow behavior.

fapolicyd is an application allowlisting daemon. It receives file access events from the kernel through fanotify, evaluates policy, and replies to the kernel. When something goes wrong in that path, the daemon should not quietly lose sight of events, quietly replace policy with a bad ruleset, or quietly allow executable content because policy had no opinion.

That is the first narrative of this release: fail-safe and correctness work. The second narrative is reporting. Once the important failure modes are made explicit, they need to show up in the right place. That led to separating status, metrics, and timing into different reports.

 

fapolicyd 1.5 makes important failure modes explicit and then reports them in the right place


Fail-safe and correctness work

One of the most important changes is the transactional rule reload. Previously, a reload could destroy the active rules before the replacement ruleset had fully loaded. If parsing failed after the old policy was gone, the daemon could be left with partial or empty policy. Since fapolicyd historically allows access when no rule has an opinion, this could become a fail-open condition.

The new approach is to build and validate the candidate ruleset separately. It owns the rules, attribute sets, syslog fields, proc-status mask, rule count, and rule identity. Only after the candidate is complete does it replace the active ruleset. A failed reload preserves the previous published policy.

That also means reload failures become useful operational data. The daemon can count rule reload failures and report them as failure-action metrics. Today those counters are observe-mode diagnostics. Later, high-security deployments can use the same named failure classes to decide whether a stronger action is needed.

The parser side was tightened for the same reason. If a subject or object assignment fails while parsing a rule, that must be a parse failure. It cannot be ignored. A malformed multi-attribute rule should not accidentally become a broader rule because one invalid attribute was skipped while another valid attribute remained.

The fanotify path also got attention. FAN_Q_OVERFLOW is the kernel telling us that fanotify events were lost. This is not the same as a denied access. It is visibility loss. If the kernel queue overflows and the daemon does not make that visible, an administrator has no way to know that enforcement degraded.

So fapolicyd now detects kernel queue overflow, counts it, logs it with rate limiting, and exposes it through reporting. This counter is not just for this release. It is groundwork for later failure policy. If a future configuration wants to treat kernel event loss as a high-severity condition, the daemon has a specific event class to build on.

The daemon also now counts failed or short fanotify response writes. Every permission event needs exactly one response back to the kernel. If the response write fails, that is a correctness issue. Counting it gives us another named failure class that can be monitored now and acted on more strongly later.

The no-opinion allow path was another area that needed guardrails. Fapolicyd does not try to deny every ordinary document open. Compatibility requires a default allow when policy has no opinion. But executable and programmatic content should not accidentally fall through because of incomplete rules, parser mistakes, missing file type classification, or internal errors.

The release adds counters that distinguish "Allowed by rule" from "Allowed by fallthrough". That keeps the compatibility behavior, but it stops hiding it. If a system has default-allow execute activity, the administrator can now see it and investigate.

This is also where the new rule linter fits.


$ fapolicyd-cli --check-rules --lint fapolicyd/src/tests/fixtures/rules-valid.rules
file is valid (3 rules)
Policy lint warning: executable events can fall through; no terminal broad execute deny found
Policy lint warning: %languages is not defined; programmatic ftype coverage
cannot be checked
$ fapolicyd-cli --check-rules --lint --verbose Rules file is valid (15 rules) Policy lint found no warnings

The linter is intentionally modest. It is not a formal proof of the policy. It looks for policy shapes that are easy to get wrong: executable events that can fall through because there is no broad terminal execute deny, missing "%languages" coverage for programmatic opens, and broad open allows that can shadow programmatic-content denies.

That matters because fapolicyd policies are ordered. A rule that looks harmless in isolation can change meaning when it appears before another rule. The linter gives administrators a cheap check before loading policy. It also ties back to the default-allow metrics: if the linter warns about a default-allow gap, the metrics report can show whether the running system is actually using that gap.

If you use rules.d, fagenrules now validates the compiled rules file before installing it. That closes a practical hole. The assembled policy should be checked before it is written as the policy the daemon will later load.

Subject deferral

The subject cache work deserves a closer look because it fixes a subtle correctness problem.

fapolicyd tracks process startup state. During exec, it may see the executable, the runtime linker, libraries, interpreter activity, and other file opens that belong to the same subject. While that sequence is still being understood, the subject is considered to be in a BUILDING state.

The subject cache is indexed by pid % slots. More than one process can hash to the same slot over time. Under heavy fork/exec pressure, a new process can arrive at a slot that is already occupied by a different process that is still BUILDING. If fapolicyd evicts the BUILDING subject too early, it loses the startup context for the process already underway. When that original process comes back with another access, fapolicyd may see a later part of startup without the earlier context. That can lead to false pattern detection decisions such as the "ld_so" detection.

One idea is to collect ejected slots and try to keep them around until they complete. That sounds easy, but it is a losing proposition.

New processes keep coming and the runtime linker will pour many new accesses into the system as it assembles the app. There is no guarantee that the process we ejected will ever make another access. It might exit. It might segfault. It might be stopped by a tracer. If it does come back, there is no efficient way to find the saved subject once it is ejected to a new storage. That turns the storage into a second, unbounded, hard-to-reason-about subject store.

The design in 1.5 takes the opposite approach. Do not eject the old BUILDING subject just because a new subject arrived. Store the new access request in a bounded deferral array. This puts back pressure on the new access while the subject already underway finishes and gets out of the way. If a subject is deferred, the runtime linker does not add more accesses allowing other processes that are actionable to proceed.

 

Subject deferral sequence of events.


The array is fixed size. That is important. A fanotify permission event can hold a task waiting for a decision, and it owns a file descriptor until the daemon replies or closes it. Deferral is acceptable because it is bounded and observable. If the array fills, fapolicyd falls back to the historical eviction behavior and counts that fallback.

There are two corner cases: traced subjects and stale subjects.

If a process is being traced, it may be stopped indefinitely. We do not know whether it will continue. If a traced process holds a BUILDING slot and another request needs that slot, waiting forever would be a denial-of-service bug. A malicious actor could try to fill slots with stopped BUILDING subjects and deadlock the system. The daemon now detects traced BUILDING subjects and can eject them when their slot is needed.

There is also a timeout. Even without tracing, a BUILDING subject that has not made progress can become stale. As a last resort, stale BUILDING slots are garbage collected so the system does not wait forever on a subject that may never complete startup.

The relevant counters are:

  • Subject deferred events
  • Subject defer max depth
  • Subject defer fallbacks
  • Early subject cache evictions
  • Subject BUILDING tracer evictions
  • Subject BUILDING stale evictions
  • Subject defer oldest age

These counters tell us whether the accesses are flowing. A busy fork/exec workload may increase "Subject deferred events". That is expected. "Subject defer fallbacks" should not steadily climb during normal operation. Tracer and stale evictions should be rare and worth looking at.

The stress and timing article later in this series will show how to create subject-cache pressure deliberately and how to read these counters during a test.

Reporting philosophy

The second narrative is reporting. Before this work, the status report had become a mixed report. It included health, configuration, counters, cache statistics, and other runtime details. That makes it harder to know what a field means. Is it a current state value? A lifetime counter? A counter that should be reset? A health indicator?

The reports now have cleaner jobs.

"fapolicyd-cli --check-status" asks whether the daemon is healthy and configured as expected.


# fapolicyd-cli --check-status
Operating mode:
Permissive: false
Integrity: sha256
reset_strategy: manual
Timing collection mode: manual
Timing collection armed: false
Timing collection last start time: never
Timing collection last stop time: never
Ruleset generation: 1
Headline activity: Allowed accesses: 8762 Denied accesses: 0
Resource configuration: CPU cores: 32 q_size: 800 Subject defer array size: 256 Subject cache size: 4099 Object cache size: 16381 Trust database max pages: 18944
Resource utilization: Trust database pages in use: 14284 (75%) Subject slots in use: 207 (5%) Object slots in use: 3334 (20%) glibc arena (total memory) is: 25560 KiB, was: 1188 KiB glibc uordblks (in use memory) is: 8348 KiB, was: 960 KiB glibc fordblks (total free space) is: 17211 KiB, was: 227 KiB
Health indicators: Kernel queue overflow: 0 Filesystem errors: 0 Filesystem error last status: none Filesystem error last seen: never Reply errors: 0 Subject defer fallbacks: 0 Early subject cache evictions: 0 Subject BUILDING tracer evictions: 0 Subject BUILDING stale evictions: 0 Subject defer oldest age: 0ns Failure action queue_full (observe): 0 Failure action kernel_queue_overflow (observe): 0 Failure action worker_stall (observe): 0 Failure action rule_reload_failure (observe): 0 Failure action trust_reload_failure (observe): 0 Failure action response_write_failure (observe): 0 Failure action fanotify_filesystem_error (observe): 0
Watched mounts: watching mount: / watching mount: /dev/shm watching mount: /run watching mount: /run/credentials/systemd-journald.service watching mount: /tmp ...

Status includes operating mode, integrity mode, reset_strategy, timing collection state, ruleset generation, headline allow and deny activity, resource configuration, resource utilization, health indicators, failure action counters, and watched mounts.

The important thing is that status is where you look for current condition. Is the daemon permissive? What ruleset generation is loaded? Are there kernel queue overflows? Are reply errors non-zero? Is there an old deferred subject? Are the watched mounts what you expect?

"fapolicyd-cli --check-metrics" asks what happened in the current counter window. Metrics includes decision outcome counters, Allowed by rule, Allowed by fallthrough, queue and deferral activity, subject and object cache effectiveness, rule hit counts, and subject/object attribute lookup counters.

Some information appears in both reports on purpose. Allowed and denied accesses are useful headline activity in status, but they are also counters in metrics. Early subject evictions are health signals in status, but they also describe the current metrics window. The difference is the question being asked. Status asks whether something needs attention. Metrics asks what moved during the measured interval.

The old cache and counter information did not disappear. It moved to the metrics report where it belongs. If you previously looked in status for cache hits, misses, evictions, rule hits, or detailed runtime counters, look in "--check-metrics".

 Metrics can also be reset in controlled ways. The default reset_strategy is never, which preserves lifetime counters. manual allows a privileged "fapolicyd-cli --reset-metrics" request to snapshot and reset runtime counters.  auto is for interval reports that should each describe only the activity since the previous interval.

Timing is separate from both status and metrics. Timing collection is disabled by default because it adds measurement work to the decision path. When enabled with "timing_collection=manual", an administrator can start and stop a bounded timing window:

fapolicyd-cli --timer-start

fapolicyd-cli --timer-stop

The daemon then writes to "/run/fapolicyd/fapolicyd.timing". This report answers where time went during that specific run: queue wait, event build, rule evaluation, MIME detection, hashing, trust database lookup, logging, audit metadata, and the fanotify response write.

I will cover metrics in the next article and timing plus the stress harness in the one after that. Those deserve separate treatment. This article is only the overview: the release first made important correctness failures explicit, then gave administrators better places to look for the evidence.

The upstream project is here: https://github.com/linux-application-whitelisting/fapolicyd.

If you are new to fapolicyd itself, the Red Hat documentation on blocking and allowing applications with fapolicyd is a useful starting point.



Wednesday, February 4, 2026

Introducing cap-audit

Applications often run as root because figuring out which Linux capabilities they actually need is difficult. You might know your web server needs to bind to port 80, which requires CAP_NET_BIND_SERVICE. But what about that database daemon? Or that monitoring agent? Trial and error gets old fast, and running everything as root is a big risk.

The libcap-ng project now includes cap-audit, a tool that traces applications to determine exactly which capabilities they require. Unlike static analysis tools that guess based on which syscalls appear in the binary, cap-audit hooks into the kernel's actual capability checking functions. When the kernel asks "does this process have CAP_NET_RAW?", cap-audit records it. This is ground truth - not guesswork.

How It Works

Cap-audit uses eBPF to hook the kernel's capability checking functions - cap_capable(), ns_capable(), and their variants. These are the functions the kernel calls every time it needs to verify a capability. When you trace an application, cap-audit forks your target program, registers its PID with the eBPF program, and then watches for kernel events.

The eBPF program filters events by PID right at the kprobe entry point. This is critical. Without filtering, it would capture thousands of capability checks per second from every process on the system. By filtering for just the target application and its children, overhead drops to less than 1% for the traced app and effectively zero for everything else.

Each capability check generates an event that includes which capability was checked, whether it was granted or denied, which syscall triggered it, and a user-space stack trace. These events stream through a ring buffer to the userspace program, which aggregates them and generates a report showing exactly what your application needs.

 


Why System Context Matters

Cap-audit doesn't just tell you which capabilities were checked - it reads various sysctls to understand when those capabilities are actually required. For example, consider /proc/sys/kernel/yama/ptrace_scope. If it's set to 0, any process can ptrace any other process it could normally signal. But if it's 1 or higher, you need CAP_SYS_PTRACE. Same binary, different capability requirements depending on system configuration.

The tool gathers kernel.perf_event_paranoid, kernel.unprivileged_bpf_disabled, kernel.kptr_restrict, and several others. These aren't just informational - they directly affect which capabilities your application needs to run. A monitoring tool that reads /proc/kallsyms needs CAP_SYSLOG when kernel.kptr_restrict is 1, but not when it's 0. Cap-audit shows you both the capabilities that were actually checked and the system settings that made them necessary. 

This means the capability requirements cap-audit reports are specific to your kernel configuration. If you're deploying to containers or hardened systems with different sysctl values, you might need different capabilities. The report includes the system context so you can make informed decisions.

You Must Exercise All Code Paths 

Here's something important to understand: cap-audit traces what your application actually does, not what it could theoretically do. If your application can set the system clock which requires CAP_SYS_TIME, but you only trace it handling normal requests, you won't see that capability requirement.

Think of it like code coverage in testing. If you don't exercise a code path during tracing, its capability requirements won't appear in the report. For daemons, this means you need to trigger all administrative operations, error handling paths, and edge cases. For CLI tools, you need to use all major features and options.

Let's see it in action:

DEMO: Tracing a daemon
======================================================================
CAPABILITY ANALYSIS FOR: /usr/sbin/irqbalance (PID 8189)
======================================================================

SYSTEM CONTEXT:
----------------------------------------------------------------------
Kernel version: 6.18.7-100.fc42.x86_64
kernel.yama.ptrace_scope: 1
kernel.kptr_restrict: 1
kernel.dmesg_restrict: 1
kernel.modules_disabled: 0
kernel.perf_event_paranoid: 2
kernel.unprivileged_bpf_disabled: 2
net.core.bpf_jit_enable: 1
net.core.bpf_jit_harden: 1
net.core.bpf_jit_kallsyms: 1
vm.mmap_min_addr: 65536
fs.protected_hardlinks: 1
fs.protected_symlinks: 1
fs.suid_dumpable: 2

REQUIRED CAPABILITIES:
----------------------------------------------------------------------
setpcap (#8)
Checks: 43 granted, 0 denied
Reason: Used by prctl (syscall 157)

sys_admin (#21)
Checks: 34 granted, 0 denied
Reason: Used by clone (syscall 56)

CONDITIONAL CAPABILITIES:
----------------------------------------------------------------------
None

ATTEMPTED BUT DENIED:
----------------------------------------------------------------------
None

SUMMARY:
----------------------------------------------------------------------
Total capability checks: 77
Required capabilities: 2
Conditional capabilities: 0
Denied operations: 0

RECOMMENDATIONS:
----------------------------------------------------------------------
Programmatic solution (C with libcap-ng):
#include <cap-ng.h>
...
capng_clear(CAPNG_SELECT_BOTH);
capng_updatev(CAPNG_ADD, CAPNG_EFFECTIVE|CAPNG_PERMITTED, SETPCAP, SYS_ADMIN, -1);
if (capng_change_id(uid, gid, CAPNG_DROP_SUPP_GRP | CAPNG_CLEAR_BOUNDING))
perror("capng_change_id");

For systemd service:
[Service]
User=<non-root-user>
Group=<non-root-group>
AmbientCapabilities=setpcap sys_admin
CapabilityBoundingSet=setpcap sys_admin

For file capabilities (via filecap):
filecap /path/to/binary setpcap sys_admin

For Docker/Podman:
docker run --user $(id -u):$(id -g) \
--cap-drop=ALL \
--cap-add=setpcap \
--cap-add=sys_admin \
your-image:tag

For Kubernetes:
securityContext:
runAsUser: 1000
runAsGroup: 1000
capabilities:
drop:
- ALL
add:
- setpcap
- sys_admin

For applications that use file-based capabilities (like /usr/bin/arping with cap_net_raw+ep), cap-audit sees when those capabilities are actually exercised. A binary might have five capabilities set on the file, but only uses three during normal operation. Cap-audit shows you what's actually needed.

DEMO: File capabilities
======================================================================
CAPABILITY ANALYSIS FOR: /usr/bin/arping (PID 7940)
======================================================================

SYSTEM CONTEXT:
----------------------------------------------------------------------
Kernel version: 6.18.7-100.fc42.x86_64
kernel.yama.ptrace_scope: 1
kernel.kptr_restrict: 1
kernel.dmesg_restrict: 1
kernel.modules_disabled: 0
kernel.perf_event_paranoid: 2
kernel.unprivileged_bpf_disabled: 2
net.core.bpf_jit_enable: 1
net.core.bpf_jit_harden: 1
net.core.bpf_jit_kallsyms: 1
vm.mmap_min_addr: 65536
fs.protected_hardlinks: 1
fs.protected_symlinks: 1
fs.suid_dumpable: 2

REQUIRED CAPABILITIES:
----------------------------------------------------------------------
setpcap (#8)
Checks: 1 granted, 5 denied
Reason: Used by capset (syscall 126)

net_raw (#13)
Checks: 1 granted, 0 denied
Reason: Used by socket (syscall 41)

CONDITIONAL CAPABILITIES:
----------------------------------------------------------------------
None

ATTEMPTED BUT DENIED:
----------------------------------------------------------------------
setuid (#7)
Attempts: 1 (all denied)
Impact: Application may have reduced functionality

SUMMARY:
----------------------------------------------------------------------
Total capability checks: 8
Required capabilities: 2
Conditional capabilities: 0
Denied operations: 1

RECOMMENDATIONS:
----------------------------------------------------------------------
Programmatic solution (C with libcap-ng):
#include <cap-ng.h>
...
capng_clear(CAPNG_SELECT_BOTH);
capng_updatev(CAPNG_ADD, CAPNG_EFFECTIVE|CAPNG_PERMITTED, SETPCAP, NET_RAW, -1);
if (capng_change_id(uid, gid, CAPNG_DROP_SUPP_GRP | CAPNG_CLEAR_BOUNDING))
perror("capng_change_id");

For systemd service:
[Service]
User=<non-root-user>
Group=<non-root-group>
AmbientCapabilities=setpcap net_raw
CapabilityBoundingSet=setpcap net_raw

For file capabilities (via filecap):
filecap /path/to/binary setpcap net_raw

For Docker/Podman:
docker run --user $(id -u):$(id -g) \
--cap-drop=ALL \
--cap-add=setpcap \
--cap-add=net_raw \
your-image:tag

For Kubernetes:
securityContext:
runAsUser: 1000
runAsGroup: 1000
capabilities:
drop:
- ALL
add:
- setpcap
- net_raw

Reading the Report

The report breaks down into several sections. Required Capabilities shows capabilities that were successfully checked - these are capabilities your application actively used and will need to function. Each entry includes how many times it was checked and which syscall triggered it. "CAP_NET_BIND_SERVICE: Used by bind (syscall 49)" tells you exactly what's going on.

Conditional Capabilities shows requirements that depend on system configuration. You'll see entries like "CAP_SYS_PTRACE: Needed when kernel.yama.ptrace_scope > 0, Current value: 1 (capability needed)". This tells you the capability is required on your current system, but might not be on systems with different sysctl values.

Attempted But Denied shows capability checks that failed. These are interesting because they reveal functionality your application tried to use but couldn't. Sometimes this is fine - the application has a fallback path. Other times, it indicates reduced functionality. The report notes "Application may have reduced functionality" so you can investigate.

The recommendations section generates ready-to-use configurations for systemd, Docker, Kubernetes, and file-based capabilities. For systemd, you get AmbientCapabilities and CapabilityBoundingSet directives. For Docker, you get --cap-drop=ALL followed by specific --cap-add entries. For file capabilities, you get the filecap command with the exact capability set. These aren't just suggestions - they're the minimal set your application demonstrated it needs. It also produces a snipit of C or python code to show how to programmatically solve this in the program.

Ground Truth, Not Guesswork

The key insight is that cap-audit hooks the actual kernel capability checking functions. When mount() checks CAP_SYS_ADMIN, cap-audit sees it. When bind() checks CAP_NET_BIND_SERVICE, cap-audit sees it. There's no parsing of source code, no heuristics based on syscall names, no guessing. The kernel's security subsystem itself is telling you what capabilities are being checked.

This is why the tool requires CAP_BPF and CAP_PERFMON to run - it's instrumenting kernel internals. But once set up, it gives you authoritative answers about capability requirements. If cap-audit says your application needs three capabilities, those are the three it checked during your trace. If it says your application doesn't need elevated capabilities at all, you can confidently run it as an unprivileged user.

I should note that both of the example traces above show programs that are not doing capabilities right. There is a '-v' command line option for verbose output. It shows that even though arping is file system based capabilities, it is still calling capset when it shouldn't. Irqbalance, constantly tries setting capabilities when it should do it once and done. We'll dive into these in a future blog.

Run your applications through cap-audit during development or security audits. Exercise all functionality, check the system context, and use the generated configurations to properly scope your capabilities. File issues on github if you have a request or find something is off. 

Thursday, September 16, 2021

Fuzzing annocheck with AFL++ Part 1

In the last article, we fuzzed annocheck with radamsa. This found several crashes. But annobin-9.94 has everything cleaned up that radamsa can find. We're not done fuzzing yet. We need to try a guided coverage fuzzer like AFL++ to see if there are more problems.

The strategy for using AFL++ is to build annocheck as a static application. We will start fuzzing it and make some adjustments based on how the initial runs go. Then we will let it run for several hours to see what it finds.

The first step is to download a copy of annobin-9.94 from here. Next install and build the source rpm. Then cd into the annobin-9.94 source directory. If you need to setup an rpm build environment, there are steps here

Next, we build annocheck as follows:

export PATH=/home/builder/working/BUILD/repos/git-repos/AFLplusplus:/home/builder/working/BUILD/repos/git-repos/llvm-project/build/bin:$PATH
export AFL_PATH=/home/builder/working/BUILD/repos/git-repos/AFLplusplus
CC=afl-clang-lto CXX=afl-clang-lto++ RANLIB=llvm-ranlib AR=llvm-ar LD=afl-clang-lto++ NM="llvm-nm -g" ./configure --without-gcc_plugin --without-tests --without-docs --enable-static --with-gcc-plugin-dir=`gcc -print-file-name=plugin`

The configure script errors out like this:

configure: creating ./config.lt
config.lt: creating libtool
checking for GCC plugin headers... no
configure: error: GCC plugin headers not found; consider installing GCC plugin development package

After much digging around, I decided that since we are making a static app and not worried about the gcc plugins, we'll just fix the tests to think everything is OK. Apply the following patch:

diff -urp annobin-9.94.orig/config/gcc-plugin.m4 annobin-9.94/config/gcc-plugin.m4
--- annobin-9.94.orig/config/gcc-plugin.m4      2021-08-31 10:02:12.000000000 -0400
+++ annobin-9.94/config/gcc-plugin.m4   2021-09-15 16:31:49.946843182 -0400
@@ -129,7 +129,7 @@ int main () {}
 [gcc_plugin_headers=yes],
 [gcc_plugin_headers=no])

-if test x"$gcc_plugin_headers" = xyes; then
+if test x"$gcc_plugin_headers" = xyes -o x"$static_plugin" = xyes; then
   AC_MSG_RESULT([yes])
 else
   AC_MSG_RESULT([no])

Since we modified the m4 scripts, we need to regenerate the configure script. I used the autogen.sh file from the audit-userspace project as a convenience. Download it and do the following:

./autogen.sh
CC=afl-clang-lto CXX=afl-clang-lto++ RANLIB=llvm-ranlib AR=llvm-ar LD=afl-clang-lto++ NM="llvm-nm -g" ./configure --without-gcc_plugin --without-tests --without-docs --enable-static --with-gcc-plugin-dir=`gcc -print-file-name=plugin`

Now configure finishes correctly. We are now ready to build it. The first step is to export the instrument variable and the run make:

export AFL_LLVM_INSTRUMENT=NATIVE
make

This fails as follows:

afl-cc ++3.14c by Michal Zalewski, Laszlo Szekeres, Marc Heuse - mode: LLVM-LTO-PCGUARD
error: unable to load plugin 'annobin': 'annobin: cannot open shared object file: No such file or directory'
make[1]: *** [Makefile:442: annocheck-annocheck.o] Error 1

After much digging around, I found that the cause of this failure is a plugin option in the annocheck_CFLAGS. Open annocheck/Makefile and go down to line 338 and remove -fplugin=annocheck. Now run make again. Now it compiles annocheck.

Time to setup for fuzzing. We will use the same program that we created to fuzz using radamsa. Go check the last article if you need the recipe. Do the following:

cd annocheck
mkdir in
cp ~/test/hello in
mkdir -p /tmp/proj/out
ln -s /tmp/proj/out out
# Then we switch to root and do some house keeping to let this
# run as fast as possible
su - root
service auditd stop
service systemd-journald stop
service abrtd stop
auditctl -D
auditctl -e 0
echo core >/proc/sys/kernel/core_pattern
cd /sys/devices/system/cpu
echo performance | tee cpu*/cpufreq/scaling_governor
exit

Now it's time to fuzz!

afl-fuzz -i in -o out ./annocheck @@

And we get the following:

There are plenty of guides that tell you what each of these items mean. Please search the internet and find one if you are curious. My main goal in this is to show you how to overcome many obstacles to get to the prize.

What I'd like to point out is 2 things. We are getting about 248 executions per second. That is kind of low. And we hit 2 crashes in 25 seconds. I hit Ctl-C to exit so that we can take an initial look at the crashes. The directory structure of the out directory looks like this:

$ tree out
out
└── default
    ├── cmdline
    ├── crashes
    │   ├── id:000000,sig:06,src:000000,time:17826,op:havoc,rep:4
    │   ├── id:000001,sig:11,src:000000,time:23980,op:havoc,rep:4
    │   └── README.txt
    ├── fuzz_bitmap
    ├── fuzzer_setup
    ├── fuzzer_stats

The actual test cases are in the crashes directory. So, what do we actually do with these? The answer is that we build another copy of annocheck using the address sanitizer and pass these to the sanitized annocheck to see what happens.

The last article explains how to build a sanitized copy of annocheck. Just open the tarball in a different location so that you don't overwrite the AFL++ build of annocheck and compile it. I built it in a dirrectory called annobin-9.94.test. You can do it anywhere, just correct the location.

Next, from the sanitized annocheck dir, we run:

$ ./annocheck /tmp/proj/out/default/crashes/id:000000,sig:06,src:000000,time:17826,op:havoc,rep:4
annocheck: Version 9.94.
Hardened: id:000000,sig:06,src:000000,time:17826,op:havoc,rep:4: FAIL: pie test because not built with '-Wl,-pie' (gcc/clang) or '-buildmode pie' (go)
Hardened: id:000000,sig:06,src:000000,time:17826,op:havoc,rep:4: FAIL: bind-now test because not linked with -Wl,-z,now
annocheck: annocheck.c:1250: find_symbol_in: Assertion `symndx == sym_hdr->sh_size / sym_hdr->sh_entsize' failed.
Aborted

Aborted? That stinks. Programs that call abort trick AFL++ into thinking this is a valid crash. This is because abort creates a core dump and that's exactly what AFL++ is looking for, The solution to this is that we need to override the abort call. The program depends on abort stopping execution. Removing it altogether means it will be running in code that the developer never intended. Since it's a false positive, we'll replace the call to abort with exit. So, let's go find it and recompile.

Grep'ing for abort doesn't find anything. But looking again at the error message says this occurs on line 1250 in annocheck.c. Opening that file finds an assert macro at that line. That means we need to override the assert macro with one that calls exit.

In annocheck/annocheck.h we find the include of assert.h around line 21. Comment that out using old style /*  */ syntax. Next, a little farther down after all the includes, put the new macro:

/* Added this so that we don't trap abort in AFL++ */
#define assert(e) ((void) ((e) ? 0 : __assert (#e, __FILE__, __LINE__)))
#define __assert(e, file, line) ((void)printf ("%s:%u: failed assertion `%s'\n", file, line, e), exit(1), 0)

Now, save, run make clean, and make. One other thing to note, annocheck makes temp directories to work with. When it aborts or crashes, it cannot clean those up. For now 

rm -rf annocheck.data.*

Back to fuzzing. We really do not need the previous runs. You can clear them out by 

rm -rf out/default

Actually, let's take a quick detour before going back to fuzzing. If you recall from above, I mentioned that the number of executions per second is kind of low. Let's fix that before we start fuzzing.

AFL++ has a mode of operation called deferred instrumentation mode. What this does is instead of starting an application from scratch, letting the runtime linker resolve all symbols, open and parse configuration files, and do general initialization...you can insert code that marks a spot where all of that has completed. AFL++ will fork new copies from that spot. The placement matters. You need to do it after general startup, and before it reads the file with the fuzzing data in it.

So, if we open annocheck.c and locate the main function at line 1943, let's scan down looking for the place to put it. It checks some arguments, checks elf version, loads rpm's configuration (that takes some time), and then processes the command line arguments. This looks like the spot. Copy and paste this at line 1967 just above processing command line args:

#ifdef __AFL_HAVE_MANUAL_CONTROL
  __AFL_INIT();
#endif

Let's make one more change. Annocheck creates a temporary directory while processing a file. Let's put that over in /tmp so as not to wear out a real hard drive. Look for a function named create_tmpdir. At line 1296 it copies a file name template into a buffer. Let's prepend "/tmp/" to that string. If you look a little further down, it calls concat and rewrites the tempdir variable with the current working directory as the path root. We don't want that. So, comment it out and the following line which outputs where the tempdir is. You have to use /*  */ commenting on this program.

Save and exit. Rerun make. In another terminal, run the following script. It will periodically clear out the annocheck.data temp directories that aren't removed on a crash.

 while  true ; do sleep 30 ; rm -rf /tmp/annocheck.data.* ; done

And in the other terminal, from the annocheck directory, run this:

afl-fuzz -i in -o out ./annocheck @@

Now let's look at the executions per second:

Now we are getting about 4.5 times the speed. Using deferred instrumentation mode makes a huge improvement. The reason we want it to run faster is that the more executions per second, the more test cases it can try in the same amount of time. On very mature programs, it can take days of fuzzing to find even one bug.

We'll stop here. You can let it run for a couple hours. You will probably have over a hundred "unique" crashes to choose from. In the next article, I'll go over how to sort through all that efficiently. If you do get a collection, remember to copy them to persistent disk storage. The /tmp dir goes away when you reboot the computer.

Wednesday, September 8, 2021

Fuzzing annocheck with Radamsa

Recently I heard that annocheck was crashing when scanning some files. That gave me an idea - fuzz it! I got the latest code in rawhide which at the time was 9.93, built it using rpmbuild, and cd'ed into it's source directory. Then:

make clean
CFLAGS="-fsanitize=address,undefined,null,return,object-size,nonnull-
attribute,returns-nonnull-attribute,bool,enum -ggdb -fno-sanitize-
recover=signed-integer-overflow" ./configure --without-tests --without-docs
make -j `nproc`

This sets it up for the address sanitizer so that we can spot fuzzing induced problems. Running make fails unexpectedly:

In function ‘follow_debuglink’,
    inlined from ‘annocheck_walk_dwarf’ at annocheck.c:1092:18:
annocheck.c:776:7: error: ‘%s’ directive argument is null [-Werror=format-
overflow=]
  776 |       einfo (VERBOSE2, "%s:  try: %s", data->filename, debugfile);      
\
      |       ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
annocheck.c:805:7: note: in expansion of macro ‘TRY_DEBUG’
  805 |       TRY_DEBUG ("%s", debug_file);
      |       ^~~~~~~~~
annocheck.c: In function ‘annocheck_walk_dwarf’:
annocheck.c:776:35: note: format string is defined here
  776 |       einfo (VERBOSE2, "%s:  try: %s", data->filename, debugfile);

This goes on an on. As I understand it, this is a bug in gcc and will be fixed in an upcoming release. However, what is stopping the build is the -Werror flag in the Makefile. So, you want to edit annocheck/Makefile and remove -Werror from the make file. Now running make will produce the binaries. Prior to working with annocheck, I wrote and compiled a little "Hello World" program in a test directory. In doing this, I left it unstripped. To prepare for fuzzing, I did this:

cd annocheck
mkdir in
cp ~/test/hello   in/test
mkdir -p /tmp/out
ln -s /tmp/out out

Then I used a script similar to the one discussed in the Fuzzing with Radamsa article from a couple days ago.

#!/bin/sh
LOG="in/test"
TLOG="out/test"

while true
do
        cat $LOG | radamsa > $TLOG
        ./annocheck $TLOG >/dev/null
        rc="$?"
        if [ "$rc" == "1" ] ; then
                exit 1
        fi
        rm -f $TLOG
        echo "==="
done

The basic idea is put a seed program into the "in" directory. Radamsa mutates it and writes it to the "out" directory, which is a symlink to /tmp. As explained in the previous article, you want to do fuzzing writes to a tmpfs file system so that you don't wear out real hardware. Running the script found this on the first test case:

hardened.c:1081:7: runtime error: null pointer passed as argument 1, which is
declared to never be null
AddressSanitizer:DEADLYSIGNAL
=================================================================
==48860==ERROR: AddressSanitizer: SEGV on unknown address 0x000000000000 (pc
0x7ffb239405bb bp 0x7fff6413a7d0 sp 0x7fff64139f60 T0)
==48860==The signal is caused by a READ memory access.
==48860==Hint: address points to the zero page.
    #0 0x7ffb239405bb in __interceptor_strcmp.part.0 (/lib64/libasan.so.
6+0x8d5bb)
    #1 0x42b9eb in interesting_sec /home/builder/working/BUILD/annobin-9.93/
annocheck/hardened.c:1081
    #2 0x42b9eb in interesting_sec /home/builder/working/BUILD/annobin-9.93/
annocheck/hardened.c:1074
    #3 0x40dd23 in run_checkers /home/builder/working/BUILD/annobin-9.93/
annocheck/annocheck.c:618
<snip>

I re-run the sanitizer it stops a lot on this error. But if you keep restarting it, eventually you may get this one:

=================================================================
==49841==ERROR: AddressSanitizer: heap-use-after-free on address
0x603000035d40 at pc 0x7fd46885170c bp 0x7ffcf350afc0 sp 0x7ffcf350a770
READ of size 1 at 0x603000035d40 thread T0
    #0 0x7fd46885170b in __interceptor_strcmp.part.0 (/lib64/libasan.so.
6+0x8d70b)
    #1 0x42637d in check_for_gaps /home/builder/working/BUILD/annobin-9.93/
annocheck/hardened.c:3709
    #2 0x42637d in finish /home/builder/working/BUILD/annobin-9.93/annocheck/
hardened.c:3889
    #3 0x42637d in finish /home/builder/working/BUILD/annobin-9.93/annocheck/
hardened.c:3844
    #4 0x40e3f6 in run_checkers /home/builder/working/BUILD/annobin-9.93/
annocheck/annocheck.c:691
    #5 0x40e3f6 in process_elf /home/builder/working/BUILD/annobin-9.93/
annocheck/annocheck.c:1517
    #6 0x40f690 in process_file /home/builder/working/BUILD/annobin-9.93/
annocheck/annocheck.c:1732
    #7 0x408880 in process_files /home/builder/working/BUILD/annobin-9.93/
annocheck/annocheck.c:1890
    #8 0x408880 in main /home/builder/working/BUILD/annobin-9.93/annocheck/
annocheck.c:1982
    #9 0x7fd467b36b74 in __libc_start_main (/lib64/libc.so.6+0x27b74)
    #10 0x409dad in _start (/home/builder/working/BUILD/annobin-9.93/
annocheck/annocheck+0x409dad)

0x603000035d40 is located 0 bytes inside of 25-byte region
[0x603000035d40,0x603000035d59)
freed by thread T0 here:
    #0 0x7fd468872647 in free (/lib64/libasan.so.6+0xae647)
    #1 0x412cfb in annocheck_get_symbol_name_and_type /home/builder/working/
BUILD/annobin-9.93/annocheck/annocheck.c:1463

previously allocated by thread T0 here:
    #0 0x7fd46881d967 in strdup (/lib64/libasan.so.6+0x59967)
    #1 0x412d25 in annocheck_get_symbol_name_and_type /home/builder/working/
BUILD/annobin-9.93/annocheck/annocheck.c:1466

SUMMARY: AddressSanitizer: heap-use-after-free (/lib64/libasan.so.6+0x8d70b)
in __interceptor_strcmp.part.0



This is a good one. Use after free can be exploitable under the right conditions. I reported these to the annobin developer. He replicated the results, fixed it up, and released 9.94 to Fedora the next day. I checked it and can confirm that Radamsa finds no other problems.

Note that if you actually wanted to fix the bug, the test case is in out/test. Just make the patch, recompile, and manually run the test case to confirm it's fixed. If you would like to preserve the test cases for later, remember that if you shut down the system they are gone unless you moved it from out to a more permanent location.

Radamsa is a good first fuzzer to reach for when starting to fuzz a new program. It's simple to setup and get running. And it finds all the low hanging fruit. But to go deeper, you need a guided fuzzer like AFL++. And that is exactly what we'll do in a future blog post.