Sunday, February 26, 2017

Audit log Normalization Part 2 - CSV format

As we discussed in the last blog, there is a new API in the auparse library that gives a unified view of the audit trail. In this post we will dig deeper into it to understand the details of what it provides.

Ausearch CSV format
Ausearch is the main tool for viewing audit logs. Two of the output formats use the normalizer interface to output the event, text and CSV. The text format was covered in the last blog post but this time we'll just look at the CSV output.

Ausearch has a lot of options that allows it to pick out the events that you are interested in. But with the CSV mode, we don't have to be quite as picky. We can just grab it all and use big data techniques to subset the information in a variety of ways.

The following discussion applies to the audit-2.7.3 package. Previous and future formats may differ slightly in details and quantity of information. Let's get an event to look at:

$ ausearch --start today --just-one --format csv 2>/dev/null

The first line is a header that specifies the name of each column. This is useful when you import the data into a spreadsheet or an R program. In R it becomes the name of the field within the dataframe variable holding the audit trail. This will be important in the next couple of blogs. But for now, let's take a look at each field:

NODE - This is the name of the computer that the event comes from. For it to have any value, the admin would have to set the name_format and possibly name options in auditd.conf. Normally this is missing unless you do remote logging. So, don't worry if its empty.

EVENT - This field corresponds to the type field in an event viewed with ausearch. If the event includes a syscall record, in almost all cases it will be called a SYSCALL event.

- This is the date formatted as specified for the locale of your machine.

TIME - This is the time in minutes, seconds, and hours formatted as specified for the locale of your machine.

SERIAL_NUM - This is the serial number of the event. This is given to help locate the exact event if you needed to find it.

EVENT_KIND - This is metadata about the previously given EVENT field. This is useful in helping to subset and classify data. It currently has the following values:

  • unknown - This means the event can't be classified. You should never see this. If you do please report it on the linux-audit mail list or file a bug report.
  • user-space - This is a catch all for user space originating events that are not classified another way.
  • system-services - This is system and service events. This include system boot, shutdown, runlevel changes, service start and stop events.
  • configuration - This includes user space config changes such as setting the time. It also includes kernel changes such as loading netfilter rules and changes to the audit system.
  • TTY - This is kernel and user space TTY events.
  • user-account - This collects up all the events that relate to creating, modifying, and deleting users. It also includes events related to creating, modifying, or deleting groups.
  • user-login - This includes all the events related to authentication, authorization, assignment of credentials, login, establishing a session, ending a session, disposing of credentials, and logging out.
  • audit-daemon - These are events related specifically to the audit daemon itself.
  • mac-decision - This is a Mandatory Access Control policy decision. In terms of SELinux, it would be an AVC event.
  • anomaly - This is an anomalous event. This means the event is unusual and should be looked at carefully. This include promiscuous mode changes for a network interface, program crashes, or programs dereferencing suspicious symlinks. In the future it will include events created by an IDS component as it identifies suspicious behavior.
  • integrity - This is integrity events coming from the IMA subsystem.
  • anomaly-response - This is for all events recorded by an IPS system that is responding to anomaly events.
  • mac - This is for any event related to the configuration of a Mandatory Access Control System.
  • crypto - This is for user space and kernel cryptography events.
  • virt - This is for any events related to the management of virtualization or containers.
  • audit-rule - This is for events that are directly related to the triggering of an audit rule. Normally this is syscall events.
SESSION - This is the session number of the user's login. When users login, a unique session identifier is created and inherited by any program in that session. This is to allow tracking an action back to an exact login. Sometimes it can be "unset" which means its not related to any login. This would indicate its related to a daemon.

SUBJ_PRIME - This is the main way that the subject is identified. In most cases this is the interpreted value of the auid field. This was chosen over the numeric number associated with an account because you may have several accounts with the same name but different account numbers across a data center.

SUBJ_SEC - This is the second way of identifying, or an alias to, the subject. Typically this is the interpreted uid field in an audit event. Normally the only time its different from the prime/auid is when you su or sudo to a different account from your login.

ACTION - This is metadata about what the subject is doing to the object. This is helpful in subsetting or classifying the event. Determining the action can be based on the event type, what the sysycall is, the op field in some events, and if it can't determine the syscall, its simply "triggered-audit-rule". The list of actions is too big to list in this blog post.

RESULT - This is either success or failed.

- This is the main way that an object is identified. It can be a file name, account, device, virtual machine, and more. Look at the OBJ_KIND description below for more ideas on what this could be.

OBJ_SEC - This is the secondary way to identify the object. For example, the path to a file is the main way to identify a file. The secondary identifier is the inode. In some cases this may be a terminal, vm, or remote port.

- This is metadata that lists what kind of object the event is about. This can be useful in subsetting or classifying the events to see what's happening to specific kinds of things. Current values are as follows (this is self explanatory):
  • unknown
  • fifo
  • character-device
  • directory
  • block-device
  • file
  • file-system
  • symlink
  • socket
  • process
  • firewall
  • service
  • account
  • user-session
  • virtual-machine
  • printer
  • system
  • admin-defined-rule
  • audit-config
  • mac-config
  • memory

HOW - This is how the subject did something to the object. Typically this is the program used to do it. In some cases the program is an interpreter. In that case the normalizer lists the command being run.

Extra Fields
The ausearch program has a couple command line switches that will cause even more fields to be emitted.

--extra-time - This causes ausearch to dump the following extra fields: YEAR,MONTH,DAY,WEEKDAY,HOUR,GMT_OFFSET

--extra-labels - This causes ausearch to add: SUBJ_LABEL and OBJ_LABEL which comes from the Mandatory Access Control system.

--extra-keys - This causes ausearch to dump the KEY value which is only found in syscall events.

Malformed Events
If we create an audit.csv file as follows:

$ ausearch --start today --format csv > audit.csv

And open it with libreoffice, you will notice that some rows do not have complete information. The reason is that not all events have the proper and required fields. There are one or two in user space, but the majority come from the kernel. Events that are known to be malformed are:


The fix for this is these events must be updated to include all the required fields. Until then, the analysis will have faults and not work correctly. We can use techniques to pull these events out of the analysis. But in the long run, they should be fixed.

This wraps up the overview of the fields that you will find in the CSV output of ausearch. We will next start into analytical programs since we have background material about what we will be looking at.

Wednesday, February 22, 2017

Audit log Normalization Part 1

In this posting I am interested in explaining more about the audit log normalizer. It represents a fundamental shift in what information that is available from the audit logs. If you take a look at a typical event as it comes from the kernel, you will see things in many places. For example, consider this event:

type=PROCTITLE msg=audit(1487184592.084:894): proctitle=7573657264656C007374657665
type=PATH msg=audit(1487184592.084:894): item=4 name="/etc/gshadow" inode=17040896 dev=08:32 mode=0100000 ouid=0 ogid=0 rdev=00:00 obj=system_u:object_r:shadow_t:s0 nametype=CREATE
type=PATH msg=audit(1487184592.084:894): item=3 name="/etc/gshadow" inode=17040557 dev=08:32 mode=0100000 ouid=0 ogid=0 rdev=00:00 obj=system_u:object_r:shadow_t:s0 nametype=DELETE
type=PATH msg=audit(1487184592.084:894): item=2 name="/etc/gshadow+" inode=17040896 dev=08:32 mode=0100000 ouid=0 ogid=0 rdev=00:00 obj=system_u:object_r:shadow_t:s0 nametype=DELETE
type=PATH msg=audit(1487184592.084:894): item=1 name="/etc/" inode=17039361 dev=08:32 mode=040755 ouid=0 ogid=0 rdev=00:00 obj=system_u:object_r:etc_t:s0 nametype=PARENT
type=PATH msg=audit(1487184592.084:894): item=0 name="/etc/" inode=17039361 dev=08:32 mode=040755 ouid=0 ogid=0 rdev=00:00 obj=system_u:object_r:etc_t:s0 nametype=PARENT
type=CWD msg=audit(1487184592.084:894): cwd="/root"
type=SYSCALL msg=audit(1487184592.084:894): arch=c000003e syscall=82 success=yes exit=0 a0=7fff7a36b340 a1=562ee0c337c0 a2=7fff7a36b2b0 a3=20 items=5 ppid=2350 pid=6755 auid=4325 uid=0 gid=0 euid=0 suid=0 fsuid=0 egid=0 sgid=0 fsgid=0 tty=pts0 ses=4 comm="userdel" exe="/usr/sbin/userdel" subj=unconfined_u:unconfined_r:unconfined_t:s0-s0:c0.c1023 key="identity"

How do you know what it means?

The Requirements
All audit events have a basic requirement stated in common criteria and re-stated in other security guides such as PCI-DSS. The events have to capture:

* Who did something (the subject)
* What did they do
* To what was it done (the object)
* What was the result
* When did it occur

The Implementation
The way that the Linux kernel operates is that there are hardwired events and configurable events. The configurable events are based on a syscall filtering. You write a rule that says whenever this action occurs, let me know about it and label it with a key to remind what its about. For example, this rule triggered the above event:

-w /etc/gshadow -p wa -k identity

This rule says, notify me whenever the file /etc/gshadow is written to. There are a list of syscalls that are considered to have written to a file. We can see the list here:


In our case the rename syscall, which is mentioned on line 2 of audit_dir_write.h, is the one that triggered the event. When this happens, the kernel grabs a bunch of information about the syscall and places it in the event as a syscall record. This occurs in the syscall filter where everything about the process and syscall is readily available. The act of doing the rename occurs in several places throughout the kernel. In those places it might be too invasive or have incomplete information available that prevents writing a good event. What the kernel does is add information about what is being acted upon as auxiliary records. These are things like PATH, CWD, SOCKADDR, IPC_OBJ, etc.

Translating the output
What this means is that identifying a subject is pretty straight forward, but identifying the object takes a lot of specialized knowledge.

To understand the pieces of the event, you would need to look in the SYSCALL record at the auid and uid fields to find the subject. What they are doing is determined by the actual syscall. In this case, rename. Now, which of the 4 path records is the one we want? In this case we want to know the existing file that disappeared. Its the one that got renamed or acted upon. This would be the PATH record with the nametype=DELETE. Then the "name" field in the same record tells us that its /etc/gshadow+. Easy, right? :-)

You might say there has to be an easier way. Especially given that this is just one sysycall and there are over 300 sycalls which are very different from one another.

The audit normalizer
The easier way has just been created with the audit-2.7 release and its the auparse_normalizer API.It builds on top of the auparse library functionality.  The auparse library has a concept of an internal cursor that walks across events, then records within an event, and then individual fields within a record. So, if you are analyzing a log, you would be iterating through events using the auparse_next_event() function call.

To get a normalized view of the current event, a programmer simply calls auparse_normalize(). This function looks up all the information about subject, object, action and assembles it into an opaque data structure which is associated with the event.

To get at the information, there are some accessor functions such as, auparse_normalize_subject_primary(). This function places the internal cursor on top of the field that was determined to be the subject. Then you can use the normal field accessor functions such as, auparse_interpret_field(), which in our event, takes the auid and converts it to the account name.

There are other accessor functions in the auparse_normalize API. The ones that return an int are ones that position the internal cursor. The ones that return a char * point to internal metadata that does not exist in the actual record. These can be used directly.

Parlor Tricks
While working on the normalizer, I realized that it might be possible to turn logs into English sentences. I was thinking that we could create the sentence in the following form:

On 1-node at 2-time 3-subj 4-acting-as 5-results 6-action 7-what 8-using

Which maps to:

1) node
2) time
3) auid, failed logins=remote system
4) uid (only when uid != auid)
5) res - successfully / unsuccessfully
6) op, syscall, type, key
7) path, address, account, policy, rule
8) exe,comm

This layout forms the basis for the output from "ausearch --format text". Using this commandline option, our event turns into this:

At 13:49:52 02/15/2017 sgrubb, acting as root, successfully renamed /etc/gshadow+ using /usr/sbin/userdel

This works out pretty well in most cases. I am still looking over events and making sure that they really do state the truth about an event. I haven't made any significant changes in a couple weeks, so I think we are close to final.

Next post we will dive in deeper with the normalizer and discuss how the CSV output is created. It fully uses the auparse_normalizer API to give you everything it knows about an event.

Monday, February 20, 2017

Pivot Tables

Today I think we have everything ready to start pulling all the pieces together to start showing how to use the audit normalizer to create data that can be analyzed more easily.

The first thing we need to do is make sure we have the libcurl-devel rpm package installed. There are several libraries that R has that depends on the curl libary to retrieve web pages and things. Building packages will fail if it can't find the system header file. Go ahead and install it if its not on your system.

Once that is done, start up RStudio.

R Libraries
R has a master library repository called CRAN (Comprehensive R Archive Network). We need to install a few libraries so that we can use them for our programs. The libraries can be easily installed from the R console. In RStudio the R console is the one that has text similar to the following:

R version 3.3.2 (2016-10-31) -- "Sincere Pumpkin Patch"
Copyright (C) 2016 The R Foundation for Statistical Computing
Platform: x86_64-redhat-linux-gnu (64-bit)

In this window, type:


It will then download and compile the library and all its dependencies. It will install the resulting libraries to ~/R/x86_64-redhat-linux-gnu-library/3.3/ under their own subdirectory. If you have trouble building libraries, the first thing to check is your mount options. If you are security conscience, then you probably have /tmp mounted as noexec. Not all packages compile in /tmp so you may install a couple packages and then hit this problem.

Let's go ahead and install a few more. We won't use them all today, but we will be using them in future blogs.


Gathering audit data
Ok. Now that we have some libraries installed, we need the data to analyze. Open a shell and do the following:

$ cd ~/R/audit-data
$ ausearch --start today --format csv > audit.csv

If for some reason ausearch fails because you don't have permissions to read the logs, you have two options. You can su/sudo to root and do this. But that means you also have to copy the file to your unprivileged home dir and change permissions on it. Or you can use set a value for log_group in auditd.conf as suggested in a previous post. You have to stop/start auditd to have it fix directory permissions.

Pivot Table Program
I want to start with a very simple and powerful program to whet your appetite for the kinds of things R can do. In RStudio, Click on File and New Project menu item. Click Empty Project.

Then Click New Directory.

For Directory name choose "audit".

Then click Create Project. Once this completes, let's create our script. Click on New File menu item, then select R script.

For our first program, type in the following (or copy and paste):

audit <- read.csv("~/R/audit-data/audit.csv", header=TRUE)

That's all of it. Three lines. You can single step the program by clicking on the Run icon at the top of the program file. You can also run the whole program by highlighting it and clicking the Run icon. Either way you should wind up with a pivot table loaded with your audit data in the viewer window on the lower right side of RStudio.

Click on the Zoom icon and you will get a detatched window that is bigger.

The way that the pivot table works is you can drag and drop the buttons in gray to the areas in blue. This creates a matrix of information that you can view. If you click on the buttons, you can make them selective about their information.

Let's check for failed logins. Grab "action" and drop it in the blue column, grab "subj_prime" and drag it underneath "action".

Then grab "result" and drag it to the blue row above the matrix.

Click on "action" so that it opens the filter dialog. Click on the "Select None" button to clear it.

and then scroll down and click on the check box for "logged-in".

Then open the "result" filter dialog and select only "failed". My screen shot below shows success because I don't have any failed logins. But I'm sure you get the idea.

If you don't like to do all the dragging and clicking, you can do this programmatically. The following code will show failed file access:

            cols = c("RESULT"),
            rows = c("ACTION", "OBJ_PRIME"),
            inclusions = list(ACTION = list("opened-file"),
            RESULT = list("failed")),

This should give you some things to ponder. R really does make programming easy to do complex things. Play around with the pivot table. You can't really hurt anything. The next blog will go into detail about what the fields in the CSV file are so that things make more sense as we start doing other visualizations.

Friday, February 17, 2017

Introduction to Linux Audit

OK, we are about done with all the prerequisites before we jump into doing the more interesting analytics. I expect that there will be some people attracted to this series of articles that have a limited understanding of the Linux audit system and how to set it up. But, I would like everyone to be able to recreate the analysis that we are going to do. So, we need go over some basics first. If I show you neat diagrams and show you how I created them, its nice. But if it also works on your system with your data, then its much more valuable.

Audit Pieces
The first thing we need to discuss is all the pieces that are in play or may come up in future discussions. Having a common set of terms will let the experienced and the newbie both understand what is being presented. Below is a diagram of the various pieces of the audit system and what they do.

Fig 1

First let me explain the colors. Light blue are the things that create the events, purple is the reporting tools, red is the controller, gray is the logs, and green is the real-time components.

Audit events can be created in two ways. There are applications that send events any time something specific happens. For example, if you log in to sshd, it will send a series of events as the log in proceeds. It is considered a trusted application and it always tries to send events. If the audit system is not enabled, the event is discarded. Otherwise the kernel accepts the event, time stamps it, adds sender information to the event, and queues it for delivery to the audit daemon, auditd. The only job that the audit daemon has is to reliably dequeue and write events to the log and the event dispatcher, audispd.

The other way that events are created is by the kernel observing system activity that matches a rule loaded by auditctl. The kernel is the thing that creates most events...assuming you loaded rules. It uses a first matching rule system to decide if a syscall is of any interest.

During each system call made by a program, the kernel will look at the rules and compare attributes of the calling program with a rule. These attributes could be the loginuid, uid, executable name, login session, or many other things. You can look at the auditctl man page for a comprehensive list.

When a rule matches, the event is created which records a lot of information about the process that triggered the event. When that happens, the event is time stamped and queued to the audit daemon for disposition.

The audit system is configured by a program named auditctl. It can, among other things, enable/disable the audit system, load rules, list rules, and report status.

The audit daemon simply polls the kernel for events. When it sees an event is available, it looks at its configuration to determine what to do with the event. It may or may not write the event to disk. It may or may not send the event to its real-time interface. Normally, it sends events to both.

Once an event is written to disk, the reporting tools ausearch, aureport, and aulast can be used. Ausearch is specialized at picking events out of the audit trail matched by command line options. It is the recommended tool for looking at audit events because it can detangle interlaced events, group the records into a complete event, and interpret the event several ways. Aureport is a program that can summarize kinds of events. The aulast program is specialized at showing login sessions.

If the audit daemon has been configured to send events to the real-time interface (the dispatcher) it queues the event to audispd who's job it is to distribute the event to any process wanting to see the events in real-time. These could be alerting programs or remote logging programs.

Auditd configuration
There are only a couple settings I want to discuss for the purposes of getting events ready to do analysis. There are other important settings which I will save for another blog post. I will simply list the settings from /etc/audit/auditd.conf that are important for analysis and data science:

local_events = yes
write_logs = yes
log_group = audit
log_format = enriched
freq = 50

The first one determines if the audit daemon should record local events. The answer should be "yes" unless the daemon is running in a container that has no access to the audit netlink interface. In that case it can still be used for aggregating logs from other systems by setting this value to "no".

The write_logs setting should be set to "yes" to put events to disk. This is normal although some people prefer not writing logs and instead, immediately send all events to a remote logging system.

Normally, to see audit logs you must be root. This is not always desirable and it interferes with an easy work flow. The log_group configuration item should be set to a group that you have access to. It could be your own group name, wheel, or maybe you make a special purpose audit group to give people access to logs without the need of privileges.

The log_format should be set to "enriched" so that extra information is recorded about the event at the time its logged so that the event can be analyzed on other systems.

The flush technique should be set to INCREMENTAL_ASYNC which gives the highest performance with a reasonable guarantee the event made it to disk. The freq setting tells how often to flush written events to disk. Set this to 50 for normal systems or 100 if you have a busy system.

Audit Rules
To make sure we get important system events, we need to configure some rules to get things that are important. If you are on a Fedora system, the first thing to do is delete /etc/audit/rules.d/audit.rules. It has a rule in it that basically turns off the audit system. This was intentionally mandated by Fedora's governing board, FESCO. Anyways, undo the damage. Next lets copy a couple rules to the right place:

$ cp /usr/share/doc/audit/rules/10-base-config.rules /etc/audit/rules.d/
$ cp /usr/share/doc/audit/rules/30-stig.rules /etc/audit/rules.d/

I'd recommend one change to the stig rules. Around line 95 is 6 rules to record successful and unsuccessful uses of chmod. This creates way too many events. Either delete them or comment them out by placing a '#' at the beginning of the rule.

Standard Directories
We also want to define the locations of a couple standard directories in your home directory for use in our experiments.

└── R
    ├── audit-data
    ├── extra-data
    └── x86_64-redhat-linux-gnu-library

RStudio will create an 'R' directory for you in your home directory when you install libraries. This is where we want to keep all of our R scripts, data, and libraries. Go ahead and create this directory structure with the same names if it does not exist.

Aliases That Help
A few aliases in the .bashrc file can make for less typing later. I'd recommend:

alias autext='ausearch --format text'
alias aucsv='ausearch --format csv'

Log out and log back in. Try 'autext --start today'.

This concludes the last of the prerequisite and setup blogs. Starting next week, we will begin talking about how to use R to do some pretty interesting things with the audit logs. In the mean time it might be good to brush up on R programming using one of the tutorials I mentioned in my previous posting.

Thursday, February 16, 2017

Building R Studio

NOTE: This post has been updated (11 Aug 2017) to point to a directory to download the Rstudio SRPM from rather than to point to a specific version.

The best computer language for analyzing data is the "R" programming language. Its a different kind of language than anything I've used. When you issue a command, it typically applies to all data at once. If you find yourself writing "for loops", you're probably doing something wrong. When you really get the hang of using R, it will feel a bit like cheating. (I did that in 5 lines of code? Seriously?)

At first it kind of looks alien. But it is easy to learn. I highly recommend anyone interested in R to take the Coursera classes on Data Science from Johns Hopkins. There are others that might be equally good, but I have not seen them.

There are other methods to learn R. For example, the web site is available and teaches the basics for free. In one hour or so, you will most likely be able to read other people's programs and have a basic idea about what its doing. Becoming proficient really requires more effort such as writing your own programs. This is when the real questions come up.

As you get into the Data  Science and "R" world, there is one recurring theme. "Start by installing a copy of RStudio on your system." The classes say it, blogs say it, tutorials say it, books say it. There's a reason for that. Its absolutely wonderful and an essential tool for wrangling data.

The best part is that RStudio is open source just like R is. In nearly all tutorials you will notice they all say to go get it from the RStudio web site, This gets you a prebuilt version of the software as an rpm, deb, exe, or tarball. One thing you will notice is that there is no srpm. Another thing you will notice is that no distribution actually ships RStudio. Why is that you might ask?

I think the answer is that it was never written in a way to be distributed by Linux distributions. It wants to put things in /usr/local which is fine for a 3rd party application, but against packaging rules for distributions. You cannot magically pass it a $DESTDIR to prefix installations for packaging purposes. Also, it wants to use its own copies of libraries rather than the system versions. This is also against most packaging guidelines. I suspect that this is really about the developers of RStudio not having to waste time debugging odd problems on various platforms due to each one being out of sync with each other.

If you are like me, I like to have the source code "just in case" and build my own package. I have looked all over the place and have found little to no guidance on building RStudio from scratch. I see where others are trying to do it. But what I have here in today's blog is actual instructions on how to build your own copy. I have already created a spec file which is most of the hard work. It is an older copy but it works fine. Some day I will update to the 1.0 release but it will take some effort.

I won't try to explain the spec file. It does not attempt to move things from /usr/local. Its just too much trouble and I'd rather get on with it. The build has been corrected to use system libraries, though.

So, let's build a copy of RStudio. You need to do this on a desktop system. I have tested that both Gnome and Cinnamon completes usable builds. I have not tested this on KDE or any other desktop. The reason why we want to build on a desktop system is because there are some dependencies during the build which look for desktop things that will fail but the build either completes and is not usable or errors out in some weird cryptic way. I have not been able to determine what this dependency is in order to add it to the BuildRequires. So, do this on a desktop from your build account. If you need to setup a build environment, see this blog article

First, let's install the build dependencies:

$ su - root
# dnf install git clang cmake ant java R-core-devel libuuid-devel openssl-devel bzip2-devel zlib-devel libffi boost-devel pango-devel xml-commons-apis hunspell-devel pandoc qt-devel qt5-qtwebkit-devel qt5-qtlocation-devel qt5-qtsensors-devel qt5-qtsvg-devel qt5-qtxmlpatterns-devel qt5-qtwebchannel-devel gstreamer-plugins-base-devel
# exit

Next download the SRPM from here:

Now build as normal user (adjusting the rpm name for the one that you downloaded):

$ rpm -ivh R-studio-desktop-1.0.146-1.fc25.src.rpm
$ cd working/BUILD
$ rpmbuild -bb ../R-studio-desktop/R-studio-desktop.spec

To free up room after successful build:

$ rm -rf rstudio-1.0.146

When completed (adjust path for OS, version, and home dir):

$ su - root
# rpm -ivh /home/builder/working/RPMS/x86_64/R-studio-desktop-1.0.146-1.fc25.x86_64.rpm

This will take about an hour to build...maybe longer. Rstudio is a huge package. The srpm is 176Mb. My x86_64 version is 19 Mb. A whole lot of something gets compressed or jettisoned during the build.

After installing it you should be able to run it. The icon is in the main menu under "Programming". When you start it up, you should see something like this:

I'll explain more about RStudio in the next blog. If you choose to just install the rpm from the RStudio web site, there's nothing wrong with that. Its what all data scientists do today...because until now, there was almost no other options.

Wednesday, February 15, 2017

Setting up a rpm build environment

I know everyone would like me to jump right in to talking about audit, but I want to take a detour first to make a couple posts that will be referenced in future articles. So, I'd like to get the first one out of the way which is building your own packages with rpm.

The reason we need to do this is that not everything that you might want is in Fedora. Sometimes a package is so hard to package that no distribution actually has it. For example, it may violate packaging guidelines as the build scripts are too complex to change without a whole lot of study. In a future blog post we will need to build one of these.

To start off with, you may want to have a specific account on your system for building packages. If so, make one and log into that account so that we can set things up.

When I build packages, I like to have things in specific places. I like to have tar file, spec file, and patches all in one directory named after the package. I do not like all sources jumbled together. We can get this with a little planning.

I prefer to have the following directory layout:

└── working
    ├── BUILD
    ├── RPMS
    │   ├── noarch
    │   └── x86_64
    ├── SRPMS
    └── tmp

To get this layout, do the following in your build account home directory:

$ mkdir -p working/{BUILD,BUILDROOT,SRPMS,tmp}
$ mkdir -p working/RPMS/{noarch,x86_64}

Next we want to add a .rpmmacros file to the home directory that will use this structure for building packages. The following assumes the account is "builder". Copy and change it as appropriate to your build account. Save it as .rpmmacros. The explanations are all inline.

# Custom RPM macros configuration file for building RPM packages
# as a non-root user.

# %_topdir defines the top directory to be used for RPM building
# purposes. It is the default ROOT of the buildsystem.
%_topdir        /home/builder/working

# %_sourcedir is where the source code tarballs, patches, etc.
# will be placed after you do an
# "rpm -ivh somepackage.1.0-1.src.rpm"
#%_sourcedir     %{_topdir}/%{name}-%{version}
%_sourcedir     %{_topdir}/%{name}

# %_specdir is where the specfile gets placed when installing a
# src.rpm. I prefer the specfile to be in the same directory
# as the source tarballs, etc.
%_specdir       %{_sourcedir}

# %_tmppath is where temporary scripts are placed during the RPM
# build process as well as the %_buildroot where %install normally
# dumps files prior to packaging up the final binary RPM's.
%_tmppath       %{_topdir}/tmp

# %_builddir is where source code tarballs are decompressed, and
# patches then applied when building an RPM package
%_builddir      %{_topdir}/BUILD

# %_buildroot is where files get placed during the %install section
# of spec file processing prior to final packaging into rpms.
# This is oddly named and probably should have been called
# "%_installroot" back when it was initially added to RPM.
%_buildroot     %{_topdir}/%{_tmppath}/%{name}-%{version}-root

# %_rpmdir is where binary RPM packages are put after being built.
%_rpmdir        %{_topdir}/RPMS

# %_srcrpmdir is where src.rpm packages are put after being built.
%_srcrpmdir     %{_topdir}/SRPMS

Now just a couple more changes and we are all set. If this is a brand new account, you might want to have rm, cp, and mv all asking permission to prevent accidents.

 $ echo -e "alias rm='rm -i'\nalias cp='cp -i'\nalias mv='mv -i'\n" >> .bashrc

And lastly, its also good get a couple prerequisite build packages installed.

$ su - root
# dnf install redhat-rpm-config rpm-build
# exit

This concludes setting up an environment to build packages for Fedora or RHEL. You can now test your setup by building the most recent audit rpm (after installing audit prerequisite rpms).

$ wget
$ rpm -ivh audit-2.7.2-2.fc24.src.rpm

Next we need to install some prerequisite packages for building the audit package:

$ su - root
# dnf install golang kernel-headers krb5-devel libcap-ng-devel openldap-devel python-devel python3-devel swig tcp_wrappers-devel audit-libs-devel
# exit

Note that under normal circumstances, you do not install audit-libs-devel to build the audit package. There is a self-test for the golang binding that needs in the system path. One of these days I'll fixup the test so that it uses the freshly built one. In any event, we can now do the build:

$ rpmbuild -ba working/audit/audit.spec

Note that even though the above references a Fedora 24 rpm, it doesn't make any difference since its just the source rpm. If everything goes to plan, you will have packages in working/RPMS. If you want to see the exploded audit source code, after building its located in working/BUILD/audit-2.7.2/. If you want to see all the files that make up the audit build, they are in working/audit/. Everything is nice and neat.

Its good to check with a simple rpm before we build a challenging package.

Tuesday, February 14, 2017


Welcome to my new blog. I have considered doing this off and on for a year or two. Finally made the decision because I realized that I have a lot of things to explore, experiment with, and comment on.

The topics that I would like comment on are really about the intersection of Linux Security and Data Science.(Hence the name of this blog.) Its an exciting time because a lot of the analytic tools are really powerful, simple, and give better insights. Another thing that makes security research interesting is the promise of AI and Deep Learning algorithms to shed light on mountains of data that were incomprehensible. Sometimes doing these experiments and research requires some setting up to get ready. All of these topics will be covered regularly.

For those who aren't familiar with my work, I work on the Linux Audit project which is a realtime security event system. I also have created a library to make using capabilities simpler. I started the openscap project and have now bowed out - its in good hands. I have also worked in committees helping to create some of the security standards in place today. I also do security research and code reviews looking for problems. All of these topics are fair game as well.

Keeping the first one short because I'd rather get on with real information sooner than later.