Thursday, March 16, 2017

The Three Level Sankey Diagram

A couple days ago we talked about a 2 level Sankey Diagram. The chart shows the relationship between items with a line who's width reflects the strength or volume of that relationship. This is useful in gauging the quantity of times something happens. I promised a tweak to the diagram that will make it even more useful. Here it is...

The three level Sankey Diagram
I wanted to give you a detailed theory of how the thing works in the last blog. To make the explanation clear, I gave you a tool that is somewhat limiting. But it is easier to follow along than a 3 level diagram.

The way that you make a 3 level diagram is to have a left, middle, and right column of data. Then you make the connections between left and middle, then middle and right. Then you plot it. So, rather than tell you again how it works, I'm just going to give you the code so that you have it and can play with it. Start up RStudio, open a new file, copy the following into it. I will assume at this point you know how to make the audit.csv file.


# This script will make a 3 level Sankey diagram
left_field <- as.character("EVENT")
right_field <- as.character("RESULT")
middle_field <- as.character("OBJ_KIND")

# Subset the data based on this expression
operand1 <- as.character("")
operation <- as.character("")
operand2 <- as.character("")
expr <- paste(operand1, operation, operand2)

# Read in the data and don't let strings become factors
audit <- read.csv("~/R/audit-data/audit.csv", header=TRUE, stringsAsFactors = FALSE)
audit$one <- rep(1,nrow(audit))

# Subset the audit information depending on the expression
if (expr != "  ") {
  audit <- filter_(audit, expr)

# Make 2 dataframes for a 3 level Sankey
left = data.frame(audit[left_field], audit[middle_field], audit$one)
colnames(left) = c("Source", "Target", "Num")
right = data.frame(audit[middle_field], audit[right_field], audit$one)
colnames(right) = c("Source", "Target", "Num")

# Now summarize and collapse to unique values to calculate edges
l <- ddply(left, .(Source,Target), summarize, Value=sum(Num))
r <- ddply(right, .(Source,Target), summarize, Value=sum(Num))

# Calculate Nodes lookup table
nodes <- c(as.character(l$Source), as.character(l$Target), as.character(r$Target))
nodes <- data.frame(unique(as.factor(nodes)))
colnames(nodes) = c("Source")
nodes$ID <- = 0, to = nrow(nodes) - 1)
nodes <- nodes[,c("ID","Source")]

# Now map Node lookup table numbers to source and target
# Merge index onto Source
l_edges <- merge(l, nodes, by.x = "Source")
l_edges$source = l_edges$ID
r_edges <- merge(r, nodes, by.x = "Source")
r_edges$source = r_edges$ID

# Merge index onto Target
names(nodes) = c("ID", "Target")
l_edges2 <- l_edges[,c("Source","Target","Value","source")]
r_edges2 <- r_edges[,c("Source","Target","Value","source")]
l_edges <- merge(l_edges2, nodes, by.x = "Target")
r_edges <- merge(r_edges2, nodes, by.x = "Target")

# rename everything so its nice and neat
names(l_edges) <- c("osrc", "otgt", "value", "source", "target")
names(r_edges) <- c("osrc", "otgt", "value", "source", "target")
names(nodes) = c("ID", "name")

# Combine into one big final data frame
edges <- rbind(l_edges, r_edges)

sankeyNetwork(Links = edges, Nodes = nodes,
              Source = "source", Target = "target",
              Value = "value", NodeID = "name",
              fontSize = 16, nodeWidth = 30,
              height = 2000, width = 1800)

There are 3 variables left_field, middle_field, and right_field that defines what will get charted. This program has them set to "EVENT", "OBJ_KIND", and "RESULT". (If you need a reminder of what fields are available and what they mean, see this blog post.) This shows you what kind of objects are associated with the event and what the final result is. This is what it looks like on my system:

I would also recommend you play around with it. I would also suggest trying these:





You should play around with it. You might see something in your events you hadn't noticed before. One last tip, too much data makes it hard to see things. You might experiment with subsetting the information either by adding command line switches to ausearch so that it narrows down what's collected or by using R code.

Wednesday, March 15, 2017

Account Names

In the last blog we looked at the 2 level Sankey Diagram. I had promised to show you an update to it that made it much more useful, but I decided to save that until next time. The reason is that I thought of a common security problem that warrants discussion. Since this is also a security blog, I thought we can take a pause on the data science and talk security.

The problem
If you have a computer on the internet it's just a fact of life that your system is being pounded constantly by people trying to login. So, what accounts do they go after?

To solve this problem, we need a script to gather information.

echo "ACCT" > $tfile

for log in /var/log/btmp*
        lastb -w -f $log | head -n -2 | awk '{ printf "%s\n", $1 }' >> $tfile

Save it and run it as root.

[root@webserver]# vi get-acct
[root@webserver]# chmod +x ./get-acct
[root@webserver]# ./get-acct
[root@webserver]# wc -l acct.csv
942453 acct.csv

OK. So, let's pull this into RStudio for some analysis. What we want to do is see what the most often used accounts are. Let's run the following script:


a <- read.csv("~/R/audit-data/acct.csv", header=TRUE)
a$one <- rep(1,nrow(a))

acct <- aggregate(a$one, by=list(a$ACCT), FUN=length)
colnames(acct) = c("acct", "tries")
account <- arrange(acct, desc(tries))

Run it. Now let's see what the 50 most popular accounts are:

> head(account, n=50)
            acct  tries
1           root 866755
2        support  10315
3          admin   7856
4           user   1904
5           ubnt   1469
6              a   1436
7           test   1324
8          guest    966
9       postgres    525
10            pi    496
11        oracle    463
12       ftpuser    425
13       service    369
14        nagios    314
15       monitor    287
16 administrator    246
17        backup    240
18           git    213
19     teamspeak    212
20          sshd    193
21       manager    184
22     minecraft    180
23           ftp    179
24         super    169
25       student    168
26        ubuntu    166
27        tomcat    160
28         ADMIN    158
29        zabbix    158
30           ts3    154
31      testuser    152
32          uucp    150
33           adm    145
34      operator    144
35      PlcmSpIp    144
36          alex    140
37    teamspeak3    140
38        client    138
39       default    138
40          info    126
41        telnet    126
42           www    124
43        hadoop    123
44        upload    120
45           fax    118
46            ts    118
47     webmaster    118
48       richard    112
49        debian    110
50      informix    110

So, what does this mean?

1) Do not allow root logins. Ever. Period.
2) Do not make an account based on a job function
3) Do not make an account based on a service name
4) Make sure all service accounts cannot be logged into
5) Do not make an account based on your first name

So, how can we check for active accounts on the system? First, let's make sure everything uses shadowed passwords:

# awk -F: '$2 != "x" { print $1 }' < /etc/passwd

Any problems here should be fixed. Next we can check which accounts are active:

# egrep -v '.*:\*|:\!' /etc/shadow | awk  -F: '{ print $1 }'

If you see any services listed or simple names and the system is hooked up to the internet 24x7, you might want to look into it. If you use two factor authentication or keys, then you are also likely in good shape.

The real point of this was to show you how to check what accounts are getting hammered the hardest by people trying to get in.

Tuesday, March 14, 2017

Simple Sankey Diagram

Sometimes in security we would like to see the relationship between things. This is typically done with a network diagram. R has a real nice package, NetworkD3, which draws network diagrams using html widgets and the D3 javascript library. We will use it to draw a specific kind of network diagram called a Sankey Diagram. The Sankey diagram is sometimes called a river plot or alluvial diagram. The idea is the width of the line conveys the volume or size of the relationship. Skinny lines have occasional relations, fat ones have a lot.

The idea of subsetting is to discard data that doesn't meet certain criteria. We can do this prior to importing into RStudio by using specific ausearch command line switches. Or we can do this in R after the data is imported. To illustrate this, let's look at a diagram that shows what user accounts transition to. Let's collect the data:

cd ~/R/audit-data
ausearch --start today --format csv > audit.csv

The program that I am going to explain is being written with variables that you can change the values to easily create different diagrams. A two level Sankey diagram has boxes on the left and boxes on the right all grouped vertically. These are called nodes. We also have a left and right field variable to denote which columns of data we are charting.

The subsetting is also given by variables so that you can easily change the diagram without much thinking about it. To see the relationship of what accounts transition to, we want to subset the data to whenever SUBJ_PRIME != SUBJ_SEC. That means we want records whenever the uid does not match the auid. The code that does this is as follows:


# This script will make a 2 level Sankey diagram based on these two variables
left_field <- as.character("SUBJ_PRIME")
right_field <- as.character("SUBJ_SEC")

# Subset the data based on this expression
operand1 <- as.character("SUBJ_PRIME")
operation <- as.character("!=")
operand2 <- as.character("SUBJ_SEC")
expr <- paste(operand1, operation, operand2)

# Read in the data and don't let strings become factors
audit <- read.csv("~/R/audit-data/audit.csv", header=TRUE, stringsAsFactors = FALSE)
audit$one <- rep(1,nrow(audit))

# Subset the audit information based on the variables
if (expr != "  ") {
  audit <- filter_(audit, expr)

If you run the program to this point, you can use the names command to see that we have an audit variable with 16 columns.

> names(audit)
 [1] "NODE"       "EVENT"      "DATE"       "TIME"       "SERIAL_NUM" "EVENT_KIND" "SESSION"    "SUBJ_PRIME" "SUBJ_SEC"   "ACTION"
[11] "RESULT"     "OBJ_PRIME"  "OBJ_SEC"    "OBJ_KIND"   "HOW"        "one"

The Diagram
What we need to do now is to sum up connections between SUBJ_PRIME and SUBJ_SEC. In order to collapse it down to unique rows, we need to get rid of everything that we don't need. Excess information will make the rows unique and not allow collapsing to the common values. We'll do this by making a new data frame that holds just that information and pass it to a function that does the summing.

# Make a dataframe for a 2 level Sankey diagram
left = data.frame(audit[left_field], audit[right_field], audit$one)
colnames(left) = c("Source", "Target", "Num")

# Now summarize and collapse to unique values to calculate edges
l <- ddply(left, .(Source,Target), summarize, Value=sum(Num))

Let's see how it turned out. In the R console, if you type the variable 'l' (el), it outputs the current value. As we can see, we have collapsed 1200+ events down to 7 unique lines.

> l
  Source         Target Value
1    gdm           root     4
2   root                    2
3 sgrubb           root    26
4 system                  108
5 system            gdm     2
6 system           root  1059
7 system setroubleshoot    16

In the cases where there is no value in the Target columns, this is malformed kernel events that are messing up the data. We can drop those with a subset function.

> l <- subset(l, Target != "")
> l
  Source         Target Value
1    gdm           root     4
3 sgrubb           root    26
5 system            gdm     2
6 system           root  1059
7 system setroubleshoot    16

To create the nodes, we want get a list that only has unique values. Then we want to give each one a number.

nodes <- c(as.character(l$Source), as.character(l$Target))
nodes <- data.frame(unique(as.factor(nodes)))
colnames(nodes) = c("Source")
nodes$ID <- = 0, to = nrow(nodes) - 1)
nodes <- nodes[,c("ID","Source")]

Let's see what our data looks like now

> nodes
  ID         Source
1  0            gdm
2  1         sgrubb
3  2         system
4  3           root
5  4 setroubleshoot

So far so good. I don't know if you noticed it, but I did something sneaky there. I renamed the column names halfway through. This is important later.

The next step is that we need to calculate the edges. IOW, what is connected to what. We need to say draw a line from node 2 to node 3. We will use the lookup table we just created as the canonical source of names to numbers. We can use a merge function to combine two data frames if they have a column in common. That is why I renamed the column in the last step. A merge is like doing a SQL join. For each row in the column indicated as common in both data frames. We will tell it to merge the left data to the node data for each value in the source field. Afterwards you can see that the numbers from the nodes are now combined with the unique Sources.

edges <- merge(l,nodes,by.x = "Source")

> edges
  Source         Target Value ID
1    gdm           root     4  0
2 sgrubb           root    26  1
3 system            gdm     2  2
4 system           root  1059  2
5 system setroubleshoot    16  2

We don't need source anymore, so we'll drop it by setting it to NULL.

edges$Source <- NULL

Now what we need to do is come up with the ID numbers for the Target column. We start by renaming the columns so that they have Target in common. Then we do the merge again.

names(edges) = c("Target","Value","Sindx")
names(nodes) = c("ID", "Target")
edges <- merge(edges,nodes,by.x = "Target")
edges$Target <- NULL

  Value Sindx ID
1     2     2  0
2     4     0  3
3    26     1  3
4  1059     2  3
5    16     2  4

The next step is to rename the columns to something nicer.

names(edges) <- c("value","source","target")
names(nodes) = c("ID", "name")

Lets see the final product

> edges
  value source target
1     2      2      0
2     4      0      3
3    26      1      3
4  1059      2      3
5    16      2      4
> nodes
  ID           name
1  0            gdm
2  1         sgrubb
3  2         system
4  3           root
5  4 setroubleshoot

The edges values field shows how big the relation is and we have the source and target node numbers. The nodes variable has the node name and its node number. We are ready to plot. We do this by feeding the information into the sankeyNetwork function of NetworkD3.

sankeyNetwork(Links = edges, Nodes = nodes,
              Source = "source", Target = "target",
              Value = "value", NodeID = "name",
              fontSize = 16)

In this you can see what account changes into another. Accounts that are logged into will be on the left. What gets transitioned to will be on the right.

Let's do one more. Change left_field and right_field to EVENT_KIND and OBJ_KIND. Make operand1, operand2, and operation all "". Then re-run. Below is what I get.

This diagram is more interesting. In the next blog we will make an adjustment to the Sankey Diagram that will make it much more interesting.

Thursday, March 9, 2017


In the last blog post, we learned about making bar charts.  Sometimes we are looking for patterns in the data and a heatmap is the best choice. We'll go over making one of those and I think you'll see that its super easy.

New data
To start with, we need data for the last week. This is a bit different that our other exercises. We can get the data like this:

# cd R/audit-data
# ausearch --start week-ago --format csv > audit.csv

The code
The program to do the heatmap is almost identical to what we did for the bar charts. If you want an explanation for this part go check out the last blog post. You can copy and paste this into the console section of RStudio to run it, or paste it into a new file. The code is as follows:


audit <- read.csv("~/R/audit-data/audit.csv", header=TRUE)
audit$one <- rep(1,nrow(audit))

# Create time series data frame for aggregating
audit$posixDate <- as.POSIXct(paste(audit$DATE, audit$TIME), format="%m/%d/%Y %H:%M:%S")

# Create a column of hour and date to aggregate an hourly total.
audit$hours <- format(audit$posixDate, format = '%Y-%m-%d %H')

# Now summarize it
temp <- aggregate(audit$one, by = list(audit$hours), FUN = length)

# Put results into data frame for plotting
final = data.frame(date=as.POSIXct(temp$Group.1, format="%Y-%m-%d %H", tz="GMT"))
final$num <- temp$x
final$day <- weekdays(as.Date(final$date))
final$oday <- factor(final$day, levels = final$day)
final$hour <- as.numeric(format(final$date, "%H"))

OK, now we are ready for the plot function to create the chart. In this one we want to set the x-axis to the day of the week and the y-axis to be the hour of the day. We also tell the aesthetic function that we want the color of the tile to indicate the count of events during that hour. Next, we tell ggplot that we want tiles, we set the color scale to continuous and give it a legend label, we call a ylim to force the y-axis to have 24 hours, we tell it to make the black and white theme for the chart, set the y scaling, create a title, and add some labels.

# Plot as heat map x is day of week, y is hour, darkness is number of events
plot1 = ggplot(final, aes(x=final$oday, y=final$hour, fill=final$num)) +
  geom_tile() +
  scale_fill_continuous(low="#F7FBFF", high="#2171B5", name="Events") +
  ylim(0,23) + theme_bw() + scale_y_continuous(breaks=seq(0, 23, 4)) +
  ggtitle("Events per Hour") + labs(x="Day of Week", y="Hour of Day")

# To see what ggplot can do

This makes a chart similar to this:

And just for fun, try it with plotly:


Now you can hover the mouse over the tiles and see the actual value.

As for what the chart reveals, you can see that the system is turned on nearly the same time every day and off varies. You might use this to spot activity during off hours that would be harder to see any other way. You might also see that there is heavy activity at the same hour everyday and use that as a starting point to investigate further.

Heatmaps are really just some different options to the plot function. The data is prepared the same way. But its organized differently. You can look around on-line for ggplot tutorials if you want to investigate customizing this even more.

Wednesday, March 1, 2017

Bar Charts

Let's take a look at how to do some quantitative charting of the audit events. Some times its good to get an idea of how many events are happening in a given time interval. We will take a look at how to create a bar chart summarizing the hourly number of events.

Bar Charts
Go ahead and create some events for us to experiment with:

# cd ~/R/audit-data
# ausearch --start today --format csv > audit.csv

The go into Rstudio and open a new file and put the following in:

audit <- read.csv("~/R/audit-data/audit.csv", header=TRUE)
audit$one <- rep(1,nrow(audit))

What this does is loads a plotting library into the work space. Reads our audit csv file that we just created into a variable that we will call audit. By saying header=TRUE, we are telling R to use the first line of the csv file to name the columns of data. Next we add a column of ones that we will use later for summing. Let's stop here for a second and examine things.

In RStudio place the cursor on the library line. Then click on "Run" to single step the program. After executing the last line, change to the console screen below. In the console screen type in "audit". This will cause the contents of audit to be echo'ed to the screen. To see the column names, use the name function in the RStudio console:

> names(audit)
 [1] "NODE"       "EVENT"      "DATE"       "TIME"       "SERIAL_NUM" "EVENT_KIND" "SESSION"    "SUBJ_PRIME" "SUBJ_SEC"   "ACTION"     "RESULT"
[12] "OBJ_PRIME"  "OBJ_SEC"    "OBJ_KIND"   "HOW"        "one"

We can see all the columns that we discussed in the last blog. You can also see the column that we just added. R has an internal representation of time that combines dates and time. We have the two separated but we need that native representation so that we can follow up with a function that splits off just the hour so that we can summarize on it. Add the next lines to the program:

# Create time series data frame for aggregating
audit$posixDate=as.POSIXct(paste(audit$DATE, audit$TIME), format="%m/%d/%Y %H:%M:%S")

# Create a column of hour and date to aggregate an hourly total.
audit$hours <- format(audit$posixDate, format = '%Y-%m-%d %H')

Next we want R to count up all the events grouped by what hour it is. We want to store the results in a temporary variable. We do this because the plotting function wants a data frame with everything it needs in the data frame and we just tell it what column contains it. So, we add the next line to the program:

# Now summarize it
temp <- aggregate(audit$one, by = list(audit$hours), FUN = length)

Then we build the data frame for plotting:

# Put results into data frame for plotting
final = data.frame(date=as.POSIXct(temp$Group.1, format="%Y-%m-%d %H", tz="GMT"))
final$num <- temp$x
final$day <- weekdays(as.Date(final$date))
final$oday <- factor(final$day, levels = final$day)
final$hour <- as.numeric(format(final$date, "%H"))

Now, its time to plot. ggplot can plot many kinds of charts. All you need to do is change a couple variables passed and you get an entirely different chart. What we will do is tell the plot to use the aesthetics which plots the x axis based on the time and y based on the sum that we totaled. Then we tell it we want a bar chart and to give it a nice title and group the data on the x axis by the time representation and to add an x and y axis label. This is coded up as follows:

# Plot as BAR CHART x is day of week, y is number of events
plot1 = ggplot(final, aes(x=final$date, y=final$num)) +
  geom_bar(stat="identity") + ggtitle("Events per Hour") +
  scale_x_datetime() + xlab("") + labs(x="Hour of Day", y="Events")

To see it, we need to print it:


You can run all the lines down to here and you get something like this:

What the print statement drew was a picture. What if you wanted an interactive web page? In the console add these:


Over in the viewer you now have a web page that tells you the total when you place the mouse over the columns. Hmm. Our chart is gray. How about adding some color? OK, how about making the color vary with how many events we have? OK, then lets type this in at the console:

plot1 = ggplot(final, aes(x=final$date, y=final$num, fill=final$num)) +
  geom_bar(stat="identity") + ggtitle("Events per Hour") +
  scale_x_datetime() + xlab("") + labs(x="Hour of Day", y="Events") +
  scale_fill_gradient(low="black", high = "red", name="Events/Hour")


All we did is tell the plot's aesthetics that we want to fill a color based on the number and then told it what gradient to use. When you run that you can now see that we have colors:

If you want that as a web page, you can re-run the ggplotly(plot1) command. If for some reason you wanted to save the image as a file on your hard drive, you just run the following command:

ggsave("bar-hourly-events-per-hour.png", plot1, width=7, height=5)

In this way you can automate creation of charts and then pull them all together into a reports.

Sunday, February 26, 2017

Audit log Normalization Part 2 - CSV format

As we discussed in the last blog, there is a new API in the auparse library that gives a unified view of the audit trail. In this post we will dig deeper into it to understand the details of what it provides.

Ausearch CSV format
Ausearch is the main tool for viewing audit logs. Two of the output formats use the normalizer interface to output the event, text and CSV. The text format was covered in the last blog post but this time we'll just look at the CSV output.

Ausearch has a lot of options that allows it to pick out the events that you are interested in. But with the CSV mode, we don't have to be quite as picky. We can just grab it all and use big data techniques to subset the information in a variety of ways.

The following discussion applies to the audit-2.7.3 package. Previous and future formats may differ slightly in details and quantity of information. Let's get an event to look at:

$ ausearch --start today --just-one --format csv 2>/dev/null

The first line is a header that specifies the name of each column. This is useful when you import the data into a spreadsheet or an R program. In R it becomes the name of the field within the dataframe variable holding the audit trail. This will be important in the next couple of blogs. But for now, let's take a look at each field:

NODE - This is the name of the computer that the event comes from. For it to have any value, the admin would have to set the name_format and possibly name options in auditd.conf. Normally this is missing unless you do remote logging. So, don't worry if its empty.

EVENT - This field corresponds to the type field in an event viewed with ausearch. If the event includes a syscall record, in almost all cases it will be called a SYSCALL event.

- This is the date formatted as specified for the locale of your machine.

TIME - This is the time in minutes, seconds, and hours formatted as specified for the locale of your machine.

SERIAL_NUM - This is the serial number of the event. This is given to help locate the exact event if you needed to find it.

EVENT_KIND - This is metadata about the previously given EVENT field. This is useful in helping to subset and classify data. It currently has the following values:

  • unknown - This means the event can't be classified. You should never see this. If you do please report it on the linux-audit mail list or file a bug report.
  • user-space - This is a catch all for user space originating events that are not classified another way.
  • system-services - This is system and service events. This include system boot, shutdown, runlevel changes, service start and stop events.
  • configuration - This includes user space config changes such as setting the time. It also includes kernel changes such as loading netfilter rules and changes to the audit system.
  • TTY - This is kernel and user space TTY events.
  • user-account - This collects up all the events that relate to creating, modifying, and deleting users. It also includes events related to creating, modifying, or deleting groups.
  • user-login - This includes all the events related to authentication, authorization, assignment of credentials, login, establishing a session, ending a session, disposing of credentials, and logging out.
  • audit-daemon - These are events related specifically to the audit daemon itself.
  • mac-decision - This is a Mandatory Access Control policy decision. In terms of SELinux, it would be an AVC event.
  • anomaly - This is an anomalous event. This means the event is unusual and should be looked at carefully. This include promiscuous mode changes for a network interface, program crashes, or programs dereferencing suspicious symlinks. In the future it will include events created by an IDS component as it identifies suspicious behavior.
  • integrity - This is integrity events coming from the IMA subsystem.
  • anomaly-response - This is for all events recorded by an IPS system that is responding to anomaly events.
  • mac - This is for any event related to the configuration of a Mandatory Access Control System.
  • crypto - This is for user space and kernel cryptography events.
  • virt - This is for any events related to the management of virtualization or containers.
  • audit-rule - This is for events that are directly related to the triggering of an audit rule. Normally this is syscall events.
SESSION - This is the session number of the user's login. When users login, a unique session identifier is created and inherited by any program in that session. This is to allow tracking an action back to an exact login. Sometimes it can be "unset" which means its not related to any login. This would indicate its related to a daemon.

SUBJ_PRIME - This is the main way that the subject is identified. In most cases this is the interpreted value of the auid field. This was chosen over the numeric number associated with an account because you may have several accounts with the same name but different account numbers across a data center.

SUBJ_SEC - This is the second way of identifying, or an alias to, the subject. Typically this is the interpreted uid field in an audit event. Normally the only time its different from the prime/auid is when you su or sudo to a different account from your login.

ACTION - This is metadata about what the subject is doing to the object. This is helpful in subsetting or classifying the event. Determining the action can be based on the event type, what the sysycall is, the op field in some events, and if it can't determine the syscall, its simply "triggered-audit-rule". The list of actions is too big to list in this blog post.

RESULT - This is either success or failed.

- This is the main way that an object is identified. It can be a file name, account, device, virtual machine, and more. Look at the OBJ_KIND description below for more ideas on what this could be.

OBJ_SEC - This is the secondary way to identify the object. For example, the path to a file is the main way to identify a file. The secondary identifier is the inode. In some cases this may be a terminal, vm, or remote port.

- This is metadata that lists what kind of object the event is about. This can be useful in subsetting or classifying the events to see what's happening to specific kinds of things. Current values are as follows (this is self explanatory):
  • unknown
  • fifo
  • character-device
  • directory
  • block-device
  • file
  • file-system
  • symlink
  • socket
  • process
  • firewall
  • service
  • account
  • user-session
  • virtual-machine
  • printer
  • system
  • admin-defined-rule
  • audit-config
  • mac-config
  • memory

HOW - This is how the subject did something to the object. Typically this is the program used to do it. In some cases the program is an interpreter. In that case the normalizer lists the command being run.

Extra Fields
The ausearch program has a couple command line switches that will cause even more fields to be emitted.

--extra-time - This causes ausearch to dump the following extra fields: YEAR,MONTH,DAY,WEEKDAY,HOUR,GMT_OFFSET

--extra-labels - This causes ausearch to add: SUBJ_LABEL and OBJ_LABEL which comes from the Mandatory Access Control system.

--extra-keys - This causes ausearch to dump the KEY value which is only found in syscall events.

Malformed Events
If we create an audit.csv file as follows:

$ ausearch --start today --format csv > audit.csv

And open it with libreoffice, you will notice that some rows do not have complete information. The reason is that not all events have the proper and required fields. There are one or two in user space, but the majority come from the kernel. Events that are known to be malformed are:


The fix for this is these events must be updated to include all the required fields. Until then, the analysis will have faults and not work correctly. We can use techniques to pull these events out of the analysis. But in the long run, they should be fixed.

This wraps up the overview of the fields that you will find in the CSV output of ausearch. We will next start into analytical programs since we have background material about what we will be looking at.

Wednesday, February 22, 2017

Audit log Normalization Part 1

In this posting I am interested in explaining more about the audit log normalizer. It represents a fundamental shift in what information that is available from the audit logs. If you take a look at a typical event as it comes from the kernel, you will see things in many places. For example, consider this event:

type=PROCTITLE msg=audit(1487184592.084:894): proctitle=7573657264656C007374657665
type=PATH msg=audit(1487184592.084:894): item=4 name="/etc/gshadow" inode=17040896 dev=08:32 mode=0100000 ouid=0 ogid=0 rdev=00:00 obj=system_u:object_r:shadow_t:s0 nametype=CREATE
type=PATH msg=audit(1487184592.084:894): item=3 name="/etc/gshadow" inode=17040557 dev=08:32 mode=0100000 ouid=0 ogid=0 rdev=00:00 obj=system_u:object_r:shadow_t:s0 nametype=DELETE
type=PATH msg=audit(1487184592.084:894): item=2 name="/etc/gshadow+" inode=17040896 dev=08:32 mode=0100000 ouid=0 ogid=0 rdev=00:00 obj=system_u:object_r:shadow_t:s0 nametype=DELETE
type=PATH msg=audit(1487184592.084:894): item=1 name="/etc/" inode=17039361 dev=08:32 mode=040755 ouid=0 ogid=0 rdev=00:00 obj=system_u:object_r:etc_t:s0 nametype=PARENT
type=PATH msg=audit(1487184592.084:894): item=0 name="/etc/" inode=17039361 dev=08:32 mode=040755 ouid=0 ogid=0 rdev=00:00 obj=system_u:object_r:etc_t:s0 nametype=PARENT
type=CWD msg=audit(1487184592.084:894): cwd="/root"
type=SYSCALL msg=audit(1487184592.084:894): arch=c000003e syscall=82 success=yes exit=0 a0=7fff7a36b340 a1=562ee0c337c0 a2=7fff7a36b2b0 a3=20 items=5 ppid=2350 pid=6755 auid=4325 uid=0 gid=0 euid=0 suid=0 fsuid=0 egid=0 sgid=0 fsgid=0 tty=pts0 ses=4 comm="userdel" exe="/usr/sbin/userdel" subj=unconfined_u:unconfined_r:unconfined_t:s0-s0:c0.c1023 key="identity"

How do you know what it means?

The Requirements
All audit events have a basic requirement stated in common criteria and re-stated in other security guides such as PCI-DSS. The events have to capture:

* Who did something (the subject)
* What did they do
* To what was it done (the object)
* What was the result
* When did it occur

The Implementation
The way that the Linux kernel operates is that there are hardwired events and configurable events. The configurable events are based on a syscall filtering. You write a rule that says whenever this action occurs, let me know about it and label it with a key to remind what its about. For example, this rule triggered the above event:

-w /etc/gshadow -p wa -k identity

This rule says, notify me whenever the file /etc/gshadow is written to. There are a list of syscalls that are considered to have written to a file. We can see the list here:


In our case the rename syscall, which is mentioned on line 2 of audit_dir_write.h, is the one that triggered the event. When this happens, the kernel grabs a bunch of information about the syscall and places it in the event as a syscall record. This occurs in the syscall filter where everything about the process and syscall is readily available. The act of doing the rename occurs in several places throughout the kernel. In those places it might be too invasive or have incomplete information available that prevents writing a good event. What the kernel does is add information about what is being acted upon as auxiliary records. These are things like PATH, CWD, SOCKADDR, IPC_OBJ, etc.

Translating the output
What this means is that identifying a subject is pretty straight forward, but identifying the object takes a lot of specialized knowledge.

To understand the pieces of the event, you would need to look in the SYSCALL record at the auid and uid fields to find the subject. What they are doing is determined by the actual syscall. In this case, rename. Now, which of the 4 path records is the one we want? In this case we want to know the existing file that disappeared. Its the one that got renamed or acted upon. This would be the PATH record with the nametype=DELETE. Then the "name" field in the same record tells us that its /etc/gshadow+. Easy, right? :-)

You might say there has to be an easier way. Especially given that this is just one sysycall and there are over 300 sycalls which are very different from one another.

The audit normalizer
The easier way has just been created with the audit-2.7 release and its the auparse_normalizer API.It builds on top of the auparse library functionality.  The auparse library has a concept of an internal cursor that walks across events, then records within an event, and then individual fields within a record. So, if you are analyzing a log, you would be iterating through events using the auparse_next_event() function call.

To get a normalized view of the current event, a programmer simply calls auparse_normalize(). This function looks up all the information about subject, object, action and assembles it into an opaque data structure which is associated with the event.

To get at the information, there are some accessor functions such as, auparse_normalize_subject_primary(). This function places the internal cursor on top of the field that was determined to be the subject. Then you can use the normal field accessor functions such as, auparse_interpret_field(), which in our event, takes the auid and converts it to the account name.

There are other accessor functions in the auparse_normalize API. The ones that return an int are ones that position the internal cursor. The ones that return a char * point to internal metadata that does not exist in the actual record. These can be used directly.

Parlor Tricks
While working on the normalizer, I realized that it might be possible to turn logs into English sentences. I was thinking that we could create the sentence in the following form:

On 1-node at 2-time 3-subj 4-acting-as 5-results 6-action 7-what 8-using

Which maps to:

1) node
2) time
3) auid, failed logins=remote system
4) uid (only when uid != auid)
5) res - successfully / unsuccessfully
6) op, syscall, type, key
7) path, address, account, policy, rule
8) exe,comm

This layout forms the basis for the output from "ausearch --format text". Using this commandline option, our event turns into this:

At 13:49:52 02/15/2017 sgrubb, acting as root, successfully renamed /etc/gshadow+ using /usr/sbin/userdel

This works out pretty well in most cases. I am still looking over events and making sure that they really do state the truth about an event. I haven't made any significant changes in a couple weeks, so I think we are close to final.

Next post we will dive in deeper with the normalizer and discuss how the CSV output is created. It fully uses the auparse_normalizer API to give you everything it knows about an event.