Thursday, May 4, 2017


In this post we will resume our exploration of data science by learning a new kind of visualization technique. Sometimes we need to show exact relationships that naturally fall out as a tree. You have a root and branches that show how something is similar and then how it becomes different as you know more. This kind of diagram is known as a Dendrogram.

I See Trees
The linux file system is a great introductory data structure that lends itself to being drawn as a tree is. You have a root directory and then subdirectories and then subdirectories under those. Turning this into a program that draws your directory structure is remarkably simple. First, let's collect some data to explore. I want to choose the directories under the /usr directory but not too many. So, run the following command:

[sgrubb@x2 dendrogram]$ echo "pathString" > ~/R/audit-data/dirs.csv
[sgrubb@x2 dendrogram]$ find /usr -type d | head -20 >> ~/R/audit-data/dirs.csv
[sgrubb@x2 dendrogram]$ cat ~/R/audit-data/dirs.csv

OK. We have our data. It turns out that there is a package in R that is a perfect fit for this kind of data. Its called data.tree. If you do not have that installed into RStudio, go ahead and run


The way that this package works is that it wants to see things represented as a string separated by a delimiter. It wants this mapping to be under a column called pathString. So, this is why we created the CSV file the way we did. The paths found by the find command have '/' as a delimiter. So, this makes our program dead simple:


# Read in the paths
f <- read.csv("~/R/audit-data/dirs.csv", header=TRUE)

# Now convert to tree structure
l <- as.Node(f, pathDelimiter = "/")

# And now as a hierarchial list
b <- ToListExplicit(l, unname = TRUE)

# And visualize it
diagonalNetwork(List = b, fontSize = 10)

On my system you get a picture something like this:

Programming in R is a lot like cheating. (Quote me on that.) Four actual lines of code to produce this. How many lines of C code would it take to do this? We can see grouping of how things are similar by how many nodes they share. Something that is very different branches away on the first node. Things that are closely related share many of the same nodes.

So, lets use this to visualize how closely related some audit events are. Let's collect some audit data:

ausearch --start today --format csv > ~/R/audit-data/audit.csv

What we want to do with this is fashion the data into the same layout that we had with the directory paths. The audit data has 15 columns. To show how events are related to one another we only want the EVENT_KIND and the EVENT columns. So our first step is to create a new dataframe with just those. Next we need to take each row and turn it into a string that has each column delimited by some character. We'll choose '/' again to keep it simple. But this time we need to create our own pathString column and fill it with the string. We will use the paste function to glue it altogether. And the rest of the program is just like the previous one.


audit <- read.csv("~/R/audit-data/audit.csv", header=TRUE)

# Subset to the fields we need
a <- audit[, c("EVENT_KIND", "EVENT")]

# Prepare to convert from data frame to tree structure by making a map
a$pathString <- paste("report", a$EVENT_KIND, a$EVENT, sep="/")

# Now convert to tree structure
l <- as.Node(a, pathDelimiter = "/")

# And now as a hierarchial list
b <- ToListExplicit(l, unname = TRUE)

# And visualize it
diagonalNetwork(List = b, fontSize = 10)

With my audit logs, I get a picture like this:

You can clearly see the grouping of events that are related to a user login, events related to system services, and Mandatory Access Control (mac) events among others. You can try this on bigger data sets.

So let's do one last dendrogram and call it a day. Using the same audit csv file, suppose you wanted to show user, user session, time of day, and action, can you guess how to do it? Look at the above program. You only have to change 2 lines. Can you guess which ones?

Here they are:

a <- audit[, c("SUBJ_PRIME", "SESSION", "TIME", "ACTION")]
a$pathString <- paste("report", a$SUBJ_PRIME, a$SESSION, a$TIME, a$ACTION, sep="/")

Which yields a picture like this from my data:

The dendrogram is a good diagram to show how things are alike and how they differ. Events tend to differ based on time and this can be useful in showing order of time series data. Creating a dendrogram is dead simple and is typically 6 lines of actual code. This adds one more tool to our toolbox for exploring security data.

No comments: