Tuesday, March 14, 2017

Simple Sankey Diagram

Sometimes in security we would like to see the relationship between things. This is typically done with a network diagram. R has a real nice package, NetworkD3, which draws network diagrams using html widgets and the D3 javascript library. We will use it to draw a specific kind of network diagram called a Sankey Diagram. The Sankey diagram is sometimes called a river plot or alluvial diagram. The idea is the width of the line conveys the volume or size of the relationship. Skinny lines have occasional relations, fat ones have a lot.

Subsetting
The idea of subsetting is to discard data that doesn't meet certain criteria. We can do this prior to importing into RStudio by using specific ausearch command line switches. Or we can do this in R after the data is imported. To illustrate this, let's look at a diagram that shows what user accounts transition to. Let's collect the data:

cd ~/R/audit-data
ausearch --start today --format csv > audit.csv


The program that I am going to explain is being written with variables that you can change the values to easily create different diagrams. A two level Sankey diagram has boxes on the left and boxes on the right all grouped vertically. These are called nodes. We also have a left and right field variable to denote which columns of data we are charting.

The subsetting is also given by variables so that you can easily change the diagram without much thinking about it. To see the relationship of what accounts transition to, we want to subset the data to whenever SUBJ_PRIME != SUBJ_SEC. That means we want records whenever the uid does not match the auid. The code that does this is as follows:

library(plyr)
library(dplyr)
library(networkD3)

# This script will make a 2 level Sankey diagram based on these two variables
left_field <- as.character("SUBJ_PRIME")
right_field <- as.character("SUBJ_SEC")

# Subset the data based on this expression
operand1 <- as.character("SUBJ_PRIME")
operation <- as.character("!=")
operand2 <- as.character("SUBJ_SEC")
expr <- paste(operand1, operation, operand2)

# Read in the data and don't let strings become factors
audit <- read.csv("~/R/audit-data/audit.csv", header=TRUE, stringsAsFactors = FALSE)
audit$one <- rep(1,nrow(audit))

# Subset the audit information based on the variables
if (expr != "  ") {
  audit <- filter_(audit, expr)
}


If you run the program to this point, you can use the names command to see that we have an audit variable with 16 columns.

> names(audit)
 [1] "NODE"       "EVENT"      "DATE"       "TIME"       "SERIAL_NUM" "EVENT_KIND" "SESSION"    "SUBJ_PRIME" "SUBJ_SEC"   "ACTION"
[11] "RESULT"     "OBJ_PRIME"  "OBJ_SEC"    "OBJ_KIND"   "HOW"        "one"



The Diagram
What we need to do now is to sum up connections between SUBJ_PRIME and SUBJ_SEC. In order to collapse it down to unique rows, we need to get rid of everything that we don't need. Excess information will make the rows unique and not allow collapsing to the common values. We'll do this by making a new data frame that holds just that information and pass it to a function that does the summing.

# Make a dataframe for a 2 level Sankey diagram
left = data.frame(audit[left_field], audit[right_field], audit$one)
colnames(left) = c("Source", "Target", "Num")

# Now summarize and collapse to unique values to calculate edges
l <- ddply(left, .(Source,Target), summarize, Value=sum(Num))

Let's see how it turned out. In the R console, if you type the variable 'l' (el), it outputs the current value. As we can see, we have collapsed 1200+ events down to 7 unique lines.

> l
  Source         Target Value
1    gdm           root     4
2   root                    2
3 sgrubb           root    26
4 system                  108
5 system            gdm     2
6 system           root  1059
7 system setroubleshoot    16

In the cases where there is no value in the Target columns, this is malformed kernel events that are messing up the data. We can drop those with a subset function.

> l <- subset(l, Target != "")
> l
  Source         Target Value
1    gdm           root     4
3 sgrubb           root    26
5 system            gdm     2
6 system           root  1059
7 system setroubleshoot    16


To create the nodes, we want get a list that only has unique values. Then we want to give each one a number.

nodes <- c(as.character(l$Source), as.character(l$Target))
nodes <- data.frame(unique(as.factor(nodes)))
colnames(nodes) = c("Source")
nodes$ID <- seq.int(from = 0, to = nrow(nodes) - 1)
nodes <- nodes[,c("ID","Source")]


Let's see what our data looks like now

> nodes
  ID         Source
1  0            gdm
2  1         sgrubb
3  2         system
4  3           root
5  4 setroubleshoot


So far so good. I don't know if you noticed it, but I did something sneaky there. I renamed the column names halfway through. This is important later.

The next step is that we need to calculate the edges. IOW, what is connected to what. We need to say draw a line from node 2 to node 3. We will use the lookup table we just created as the canonical source of names to numbers. We can use a merge function to combine two data frames if they have a column in common. That is why I renamed the column in the last step. A merge is like doing a SQL join. For each row in the column indicated as common in both data frames. We will tell it to merge the left data to the node data for each value in the source field. Afterwards you can see that the numbers from the nodes are now combined with the unique Sources.

edges <- merge(l,nodes,by.x = "Source")

> edges
  Source         Target Value ID
1    gdm           root     4  0
2 sgrubb           root    26  1
3 system            gdm     2  2
4 system           root  1059  2
5 system setroubleshoot    16  2


We don't need source anymore, so we'll drop it by setting it to NULL.

edges$Source <- NULL

Now what we need to do is come up with the ID numbers for the Target column. We start by renaming the columns so that they have Target in common. Then we do the merge again.

names(edges) = c("Target","Value","Sindx")
names(nodes) = c("ID", "Target")
edges <- merge(edges,nodes,by.x = "Target")
edges$Target <- NULL

>edges
  Value Sindx ID
1     2     2  0
2     4     0  3
3    26     1  3
4  1059     2  3
5    16     2  4


The next step is to rename the columns to something nicer.

names(edges) <- c("value","source","target")
names(nodes) = c("ID", "name")


Lets see the final product

> edges
  value source target
1     2      2      0
2     4      0      3
3    26      1      3
4  1059      2      3
5    16      2      4
> nodes
  ID           name
1  0            gdm
2  1         sgrubb
3  2         system
4  3           root
5  4 setroubleshoot


The edges values field shows how big the relation is and we have the source and target node numbers. The nodes variable has the node name and its node number. We are ready to plot. We do this by feeding the information into the sankeyNetwork function of NetworkD3.


sankeyNetwork(Links = edges, Nodes = nodes,
              Source = "source", Target = "target",
              Value = "value", NodeID = "name",
              fontSize = 16)




In this you can see what account changes into another. Accounts that are logged into will be on the left. What gets transitioned to will be on the right.



Let's do one more. Change left_field and right_field to EVENT_KIND and OBJ_KIND. Make operand1, operand2, and operation all "". Then re-run. Below is what I get.




This diagram is more interesting. In the next blog we will make an adjustment to the Sankey Diagram that will make it much more interesting.

1 comment:

sai venkat said...

Thank you so much for sharing this worth able content with us. The concept taken here will be useful for my future programs and i will surely implement them in my study. Keep blogging article like this.

Data Science Online Training|
Hadoop Online Training
R Programming Online Training|