The idea of subsetting is to discard data that doesn't meet certain criteria. We can do this prior to importing into RStudio by using specific ausearch command line switches. Or we can do this in R after the data is imported. To illustrate this, let's look at a diagram that shows what user accounts transition to. Let's collect the data:
The program that I am going to explain is being written with variables that you can change the values to easily create different diagrams. A two level Sankey diagram has boxes on the left and boxes on the right all grouped vertically. These are called nodes. We also have a left and right field variable to denote which columns of data we are charting.
The subsetting is also given by variables so that you can easily change the diagram without much thinking about it. To see the relationship of what accounts transition to, we want to subset the data to whenever SUBJ_PRIME != SUBJ_SEC. That means we want records whenever the uid does not match the auid. The code that does this is as follows:
If you run the program to this point, you can use the names command to see that we have an audit variable with 16 columns.
What we need to do now is to sum up connections between SUBJ_PRIME and SUBJ_SEC. In order to collapse it down to unique rows, we need to get rid of everything that we don't need. Excess information will make the rows unique and not allow collapsing to the common values. We'll do this by making a new data frame that holds just that information and pass it to a function that does the summing.
Let's see how it turned out. In the R console, if you type the variable 'l' (el), it outputs the current value. As we can see, we have collapsed 1200+ events down to 7 unique lines.
In the cases where there is no value in the Target columns, this is malformed kernel events that are messing up the data. We can drop those with a subset function.
To create the nodes, we want get a list that only has unique values. Then we want to give each one a number.
Let's see what our data looks like now
So far so good. I don't know if you noticed it, but I did something sneaky there. I renamed the column names halfway through. This is important later.
The next step is that we need to calculate the edges. IOW, what is connected to what. We need to say draw a line from node 2 to node 3. We will use the lookup table we just created as the canonical source of names to numbers. We can use a merge function to combine two data frames if they have a column in common. That is why I renamed the column in the last step. A merge is like doing a SQL join. For each row in the column indicated as common in both data frames. We will tell it to merge the left data to the node data for each value in the source field. Afterwards you can see that the numbers from the nodes are now combined with the unique Sources.
We don't need source anymore, so we'll drop it by setting it to NULL.
Now what we need to do is come up with the ID numbers for the Target column. We start by renaming the columns so that they have Target in common. Then we do the merge again.
The next step is to rename the columns to something nicer.
Lets see the final product
The edges values field shows how big the relation is and we have the source and target node numbers. The nodes variable has the node name and its node number. We are ready to plot. We do this by feeding the information into the sankeyNetwork function of NetworkD3.
In this you can see what account changes into another. Accounts that are logged into will be on the left. What gets transitioned to will be on the right.
Let's do one more. Change left_field and right_field to EVENT_KIND and OBJ_KIND. Make operand1, operand2, and operation all "". Then re-run. Below is what I get.
This diagram is more interesting. In the next blog we will make an adjustment to the Sankey Diagram that will make it much more interesting.