Security + Data Science: March 2017

Thursday, March 16, 2017

The Three Level Sankey Diagram

A couple days ago we talked about a 2 level Sankey Diagram. The chart shows the relationship between items with a line who's width reflects the strength or volume of that relationship. This is useful in gauging the quantity of times something happens. I promised a tweak to the diagram that will make it even more useful. Here it is...

The three level Sankey Diagram
I wanted to give you a detailed theory of how the thing works in the last blog. To make the explanation clear, I gave you a tool that is somewhat limiting. But it is easier to follow along than a 3 level diagram.

The way that you make a 3 level diagram is to have a left, middle, and right column of data. Then you make the connections between left and middle, then middle and right. Then you plot it. So, rather than tell you again how it works, I'm just going to give you the code so that you have it and can play with it. Start up RStudio, open a new file, copy the following into it. I will assume at this point you know how to make the audit.csv file.

library(plyr)
library(dplyr)
library(networkD3)

# This script will make a 3 level Sankey diagram
left_field <- as.character("EVENT")
right_field <- as.character("RESULT")
middle_field <- as.character("OBJ_KIND")

# Subset the data based on this expression
operand1 <- as.character("")
operation <- as.character("")
operand2 <- as.character("")
expr <- paste(operand1, operation, operand2)

# Read in the data and don't let strings become factors
audit <- read.csv("~/R/audit-data/audit.csv", header=TRUE, stringsAsFactors = FALSE)
audit$one <- rep(1,nrow(audit))

# Subset the audit information depending on the expression
if (expr != " ") {
audit <- filter_(audit, expr)
}

# Make 2 dataframes for a 3 level Sankey
left = data.frame(audit[left_field], audit[middle_field], audit$one)
colnames(left) = c("Source", "Target", "Num")
right = data.frame(audit[middle_field], audit[right_field], audit$one)
colnames(right) = c("Source", "Target", "Num")

# Now summarize and collapse to unique values to calculate edges
l <- ddply(left, .(Source,Target), summarize, Value=sum(Num))
r <- ddply(right, .(Source,Target), summarize, Value=sum(Num))

# Calculate Nodes lookup table
nodes <- c(as.character(l$Source), as.character(l$Target), as.character(r$Target))
nodes <- data.frame(unique(as.factor(nodes)))
colnames(nodes) = c("Source")
nodes$ID <- seq.int(from = 0, to = nrow(nodes) - 1)
nodes <- nodes[,c("ID","Source")]

# Now map Node lookup table numbers to source and target
# Merge index onto Source
l_edges <- merge(l, nodes, by.x = "Source")
l_edges$source = l_edges$ID
r_edges <- merge(r, nodes, by.x = "Source")
r_edges$source = r_edges$ID

# Merge index onto Target
names(nodes) = c("ID", "Target")
l_edges2 <- l_edges[,c("Source","Target","Value","source")]
r_edges2 <- r_edges[,c("Source","Target","Value","source")]
l_edges <- merge(l_edges2, nodes, by.x = "Target")
r_edges <- merge(r_edges2, nodes, by.x = "Target")

# rename everything so its nice and neat
names(l_edges) <- c("osrc", "otgt", "value", "source", "target")
names(r_edges) <- c("osrc", "otgt", "value", "source", "target")
names(nodes) = c("ID", "name")

# Combine into one big final data frame
edges <- rbind(l_edges, r_edges)

sankeyNetwork(Links = edges, Nodes = nodes,
              Source = "source", Target = "target",
              Value = "value", NodeID = "name",
              fontSize = 16, nodeWidth = 30,
              height = 2000, width = 1800)

There are 3 variables left_field, middle_field, and right_field that defines what will get charted. This program has them set to "EVENT", "OBJ_KIND", and "RESULT". (If you need a reminder of what fields are available and what they mean, see this blog post.) This shows you what kind of objects are associated with the event and what the final result is. This is what it looks like on my system:

I would also recommend you play around with it. I would also suggest trying these:

ACTION
RESULT
OBJ_KIND

HOW
OBJ_KIND
RESULT

SUBJ_PRIME
RESULT
OBJ_KIND

EVENT
ACTION
OBJ_KIND

You should play around with it. You might see something in your events you hadn't noticed before. One last tip, too much data makes it hard to see things. You might experiment with subsetting the information either by adding command line switches to ausearch so that it narrows down what's collected or by using R code.

Wednesday, March 15, 2017

Account Names

In the last blog we looked at the 2 level Sankey Diagram. I had promised to show you an update to it that made it much more useful, but I decided to save that until next time. The reason is that I thought of a common security problem that warrants discussion. Since this is also a security blog, I thought we can take a pause on the data science and talk security.

The problem
If you have a computer on the internet it's just a fact of life that your system is being pounded constantly by people trying to login. So, what accounts do they go after?

To solve this problem, we need a script to gather information.

#!/bin/sh
tfile="acct.csv"
echo "ACCT" > $tfile

for log in /var/log/btmp*
do
lastb -w -f $log | head -n -2 | awk '{ printf "%s\n", $1 }' >> $tfile
done

Save it and run it as root.

[root@webserver]# vi get-acct
[root@webserver]# chmod +x ./get-acct
[root@webserver]# ./get-acct
[root@webserver]# wc -l acct.csv
942453 acct.csv

OK. So, let's pull this into RStudio for some analysis. What we want to do is see what the most often used accounts are. Let's run the following script:

library(dplyr)

a <- read.csv("~/R/audit-data/acct.csv", header=TRUE)
a$one <- rep(1,nrow(a))

acct <- aggregate(a$one, by=list(a$ACCT), FUN=length)
colnames(acct) = c("acct", "tries")
account <- arrange(acct, desc(tries))

Run it. Now let's see what the 50 most popular accounts are:

> head(account, n=50)
            acct tries
1           root 866755
2        support 10315
3          admin   7856
4           user   1904
5           ubnt   1469
6              a   1436
7           test   1324
8          guest    966
9       postgres    525
10            pi    496
11        oracle    463
12       ftpuser    425
13       service    369
14        nagios    314
15       monitor    287
16 administrator    246
17        backup    240
18           git    213
19     teamspeak    212
20          sshd    193
21       manager    184
22     minecraft    180
23           ftp    179
24         super    169
25       student    168
26        ubuntu    166
27        tomcat    160
28         ADMIN    158
29        zabbix    158
30           ts3    154
31      testuser    152
32          uucp    150
33           adm    145
34      operator    144
35      PlcmSpIp    144
36          alex    140
37    teamspeak3    140
38        client    138
39       default    138
40          info    126
41        telnet    126
42           www    124
43        hadoop    123
44        upload    120
45           fax    118
46            ts    118
47     webmaster    118
48       richard    112
49        debian    110
50      informix    110

So, what does this mean?

1) Do not allow root logins. Ever. Period.
2) Do not make an account based on a job function
3) Do not make an account based on a service name
4) Make sure all service accounts cannot be logged into
5) Do not make an account based on your first name

So, how can we check for active accounts on the system? First, let's make sure everything uses shadowed passwords:

# awk -F: '$2 != "x" { print $1 }' < /etc/passwd

Any problems here should be fixed. Next we can check which accounts are active:

# egrep -v '.*:\*|:\!' /etc/shadow | awk -F: '{ print $1 }'

If you see any services listed or simple names and the system is hooked up to the internet 24x7, you might want to look into it. If you use two factor authentication or keys, then you are also likely in good shape.

The real point of this was to show you how to check what accounts are getting hammered the hardest by people trying to get in.

Tuesday, March 14, 2017

Simple Sankey Diagram

Sometimes in security we would like to see the relationship between things. This is typically done with a network diagram. R has a real nice package, NetworkD3, which draws network diagrams using html widgets and the D3 javascript library. We will use it to draw a specific kind of network diagram called a Sankey Diagram. The Sankey diagram is sometimes called a river plot or alluvial diagram. The idea is the width of the line conveys the volume or size of the relationship. Skinny lines have occasional relations, fat ones have a lot.

Subsetting
The idea of subsetting is to discard data that doesn't meet certain criteria. We can do this prior to importing into RStudio by using specific ausearch command line switches. Or we can do this in R after the data is imported. To illustrate this, let's look at a diagram that shows what user accounts transition to. Let's collect the data:

cd ~/R/audit-data
ausearch --start today --format csv > audit.csv

The program that I am going to explain is being written with variables that you can change the values to easily create different diagrams. A two level Sankey diagram has boxes on the left and boxes on the right all grouped vertically. These are called nodes. We also have a left and right field variable to denote which columns of data we are charting.

The subsetting is also given by variables so that you can easily change the diagram without much thinking about it. To see the relationship of what accounts transition to, we want to subset the data to whenever SUBJ_PRIME != SUBJ_SEC. That means we want records whenever the uid does not match the auid. The code that does this is as follows:

library(plyr)
library(dplyr)
library(networkD3)

# This script will make a 2 level Sankey diagram based on these two variables
left_field <- as.character("SUBJ_PRIME")
right_field <- as.character("SUBJ_SEC")

# Subset the data based on this expression
operand1 <- as.character("SUBJ_PRIME")
operation <- as.character("!=")
operand2 <- as.character("SUBJ_SEC")
expr <- paste(operand1, operation, operand2)

# Read in the data and don't let strings become factors
audit <- read.csv("~/R/audit-data/audit.csv", header=TRUE, stringsAsFactors = FALSE)
audit$one <- rep(1,nrow(audit))

# Subset the audit information based on the variables
if (expr != " ") {
audit <- filter_(audit, expr)
}

If you run the program to this point, you can use the names command to see that we have an audit variable with 16 columns.

> names(audit)
[1] "NODE" "EVENT" "DATE" "TIME" "SERIAL_NUM" "EVENT_KIND" "SESSION" "SUBJ_PRIME" "SUBJ_SEC" "ACTION"
[11] "RESULT" "OBJ_PRIME" "OBJ_SEC" "OBJ_KIND" "HOW" "one"

The Diagram
What we need to do now is to sum up connections between SUBJ_PRIME and SUBJ_SEC. In order to collapse it down to unique rows, we need to get rid of everything that we don't need. Excess information will make the rows unique and not allow collapsing to the common values. We'll do this by making a new data frame that holds just that information and pass it to a function that does the summing.

# Make a dataframe for a 2 level Sankey diagram
left = data.frame(audit[left_field], audit[right_field], audit$one)
colnames(left) = c("Source", "Target", "Num")

# Now summarize and collapse to unique values to calculate edges
l <- ddply(left, .(Source,Target), summarize, Value=sum(Num))

Let's see how it turned out. In the R console, if you type the variable 'l' (el), it outputs the current value. As we can see, we have collapsed 1200+ events down to 7 unique lines.

> l
Source         Target Value
1    gdm           root     4
2   root                    2
3 sgrubb           root    26
4 system                  108
5 system            gdm     2
6 system           root 1059
7 system setroubleshoot    16

In the cases where there is no value in the Target columns, this is malformed kernel events that are messing up the data. We can drop those with a subset function.

> l <- subset(l, Target != "")
> l
Source         Target Value
1    gdm           root     4
3 sgrubb           root    26
5 system            gdm     2
6 system           root 1059
7 system setroubleshoot    16

To create the nodes, we want get a list that only has unique values. Then we want to give each one a number.

nodes <- c(as.character(l$Source), as.character(l$Target))
nodes <- data.frame(unique(as.factor(nodes)))
colnames(nodes) = c("Source")
nodes$ID <- seq.int(from = 0, to = nrow(nodes) - 1)
nodes <- nodes[,c("ID","Source")]

Let's see what our data looks like now

> nodes
ID         Source
1 0            gdm
2 1         sgrubb
3 2         system
4 3           root
5 4 setroubleshoot

So far so good. I don't know if you noticed it, but I did something sneaky there. I renamed the column names halfway through. This is important later.

The next step is that we need to calculate the edges. IOW, what is connected to what. We need to say draw a line from node 2 to node 3. We will use the lookup table we just created as the canonical source of names to numbers. We can use a merge function to combine two data frames if they have a column in common. That is why I renamed the column in the last step. A merge is like doing a SQL join. For each row in the column indicated as common in both data frames. We will tell it to merge the left data to the node data for each value in the source field. Afterwards you can see that the numbers from the nodes are now combined with the unique Sources.

edges <- merge(l,nodes,by.x = "Source")

> edges
Source         Target Value ID
1    gdm           root     4 0
2 sgrubb           root    26 1
3 system            gdm     2 2
4 system           root 1059 2
5 system setroubleshoot    16 2

We don't need source anymore, so we'll drop it by setting it to NULL.

edges$Source <- NULL

Now what we need to do is come up with the ID numbers for the Target column. We start by renaming the columns so that they have Target in common. Then we do the merge again.

names(edges) = c("Target","Value","Sindx")
names(nodes) = c("ID", "Target")
edges <- merge(edges,nodes,by.x = "Target")
edges$Target <- NULL

>edges
Value Sindx ID
1     2     2 0
2     4     0 3
3    26     1 3
4 1059     2 3
5    16     2 4

The next step is to rename the columns to something nicer.

names(edges) <- c("value","source","target")
names(nodes) = c("ID", "name")

Lets see the final product

> edges
value source target
1     2      2      0
2     4      0      3
3    26      1      3
4 1059      2      3
5    16      2      4
> nodes
ID           name
1 0            gdm
2 1         sgrubb
3 2         system
4 3           root
5 4 setroubleshoot

The edges values field shows how big the relation is and we have the source and target node numbers. The nodes variable has the node name and its node number. We are ready to plot. We do this by feeding the information into the sankeyNetwork function of NetworkD3.

sankeyNetwork(Links = edges, Nodes = nodes,
              Source = "source", Target = "target",
              Value = "value", NodeID = "name",
              fontSize = 16)

In this you can see what account changes into another. Accounts that are logged into will be on the left. What gets transitioned to will be on the right.

Let's do one more. Change left_field and right_field to EVENT_KIND and OBJ_KIND. Make operand1, operand2, and operation all "". Then re-run. Below is what I get.

This diagram is more interesting. In the next blog we will make an adjustment to the Sankey Diagram that will make it much more interesting.

Thursday, March 9, 2017

Heatmaps

In the last blog post, we learned about making bar charts. Sometimes we are looking for patterns in the data and a heatmap is the best choice. We'll go over making one of those and I think you'll see that its super easy.

New data
To start with, we need data for the last week. This is a bit different that our other exercises. We can get the data like this:

# cd R/audit-data
# ausearch --start week-ago --format csv > audit.csv

The code
The program to do the heatmap is almost identical to what we did for the bar charts. If you want an explanation for this part go check out the last blog post. You can copy and paste this into the console section of RStudio to run it, or paste it into a new file. The code is as follows:

library(ggplot2)

audit <- read.csv("~/R/audit-data/audit.csv", header=TRUE)
audit$one <- rep(1,nrow(audit))

# Create time series data frame for aggregating
audit$posixDate <- as.POSIXct(paste(audit$DATE, audit$TIME), format="%m/%d/%Y %H:%M:%S")

# Create a column of hour and date to aggregate an hourly total.
audit$hours <- format(audit$posixDate, format = '%Y-%m-%d %H')

# Now summarize it
temp <- aggregate(audit$one, by = list(audit$hours), FUN = length)

# Put results into data frame for plotting
final = data.frame(date=as.POSIXct(temp$Group.1, format="%Y-%m-%d %H", tz="GMT"))
final$num <- temp$x
final$day <- weekdays(as.Date(final$date))
final$oday <- factor(final$day, levels = unique(final$day))
final$hour <- as.numeric(format(final$date, "%H"))

OK, now we are ready for the plot function to create the chart. In this one we want to set the x-axis to the day of the week and the y-axis to be the hour of the day. We also tell the aesthetic function that we want the color of the tile to indicate the count of events during that hour. Next, we tell ggplot that we want tiles, we set the color scale to continuous and give it a legend label, we call a ylim to force the y-axis to have 24 hours, we tell it to make the black and white theme for the chart, set the y scaling, create a title, and add some labels.

# Plot as heat map x is day of week, y is hour, darkness is number of events
plot1 = ggplot(final, aes(x=final$oday, y=final$hour, fill=final$num)) +
geom_tile() +
scale_fill_continuous(low="#F7FBFF", high="#2171B5", name="Events") +
ylim(0,23) + theme_bw() + scale_y_continuous(breaks=seq(0, 23, 4)) +
ggtitle("Events per Hour") + labs(x="Day of Week", y="Hour of Day")

# To see what ggplot can do
print(plot1)

This makes a chart similar to this:

And just for fun, try it with plotly:

library(plotly)
ggplotly(plot1)

Now you can hover the mouse over the tiles and see the actual value.

As for what the chart reveals, you can see that the system is turned on nearly the same time every day and off varies. You might use this to spot activity during off hours that would be harder to see any other way. You might also see that there is heavy activity at the same hour everyday and use that as a starting point to investigate further.

Summary
Heatmaps are really just some different options to the plot function. The data is prepared the same way. But its organized differently. You can look around on-line for ggplot tutorials if you want to investigate customizing this even more.

Wednesday, March 1, 2017

Bar Charts

Let's take a look at how to do some quantitative charting of the audit events. Some times its good to get an idea of how many events are happening in a given time interval. We will take a look at how to create a bar chart summarizing the hourly number of events.

Bar Charts
Go ahead and create some events for us to experiment with:

# cd ~/R/audit-data
# ausearch --start today --format csv > audit.csv

The go into Rstudio and open a new file and put the following in:

library(ggplot2)
audit <- read.csv("~/R/audit-data/audit.csv", header=TRUE)
audit$one <- rep(1,nrow(audit))

What this does is loads a plotting library into the work space. Reads our audit csv file that we just created into a variable that we will call audit. By saying header=TRUE, we are telling R to use the first line of the csv file to name the columns of data. Next we add a column of ones that we will use later for summing. Let's stop here for a second and examine things.

In RStudio place the cursor on the library line. Then click on "Run" to single step the program. After executing the last line, change to the console screen below. In the console screen type in "audit". This will cause the contents of audit to be echo'ed to the screen. To see the column names, use the name function in the RStudio console:

> names(audit)
[1] "NODE" "EVENT" "DATE" "TIME" "SERIAL_NUM" "EVENT_KIND" "SESSION" "SUBJ_PRIME" "SUBJ_SEC" "ACTION" "RESULT"
[12] "OBJ_PRIME" "OBJ_SEC" "OBJ_KIND" "HOW" "one"

We can see all the columns that we discussed in the last blog. You can also see the column that we just added. R has an internal representation of time that combines dates and time. We have the two separated but we need that native representation so that we can follow up with a function that splits off just the hour so that we can summarize on it. Add the next lines to the program:

# Create time series data frame for aggregating
audit$posixDate=as.POSIXct(paste(audit$DATE, audit$TIME), format="%m/%d/%Y %H:%M:%S")

# Create a column of hour and date to aggregate an hourly total.
audit$hours <- format(audit$posixDate, format = '%Y-%m-%d %H')

Next we want R to count up all the events grouped by what hour it is. We want to store the results in a temporary variable. We do this because the plotting function wants a data frame with everything it needs in the data frame and we just tell it what column contains it. So, we add the next line to the program:

# Now summarize it
temp <- aggregate(audit$one, by = list(audit$hours), FUN = length)

Then we build the data frame for plotting:

# Put results into data frame for plotting
final = data.frame(date=as.POSIXct(temp$Group.1, format="%Y-%m-%d %H", tz="GMT"))
final$num <- temp$x
final$day <- weekdays(as.Date(final$date))
final$oday <- factor(final$day, levels = unique(final$day))
final$hour <- as.numeric(format(final$date, "%H"))

Now, its time to plot. ggplot can plot many kinds of charts. All you need to do is change a couple variables passed and you get an entirely different chart. What we will do is tell the plot to use the aesthetics which plots the x axis based on the time and y based on the sum that we totaled. Then we tell it we want a bar chart and to give it a nice title and group the data on the x axis by the time representation and to add an x and y axis label. This is coded up as follows:

# Plot as BAR CHART x is day of week, y is number of events
plot1 = ggplot(final, aes(x=final$date, y=final$num)) +
geom_bar(stat="identity") + ggtitle("Events per Hour") +
scale_x_datetime() + xlab("") + labs(x="Hour of Day", y="Events")

To see it, we need to print it:

print(plot1)

You can run all the lines down to here and you get something like this:

What the print statement drew was a picture. What if you wanted an interactive web page? In the console add these:

library(plotly)
ggplotly(plot1)

Over in the viewer you now have a web page that tells you the total when you place the mouse over the columns. Hmm. Our chart is gray. How about adding some color? OK, how about making the color vary with how many events we have? OK, then lets type this in at the console:

plot1 = ggplot(final, aes(x=final$date, y=final$num, fill=final$num)) +
geom_bar(stat="identity") + ggtitle("Events per Hour") +
scale_x_datetime() + xlab("") + labs(x="Hour of Day", y="Events") +
scale_fill_gradient(low="black", high = "red", name="Events/Hour")

print(plot1)

All we did is tell the plot's aesthetics that we want to fill a color based on the number and then told it what gradient to use. When you run that you can now see that we have colors:

If you want that as a web page, you can re-run the ggplotly(plot1) command. If for some reason you wanted to save the image as a file on your hard drive, you just run the following command:

ggsave("bar-hourly-events-per-hour.png", plot1, width=7, height=5)

In this way you can automate creation of charts and then pull them all together into a reports.