Wednesday, March 1, 2017

Bar Charts

Let's take a look at how to do some quantitative charting of the audit events. Some times its good to get an idea of how many events are happening in a given time interval. We will take a look at how to create a bar chart summarizing the hourly number of events.

Bar Charts
Go ahead and create some events for us to experiment with:

# cd ~/R/audit-data
# ausearch --start today --format csv > audit.csv


The go into Rstudio and open a new file and put the following in:

library(ggplot2)
audit <- read.csv("~/R/audit-data/audit.csv", header=TRUE)
audit$one <- rep(1,nrow(audit))


What this does is loads a plotting library into the work space. Reads our audit csv file that we just created into a variable that we will call audit. By saying header=TRUE, we are telling R to use the first line of the csv file to name the columns of data. Next we add a column of ones that we will use later for summing. Let's stop here for a second and examine things.

In RStudio place the cursor on the library line. Then click on "Run" to single step the program. After executing the last line, change to the console screen below. In the console screen type in "audit". This will cause the contents of audit to be echo'ed to the screen. To see the column names, use the name function in the RStudio console:

> names(audit)
 [1] "NODE"       "EVENT"      "DATE"       "TIME"       "SERIAL_NUM" "EVENT_KIND" "SESSION"    "SUBJ_PRIME" "SUBJ_SEC"   "ACTION"     "RESULT"
[12] "OBJ_PRIME"  "OBJ_SEC"    "OBJ_KIND"   "HOW"        "one"


We can see all the columns that we discussed in the last blog. You can also see the column that we just added. R has an internal representation of time that combines dates and time. We have the two separated but we need that native representation so that we can follow up with a function that splits off just the hour so that we can summarize on it. Add the next lines to the program:


# Create time series data frame for aggregating
audit$posixDate=as.POSIXct(paste(audit$DATE, audit$TIME), format="%m/%d/%Y %H:%M:%S")

# Create a column of hour and date to aggregate an hourly total.
audit$hours <- format(audit$posixDate, format = '%Y-%m-%d %H')

Next we want R to count up all the events grouped by what hour it is. We want to store the results in a temporary variable. We do this because the plotting function wants a data frame with everything it needs in the data frame and we just tell it what column contains it. So, we add the next line to the program:

# Now summarize it
temp <- aggregate(audit$one, by = list(audit$hours), FUN = length)


Then we build the data frame for plotting:

# Put results into data frame for plotting
final = data.frame(date=as.POSIXct(temp$Group.1, format="%Y-%m-%d %H", tz="GMT"))
final$num <- temp$x
final$day <- weekdays(as.Date(final$date))
final$oday <- factor(final$day, levels = unique(final$day))
final$hour <- as.numeric(format(final$date, "%H"))


Now, its time to plot. ggplot can plot many kinds of charts. All you need to do is change a couple variables passed and you get an entirely different chart. What we will do is tell the plot to use the aesthetics which plots the x axis based on the time and y based on the sum that we totaled. Then we tell it we want a bar chart and to give it a nice title and group the data on the x axis by the time representation and to add an x and y axis label. This is coded up as follows:

# Plot as BAR CHART x is day of week, y is number of events
plot1 = ggplot(final, aes(x=final$date, y=final$num)) +
  geom_bar(stat="identity") + ggtitle("Events per Hour") +
  scale_x_datetime() + xlab("") + labs(x="Hour of Day", y="Events")


To see it, we need to print it:

print(plot1)


You can run all the lines down to here and you get something like this:




What the print statement drew was a picture. What if you wanted an interactive web page? In the console add these:

library(plotly)
ggplotly(plot1)


Over in the viewer you now have a web page that tells you the total when you place the mouse over the columns. Hmm. Our chart is gray. How about adding some color? OK, how about making the color vary with how many events we have? OK, then lets type this in at the console:

plot1 = ggplot(final, aes(x=final$date, y=final$num, fill=final$num)) +
  geom_bar(stat="identity") + ggtitle("Events per Hour") +
  scale_x_datetime() + xlab("") + labs(x="Hour of Day", y="Events") +
  scale_fill_gradient(low="black", high = "red", name="Events/Hour")

print(plot1)


All we did is tell the plot's aesthetics that we want to fill a color based on the number and then told it what gradient to use. When you run that you can now see that we have colors:



If you want that as a web page, you can re-run the ggplotly(plot1) command. If for some reason you wanted to save the image as a file on your hard drive, you just run the following command:

ggsave("bar-hourly-events-per-hour.png", plot1, width=7, height=5)


In this way you can automate creation of charts and then pull them all together into a reports.

No comments: