Thursday, March 9, 2017

Heatmaps

In the last blog post, we learned about making bar charts.  Sometimes we are looking for patterns in the data and a heatmap is the best choice. We'll go over making one of those and I think you'll see that its super easy.

New data
To start with, we need data for the last week. This is a bit different that our other exercises. We can get the data like this:

# cd R/audit-data
# ausearch --start week-ago --format csv > audit.csv


The code
The program to do the heatmap is almost identical to what we did for the bar charts. If you want an explanation for this part go check out the last blog post. You can copy and paste this into the console section of RStudio to run it, or paste it into a new file. The code is as follows:

library(ggplot2)

audit <- read.csv("~/R/audit-data/audit.csv", header=TRUE)
audit$one <- rep(1,nrow(audit))

# Create time series data frame for aggregating
audit$posixDate <- as.POSIXct(paste(audit$DATE, audit$TIME), format="%m/%d/%Y %H:%M:%S")

# Create a column of hour and date to aggregate an hourly total.
audit$hours <- format(audit$posixDate, format = '%Y-%m-%d %H')

# Now summarize it
temp <- aggregate(audit$one, by = list(audit$hours), FUN = length)

# Put results into data frame for plotting
final = data.frame(date=as.POSIXct(temp$Group.1, format="%Y-%m-%d %H", tz="GMT"))
final$num <- temp$x
final$day <- weekdays(as.Date(final$date))
final$oday <- factor(final$day, levels = unique(final$day))
final$hour <- as.numeric(format(final$date, "%H"))


OK, now we are ready for the plot function to create the chart. In this one we want to set the x-axis to the day of the week and the y-axis to be the hour of the day. We also tell the aesthetic function that we want the color of the tile to indicate the count of events during that hour. Next, we tell ggplot that we want tiles, we set the color scale to continuous and give it a legend label, we call a ylim to force the y-axis to have 24 hours, we tell it to make the black and white theme for the chart, set the y scaling, create a title, and add some labels.

# Plot as heat map x is day of week, y is hour, darkness is number of events
plot1 = ggplot(final, aes(x=final$oday, y=final$hour, fill=final$num)) +
  geom_tile() +
  scale_fill_continuous(low="#F7FBFF", high="#2171B5", name="Events") +
  ylim(0,23) + theme_bw() + scale_y_continuous(breaks=seq(0, 23, 4)) +
  ggtitle("Events per Hour") + labs(x="Day of Week", y="Hour of Day")

# To see what ggplot can do
print(plot1)


This makes a chart similar to this:



And just for fun, try it with plotly:

library(plotly)
ggplotly(plot1)


Now you can hover the mouse over the tiles and see the actual value.

As for what the chart reveals, you can see that the system is turned on nearly the same time every day and off varies. You might use this to spot activity during off hours that would be harder to see any other way. You might also see that there is heavy activity at the same hour everyday and use that as a starting point to investigate further.

Summary
Heatmaps are really just some different options to the plot function. The data is prepared the same way. But its organized differently. You can look around on-line for ggplot tutorials if you want to investigate customizing this even more.

1 comment:

Natural gas said...

Very helpful blog. Heatmap is a graphical representation of data where the individual values contained in a matrix are represented as colors.