The best computer language for analyzing data is the "R" programming language. Its a different kind of language than anything I've used. When you issue a command, it typically applies to all data at once. If you find yourself writing "for loops", you're probably doing something wrong. When you really get the hang of using R, it will feel a bit like cheating. (I did that in 5 lines of code? Seriously?)
At first it kind of looks alien. But it is easy to learn. I highly recommend anyone interested in R to take the Coursera classes on Data Science from Johns Hopkins. There are others that might be equally good, but I have not seen them.
There are other methods to learn R. For example, the tryr.codeschool.com web site is available and teaches the basics for free. In one hour or so, you will most likely be able to read other people's programs and have a basic idea about what its doing. Becoming proficient really requires more effort such as writing your own programs. This is when the real questions come up.
As you get into the Data Science and "R" world, there is one recurring theme. "Start by installing a copy of RStudio on your system." The classes say it, blogs say it, tutorials say it, books say it. There's a reason for that. Its absolutely wonderful and an essential tool for wrangling data.
The best part is that RStudio is open source just like R is. In nearly all tutorials you will notice they all say to go get it from the RStudio web site, https://www.rstudio.com/products/rstudio/download. This gets you a prebuilt version of the software as an rpm, deb, exe, or tarball. One thing you will notice is that there is no srpm. Another thing you will notice is that no distribution actually ships RStudio. Why is that you might ask?
I think the answer is that it was never written in a way to be distributed by Linux distributions. It wants to put things in /usr/local which is fine for a 3rd party application, but against packaging rules for distributions. You cannot magically pass it a $DESTDIR to prefix installations for packaging purposes. Also, it wants to use its own copies of libraries rather than the system versions. This is also against most packaging guidelines. I suspect that this is really about the developers of RStudio not having to waste time debugging odd problems on various platforms due to each one being out of sync with each other.
If you are like me, I like to have the source code "just in case" and build my own package. I have looked all over the place and have found little to no guidance on building RStudio from scratch. I see where others are trying to do it. But what I have here in today's blog is actual instructions on how to build your own copy. I have already created a spec file which is most of the hard work. It is an older copy but it works fine. Some day I will update to the 1.0 release but it will take some effort.
I won't try to explain the spec file. It does not attempt to move things from /usr/local. Its just too much trouble and I'd rather get on with it. The build has been corrected to use system libraries, though.
So, let's build a copy of RStudio. You need to do this on a desktop system. I have tested that both Gnome and Cinnamon completes usable builds. I have not tested this on KDE or any other desktop. The reason why we want to build on a desktop system is because there are some dependencies during the build which look for desktop things that will fail but the build either completes and is not usable or errors out in some weird cryptic way. I have not been able to determine what this dependency is in order to add it to the BuildRequires. So, do this on a desktop from your build account. If you need to setup a build environment, see this blog article
First, let's install the build dependencies:
Now build as normal user:
To free up room after successful build:
When completed (adjust path for OS, version, and home dir):
This will take about an hour to build...maybe longer. Rstudio is a huge package. The srpm is 176Mb. My x86_64 version is 19 Mb. A whole lot of something gets compressed or jettisoned during the build.
After installing it you should be able to run it. The icon is in the main menu under "Programming". When you start it up, you should see something like this:
I'll explain more about RStudio in the next blog. If you choose to just install the rpm from the RStudio web site, there's nothing wrong with that. Its what all data scientists do today...because until now, there was almost no other options.