R Language – from nothing to a plot of your own data

What is R?

R is a procedural programming language, with built in statistical functions and the ability to generate graphical plots of data

Let’s install some software (all software below is Open Source License)

Use CRAN (Comprehensive R Archive Network) to download and install R. Go to CRAN https://cran.r-project.org/mirrors.html, select the mirror suitable for your geo location, and choose your platform. This will install R language itself

Then, let’s install RStudio.

Go to https://www.rstudio.com/products/rstudio/download/ and choose “RStudio Desktop – Open Source License”

Create the Case Study

The one used here is how many bananas did 2 people eat, for one week (7 days). There is data for two people.

Launch RStudio

Screenshot of RStudio
Screenshot of RStudio, to explain the tabs outline

Observe the basic elements of the IDE: Script upper left, console below. Once the commands are executed, the values of the variables are displayed top-right. As seen in the screen to the left, there are 4 elements created, 2 vectors and 2 dataframes, and all assigned to variables. The two vectors contain our data (who ate how many bananas) and the dataframes are used for the plots. The first dataframe is used to determine the axis, the second to compute the statistical data.
Quick note, for those double-checking the numbers: the functions use N-1 instead of N as dividend, meaning they treat the data set as a sample, instead of a population. The mapping of the function names, between R and Excel for Variance is (EXCEL: VAR, R: VAR), and St.Dev (EXCEL:STDEV, R: SD)Excel, additionally, has functions VAR.P and STDEV.P where the dividend is N, providing a calculation that is considered to cover the entire dataset, and not a sample.

Now we need to install the graphics libraries to our installation and have them available for our code. For this, we need to type two commands. The first, connects to CRAN (so a Internet connection is required) to get the libraries.

Write our code and enter data

install.packages("ggplot2")
library(ggplot2)

Now let’s create some sample data. We will create a “data frame”, which will consist of two vectors, one holding the measurements (fruit consumed per day) and the second will contain the person linked to this measurement

cName <- c('Radka','Radka','Radka','Radka','Radka','Radka','Radka','Natalie','Natalie',

'Natalie','Natalie','Natalie','Natalie','Natalie')

BpD <- c(2, 3, 6, 0, 1, 5, 4,1, 4, 5, 7, 3, 9, 1)

dfNP <- data.frame(cName,BpD)

Now, we want to calculate the mean and Standard Deviation (see next post for explanation of terms) of our data, and put them in their own data frame:

dsTATNP <- plyr::ddply(dfNP, "cName", plyr::summarise, mean = mean(BpD), sd = sd(BpD))

And finally, perform the plot itself, using this command:

p <- ggplot() +
xlab("Participant Name")+
ylab("Bananas per Day")+
geom_point(data=dfNP,aes(cName,BpD))+
geom_point(data=dsTATNP,aes(cName,mean), colour = 'red', size = 3)+
geom_point(data=dsTATNP,aes(cName,sd), colour = 'green', size = 4)
p + labs(title = "R/ggplot demo, Nikolas Perdikis May 2018", subtitle = "Visit my starter Big Data blog: http://www.smalldeskbigdata.com")

It should look something like this:

R/ggplot demo
Black dots are daily data consumption of fruit,
Green dot is Standard Deviation,
Red dot is Mean value

Additional steps

Use the following commands as practice:

-Type the name of the variable in the Console, to display its value. in our case, type cName and BpD

– Use help for commands using help(). For the commands we have used, you can use help(c), help(ggplot)

See how the value of the variables is displayed in the top right window, while help,output and other elements is displayed in the bottom right window

Environment and Output
in the lower window, switch between the tabs to see relevant information