What is R?
R is a procedural programming language, with built in statistical functions and the ability to generate graphical plots of data
Let’s install some software (all software below is Open Source License)
Use CRAN (Comprehensive R Archive Network) to download and install R. Go to CRAN https://cran.r-project.org/mirrors.html, select the mirror suitable for your geo location, and choose your platform. This will install R language itself
Then, let’s install RStudio.
Go to https://www.rstudio.com/products/rstudio/download/ and choose “RStudio Desktop – Open Source License”
Create the Case Study
The one used here is how many bananas did 2 people eat, for one week (7 days). There is data for two people.
Observe the basic elements of the IDE: Script upper left, console below. Once the commands are executed, the values of the variables are displayed top-right. As seen in the screen to the left, there are 4 elements created, 2 vectors and 2 dataframes, and all assigned to variables. The two vectors contain our data (who ate how many bananas) and the dataframes are used for the plots. The first dataframe is used to determine the axis, the second to compute the statistical data.
Quick note, for those double-checking the numbers: the functions use N-1 instead of N as dividend, meaning they treat the data set as a sample, instead of a population. The mapping of the function names, between R and Excel for Variance is (EXCEL: VAR, R: VAR), and St.Dev (EXCEL:STDEV, R: SD)Excel, additionally, has functions VAR.P and STDEV.P where the dividend is N, providing a calculation that is considered to cover the entire dataset, and not a sample.
Now we need to install the graphics libraries to our installation and have them available for our code. For this, we need to type two commands. The first, connects to CRAN (so a Internet connection is required) to get the libraries.
Write our code and enter data
Now let’s create some sample data. We will create a “data frame”, which will consist of two vectors, one holding the measurements (fruit consumed per day) and the second will contain the person linked to this measurement
cName <- c('Radka','Radka','Radka','Radka','Radka','Radka','Radka','Natalie','Natalie',
BpD <- c(2, 3, 6, 0, 1, 5, 4,1, 4, 5, 7, 3, 9, 1)
dfNP <- data.frame(cName,BpD)
Now, we want to calculate the mean and Standard Deviation (see next post for explanation of terms) of our data, and put them in their own data frame:
dsTATNP <- plyr::ddply(dfNP, "cName", plyr::summarise, mean = mean(BpD), sd = sd(BpD))
And finally, perform the plot itself, using this command:
p <- ggplot() +
ylab("Bananas per Day")+
geom_point(data=dsTATNP,aes(cName,mean), colour = 'red', size = 3)+
geom_point(data=dsTATNP,aes(cName,sd), colour = 'green', size = 4)
p + labs(title = "R/ggplot demo, Nikolas Perdikis May 2018", subtitle = "Visit my starter Big Data blog: http://www.smalldeskbigdata.com")
It should look something like this:
Use the following commands as practice:
-Type the name of the variable in the Console, to display its value. in our case, type
– Use help for commands using help(). For the commands we have used, you can use
See how the value of the variables is displayed in the top right window, while help,output and other elements is displayed in the bottom right window