Machine Learning, Classification and Regression

As I have stated before, I do not claim to be a statistician. However, I do feel the need to provide a very peeled down definition of some terms, as those are used in our future exercises in R, using classification and regression formulas. This, building and expanding on the current bananas case study

In my own words, “Machine Learning” is a term to imply that a decision (or computation, or derivation of a value), inside a piece of software, is not based on an algorithm. Instead, a statistical model is used, on a set of existing set of (X,Y) values, to model the relationship between the x and y values for example. We identify two main categories of Machine Learning, those of Classification and Regression

Classification groups sparse data into categories. For example, in a satellite image, a classification could provide the type of land use depicted in the image, so it could be a predefined value between “urban area”, “rural area”, “sea”. Classification is divided in two methodologies, supervised and unsupervised. in the land-use example, different parameters can be used (such as variance/covariance of the adjacent picture elements) to try and predefine characteristics of each type of use. For example, a rural area we would expect to have a high degree of uniformity in the shading (even more so the sea), with clearly defined rectangular limits between the fields, distinctive shapes of rivers or other distinctive topographical features.

Regression attempts, from a distinct set of pairs of values, to provide a continuous mapping of the pairs, even if those are not initially present in the data set. For example, if the pairs are (1,1,2,4,4,16,5,25,9,81) we have reason to believe that the relation between each pair is y=x^2. Having determined the formula that determines the relationship of the pairs, we can easily predict what will be the value of a new pair. For example we can extend the above array to (1,1,2,4,4,16,5,25,9,81,10,100)

While in the first case the difficulty lies in properly defining the classes and their attributes, the second case needs from the researcher to apply the proper regression method for the acceptable margin of error. i like to think, that both cases, are governed by the law of Causality


Quick explanation of the statistical terms used in the case study

The numbers below refer to this case study:


Statistical Mean

The average value of a set of numbers.
How to calculate: Add all the values, and multiply by the number of the occurrences of the value
Example: In our case, (2+3+6+0+1+5+4)/7 = 3. So this person, had a value of mean consumption equal to 3 bananas per day
Use: The Mean allows us to produce a single value which most closely interprets a data set


Indication of how “spread” is a data set, in this case, the bigger the number, the more the spread it is.
How to calculate: Subtract the Mean from each value (also called the deviation from the mean), square it, sum the squares, divide by the occurrences of the value
Example: [(2-3)^2+(3-3)^2+(6-3)^2+(0-3)^2+(1-3)^2+(5-3)^2+(4-3)^2]/7=4
Use: It can be easily seen that if the person had exactly the same number of bananas every day, all the deviations would be zero, so there would be no variance. A lower variance would mean that the number of fruits daily is closer to each other, making predictions for future consumptions much safer

Standard Deviation

How to calculate: The square root of Variance (in our case, 2)
Use: While the Variance shows the dispersion of the data set, standard deviation is used to measure confidence in the statistical conclusions we achieve. Imagine in our own sample that we would have fewer values, and some of them had much different values. in this case a very large Standard Deviation would mean that is impossible to predict how many bananas will the person eat


R Language – from nothing to a plot of your own data

What is R?

R is a procedural programming language, with built in statistical functions and the ability to generate graphical plots of data

Let’s install some software (all software below is Open Source License)

Use CRAN (Comprehensive R Archive Network) to download and install R. Go to CRAN, select the mirror suitable for your geo location, and choose your platform. This will install R language itself

Then, let’s install RStudio.

Go to and choose “RStudio Desktop – Open Source License”

Create the Case Study

The one used here is how many bananas did 2 people eat, for one week (7 days). There is data for two people.

Launch RStudio

Screenshot of RStudio
Screenshot of RStudio, to explain the tabs outline

Observe the basic elements of the IDE: Script upper left, console below. Once the commands are executed, the values of the variables are displayed top-right. As seen in the screen to the left, there are 4 elements created, 2 vectors and 2 dataframes, and all assigned to variables. The two vectors contain our data (who ate how many bananas) and the dataframes are used for the plots. The first dataframe is used to determine the axis, the second to compute the statistical data.
Quick note, for those double-checking the numbers: the functions use N-1 instead of N as dividend, meaning they treat the data set as a sample, instead of a population. The mapping of the function names, between R and Excel for Variance is (EXCEL: VAR, R: VAR), and St.Dev (EXCEL:STDEV, R: SD)Excel, additionally, has functions VAR.P and STDEV.P where the dividend is N, providing a calculation that is considered to cover the entire dataset, and not a sample.

Now we need to install the graphics libraries to our installation and have them available for our code. For this, we need to type two commands. The first, connects to CRAN (so a Internet connection is required) to get the libraries.

Write our code and enter data


Now let’s create some sample data. We will create a “data frame”, which will consist of two vectors, one holding the measurements (fruit consumed per day) and the second will contain the person linked to this measurement

cName <- c('Radka','Radka','Radka','Radka','Radka','Radka','Radka','Natalie','Natalie',


BpD <- c(2, 3, 6, 0, 1, 5, 4,1, 4, 5, 7, 3, 9, 1)

dfNP <- data.frame(cName,BpD)

Now, we want to calculate the mean and Standard Deviation (see next post for explanation of terms) of our data, and put them in their own data frame:

dsTATNP <- plyr::ddply(dfNP, "cName", plyr::summarise, mean = mean(BpD), sd = sd(BpD))

And finally, perform the plot itself, using this command:

p <- ggplot() +
xlab("Participant Name")+
ylab("Bananas per Day")+
geom_point(data=dsTATNP,aes(cName,mean), colour = 'red', size = 3)+
geom_point(data=dsTATNP,aes(cName,sd), colour = 'green', size = 4)
p + labs(title = "R/ggplot demo, Nikolas Perdikis May 2018", subtitle = "Visit my starter Big Data blog:")

It should look something like this:

R/ggplot demo
Black dots are daily data consumption of fruit,
Green dot is Standard Deviation,
Red dot is Mean value

Additional steps

Use the following commands as practice:

-Type the name of the variable in the Console, to display its value. in our case, type cName and BpD

– Use help for commands using help(). For the commands we have used, you can use help(c), help(ggplot)

See how the value of the variables is displayed in the top right window, while help,output and other elements is displayed in the bottom right window

Environment and Output
in the lower window, switch between the tabs to see relevant information