The Java Development Kit and Runtime Environment are distributed by Oracle. “Java 8” refers to release 1.8. Java 10, the latest (June 2018) release is the first i have seen in the new naming convention. The simplest way to put it, is that a Java installation allows for Java programs to run on our Windows 10 computer. There are two main flavors of Java, the JDK, which allows development of Java, or the JRE which allows the execution of JARs. In this case we will do the JDK, as it will allow us in the future to build our SPARK installation

To find the executable for the install, search online for “java jdk 1.8 download” and find a the correct page inside oracle.com. It will look something like this:

in the setup dialogs, select the file path of the Java installation. I prefer to not install the JAVA Home under “Program Files”, as i feel uncertain whether all sorts of code will be able to process the space in the filename, or it will have to be defined as PROGRA%1. Also the Java path will have to be added to the machine’s file PATH, another reason to keep the tree short.

Once the installation wizard has finished, we need to create environment variable JAVA_HOME, and add the path to the java.exe program to the computer’s PATH system variable.

The way to do this, in Windows 10, is to click the mouse on the magnifying glass button, right next of the Start menu. This is this:

Windows will offer “Edit the system environment variables (Control Panel) so we select that.Click on “New…” to add the user variable for your account and then append the %JAVA_HOME%\bin to the PATH

A machine can have more than one paths to JRE/JDK. Changing JAVA_HOME and PATH is all it takes to “activate” a certain version of Java. To confirm the installation was successful, and indeed we are using the version we need, we need to open a windows shell (again we can use the magnifier and type “cmd“) and then type java -version. This will both validate that the correct version of Java is active, but also that the system file path has access to the Java executable

A word of caution: if Oracle software is already installed, it has placed its Java path to the beginning of the PATH string.

Instead of creating the data frame programmatically, why not use an existing spreadsheet, available online. A simple HTTP file server, free and Open Source is HFS: http://rejetto.com/hfs/?f=
Once HFS is installed, uploading the spreadsheet is as simple as dragging and dropping an existing spreadsheet. Here is how it looks on my machine:

Once the file is in place, we can get its URL directly by right-clicking on the file, and selecting “Copy URL address – Ctrl+C”

The first time it will be needed to read the spreadsheet into R, some more software is needed

Download and install a PERL home on the machine. For Windows, Strawberry Perl can be used (link: http://strawberryperl.com/ )

Next, the package to read the spreadsheet needs to be installed into R. The package name is “gdata” so we perform

install.packages("gdata")
and require("gdata")

at this point, we are ready to load directly the spreadsheet into a new data frame

dfXLBAN <- read.xls("http://192.168.0.107/BaRanas.xlsx") trying URL 'http://192.168.0.107/BaRanas.xlsx' Content type 'application/octet-stream' length 8786 bytes downloaded 8786 bytes
lets display the data frame:

Since the spreadsheet contained one more row, we need (in order to have the dataframe exactly the same as our example with inline data) to discard the Date Column: dfB2 <- data.frame(dfXLBAN$Name,dfXLBAN$Bananas)

As I have stated before, I do not claim to be a statistician. However, I do feel the need to provide a very peeled down definition of some terms, as those are used in our future exercises in R, using classification and regression formulas. This, building and expanding on the current bananas case study

In my own words, “Machine Learning” is a term to imply that a decision (or computation, or derivation of a value), inside a piece of software, is not based on an algorithm. Instead, a statistical model is used, on a set of existing set of (X,Y) values, to model the relationship between the x and y values for example. We identify two main categories of Machine Learning, those of Classification and Regression

Classification groups sparse data into categories. For example, in a satellite image, a classification could provide the type of land use depicted in the image, so it could be a predefined value between “urban area”, “rural area”, “sea”. Classification is divided in two methodologies, supervised and unsupervised. in the land-use example, different parameters can be used (such as variance/covariance of the adjacent picture elements) to try and predefine characteristics of each type of use. For example, a rural area we would expect to have a high degree of uniformity in the shading (even more so the sea), with clearly defined rectangular limits between the fields, distinctive shapes of rivers or other distinctive topographical features.

Regression attempts, from a distinct set of pairs of values, to provide a continuous mapping of the pairs, even if those are not initially present in the data set. For example, if the pairs are (1,1,2,4,4,16,5,25,9,81) we have reason to believe that the relation between each pair is y=x^2. Having determined the formula that determines the relationship of the pairs, we can easily predict what will be the value of a new pair. For example we can extend the above array to (1,1,2,4,4,16,5,25,9,81,10,100)

While in the first case the difficulty lies in properly defining the classes and their attributes, the second case needs from the researcher to apply the proper regression method for the acceptable margin of error. i like to think, that both cases, are governed by the law of Causality

The average value of a set of numbers.
How to calculate: Add all the values, and multiply by the number of the occurrences of the value
Example: In our case, (2+3+6+0+1+5+4)/7 = 3. So this person, had a value of mean consumption equal to 3 bananas per day
Use: The Mean allows us to produce a single value which most closely interprets a data set

Variance

Indication of how “spread” is a data set, in this case, the bigger the number, the more the spread it is.
How to calculate: Subtract the Mean from each value (also called the deviation from the mean), square it, sum the squares, divide by the occurrences of the value
Example: [(2-3)^2+(3-3)^2+(6-3)^2+(0-3)^2+(1-3)^2+(5-3)^2+(4-3)^2]/7=4 Use: It can be easily seen that if the person had exactly the same number of bananas every day, all the deviations would be zero, so there would be no variance. A lower variance would mean that the number of fruits daily is closer to each other, making predictions for future consumptions much safer

Standard Deviation

How to calculate: The square root of Variance (in our case, 2)
Use: While the Variance shows the dispersion of the data set, standard deviation is used to measure confidence in the statistical conclusions we achieve. Imagine in our own sample that we would have fewer values, and some of them had much different values. in this case a very large Standard Deviation would mean that is impossible to predict how many bananas will the person eat

R is a procedural programming language, with built in statistical functions and the ability to generate graphical plots of data

Let’s install some software (all software below is Open Source License)

Use CRAN (Comprehensive R Archive Network) to download and install R. Go to CRAN https://cran.r-project.org/mirrors.html, select the mirror suitable for your geo location, and choose your platform. This will install R language itself

The one used here is how many bananas did 2 people eat, for one week (7 days). There is data for two people.

Launch RStudio

Observe the basic elements of the IDE: Script upper left, console below. Once the commands are executed, the values of the variables are displayed top-right. As seen in the screen to the left, there are 4 elements created, 2 vectors and 2 dataframes, and all assigned to variables. The two vectors contain our data (who ate how many bananas) and the dataframes are used for the plots. The first dataframe is used to determine the axis, the second to compute the statistical data.
Quick note, for those double-checking the numbers: the functions use N-1 instead of N as dividend, meaning they treat the data set as a sample, instead of a population. The mapping of the function names, between R and Excel for Variance is (EXCEL: VAR, R: VAR), and St.Dev (EXCEL:STDEV, R: SD)Excel, additionally, has functions VAR.P and STDEV.P where the dividend is N, providing a calculation that is considered to cover the entire dataset, and not a sample.

Now we need to install the graphics libraries to our installation and have them available for our code. For this, we need to type two commands. The first, connects to CRAN (so a Internet connection is required) to get the libraries.

Write our code and enter data

install.packages("ggplot2")
library(ggplot2)

Now let’s create some sample data. We will create a “data frame”, which will consist of two vectors, one holding the measurements (fruit consumed per day) and the second will contain the person linked to this measurement