As I have stated before, I do not claim to be a statistician. However, I do feel the need to provide a very peeled down definition of some terms, as those are used in our future exercises in R, using classification and regression formulas. This, building and expanding on the current bananas case study
In my own words, “Machine Learning” is a term to imply that a decision (or computation, or derivation of a value), inside a piece of software, is not based on an algorithm. Instead, a statistical model is used, on a set of existing set of
(X,Y) values, to model the relationship between the x and y values for example. We identify two main categories of Machine Learning, those of Classification and Regression
Classification groups sparse data into categories. For example, in a satellite image, a classification could provide the type of land use depicted in the image, so it could be a predefined value between “urban area”, “rural area”, “sea”. Classification is divided in two methodologies, supervised and unsupervised. in the land-use example, different parameters can be used (such as variance/covariance of the adjacent picture elements) to try and predefine characteristics of each type of use. For example, a rural area we would expect to have a high degree of uniformity in the shading (even more so the sea), with clearly defined rectangular limits between the fields, distinctive shapes of rivers or other distinctive topographical features.
Regression attempts, from a distinct set of pairs of values, to provide a continuous mapping of the pairs, even if those are not initially present in the data set. For example, if the pairs are
(1,1,2,4,4,16,5,25,9,81) we have reason to believe that the relation between each pair is
y=x^2. Having determined the formula that determines the relationship of the pairs, we can easily predict what will be the value of a new pair. For example we can extend the above array to
While in the first case the difficulty lies in properly defining the classes and their attributes, the second case needs from the researcher to apply the proper regression method for the acceptable margin of error. i like to think, that both cases, are governed by the law of Causality