My take, as a starter, on “Why Big Data”?

Whether in human or machine intelligence, one can think of two main categories of solution to problems. The first, is the kind of problem that has a deterministic, rule-based solution. The second, is a problem that a decision, or outcome, cannot be derived by a mathematical formula or a correlation of factors that are both fairly constant in volume and with equal weight to each other. How are those problems solved? By data. lots, and lots of data. While how we go from X to Y (whether Y is a category, a yes/no answer or a prediction) may not be known, we have sufficient sets of (X,Y) to feel confident that we can apply different models and decide which fits the data-set the closest, with the least amount of error or uncertainty.

Current technology, both in hardware processing and in software solutions, has allowed us to design systems that can store and analyze such datasets, in a manner much more economic and scalable than before. Big Data are anything that encompasses those datasets. The data itself, the technology and software solutions to store them in a manner that is efficient at scale, the procedures to unify different data sources and generalize or prepare the data for decision making, the intuition of the Data Scientists that understand the nature of the data, and the choice of tools to be used for a certain application, are all parts of the Big Data revolution.

It would be interesting to discuss with comments from your side, what kind of problem you would categorize in which of the two cases (or possibly a different one). Thank you in advance


Machine Learning, Classification and Regression

As I have stated before, I do not claim to be a statistician. However, I do feel the need to provide a very peeled down definition of some terms, as those are used in our future exercises in R, using classification and regression formulas. This, building and expanding on the current bananas case study

In my own words, “Machine Learning” is a term to imply that a decision (or computation, or derivation of a value), inside a piece of software, is not based on an algorithm. Instead, a statistical model is used, on a set of existing set of (X,Y) values, to model the relationship between the x and y values for example. We identify two main categories of Machine Learning, those of Classification and Regression

Classification groups sparse data into categories. For example, in a satellite image, a classification could provide the type of land use depicted in the image, so it could be a predefined value between “urban area”, “rural area”, “sea”. Classification is divided in two methodologies, supervised and unsupervised. in the land-use example, different parameters can be used (such as variance/covariance of the adjacent picture elements) to try and predefine characteristics of each type of use. For example, a rural area we would expect to have a high degree of uniformity in the shading (even more so the sea), with clearly defined rectangular limits between the fields, distinctive shapes of rivers or other distinctive topographical features.

Regression attempts, from a distinct set of pairs of values, to provide a continuous mapping of the pairs, even if those are not initially present in the data set. For example, if the pairs are (1,1,2,4,4,16,5,25,9,81) we have reason to believe that the relation between each pair is y=x^2. Having determined the formula that determines the relationship of the pairs, we can easily predict what will be the value of a new pair. For example we can extend the above array to (1,1,2,4,4,16,5,25,9,81,10,100)

While in the first case the difficulty lies in properly defining the classes and their attributes, the second case needs from the researcher to apply the proper regression method for the acceptable margin of error. i like to think, that both cases, are governed by the law of Causality