This is a simple test case of creating a random 2×2 matrix, performing its inversion, and multiplying them. We will also use MS Excel to check our computations
Create a 2×2 matrix
> mNP <- matrix(rnorm(4),nrow=2,ncol=2)
This command used the matrix() function . This first argument, the datasets, uses rnorm() function to generate 4 random variates, and the next two arguments explain that those are to be ordered in a matrix of 2 rows by 2 columns. Now let’s display the matrix
Next, we use solve() function to invert the matrix, and display the output
> mNP_inv <- solve(mNP)
[1,] 0.3360953 1.051306
[2,] -1.7260381 1.263961
Finally, we use the %*% operator to to the algebraic multiplication of the matrices and check if we arrive to I
> mNP %*% mNP_inv
[1,] 1 0
[2,] 0 1
If you would want to do the last operation in Excel, as a check, then let’s suppose that the first matrix sits in fields A1,B1,A2,B2 and the second one sits in fields D1,E1,D2,E2. Their multiplication product would be:
Top left, top right =A1*D1+B1*D2 =A1*E1+B1*E2
Bottom left, bottom right =A2*D1+B2*D2 =A2*E1+B2*E2
Looking back to what my generation considers “programming”, a term later changed to “development”, we see a gradual shift of programming languages, from tools that help us talk to a machine in its native language (which is the instruction set of its processor(s)), to a toolset that comes ever closer to understanding more business terms, and needs less delving into the binary reality of a processor. What has remained the same? The need to implement certain functionality, whether it be displaying a scatter plot on screen, or calculating a standard deviation, or handling files, anything that may be required.
From that point of view, programming has become “easier”, while development has become “harder”. This is not a controversial statement. It is easier by now, to create an array that will hold data, for example. Less and less complexity has to be dealt with, whether this is working with files, memory, or parallelizing. At the same time, the plethora of available tools, and the complexity of the modern IT ecosystem, along with the simplicity of tools to a degree, center the developer to direct a fully sized orchestra. Knowledge of procedural programming, of the libraries/tools relevant to the business, intimacy and “instinct” for the data at hand, are all considered necessary assets.
This is where Python, and similar solutions, like R, stand. Working with files, memory, writing your program logic is not very unlikely from using a previous generation language. Many features, like the argv to get the executable path/name, is very similar to ANSI C. Where their true strength, and complexity lies, is for one to be skilled with the data and the available functions at hand. This may take much more time to learn, than just going through files and printing the infamous “Hello World”. Back to the original question, and keeping in mind the title of this article: Python is easy to “program”, yet can be infinitely hard to “develop”
Who should learn Python
Python is an analytical programming tool commonly associated with Machine Learning, Artificial Intelligence and Big Data. If engineering in those domains sounds interesting to you, it is probably time to start. While it may or may not be part of every solution, it is a very common tool to use, along with R
Simple things to get you started
First, install the environment. The packages are located here: https://www.python.org/downloads/ . If there is any request to my site i can do a step by step installation
If you plan to use graphics, choose a website that offers graphics functions, and become familiar. I started to using https://plot.ly/#/ this I seem to think I discovered through my Google feed
How to get inspired
Site https://www.learnpython.org/ offers online courses, to get one start using the language (i am not affiliated to them, but i did find their content useful). It is highly recommended, to really work on the exercises than scroll through the code. it took me a while to realize that tabs can indicate nested operations, for example. So going through the simplest examples and working your way up, is highly recommended.
Test Case was implemented in Python 3.6.5 running on a Ubuntu Linux 18.04 64-bit virtual machine. In order to carry out this test-case you will need to create an account in plot.ly and create the credentials file on the host you will be running Python from. All instructions are on their web site
Suppose a CSV which has a first row we want to define as the X-Axis of our plot, and two further rows which we want as the data in the Y-Axis. It could be something like this:
You need two pieces of software to have Linux machine going under your laptop: The virtualization software, and an image of the operating system you plan to stage in your Virtual Machine (VM). Trying to keep this article as short as possible, any issues with hardware requirements and licensing for production use are left outside.
Identify the combination of virtualization software and Linux platform you need. This demo will install Oracle VM VirtualBox and deploy UBUNTU 18.04 64 bit, on a laptop running Windows 10 Home
Once done, it is time to launch the VirtualBox and create the virtual machine. Find the “Oracle VM VirtualBox” icon on your desktop or program group. It looks like this:
When it has launched, click on the left-most icon in the toolbar (“New”) then provide a name for your VM, the type and the version. This should much the Linux distro you have downloaded.
In the next three pages of the install process you need to create a Virtual Hard Disk. The simplest choice is to select VHD (Virtual Hard Disk)/Fixed size. The distro notes should point to a minimum disk requirement (see above screenshot)
…The first time you launch your VM, it will ask
for a start-up disk. This will be the Linux distribution file that we downloaded:
From then on, there will be a Welcome dialogue (similar to any startup installation such those that come in smartphones).
Whether in human or machine intelligence, one can think of two main categories of solution to problems. The first, is the kind of problem that has a deterministic, rule-based solution. The second, is a problem that a decision, or outcome, cannot be derived by a mathematical formula or a correlation of factors that are both fairly constant in volume and with equal weight to each other. How are those problems solved? By data. lots, and lots of data. While how we go from X to Y (whether Y is a category, a yes/no answer or a prediction) may not be known, we have sufficient sets of (X,Y) to feel confident that we can apply different models and decide which fits the data-set the closest, with the least amount of error or uncertainty.
Current technology, both in hardware processing and in software solutions, has allowed us to design systems that can store and analyze such datasets, in a manner much more economic and scalable than before. Big Data are anything that encompasses those datasets. The data itself, the technology and software solutions to store them in a manner that is efficient at scale, the procedures to unify different data sources and generalize or prepare the data for decision making, the intuition of the Data Scientists that understand the nature of the data, and the choice of tools to be used for a certain application, are all parts of the Big Data revolution.
It would be interesting to discuss with comments from your side, what kind of problem you would categorize in which of the two cases (or possibly a different one). Thank you in advance
The Java Development Kit and Runtime Environment are distributed by Oracle. “Java 8” refers to release 1.8. Java 10, the latest (June 2018) release is the first i have seen in the new naming convention. The simplest way to put it, is that a Java installation allows for Java programs to run on our Windows 10 computer. There are two main flavors of Java, the JDK, which allows development of Java, or the JRE which allows the execution of JARs. In this case we will do the JDK, as it will allow us in the future to build our SPARK installation
To find the executable for the install, search online for “java jdk 1.8 download” and find a the correct page inside oracle.com. It will look something like this:
in the setup dialogs, select the file path of the Java installation. I prefer to not install the JAVA Home under “Program Files”, as i feel uncertain whether all sorts of code will be able to process the space in the filename, or it will have to be defined as PROGRA%1. Also the Java path will have to be added to the machine’s file PATH, another reason to keep the tree short.
Once the installation wizard has finished, we need to create environment variable JAVA_HOME, and add the path to the java.exe program to the computer’s PATH system variable.
The way to do this, in Windows 10, is to click the mouse on the magnifying glass button, right next of the Start menu. This is this:
Windows will offer “Edit the system environment variables (Control Panel) so we select that.Click on “New…” to add the user variable for your account and then append the %JAVA_HOME%\bin to the PATH
A machine can have more than one paths to JRE/JDK. Changing JAVA_HOME and PATH is all it takes to “activate” a certain version of Java. To confirm the installation was successful, and indeed we are using the version we need, we need to open a windows shell (again we can use the magnifier and type “cmd“) and then type java -version. This will both validate that the correct version of Java is active, but also that the system file path has access to the Java executable
A word of caution: if Oracle software is already installed, it has placed its Java path to the beginning of the PATH string.
Instead of creating the data frame programmatically, why not use an existing spreadsheet, available online. A simple HTTP file server, free and Open Source is HFS: http://rejetto.com/hfs/?f=
Once HFS is installed, uploading the spreadsheet is as simple as dragging and dropping an existing spreadsheet. Here is how it looks on my machine:
Once the file is in place, we can get its URL directly by right-clicking on the file, and selecting “Copy URL address – Ctrl+C”
The first time it will be needed to read the spreadsheet into R, some more software is needed
Since the spreadsheet contained one more row, we need (in order to have the dataframe exactly the same as our example with inline data) to discard the Date Column: dfB2 <- data.frame(dfXLBAN$Name,dfXLBAN$Bananas)