Learning about Machine Learning

7 min readJun 25, 2020

Written by Armin Hrncic, Software Developer at Softray Solutions

Big Data, Machine Learning, Artificial Intelligence…We have been constantly hearing these buzzwords over the last couple of years. What is the meaning of it and why now? Why all of the sudden everyone wants to implement it or use it? After reading this post you will have a clearer picture about these terms purpose, and even if you already knew all of this you will find a few interesting facts and analogies that you haven’t heard of yet, also we will write a little machine learning program. We can be either excited about technology progress or perhaps be scared about robots taking over the world, but one thing is for sure — we should not be indifferent about it.

Up next on the 1950’s news: Artificial Intelligence

Not to offend anyone from 1950’s, but we can say that the AI term is surprisingly old. Ok, fine, middle-aged if you wish. It was first coined by John McCarthy in 1956 inspired by Alan Turing, who proposed a famous Turing Test five years before. The test is considered as the door that opened AI field. The goal of the test was to give an answer to the question: Can we build a computer that can sufficiently imitate a human to the point where a suspicious judge cannot tell the difference between human and machine? The concept of this test was to include a machine, a human and a human questioner. The questioner’s job is then to decide which is the machine and which is the human based on their answers. Since the formation of the test, many AI programs have been able to pass it and one of the first is a program created by Joseph Weizenbaum called ELIZA. Based on this, we can describe AI as the capability of the machine to imitate intelligent human behavior.

McCarthy and Turing — founding fathers of Artificial Intelligence

With all their contributions, we can say that both, Turing and McCarthy are founding fathers of Artificial Intelligence. With all the machines talks and hipster nerdy looks, we can be assured that both of them were ahead of their time. Fun fact about McCarthy, he is the inventor of garbage collection for which we are all very thankful.

Ok, enough with the history lesson, let’s talk about some actual machine learning facts.

Machine Learning

We already said that AI is in fact machines learning to act like humans. How can machines do such things? The answer is by processing the input data and recognizing attributes and patterns in it. So we can conclude that ML is the subset of AI focused on the ‘learning’ part. Machine learning is the study of computer algorithms that improve automatically through experience. ML algorithms build a mathematical model based on sample data, known as training data in order to make predictions or decisions without being explicitly programmed to do so. So in other words, the machine gets data, uses mathematically advanced algorithms and gives different output for different input. Just like human right?

For example, let’s consider a picture of a box as an input, algorithm is processing it and recognizing attributes that describes a box such as, there are 4 sides, it is closed, it is connected, it must be box. The more the attributes that algorithm can recognize, the better the outcome prediction will be.

ML is a term from the late 80s and early 90s, which is not old, but still it was almost 30 years ago but the peak is now.

What is the big deal about Big data?

The term Big Data represents the data that is so large, fast or complex that it is difficult or impossible to process it with traditional methods. Organizations collect data from a variety of sources, including business transactions, smart (IOT) devices, industrial equipment, videos, social media and more. Storing all this data would be a problem before, but nowadays it is a lot cheaper with all the storage clouds and data lakes. We can finally say that ML is at its peak because of all available data we have now. We have larger amounts of data, better hardware and better software. Statistic shows that by 2020, 44 zettabytes of data (or 44 trillion gigabytes) is accumulated, algorithms are highly advanced so everything is prepared for the AI to take over.

There is a great analogy about first AI developers and the first church. The first church was built for decades and most of their builders have not seen the final product. Just like them, most of the AI developers worked on creating components and algorithms that will later be a part of the final structure in the future. Well, ladies and gentlemen, the future is now.

Iris Classification program

Ok, let’s get our hands dirty and try something. We will do the Iris classification program, also known as machine learning’s equivalent of „Hello World“ program in order to actually start learning about machine learning. We will need a program called Anaconda, which is a free and open-source distribution of the Python and R programming languages for scientific computing (data science, machine learning). After installation, we will run anaconda cmd and run command: jupyter notebook. Jupyter Notebook combines live code, graphics, visualizations, and text in shareable notebooks that run in a web browser. New page will open in browser with url http://localhost:8888/tree. Now we select new Python 3 notebook and we are ready to write our code:

Iris

Iris is the family in the flower which contains the several species such as the setosa, versicolor, virginica. We need to create the model that can classify the different species of the Iris flower based on the given features.

Dataset

First, we are just going to import iris dataset from scikit learn library. Dataset contains 150 samples of iris features, 3 labels — species of Iris (Iris setosa, Iris virginica and Iris versicolor) and 4 features: Sepal length, Sepal width, Petal length, Petal Width in cm.

Dataset is an object that contains data (sample features), target (sample classification), feature and target labels.

Algorithm

So we have the dataset, what next? The next step if finding the proper algorithm to use on our dataset. There are lots of algorithms that are used in machine learning and we should really explore them more and understand them in order to use them for our predictions.

But for now, let's quickly analyze our dataset. We can easily represent our sepal features on graph by creating scatter plot using matplotlib library.

Each color represents iris group and by looking at the graph we can see that for each dot, most of their neighbors are from the same group as that dot. We can use nearest neighbor algorithm which classifies the item based on its neighbors by features. We just need to import this algorithm, create our model for training and model for testing. Sklearn method train_test_split will separate our dataset into training model which will get 75% of the data and testing model which will get 25%. By calling fit method for our classifier, we are training our model. The difference is that training output is available to model whereas testing data is the unseen data for which predictions have to be made.

Finally, by calling score method our model is making predictions of test targets and show us how many times the prediction was correct.

So we managed to score 97% which is a solid result. For the end let’s make a prediction with our model:

Based on the input features, this is 97% iris setosa. If you found this interesting you can try some other algorithm or do some feature engineering in order to get to 100%.

If you enjoyed reading this, click on the clap button so others can find this post.