Intro to Classification with Python and Sklearn Part_2 (train and test part)

Ben Brown
2 min readApr 13, 2020

In the first tutorial, the classification process used data created. This time classification process is done using data imported.

from sklearn.datasets import load_iris

load_iris is a function within the module ‘datasets’, as shown in next script, automatically returns the data as a ‘bunch’ data type object.

iris = load_iris()

The classification problem that you are solving here includes the famous iris flower dataset which includes 3 different types of iris flowers. In order to predict the correct type of iris species, the following characteristics are considered: sepal length, sepal width, petal length, and petal width. The sklearn.datasets module makes it easy to extract this information which can be done by implementing the following code:

print(iris.feature_names)
print(iris.target_names)

This tutorial gives the same main intricacies as the first. The train and test data will be used here.

The train feature data will be used to extract test feature data. The train target data will be used to extract test target data. This will be done by indexing. Know this: the 0th-49th sample contains ‘setosa’ classifications, 50th-99th ‘versicolor’, 100th–150th ‘virginica’.

The test data will contain one sample from each classification class. To create a test variable with the data that is needed implement the code below:

# iris.data is the array that contains the features
test_extract = [0,50,100]
test_features = iris.data[test_extract]

The same thing can be done with the target data. The ‘test_extract’ variable creates a list because when indexing through the array, this is how you extract the data via indexing as opposed to other methods.

# iris.target has classification results based on feature data
test_target = iris.target[test_extract]

The difference in the first tutorial and this tutorial is applying the principal of training data and testing data. The purpose of this application is using the testing data to check how good the model is based on DATA IT HAS NOT SEEN BEFORE. What better way to test it?

Since the output is known using the testing data then this is an efficient process in testing if the model is at its greatest efficiency.

The same code is implemented from the previous tutorial. The beautiful thing about importing this data from sci kit learn libraries is that it is already formatted and cleaned so this makes for testing models VERY easy as data cleaning and wrangling is the major time consuming process in the machine learning pipeline.

from sklearn import tree
clf = tree.DecisionTreeClassifier()
clf.fit(iris.data, iris.target)

In order to compare the model to the correct output, use the test data and the prediction method under the ‘tree’ module.

print(clf.predict(test_features))
print("correct output" + "\n")
print(test_target)

To check out the whole code go here.

--

--