CART: Classification AND Regression Trees
Introduction
The goal of the Decision Tree Regression Algorithm is to split your data into groups. The algorithm uses something called information entropy, which is a mathematical process. The point of the algorithm is to split the data in such a way that information is added in the form of groupings. It stops splitting the data when it’s unable to add any more information to the dataset.
For example, given a scatter plot with two Independent Variables and with ‘y’ being in the third dimension like seen below, we can see how by splitting the data into groups we have a clearer picture of what’s going on.
Each split is called a leaf and the final leaves are called terminal leaves.
So, let’s say we add a new point and (where = 30 and = 150) In order to predict the ‘y’ value we take the average of all the values with the terminal leaf and that value will be assigned to our new point.
So, here in this case the green boxes represent the average of the terminal leaf. If we take P(30, 150), our prediction for ‘y’ is -64.1.
Another way to visualize the dataset is like this.
Here, we simply follow the tree. So in the case of P(30, 150), < 20 false. < 170 is true. < 40 is true. So the prediction will be again -64.1.
Building Decision Tree Regression
Use the Regression Template to preprocess the data.
Creating the Regressor
Import the DecisionTreeRegressor
class from the library sklearn.tree
and create an object of the class.
DecisionTreeRegressor takes the arguments:
random_state
: to get the same result use the same number for random_state
from sklearn.tree import DecisionTreeRegressor
regressor = DecisionTreeRegressor(random_state = 0)
regressor.fit(X, y)
Visualize the dataset
The problem is that we need to make sure the resolution is higher using the template. This is because a Decision Tree is not a continous function but instead a step wise function.
#HIGH RESOLUTION AND SMOOTHER CURVE
X_grid = np.arrange(min(X), max(X), 0.01)
X_grid = X_grid.reshape((len(X_grid), 1))
plt.scatter(X, y, color='red')
plt.plot(X_grid, regressor.predict(X_grid), color = 'blue')
plt.title('Title of my Model)
plt.xlabel('X Axis Title')
plt.ylabel('Y Axis Title')
plt.show()