Learn With Other Kaggle Users

Classify forest types based on information about the area

This is a beginner friendly Kaggle Competition so we will learn Machine Learning together in this competition using the Forestry Data Set

Welcome to our first, invite-only, beginner-friendly competition.

Some people associate Kaggle with ultra-competitive machine learning masterminds. But we’re also a community of people learning together. In this competition we’re skipping the medals, points, and prizes so people who are just getting started can share ideas, experiment with different techniques and learn from each other.

The challenge:

In this competition you’ll predict what types of trees there are in an area based on various geographic features.

The competition datasets comes from a study conducted in four wilderness areas within the beautiful Roosevelt National Forest of northern Colorado. These areas represent forests with very little human disturbances – the existing forest cover types there are more a result of ecological processes rather than forest management practices.

The data is in raw form and contains categorical data such as wilderness areas and soil type.

Evaluation

Submissions are evaluated on categorization accuracy.

That’s just what fraction of predictions did you get right. You will want to use Classifier models like RandomForestClassifier rather than Regression models. With classifier models, the predict method will tell you which category is most likely.

Submission File For each ID in the test set, you must predict a probability for the TARGET variable. The file should contain a header and have the following format:

ID,TARGET
2,0
5,3
6,2
etc.

Data Description

The study area includes four wilderness areas located in the Roosevelt National Forest of northern Colorado. Each observation is a 30m x 30m patch. You are asked to predict an integer classification for the forest cover type. The seven types are:

    • Spruce/Fir
    • Lodgepole Pine
    • Ponderosa Pine
    • Cottonwood/Willow
    • Aspen
    • Douglas-fir
    • Krummholz

The training set (15120 observations) contains both features and the Cover_Type. The test set contains only the features. You must predict the Cover_Type for every row in the test set (565892 observations).

Data Fields

  • Elevation - Elevation in meters
  • Aspect - Aspect in degrees azimuth
  • Slope - Slope in degrees
  • Horizontal_Distance_To_Hydrology - Horz Dist to nearest surface water features
  • Vertical_Distance_To_Hydrology - Vert Dist to nearest surface water features
  • Horizontal_Distance_To_Roadways - Horz Dist to nearest roadway
  • Hillshade_9am (0 to 255 index) - Hillshade index at 9am, summer solstice
  • Hillshade_Noon (0 to 255 index) - Hillshade index at noon, summer solstice
  • Hillshade_3pm (0 to 255 index) - Hillshade index at 3pm, summer solstice
  • Horizontal_Distance_To_Fire_Points - Horz Dist to nearest wildfire ignition points
  • Wilderness_Area (4 binary columns, 0 = absence or 1 = presence) - Wilderness area designation
  • Soil_Type (40 binary columns, 0 = absence or 1 = presence) - Soil Type designation
  • Cover_Type (7 types, integers 1 to 7) - Forest Cover Type designation

The wilderness areas are:

    • Rawah Wilderness Area
    • Neota Wilderness Area
    • Comanche Peak Wilderness Area
    • Cache la Poudre Wilderness Area

The soil types are:

  1. Cathedral family - Rock outcrop complex, extremely stony.
  2. Vanet - Ratake families complex, very stony.
  3. Haploborolis - Rock outcrop complex, rubbly.
  4. Ratake family - Rock outcrop complex, rubbly.
  5. Vanet family - Rock outcrop complex complex, rubbly.
  6. Vanet - Wetmore families - Rock outcrop complex, stony.
  7. Gothic family.
  8. Supervisor - Limber families complex.
  9. Troutville family, very stony.
  10. Bullwark - Catamount families - Rock outcrop complex, rubbly.
  11. Bullwark - Catamount families - Rock land complex, rubbly.
  12. Legault family - Rock land complex, stony.
  13. Catamount family - Rock land - Bullwark family complex, rubbly.
  14. Pachic Argiborolis - Aquolis complex.
  15. unspecified in the USFS Soil and ELU Survey.
  16. Cryaquolis - Cryoborolis complex.
  17. Gateview family - Cryaquolis complex.
  18. Rogert family, very stony.
  19. Typic Cryaquolis - Borohemists complex.
  20. Typic Cryaquepts - Typic Cryaquolls complex.
  21. Typic Cryaquolls - Leighcan family, till substratum complex.
  22. Leighcan family, till substratum, extremely bouldery.
  23. Leighcan family, till substratum - Typic Cryaquolls complex.
  24. Leighcan family, extremely stony.
  25. Leighcan family, warm, extremely stony.
  26. Granile - Catamount families complex, very stony.
  27. Leighcan family, warm - Rock outcrop complex, extremely stony.
  28. Leighcan family - Rock outcrop complex, extremely stony.
  29. Como - Legault families complex, extremely stony.
  30. Como family - Rock land - Legault family complex, extremely stony.
  31. Leighcan - Catamount families complex, extremely stony.
  32. Catamount family - Rock outcrop - Leighcan family complex, extremely stony.
  33. Leighcan - Catamount families - Rock outcrop complex, extremely stony.
  34. Cryorthents - Rock land complex, extremely stony.
  35. Cryumbrepts - Rock outcrop - Cryaquepts complex.
  36. Bross family - Rock land - Cryumbrepts complex, extremely stony.
  37. Rock outcrop - Cryumbrepts - Cryorthents complex, extremely stony.
  38. Leighcan - Moran families - Cryaquolls complex, extremely stony.
  39. Moran family - Cryorthents - Leighcan family complex, extremely stony.
  40. Moran family - Cryorthents - Rock land complex, extremely stony.

Data Analysis and Visualization

Ok, so before diving into the modeling part we will do some Data Analysis and Data Visualization

There is a great Notebook by Fatih Bilgin : https://www.kaggle.com/fatihbilgin/quick-visualization-and-eda-for-beginners which performs this well

from google.colab import drive
drive.mount('/gdrive')
Drive already mounted at /gdrive; to attempt to forcibly remount, call drive.mount("/gdrive", force_remount=True).
PATH = "/gdrive/My\ Drive/DataSciLab/competitions/learn-together/"
Path = "/gdrive/My Drive/DataSciLab/competitions/learn-together/"
!ls {PATH}
sample_submission.csv  test.csv  train.csv
import pandas as pd


train = pd.read_csv(f"{Path}train.csv", index_col = "Id")
test = pd.read_csv(f"{Path}test.csv", index_col = "Id")
train.head()
Elevation Aspect Slope Horizontal_Distance_To_Hydrology Vertical_Distance_To_Hydrology Horizontal_Distance_To_Roadways Hillshade_9am Hillshade_Noon Hillshade_3pm Horizontal_Distance_To_Fire_Points Wilderness_Area1 Wilderness_Area2 Wilderness_Area3 Wilderness_Area4 Soil_Type1 Soil_Type2 Soil_Type3 Soil_Type4 Soil_Type5 Soil_Type6 Soil_Type7 Soil_Type8 Soil_Type9 Soil_Type10 Soil_Type11 Soil_Type12 Soil_Type13 Soil_Type14 Soil_Type15 Soil_Type16 Soil_Type17 Soil_Type18 Soil_Type19 Soil_Type20 Soil_Type21 Soil_Type22 Soil_Type23 Soil_Type24 Soil_Type25 Soil_Type26 Soil_Type27 Soil_Type28 Soil_Type29 Soil_Type30 Soil_Type31 Soil_Type32 Soil_Type33 Soil_Type34 Soil_Type35 Soil_Type36 Soil_Type37 Soil_Type38 Soil_Type39 Soil_Type40 Cover_Type
Id
1 2596 51 3 258 0 510 221 232 148 6279 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 5
2 2590 56 2 212 -6 390 220 235 151 6225 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 5
3 2804 139 9 268 65 3180 234 238 135 6121 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 2
4 2785 155 18 242 118 3090 238 238 122 6211 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 2
5 2595 45 2 153 -1 391 220 234 150 6172 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 5
print(f"Training Data Set Shape: {train.shape} \nTest Data Set Shape: {test.shape}")
Training Data Set Shape: (15120, 55)
Test Data Set Shape: (565892, 54)
train.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 15120 entries, 1 to 15120
Data columns (total 55 columns):
Elevation                             15120 non-null int64
Aspect                                15120 non-null int64
Slope                                 15120 non-null int64
Horizontal_Distance_To_Hydrology      15120 non-null int64
Vertical_Distance_To_Hydrology        15120 non-null int64
Horizontal_Distance_To_Roadways       15120 non-null int64
Hillshade_9am                         15120 non-null int64
Hillshade_Noon                        15120 non-null int64
Hillshade_3pm                         15120 non-null int64
Horizontal_Distance_To_Fire_Points    15120 non-null int64
Wilderness_Area1                      15120 non-null int64
Wilderness_Area2                      15120 non-null int64
Wilderness_Area3                      15120 non-null int64
Wilderness_Area4                      15120 non-null int64
Soil_Type1                            15120 non-null int64
Soil_Type2                            15120 non-null int64
Soil_Type3                            15120 non-null int64
Soil_Type4                            15120 non-null int64
Soil_Type5                            15120 non-null int64
Soil_Type6                            15120 non-null int64
Soil_Type7                            15120 non-null int64
Soil_Type8                            15120 non-null int64
Soil_Type9                            15120 non-null int64
Soil_Type10                           15120 non-null int64
Soil_Type11                           15120 non-null int64
Soil_Type12                           15120 non-null int64
Soil_Type13                           15120 non-null int64
Soil_Type14                           15120 non-null int64
Soil_Type15                           15120 non-null int64
Soil_Type16                           15120 non-null int64
Soil_Type17                           15120 non-null int64
Soil_Type18                           15120 non-null int64
Soil_Type19                           15120 non-null int64
Soil_Type20                           15120 non-null int64
Soil_Type21                           15120 non-null int64
Soil_Type22                           15120 non-null int64
Soil_Type23                           15120 non-null int64
Soil_Type24                           15120 non-null int64
Soil_Type25                           15120 non-null int64
Soil_Type26                           15120 non-null int64
Soil_Type27                           15120 non-null int64
Soil_Type28                           15120 non-null int64
Soil_Type29                           15120 non-null int64
Soil_Type30                           15120 non-null int64
Soil_Type31                           15120 non-null int64
Soil_Type32                           15120 non-null int64
Soil_Type33                           15120 non-null int64
Soil_Type34                           15120 non-null int64
Soil_Type35                           15120 non-null int64
Soil_Type36                           15120 non-null int64
Soil_Type37                           15120 non-null int64
Soil_Type38                           15120 non-null int64
Soil_Type39                           15120 non-null int64
Soil_Type40                           15120 non-null int64
Cover_Type                            15120 non-null int64
dtypes: int64(55)
memory usage: 6.5 MB
train.describe().T
count mean std min 25% 50% 75% max
Elevation 15120.0 2749.322553 417.678187 1863.0 2376.0 2752.0 3104.00 3849.0
Aspect 15120.0 156.676653 110.085801 0.0 65.0 126.0 261.00 360.0
Slope 15120.0 16.501587 8.453927 0.0 10.0 15.0 22.00 52.0
Horizontal_Distance_To_Hydrology 15120.0 227.195701 210.075296 0.0 67.0 180.0 330.00 1343.0
Vertical_Distance_To_Hydrology 15120.0 51.076521 61.239406 -146.0 5.0 32.0 79.00 554.0
Horizontal_Distance_To_Roadways 15120.0 1714.023214 1325.066358 0.0 764.0 1316.0 2270.00 6890.0
Hillshade_9am 15120.0 212.704299 30.561287 0.0 196.0 220.0 235.00 254.0
Hillshade_Noon 15120.0 218.965608 22.801966 99.0 207.0 223.0 235.00 254.0
Hillshade_3pm 15120.0 135.091997 45.895189 0.0 106.0 138.0 167.00 248.0
Horizontal_Distance_To_Fire_Points 15120.0 1511.147288 1099.936493 0.0 730.0 1256.0 1988.25 6993.0
Wilderness_Area1 15120.0 0.237897 0.425810 0.0 0.0 0.0 0.00 1.0
Wilderness_Area2 15120.0 0.033003 0.178649 0.0 0.0 0.0 0.00 1.0
Wilderness_Area3 15120.0 0.419907 0.493560 0.0 0.0 0.0 1.00 1.0
Wilderness_Area4 15120.0 0.309193 0.462176 0.0 0.0 0.0 1.00 1.0
Soil_Type1 15120.0 0.023479 0.151424 0.0 0.0 0.0 0.00 1.0
Soil_Type2 15120.0 0.041204 0.198768 0.0 0.0 0.0 0.00 1.0
Soil_Type3 15120.0 0.063624 0.244091 0.0 0.0 0.0 0.00 1.0
Soil_Type4 15120.0 0.055754 0.229454 0.0 0.0 0.0 0.00 1.0
Soil_Type5 15120.0 0.010913 0.103896 0.0 0.0 0.0 0.00 1.0
Soil_Type6 15120.0 0.042989 0.202840 0.0 0.0 0.0 0.00 1.0
Soil_Type7 15120.0 0.000000 0.000000 0.0 0.0 0.0 0.00 0.0
Soil_Type8 15120.0 0.000066 0.008133 0.0 0.0 0.0 0.00 1.0
Soil_Type9 15120.0 0.000661 0.025710 0.0 0.0 0.0 0.00 1.0
Soil_Type10 15120.0 0.141667 0.348719 0.0 0.0 0.0 0.00 1.0
Soil_Type11 15120.0 0.026852 0.161656 0.0 0.0 0.0 0.00 1.0
Soil_Type12 15120.0 0.015013 0.121609 0.0 0.0 0.0 0.00 1.0
Soil_Type13 15120.0 0.031481 0.174621 0.0 0.0 0.0 0.00 1.0
Soil_Type14 15120.0 0.011177 0.105133 0.0 0.0 0.0 0.00 1.0
Soil_Type15 15120.0 0.000000 0.000000 0.0 0.0 0.0 0.00 0.0
Soil_Type16 15120.0 0.007540 0.086506 0.0 0.0 0.0 0.00 1.0
Soil_Type17 15120.0 0.040476 0.197080 0.0 0.0 0.0 0.00 1.0
Soil_Type18 15120.0 0.003968 0.062871 0.0 0.0 0.0 0.00 1.0
Soil_Type19 15120.0 0.003042 0.055075 0.0 0.0 0.0 0.00 1.0
Soil_Type20 15120.0 0.009193 0.095442 0.0 0.0 0.0 0.00 1.0
Soil_Type21 15120.0 0.001058 0.032514 0.0 0.0 0.0 0.00 1.0
Soil_Type22 15120.0 0.022817 0.149326 0.0 0.0 0.0 0.00 1.0
Soil_Type23 15120.0 0.050066 0.218089 0.0 0.0 0.0 0.00 1.0
Soil_Type24 15120.0 0.016997 0.129265 0.0 0.0 0.0 0.00 1.0
Soil_Type25 15120.0 0.000066 0.008133 0.0 0.0 0.0 0.00 1.0
Soil_Type26 15120.0 0.003571 0.059657 0.0 0.0 0.0 0.00 1.0
Soil_Type27 15120.0 0.000992 0.031482 0.0 0.0 0.0 0.00 1.0
Soil_Type28 15120.0 0.000595 0.024391 0.0 0.0 0.0 0.00 1.0
Soil_Type29 15120.0 0.085384 0.279461 0.0 0.0 0.0 0.00 1.0
Soil_Type30 15120.0 0.047950 0.213667 0.0 0.0 0.0 0.00 1.0
Soil_Type31 15120.0 0.021958 0.146550 0.0 0.0 0.0 0.00 1.0
Soil_Type32 15120.0 0.045635 0.208699 0.0 0.0 0.0 0.00 1.0
Soil_Type33 15120.0 0.040741 0.197696 0.0 0.0 0.0 0.00 1.0
Soil_Type34 15120.0 0.001455 0.038118 0.0 0.0 0.0 0.00 1.0
Soil_Type35 15120.0 0.006746 0.081859 0.0 0.0 0.0 0.00 1.0
Soil_Type36 15120.0 0.000661 0.025710 0.0 0.0 0.0 0.00 1.0
Soil_Type37 15120.0 0.002249 0.047368 0.0 0.0 0.0 0.00 1.0
Soil_Type38 15120.0 0.048148 0.214086 0.0 0.0 0.0 0.00 1.0
Soil_Type39 15120.0 0.043452 0.203880 0.0 0.0 0.0 0.00 1.0
Soil_Type40 15120.0 0.030357 0.171574 0.0 0.0 0.0 0.00 1.0
Cover_Type 15120.0 4.000000 2.000066 1.0 2.0 4.0 6.00 7.0

All Wilderness_Area and Soil_Type columns have values in the range of 0 and 1. Quite likely these columns are categorical and consist of 0 and 1. To validate this i’m checking distinct values of following columns:

print(train.iloc[:, 10:-1].columns)
Index(['Wilderness_Area1', 'Wilderness_Area2', 'Wilderness_Area3',
       'Wilderness_Area4', 'Soil_Type1', 'Soil_Type2', 'Soil_Type3',
       'Soil_Type4', 'Soil_Type5', 'Soil_Type6', 'Soil_Type7', 'Soil_Type8',
       'Soil_Type9', 'Soil_Type10', 'Soil_Type11', 'Soil_Type12',
       'Soil_Type13', 'Soil_Type14', 'Soil_Type15', 'Soil_Type16',
       'Soil_Type17', 'Soil_Type18', 'Soil_Type19', 'Soil_Type20',
       'Soil_Type21', 'Soil_Type22', 'Soil_Type23', 'Soil_Type24',
       'Soil_Type25', 'Soil_Type26', 'Soil_Type27', 'Soil_Type28',
       'Soil_Type29', 'Soil_Type30', 'Soil_Type31', 'Soil_Type32',
       'Soil_Type33', 'Soil_Type34', 'Soil_Type35', 'Soil_Type36',
       'Soil_Type37', 'Soil_Type38', 'Soil_Type39', 'Soil_Type40'],
      dtype='object')
pd.unique(train.iloc[:,10:-1].values.ravel())
array([1, 0])

Yes all wilderness area and soil type columns consist of 0 and 1. In other words they are categorical. So i’m convering these columns to categorical ones.

train.iloc[:,10:-1] = train.iloc[:,10:-1].astype("category")
test.iloc[:,10:] = test.iloc[:,10:].astype("category")

I’m trying to find out correlation between columns with heatmap in this step.

import matplotlib.pyplot as plt
import seaborn as sns

plt.style.use('dark_background')

f,ax = plt.subplots(figsize=(8,6))
sns.heatmap(train.corr(),annot=True, linewidths=.5, fmt='.1f', ax=ax)
plt.show()

png

It seems the most important correlations are between “Horizontal Distance To Hydrology” and “Vertical Distance To Hydrology” with 70%; between “Aspect” and “Hillshade 3pm” with 60%; between “Hillshade Noon” and “Hillshade 3pm” with %60; between “Elevation” and “Horizontal Distance To Roadways” with %60. Let’s see how they are looking.

train.plot(kind='scatter', x='Vertical_Distance_To_Hydrology', y='Horizontal_Distance_To_Hydrology', alpha=0.5, color='yellow', figsize = (12,9))
plt.title('Vertical And Horizontal Distance To Hydrology')
plt.xlabel("Vertical Distance")
plt.ylabel("Horizontal Distance")
plt.show()

png

train.plot(kind='scatter', x='Aspect', y='Hillshade_3pm', alpha=0.5, color='maroon', figsize = (12,9))
plt.title('Aspect and Hillshade 3pm Relation')
plt.xlabel("Aspect")
plt.ylabel("Hillshade 3pm")
plt.show()

png

train.plot(kind='scatter', x='Hillshade_Noon', y='Hillshade_3pm', alpha=0.5, color='purple', figsize = (12,9))
plt.title('Hillshade Noon and Hillshade 3pm Relation')
plt.xlabel("Hillshade_Noon")
plt.ylabel("Hillshade 3pm")
plt.show()

png

There are obvious patterns if we ignore to outliers. And with this patterns, our model will learn.

Boxplot can be used to see outliers. For a better visualization i will use plotly this time.

Wow, this is very interesting plot! I need to learn this too!

!pip install plotly
Requirement already satisfied: plotly in /usr/local/lib/python3.6/dist-packages (4.1.1)
Requirement already satisfied: six in /usr/local/lib/python3.6/dist-packages (from plotly) (1.12.0)
Requirement already satisfied: retrying>=1.3.3 in /usr/local/lib/python3.6/dist-packages (from plotly) (1.3.3)
import plotly.graph_objs as go
from plotly.offline import iplot

trace1 = go.Box(
    y= train["Vertical_Distance_To_Hydrology"],
    name = 'Vertical Distance',
    marker = dict(color = 'rgb(0,145,119)')
)
trace2 = go.Box(
    y= train["Horizontal_Distance_To_Hydrology"],
    name = 'Horizontal Distance',
    marker = dict(color = 'rgb(5, 79, 174)')
)

data = [trace1, trace2]
layout = dict(autosize=False, width=700,height=500, title='Distance To Hydrology', paper_bgcolor='rgb(243, 243, 243)',
              plot_bgcolor='rgb(243, 243, 243)', margin=dict(l=40,r=30,b=80,t=100,))
fig = dict(data=data, layout=layout)
iplot(fig)
trace1 = go.Box(
    y= train["Hillshade_Noon"],
    name = 'Hillshade Noon',
    marker = dict(color = 'rgb(255,111,145)')
)
trace2 = go.Box(
    y= train["Hillshade_3pm"],
    name = 'Hillshade 3pm',
    marker = dict(color = 'rgb(132,94,194)')
)

data = [trace1, trace2]
layout = dict(autosize=False, width=700,height=500, title='Hillshade 3pm and Noon', paper_bgcolor='rgb(243, 243, 243)',
              plot_bgcolor='rgb(243, 243, 243)', margin=dict(l=40,r=30,b=80,t=100,))
fig = dict(data=data, layout=layout)
iplot(fig)

This time I’ll compare vertical and horizontal distance to hydrology with histogram.

f,ax=plt.subplots(1,2,figsize=(15,7))
train.Vertical_Distance_To_Hydrology.plot.hist(ax=ax[0],bins=30,edgecolor='black',color='crimson')
ax[0].set_title('Vertical Distance To Hydrology')
x1=list(range(-150,350,50))
ax[0].set_xticks(x1)
train.Horizontal_Distance_To_Hydrology.plot.hist(ax=ax[1],bins=30,edgecolor='black',color='darkmagenta')
ax[1].set_title('Horizontal Distance To Hydrology')
x2=list(range(0,1000,100))
ax[1].set_xticks(x2)
plt.show()

png

Let’s take a look our categorical categorical variables soil types and wilderness areas.

soil_types = train.iloc[:,14:-1].sum(axis=0)

plt.figure(figsize=(18,9))
sns.barplot(x=soil_types.index, y=soil_types.values, palette="rocket")
plt.xticks(rotation= 75)
plt.ylabel('Total')
plt.title('Count of Soil Types With Value 1',color = 'darkred',fontsize=12)
Text(0.5, 1.0, 'Count of Soil Types With Value 1')

png

Type 7, Type 8, Type 15 and Type 25 have either no or too few values. Must examine carefully before create a model.

wilderness_areas = train.iloc[:,10:14].sum(axis=0)

plt.figure(figsize=(7,5))
sns.barplot(x=wilderness_areas.index,y=wilderness_areas.values, palette="Blues_d")
plt.xticks(rotation=90)
plt.title('Wilderness Areas',color = 'darkred',fontsize=12)
plt.ylabel('Total')
plt.show()

png

I wonder how many (y) labels we have in each class. I’ll take a look the last column (cover type) for this.

import plotly.express as px

cover_type = train["Cover_Type"].value_counts()
df_cover_type = pd.DataFrame({'CoverType': cover_type.index, 'Total':cover_type.values})

fig = px.bar(df_cover_type, x='CoverType', y='Total', height=400, width=650)
fig.show()

There are same amount of data for each class exactly…

In terms of horizontal distance to x point, distribution of class charts following…

f,ax=plt.subplots(1,3,figsize=(21,7))
train.plot.scatter(ax=ax[0],x='Cover_Type', y='Horizontal_Distance_To_Fire_Points', alpha=0.5, color='purple')
ax[0].set_title('Horizontal Distance To Fire Points')
x1=list(range(1,8,1))
ax[0].set_ylabel("")
ax[0].set_xlabel("Cover Type")
train.plot.scatter(ax=ax[1],x='Cover_Type', y='Horizontal_Distance_To_Roadways', alpha=0.5, color='purple')
ax[1].set_title('Horizontal Distance To Roadways')
x2=list(range(1,8,1))
ax[1].set_ylabel("")
ax[1].set_xlabel("Cover Type")
train.plot.scatter(ax=ax[2],x='Cover_Type', y='Horizontal_Distance_To_Hydrology', alpha=0.5, color='purple')
ax[2].set_title('Horizontal Distance To Hydrology')
x2=list(range(1,8,1))
ax[2].set_ylabel("")
ax[2].set_xlabel("Cover Type")
plt.show()

png

Pandas Profiling

Actually there is a faster way for exploratory data analysis. Pandas provides you powerful HTML profiling reports with pandas-profiling. It’s like a magic! You can click “Overview”, “Variables” etc tabs for a quick run.

import pandas_profiling as pp

report = pp.ProfileReport(train)
report.to_file("report.html")

report

Modeling

Let’s do it step by step first I’ll slip the training data to make up a validation set and then try out different classification algorithms and then compare them

# Separate the target variable
y = train.Cover_Type
train.drop(['Cover_Type'], axis = 1, inplace = True)
# Split Data
from sklearn.model_selection import train_test_split

train_X, val_X, train_y, val_y = train_test_split(train, y, random_state= 0)

print(f"Train Shape: {train_X.shape}, {train_y.shape}")
print(f"Validation Shape: {val_X.shape}, {val_y.shape}")
Train Shape: (11340, 54), (11340,)
Validation Shape: (3780, 54), (3780,)
# Let's fit the Decision Tree Model first

from sklearn.tree import DecisionTreeClassifier

model = DecisionTreeClassifier()

# Fit the model
model.fit(train_X, train_y)
DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=None,
                       max_features=None, max_leaf_nodes=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, presort=False,
                       random_state=None, splitter='best')
# Get predictions for the validation set

predictions = model.predict(val_X)
# Let's check our categorization accuracy

from sklearn.metrics import accuracy_score

accuracy_score(val_y, predictions)
0.7603174603174603

That is good for our first model I know we can improve upon this but let’s see how well we stand on Leaderboard from this very basic model let’s create a submission file and submit the scores and then move on to try different models

# Predict the test data set
predictions_test = model.predict(test)
submission = pd.DataFrame({
    "ID": test.index,
    "Cover_Type": predictions_test
})

submission.to_csv("submission.csv", index = False)
submission.head()
ID Cover_Type
0 15121 1
1 15122 1
2 15123 1
3 15124 1
4 15125 1

Okay this worked! My score on leaderboard with this was 0.64322 and I received 435th/504 rank Now I’ll try few other models or variations Let’s try Random Forest but first let’s try Decision Tree with some other parameters and let’s see if we can improve our Decision Tree’s accuracy By using all default parameters our Validation Set accuracy is 0.76534

# Let's create a helper function

def calc_tree_accuracy(max_leaf_nodes, train_X, val_X, train_y, val_y):
    model = DecisionTreeClassifier(max_leaf_nodes=max_leaf_nodes, random_state = 0)

    model.fit(train_X, train_y)
    preds = model.predict(val_X)
    acc = accuracy_score(val_y, preds)

    return(acc)
# Now let's fit our model with different max leaf nodes
accuracies = {}
for max_leaf_nodes in range(5, 2000, 50) :
    my_acc = calc_tree_accuracy(max_leaf_nodes, train_X, val_X, train_y, val_y)
    accuracies[max_leaf_nodes] = my_acc
    print(f"Max leaf nodes: {max_leaf_nodes} - - Accuracy: {my_acc}")

print("Best Max Leaf Nodes Parameter:", max(accuracies.items(), key = lambda k : k[1]))
Max leaf nodes: 5 - - Accuracy: 0.557936507936508
Max leaf nodes: 55 - - Accuracy: 0.7092592592592593
Max leaf nodes: 105 - - Accuracy: 0.7394179894179894
Max leaf nodes: 155 - - Accuracy: 0.7497354497354497
Max leaf nodes: 205 - - Accuracy: 0.7579365079365079
Max leaf nodes: 255 - - Accuracy: 0.7584656084656085
Max leaf nodes: 305 - - Accuracy: 0.7619047619047619
Max leaf nodes: 355 - - Accuracy: 0.762962962962963
Max leaf nodes: 405 - - Accuracy: 0.7658730158730159
Max leaf nodes: 455 - - Accuracy: 0.7714285714285715
Max leaf nodes: 505 - - Accuracy: 0.7698412698412699
Max leaf nodes: 555 - - Accuracy: 0.7701058201058201
Max leaf nodes: 605 - - Accuracy: 0.7690476190476191
Max leaf nodes: 655 - - Accuracy: 0.7738095238095238
Max leaf nodes: 705 - - Accuracy: 0.7748677248677248
Max leaf nodes: 755 - - Accuracy: 0.7753968253968254
Max leaf nodes: 805 - - Accuracy: 0.7761904761904762
Max leaf nodes: 855 - - Accuracy: 0.7764550264550265
Max leaf nodes: 905 - - Accuracy: 0.7753968253968254
Max leaf nodes: 955 - - Accuracy: 0.7746031746031746
Max leaf nodes: 1005 - - Accuracy: 0.7738095238095238
Max leaf nodes: 1055 - - Accuracy: 0.773015873015873
Max leaf nodes: 1105 - - Accuracy: 0.7746031746031746
Max leaf nodes: 1155 - - Accuracy: 0.7746031746031746
Max leaf nodes: 1205 - - Accuracy: 0.7711640211640212
Max leaf nodes: 1255 - - Accuracy: 0.7706349206349207
Max leaf nodes: 1305 - - Accuracy: 0.7695767195767196
Max leaf nodes: 1355 - - Accuracy: 0.7674603174603175
Max leaf nodes: 1405 - - Accuracy: 0.7666666666666667
Max leaf nodes: 1455 - - Accuracy: 0.7669312169312169
Max leaf nodes: 1505 - - Accuracy: 0.7653439153439153
Max leaf nodes: 1555 - - Accuracy: 0.7637566137566137
Max leaf nodes: 1605 - - Accuracy: 0.7634920634920634
Max leaf nodes: 1655 - - Accuracy: 0.7603174603174603
Max leaf nodes: 1705 - - Accuracy: 0.7600529100529101
Max leaf nodes: 1755 - - Accuracy: 0.7600529100529101
Max leaf nodes: 1805 - - Accuracy: 0.7600529100529101
Max leaf nodes: 1855 - - Accuracy: 0.7600529100529101
Max leaf nodes: 1905 - - Accuracy: 0.7600529100529101
Max leaf nodes: 1955 - - Accuracy: 0.7600529100529101
Best Max Leaf Nodes Parameter: (855, 0.7764550264550265)

That’s amazing! My accuracy with 5 max leaf nodes was 55% and it kept growing until around max leaf nodes 1705 I think the highest accuracy was around 0.776 at 855 max leaf nodes

# Let's create Random Forest model now
from sklearn.ensemble import RandomForestClassifier

model = RandomForestClassifier(random_state = 0)
model.fit(train_X, train_y)

predictions = model.predict(val_X)

print("Random Forest Accuracy: ", accuracy_score(val_y, predictions))
Random Forest Accuracy:  0.814021164021164


/usr/local/lib/python3.6/dist-packages/sklearn/ensemble/forest.py:245: FutureWarning:

The default value of n_estimators will change from 10 in version 0.20 to 100 in 0.22.

That is a huge improvement! Let’s see where this leads us on the Leaderboard

predictions_test = model.predict(test)
submission = pd.DataFrame({
    "ID": test.index,
    "Cover_Type": predictions_test
})

submission.to_csv("submission.csv", index = False)

Now my score on leaderboard is 0.70138 and ranked 398th out of 505

Let’s try different parameters using grid search cv

%time

from sklearn.model_selection import GridSearchCV

model = RandomForestClassifier(random_state = 0)

param_grid = {
    'n_estimators': [50, 100, 500, 1000],
    'max_features': ['auto', 'sqrt', 'log2'],
    'max_depth' : [8, 9, 10, 11, 12],
    'criterion' :['gini', 'entropy']
}

model_CV = GridSearchCV(estimator = model, param_grid = param_grid, cv = 5)

# We do not need validation set that's why we will train it on full training set
model_CV.fit(train, y)

# Let's check the best parameters
model_CV.best_params_
CPU times: user 0 ns, sys: 2 µs, total: 2 µs
Wall time: 5.48 µs





{'criterion': 'entropy',
 'max_depth': 12,
 'max_features': 'auto',
 'n_estimators': 500}

Okay so let’s train with the best parameters now

model = RandomForestClassifier(random_state=0, max_features='auto', criterion='entropy', max_depth=10, n_estimators=500)

model.fit(train_X, train_y)

predictions = model.predict(val_X)

print("Best Parameters Random Forest Accuracy Is: ", accuracy_score(val_y, predictions))
Best Parameters Random Forest Accuracy Is:  0.7962962962962963

This is almost same or slightly lower than before… Let’s still check on test set

predictions_test = model.predict(test)
submission = pd.DataFrame({
    "ID": test.index,
    "Cover_Type": predictions_test
})

submission.to_csv("submission.csv", index = False)

Updated:

Leave a comment