Before we proceed to training, we need to perform the featuring engineering step. In this step, we need to transform all categorical variables to numeric variables.
The models cannot be train with categorical variables, we need to convert out categorical data to numeric matrix form. The most popular technique to do this is one-hot encoding.
For example, in the clients group, we select Jimmy, he is a blue-collar and a married client:
Jimmy’s job record will have a value active, or hot, in blue-collar job, so it gets a 1, whereas the remaining values are not active, so they get a 0. The same will happen with married variable.
Luckily, we don’t need to implement this by hand. We can use a Scikit-Learn’s module called DictVectorizer
.
This module take in a dictionary and vectorizes it, such that:
train_dicts = X_train.to_dict(orient='records')
val_dicts = X_val.to_dict(orient='records')
test_dicts = X_test.to_dict(orient='records')
train_dicts[0]
#{'age': 45,
# 'job': 'entrepreneur',
# 'marital': 'married',
# 'education': 'primary',
# 'default': 'no',
# 'balance': -100,
# 'housing': 'yes',
# 'contact': 'unknown',
# 'day': 27,
# 'month': 'may',
# 'campaign': 6,
# 'pdays': -1,
# 'previous': 0,
# 'poutcome': 'unknown'}
Now we can use DictVectorizer
:
dv = DictVectorizer(sparse=False).set_output(transform='pandas').fit(train_dicts)
X_train = dv.transform(train_dicts)
X_val = dv.transform(val_dicts)
X_test = dv.transform(test_dicts)
X_train.head(5)
age | balance | campaign | contact=cellular | contact=telephone | contact=unknown | day | default=no | default=yes | education=primary | ... | month=may | month=nov | month=oct | month=sep | pdays | poutcome=failure | poutcome=other | poutcome=success | poutcome=unknown | previous | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 45.0 | -100.0 | 6.0 | 0.0 | 0.0 | 1.0 | 27.0 | 1.0 | 0.0 | 1.0 | ... | 1.0 | 0.0 | 0.0 | 0.0 | -1.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 |
1 | 29.0 | 166.0 | 8.0 | 1.0 | 0.0 | 0.0 | 28.0 | 1.0 | 0.0 | 0.0 | ... | 0.0 | 0.0 | 0.0 | 0.0 | -1.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 |
2 | 31.0 | 121.0 | 1.0 | 0.0 | 0.0 | 1.0 | 20.0 | 1.0 | 0.0 | 0.0 | ... | 1.0 | 0.0 | 0.0 | 0.0 | -1.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 |
3 | 40.0 | 1693.0 | 1.0 | 1.0 | 0.0 | 0.0 | 17.0 | 1.0 | 0.0 | 0.0 | ... | 0.0 | 1.0 | 0.0 | 0.0 | -1.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 |
4 | 28.0 | 317.0 | 3.0 | 0.0 | 0.0 | 1.0 | 16.0 | 1.0 | 0.0 | 0.0 | ... | 1.0 | 0.0 | 0.0 | 0.0 | -1.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 |
Once we transform all categorical variables to numerical variables already to train the first model for classification.
Logistic regression is a linear model, but unlike linear regression, it’s a classification model, not regression. Remembering the linear regression form using sum notation:
$$ y_i = g(x_i) = w_0 + x_i^Tw = w_0 + \sum_{j=1}^{n}x_{ij}w_j $$