Featuring engineering

Before we proceed to training, we need to perform the featuring engineering step. In this step, we need to transform all categorical variables to numeric variables.

One-hot encoding for categorical variables

The models cannot be train with categorical variables, we need to convert out categorical data to numeric matrix form. The most popular technique to do this is one-hot encoding.

For example, in the clients group, we select Jimmy, he is a blue-collar and a married client:

image.png

Jimmy’s job record will have a value active, or hot, in blue-collar job, so it gets a 1, whereas the remaining values are not active, so they get a 0. The same will happen with married variable.

Luckily, we don’t need to implement this by hand. We can use a Scikit-Learn’s module called DictVectorizer.

This module take in a dictionary and vectorizes it, such that:

train_dicts = X_train.to_dict(orient='records')
val_dicts = X_val.to_dict(orient='records')
test_dicts = X_test.to_dict(orient='records')

train_dicts[0]

#{'age': 45,
# 'job': 'entrepreneur',
# 'marital': 'married',
# 'education': 'primary',
# 'default': 'no',
# 'balance': -100,
# 'housing': 'yes',
# 'contact': 'unknown',
# 'day': 27,
# 'month': 'may',
# 'campaign': 6,
# 'pdays': -1,
# 'previous': 0,
# 'poutcome': 'unknown'}

Now we can use DictVectorizer :

dv = DictVectorizer(sparse=False).set_output(transform='pandas').fit(train_dicts)

X_train = dv.transform(train_dicts)
X_val = dv.transform(val_dicts)
X_test = dv.transform(test_dicts)

X_train.head(5)
age balance campaign contact=cellular contact=telephone contact=unknown day default=no default=yes education=primary ... month=may month=nov month=oct month=sep pdays poutcome=failure poutcome=other poutcome=success poutcome=unknown previous
0 45.0 -100.0 6.0 0.0 0.0 1.0 27.0 1.0 0.0 1.0 ... 1.0 0.0 0.0 0.0 -1.0 0.0 0.0 0.0 1.0 0.0
1 29.0 166.0 8.0 1.0 0.0 0.0 28.0 1.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 -1.0 0.0 0.0 0.0 1.0 0.0
2 31.0 121.0 1.0 0.0 0.0 1.0 20.0 1.0 0.0 0.0 ... 1.0 0.0 0.0 0.0 -1.0 0.0 0.0 0.0 1.0 0.0
3 40.0 1693.0 1.0 1.0 0.0 0.0 17.0 1.0 0.0 0.0 ... 0.0 1.0 0.0 0.0 -1.0 0.0 0.0 0.0 1.0 0.0
4 28.0 317.0 3.0 0.0 0.0 1.0 16.0 1.0 0.0 0.0 ... 1.0 0.0 0.0 0.0 -1.0 0.0 0.0 0.0 1.0 0.0

Logistic Regression

From scratch

Once we transform all categorical variables to numerical variables already to train the first model for classification.

Logistic regression is a linear model, but unlike linear regression, it’s a classification model, not regression. Remembering the linear regression form using sum notation:

$$ y_i = g(x_i) = w_0 + x_i^Tw = w_0 + \sum_{j=1}^{n}x_{ij}w_j $$