Featuring engineering

Before we proceed to training, we need to perform the featuring engineering step. In this step, we need to transform all categorical variables to numeric variables.

One-hot encoding for categorical variables

The models cannot be train with categorical variables, we need to convert out categorical data to numeric matrix form. The most popular technique to do this is one-hot encoding.

For example, in the clients group, we select Jimmy, he is a blue-collar and a married client:

Jimmy’s job record will have a value active, or hot, in blue-collar job, so it gets a 1, whereas the remaining values are not active, so they get a 0. The same will happen with married variable.

Luckily, we don’t need to implement this by hand. We can use a Scikit-Learn’s module called DictVectorizer.

This module take in a dictionary and vectorizes it, such that:

train_dicts = X_train.to_dict(orient='records')
val_dicts = X_val.to_dict(orient='records')
test_dicts = X_test.to_dict(orient='records')

train_dicts[0]

#{'age': 45,
# 'job': 'entrepreneur',
# 'marital': 'married',
# 'education': 'primary',
# 'default': 'no',
# 'balance': -100,
# 'housing': 'yes',
# 'contact': 'unknown',
# 'day': 27,
# 'month': 'may',
# 'campaign': 6,
# 'pdays': -1,
# 'previous': 0,
# 'poutcome': 'unknown'}

Now we can use DictVectorizer :

dv = DictVectorizer(sparse=False).set_output(transform='pandas').fit(train_dicts)

X_train = dv.transform(train_dicts)
X_val = dv.transform(val_dicts)
X_test = dv.transform(test_dicts)

X_train.head(5)

	age	balance	campaign	contact=cellular	contact=unknown	day	default=no	education=primary	...	month=may	month=nov	pdays	poutcome=unknown
0	45.0	-100.0	6.0	0.0	1.0	27.0	1.0	1.0	...	1.0	0.0	-1.0	1.0
1	29.0	166.0	8.0	1.0	0.0	28.0	1.0	0.0	...	0.0	0.0	-1.0	1.0
2	31.0	121.0	1.0	0.0	1.0	20.0	1.0	0.0	...	1.0	0.0	-1.0	1.0
3	40.0	1693.0	1.0	1.0	0.0	17.0	1.0	0.0	...	0.0	1.0	-1.0	1.0
4	28.0	317.0	3.0	0.0	1.0	16.0	1.0	0.0	...	1.0	0.0	-1.0	1.0

Logistic Regression

From scratch

Once we transform all categorical variables to numerical variables already to train the first model for classification.

Logistic regression is a linear model, but unlike linear regression, it’s a classification model, not regression. Remembering the linear regression form using sum notation:

$$ y_i = g(x_i) = w_0 + x_i^Tw = w_0 + \sum_{j=1}^{n}x_{ij}w_j $$