In this week, we’ll use the following dataset:
!wget <https://github.com/alexeygrigorev/datasets/raw/refs/heads/master/jamb_exam_results.csv>
Let’s open the dataset with Pandas package:
df_raw = pd.read_csv("/content/jamb_exam_results.csv")
df_raw.head(5)
JAMB_Score | Study_Hours_Per_Week | Attendance_Rate | Teacher_Quality | Distance_To_School | School_Type | School_Location | Extra_Tutorials | Access_To_Learning_Materials | Parent_Involvement | IT_Knowledge | Student_ID | Age | Gender | Socioeconomic_Status | Parent_Education_Level | Assignments_Completed | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 192 | 22 | 78 | 4 | 12.4 | Public | Urban | Yes | Yes | High | Medium | 1 | 17 | Male | Low | Tertiary | 2 |
1 | 207 | 14 | 88 | 4 | 2.7 | Public | Rural | No | Yes | High | High | 2 | 15 | Male | High | NaN | 1 |
2 | 182 | 29 | 87 | 2 | 9.6 | Public | Rural | Yes | Yes | High | Medium | 3 | 20 | Female | High | Tertiary | 2 |
3 | 210 | 29 | 99 | 2 | 2.6 | Public | Urban | No | Yes | Medium | High | 4 | 22 | Female | Medium | Tertiary | 1 |
4 | 199 | 12 | 98 | 3 | 8.8 | Public | Urban | No | Yes | Medium | Medium | 5 | 22 | Female | Medium | Tertiary | 1 |
In previous chapters, we already learned to perform some techniques for prepare the data which also we’ll use in this project.
This step makes handling the features the easiest:
df_raw = df_raw.rename(str.lower, axis='columns')
df_raw.columns
#Index(['jamb_score', 'study_hours_per_week', 'attendance_rate',
# 'teacher_quality', 'distance_to_school', 'school_type',
# 'school_location', 'extra_tutorials', 'access_to_learning_materials',
# 'parent_involvement', 'it_knowledge', 'student_id', 'age', 'gender',
# 'socioeconomic_status', 'parent_education_level',
# 'assignments_completed'],
# dtype='object')
Some columns are irrelevant to our objective, so we decided to remove them:
df = df_raw.drop(["student_id"], axis=1)
df.head(5)
jamb_score | study_hours_per_week | attendance_rate | teacher_quality | distance_to_school | school_type | school_location | extra_tutorials | access_to_learning_materials | parent_involvement | it_knowledge | age | gender | socioeconomic_status | parent_education_level | assignments_completed | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 192 | 22 | 78 | 4 | 12.4 | Public | Urban | Yes | Yes | High | Medium | 17 | Male | Low | Tertiary | 2 |
1 | 207 | 14 | 88 | 4 | 2.7 | Public | Rural | No | Yes | High | High | 15 | Male | High | NaN | 1 |
2 | 182 | 29 | 87 | 2 | 9.6 | Public | Rural | Yes | Yes | High | Medium | 20 | Female | High | Tertiary | 2 |
3 | 210 | 29 | 99 | 2 | 2.6 | Public | Urban | No | Yes | Medium | High | 22 | Female | Medium | Tertiary | 1 |
4 | 199 | 12 | 98 | 3 | 8.8 | Public | Urban | No | Yes | Medium | Medium | 22 | Female | Medium | Tertiary | 1 |
Our models aren’t handling the missing values:
df.isna().sum()