Classification
In this tutorial, we use safeds
on Titanic passenger data to predict who will survive and who will not.
Loading Data¶
The data is available under Titanic - Machine Learning from Disaster:
from safeds.data.tabular.containers import Table
raw_data = Table.from_csv_file("data/titanic.csv")
# For visualisation purposes we only print out the first 15 rows.
raw_data.slice_rows(length=15)
id | name | sex | age | siblings_spouses | parents_children | ticket | travel_class | fare | cabin | port_embarked | survived |
---|---|---|---|---|---|---|---|---|---|---|---|
i64 | str | str | f64 | i64 | i64 | str | i64 | f64 | str | str | i64 |
0 | "Abbing, Mr. Anthony" | "male" | 42.0 | 0 | 0 | "C.A. 5547" | 3 | 7.55 | null | "Southampton" | 0 |
1 | "Abbott, Master. Eugene Joseph" | "male" | 13.0 | 0 | 2 | "C.A. 2673" | 3 | 20.25 | null | "Southampton" | 0 |
2 | "Abbott, Mr. Rossmore Edward" | "male" | 16.0 | 1 | 1 | "C.A. 2673" | 3 | 20.25 | null | "Southampton" | 0 |
3 | "Abbott, Mrs. Stanton (Rosa Hun… | "female" | 35.0 | 1 | 1 | "C.A. 2673" | 3 | 20.25 | null | "Southampton" | 1 |
4 | "Abelseth, Miss. Karen Marie" | "female" | 16.0 | 0 | 0 | "348125" | 3 | 7.65 | null | "Southampton" | 1 |
… | … | … | … | … | … | … | … | … | … | … | … |
10 | "Adahl, Mr. Mauritz Nils Martin" | "male" | 30.0 | 0 | 0 | "C 7076" | 3 | 7.25 | null | "Southampton" | 0 |
11 | "Adams, Mr. John" | "male" | 26.0 | 0 | 0 | "341826" | 3 | 8.05 | null | "Southampton" | 0 |
12 | "Ahlin, Mrs. Johan (Johanna Per… | "female" | 40.0 | 1 | 0 | "7546" | 3 | 9.475 | null | "Southampton" | 0 |
13 | "Aks, Master. Philip Frank" | "male" | 0.8333 | 0 | 1 | "392091" | 3 | 9.35 | null | "Southampton" | 1 |
14 | "Aks, Mrs. Sam (Leah Rosen)" | "female" | 18.0 | 0 | 1 | "392091" | 3 | 9.35 | null | "Southampton" | 1 |
Splitting Data into Train and Test Sets¶
- Training set: Contains 60% of the data and will be used to train the model.
- Testing set: Contains 40% of the data and will be used to test the model's accuracy.
train_table, test_table = raw_data.shuffle_rows().split_rows(0.6)
Removing Low-Quality Columns¶
train_table.summarize_statistics()
statistic | id | name | sex | age | siblings_spouses | parents_children | ticket | travel_class | fare | cabin | port_embarked | survived |
---|---|---|---|---|---|---|---|---|---|---|---|---|
str | f64 | str | str | f64 | f64 | f64 | str | f64 | f64 | str | str | f64 |
"min" | 0.0 | "Abbing, Mr. Anthony" | "female" | 0.1667 | 0.0 | 0.0 | "110152" | 1.0 | 0.0 | "A10" | "Cherbourg" | 0.0 |
"max" | 1307.0 | "van Billiard, Master. James Wi… | "male" | 80.0 | 8.0 | 9.0 | "WE/P 5735" | 3.0 | 512.3292 | "T" | "Southampton" | 1.0 |
"mean" | 647.119745 | null | null | 29.851124 | 0.512102 | 0.374522 | null | 2.278981 | 34.78715 | null | null | 0.370701 |
"median" | 660.0 | null | null | 28.0 | 0.0 | 0.0 | null | 3.0 | 14.4542 | null | null | 0.0 |
"standard deviation" | 377.03706 | null | null | 14.601355 | 1.060741 | 0.860033 | null | 0.843571 | 55.244376 | null | null | 0.4833 |
"missing value ratio" | 0.0 | "0.0" | "0.0" | 0.206369 | 0.0 | 0.0 | "0.0" | 0.0 | 0.0 | "0.7668789808917198" | "0.0012738853503184713" | 0.0 |
"stability" | 0.001274 | "0.0025477707006369425" | "0.6522292993630573" | 0.046549 | 0.677707 | 0.770701 | "0.008917197452229299" | 0.533758 | 0.054777 | "0.0273224043715847" | "0.701530612244898" | 0.629299 |
"idness" | 1.0 | "0.9987261146496815" | "0.0025477707006369425" | 0.118471 | 0.008917 | 0.010191 | "0.7834394904458599" | 0.003822 | 0.28535 | "0.17070063694267515" | "0.005095541401273885" | 0.002548 |
We remove certain columns for the following reasons:
- high idness:
id
,ticket
- high stability:
parents_children
- high missing value ratio:
cabin
train_table = train_table.remove_columns(["id", "ticket", "parents_children", "cabin"])
test_table = test_table.remove_columns(["id", "ticket", "parents_children", "cabin"])
Handling Missing Values¶
We fill in missing values in the age
and fare
columns with the mean of each column
from safeds.data.tabular.transformation import SimpleImputer
simple_imputer = SimpleImputer(selector=["age", "fare"], strategy=SimpleImputer.Strategy.mean())
fitted_simple_imputer_train, transformed_train_data = simple_imputer.fit_and_transform(train_table)
transformed_test_data = fitted_simple_imputer_train.transform(test_table)
Handling Nominal Categorical Data¶
We use OneHotEncoder
to transform categorical, non-numerical values into numerical representations with values of zero or one. In this example, we will transform the values of the sex
column, so they can be used in the model to predict passenger survival.
- Use the
fit_and_transform
function of theOneHotEncoder
to pass the table and the column names to be used as features for the prediction.
from safeds.data.tabular.transformation import OneHotEncoder
fitted_one_hot_encoder_train, transformed_train_data = OneHotEncoder(
selector=["sex", "port_embarked"],
).fit_and_transform(transformed_train_data)
transformed_test_data = fitted_one_hot_encoder_train.transform(transformed_test_data)
Statistics after Data Processing¶
Check the data after cleaning and transformation to ensure the changes were made correctly.
transformed_train_data.summarize_statistics()
statistic | name | age | siblings_spouses | travel_class | fare | survived | sex__female | sex__male | port_embarked__Southampton | port_embarked__Queenstown | port_embarked__Cherbourg |
---|---|---|---|---|---|---|---|---|---|---|---|
str | str | f64 | f64 | f64 | f64 | f64 | f64 | f64 | f64 | f64 | f64 |
"min" | "Abbing, Mr. Anthony" | 0.1667 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
"max" | "van Billiard, Master. James Wi… | 80.0 | 8.0 | 3.0 | 512.3292 | 1.0 | 1.0 | 1.0 | 1.0 | 1.0 | 1.0 |
"mean" | null | 29.851124 | 0.512102 | 2.278981 | 34.78715 | 0.370701 | 0.347771 | 0.652229 | 0.700637 | 0.092994 | 0.205096 |
"median" | null | 29.851124 | 0.0 | 3.0 | 14.4542 | 0.0 | 0.0 | 1.0 | 1.0 | 0.0 | 0.0 |
"standard deviation" | null | 13.005598 | 1.060741 | 0.843571 | 55.244376 | 0.4833 | 0.476566 | 0.476566 | 0.458271 | 0.290609 | 0.404029 |
"missing value ratio" | "0.0" | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
"stability" | "0.0025477707006369425" | 0.206369 | 0.677707 | 0.533758 | 0.054777 | 0.629299 | 0.652229 | 0.652229 | 0.700637 | 0.907006 | 0.794904 |
"idness" | "0.9987261146496815" | 0.118471 | 0.008917 | 0.003822 | 0.28535 | 0.002548 | 0.002548 | 0.002548 | 0.002548 | 0.002548 | 0.002548 |
Marking the Target Column¶
Here, we set the target, extra, and feature columns using to_tabular_dataset
.
This ensures the model knows which column to predict and which columns to use as features during training.
- target:
survived
- extra:
name
- fearutes: all columns expect target and extra
tagged_train_table = transformed_train_data.to_tabular_dataset("survived", extra_names=["name"])
Fitting a Classifier¶
We use the RandomForest
classifier as our model and pass the training dataset to the model's fit
function to train it.
from safeds.ml.classical.classification import RandomForestClassifier
classifier = RandomForestClassifier()
fitted_classifier = classifier.fit(tagged_train_table)
Predicting with the Classifier¶
Use the trained RandomForest
model to predict the survival rate of passengers in the test dataset.
Pass the test_table
into the predict
function, which uses our trained model for prediction.
prediction = fitted_classifier.predict(transformed_test_data)
Reverse-Transforming the Prediction¶
After making a prediction, the values will be in a transformed format. To interpret the results using the original values, we need to reverse this transformation. This is done using inverse_transform_table
with the fitted transformers that support inverse transformation.
reverse_transformed_prediction = prediction.to_table().inverse_transform_table(fitted_one_hot_encoder_train)
# For visualisation purposes we only print out the first 15 rows.
reverse_transformed_prediction.slice_rows(length=15)
name | age | siblings_spouses | travel_class | fare | survived | sex | port_embarked |
---|---|---|---|---|---|---|---|
str | f64 | i64 | i64 | f64 | i64 | str | str |
"Dantcheff, Mr. Ristiu" | 25.0 | 0 | 3 | 7.8958 | 0 | "male" | "Southampton" |
"Chip, Mr. Chang" | 32.0 | 0 | 3 | 56.4958 | 1 | "male" | "Southampton" |
"McEvoy, Mr. Michael" | 29.851124 | 0 | 3 | 15.5 | 0 | "male" | "Queenstown" |
"Harrison, Mr. William" | 40.0 | 0 | 1 | 0.0 | 0 | "male" | "Southampton" |
"West, Mrs. Edwy Arthur (Ada Ma… | 33.0 | 1 | 2 | 27.75 | 1 | "female" | "Southampton" |
… | … | … | … | … | … | … | … |
"Andersson, Miss. Ida Augusta M… | 38.0 | 4 | 3 | 7.775 | 0 | "female" | "Southampton" |
"Sheerlinck, Mr. Jan Baptist" | 29.0 | 0 | 3 | 9.5 | 0 | "male" | "Southampton" |
"Drew, Master. Marshall Brines" | 8.0 | 0 | 2 | 32.5 | 1 | "male" | "Southampton" |
"O'Connell, Mr. Patrick D" | 29.851124 | 0 | 3 | 7.7333 | 0 | "male" | "Queenstown" |
"Foley, Mr. William" | 29.851124 | 0 | 3 | 7.75 | 0 | "male" | "Queenstown" |
Testing the Accuracy of the Model¶
We evaluate the performance of the trained model by testing its accuracy on the transformed test data using accuracy
.
accuracy = fitted_classifier.accuracy(transformed_test_data) * 100
f"Accuracy on test data: {accuracy:.4f}%"
'Accuracy on test data: 78.2443%'