Classification

In this tutorial, we use safeds on Titanic passenger data to predict who will survive and who will not.

Loading Data¶

The data is available under Titanic - Machine Learning from Disaster:

In [1]:

Copied!

from safeds.data.tabular.containers import Table

raw_data = Table.from_csv_file("data/titanic.csv")
# For visualisation purposes we only print out the first 15 rows.
raw_data.slice_rows(length=15)
from safeds.data.tabular.containers import Table

raw_data = Table.from_csv_file("data/titanic.csv")
# For visualisation purposes we only print out the first 15 rows.
raw_data.slice_rows(length=15)

Out[1]:

shape: (15, 12)

id	name	sex	age	siblings_spouses	parents_children	ticket	travel_class	fare	cabin	port_embarked	survived
i64	str	str	f64	i64	i64	str	i64	f64	str	str	i64
0	"Abbing, Mr. Anthony"	"male"	42.0	0	0	"C.A. 5547"	3	7.55	null	"Southampton"	0
1	"Abbott, Master. Eugene Joseph"	"male"	13.0	0	2	"C.A. 2673"	3	20.25	null	"Southampton"	0
2	"Abbott, Mr. Rossmore Edward"	"male"	16.0	1	1	"C.A. 2673"	3	20.25	null	"Southampton"	0
3	"Abbott, Mrs. Stanton (Rosa Hun…	"female"	35.0	1	1	"C.A. 2673"	3	20.25	null	"Southampton"	1
4	"Abelseth, Miss. Karen Marie"	"female"	16.0	0	0	"348125"	3	7.65	null	"Southampton"	1
…	…	…	…	…	…	…	…	…	…	…	…
10	"Adahl, Mr. Mauritz Nils Martin"	"male"	30.0	0	0	"C 7076"	3	7.25	null	"Southampton"	0
11	"Adams, Mr. John"	"male"	26.0	0	0	"341826"	3	8.05	null	"Southampton"	0
12	"Ahlin, Mrs. Johan (Johanna Per…	"female"	40.0	1	0	"7546"	3	9.475	null	"Southampton"	0
13	"Aks, Master. Philip Frank"	"male"	0.8333	0	1	"392091"	3	9.35	null	"Southampton"	1
14	"Aks, Mrs. Sam (Leah Rosen)"	"female"	18.0	0	1	"392091"	3	9.35	null	"Southampton"	1

Splitting Data into Train and Test Sets¶

Training set: Contains 60% of the data and will be used to train the model.
Testing set: Contains 40% of the data and will be used to test the model's accuracy.

In [2]:

Copied!

train_table, test_table = raw_data.shuffle_rows().split_rows(0.6)
train_table, test_table = raw_data.shuffle_rows().split_rows(0.6)

Removing Low-Quality Columns¶

In [3]:

Copied!

train_table.summarize_statistics()
train_table.summarize_statistics()

Out[3]:

shape: (8, 13)

statistic	id	name	sex	age	siblings_spouses	parents_children	ticket	travel_class	fare	cabin	port_embarked	survived
str	f64	str	str	f64	f64	f64	str	f64	f64	str	str	f64
"min"	0.0	"Abbing, Mr. Anthony"	"female"	0.1667	0.0	0.0	"110152"	1.0	0.0	"A10"	"Cherbourg"	0.0
"max"	1307.0	"van Billiard, Master. James Wi…	"male"	80.0	8.0	9.0	"WE/P 5735"	3.0	512.3292	"T"	"Southampton"	1.0
"mean"	647.119745	null	null	29.851124	0.512102	0.374522	null	2.278981	34.78715	null	null	0.370701
"median"	660.0	null	null	28.0	0.0	0.0	null	3.0	14.4542	null	null	0.0
"standard deviation"	377.03706	null	null	14.601355	1.060741	0.860033	null	0.843571	55.244376	null	null	0.4833
"missing value ratio"	0.0	"0.0"	"0.0"	0.206369	0.0	0.0	"0.0"	0.0	0.0	"0.7668789808917198"	"0.0012738853503184713"	0.0
"stability"	0.001274	"0.0025477707006369425"	"0.6522292993630573"	0.046549	0.677707	0.770701	"0.008917197452229299"	0.533758	0.054777	"0.0273224043715847"	"0.701530612244898"	0.629299
"idness"	1.0	"0.9987261146496815"	"0.0025477707006369425"	0.118471	0.008917	0.010191	"0.7834394904458599"	0.003822	0.28535	"0.17070063694267515"	"0.005095541401273885"	0.002548

We remove certain columns for the following reasons:

high idness: id , ticket
high stability: parents_children
high missing value ratio: cabin

In [4]:

Copied!

train_table = train_table.remove_columns(["id", "ticket", "parents_children", "cabin"])
test_table = test_table.remove_columns(["id", "ticket", "parents_children", "cabin"])
train_table = train_table.remove_columns(["id", "ticket", "parents_children", "cabin"])
test_table = test_table.remove_columns(["id", "ticket", "parents_children", "cabin"])

Handling Missing Values¶

We fill in missing values in the age and fare columns with the mean of each column

In [5]:

Copied!

from safeds.data.tabular.transformation import SimpleImputer

simple_imputer = SimpleImputer(selector=["age", "fare"], strategy=SimpleImputer.Strategy.mean())
fitted_simple_imputer_train, transformed_train_data = simple_imputer.fit_and_transform(train_table)
transformed_test_data = fitted_simple_imputer_train.transform(test_table)
from safeds.data.tabular.transformation import SimpleImputer

simple_imputer = SimpleImputer(selector=["age", "fare"], strategy=SimpleImputer.Strategy.mean())
fitted_simple_imputer_train, transformed_train_data = simple_imputer.fit_and_transform(train_table)
transformed_test_data = fitted_simple_imputer_train.transform(test_table)

Handling Nominal Categorical Data¶

We use OneHotEncoder to transform categorical, non-numerical values into numerical representations with values of zero or one. In this example, we will transform the values of the sex column, so they can be used in the model to predict passenger survival.

Use the fit_and_transform function of the OneHotEncoder to pass the table and the column names to be used as features for the prediction.

In [6]:

Copied!





from safeds.data.tabular.transformation import OneHotEncoder

fitted_one_hot_encoder_train, transformed_train_data = OneHotEncoder(
    selector=["sex", "port_embarked"],
).fit_and_transform(transformed_train_data)
transformed_test_data = fitted_one_hot_encoder_train.transform(transformed_test_data)
from safeds.data.tabular.transformation import OneHotEncoder

fitted_one_hot_encoder_train, transformed_train_data = OneHotEncoder(
    selector=["sex", "port_embarked"],
).fit_and_transform(transformed_train_data)
transformed_test_data = fitted_one_hot_encoder_train.transform(transformed_test_data)

Statistics after Data Processing¶

Check the data after cleaning and transformation to ensure the changes were made correctly.

In [7]:

Copied!

transformed_train_data.summarize_statistics()
transformed_train_data.summarize_statistics()

Out[7]:

shape: (8, 12)

statistic	name	age	siblings_spouses	travel_class	fare	survived	sex__female	sex__male	port_embarked__Southampton	port_embarked__Queenstown	port_embarked__Cherbourg
str	str	f64	f64	f64	f64	f64	f64	f64	f64	f64	f64
"min"	"Abbing, Mr. Anthony"	0.1667	0.0	1.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0
"max"	"van Billiard, Master. James Wi…	80.0	8.0	3.0	512.3292	1.0	1.0	1.0	1.0	1.0	1.0
"mean"	null	29.851124	0.512102	2.278981	34.78715	0.370701	0.347771	0.652229	0.700637	0.092994	0.205096
"median"	null	29.851124	0.0	3.0	14.4542	0.0	0.0	1.0	1.0	0.0	0.0
"standard deviation"	null	13.005598	1.060741	0.843571	55.244376	0.4833	0.476566	0.476566	0.458271	0.290609	0.404029
"missing value ratio"	"0.0"	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0
"stability"	"0.0025477707006369425"	0.206369	0.677707	0.533758	0.054777	0.629299	0.652229	0.652229	0.700637	0.907006	0.794904
"idness"	"0.9987261146496815"	0.118471	0.008917	0.003822	0.28535	0.002548	0.002548	0.002548	0.002548	0.002548	0.002548

Marking the Target Column¶

Here, we set the target, extra, and feature columns using to_tabular_dataset. This ensures the model knows which column to predict and which columns to use as features during training.

target: survived
extra: name
fearutes: all columns expect target and extra

In [8]:

Copied!

tagged_train_table = transformed_train_data.to_tabular_dataset("survived", extra_names=["name"])
tagged_train_table = transformed_train_data.to_tabular_dataset("survived", extra_names=["name"])

Fitting a Classifier¶

We use the RandomForest classifier as our model and pass the training dataset to the model's fit function to train it.

In [9]:

Copied!

from safeds.ml.classical.classification import RandomForestClassifier

classifier = RandomForestClassifier()
fitted_classifier = classifier.fit(tagged_train_table)
from safeds.ml.classical.classification import RandomForestClassifier

classifier = RandomForestClassifier()
fitted_classifier = classifier.fit(tagged_train_table)

Predicting with the Classifier¶

Use the trained RandomForest model to predict the survival rate of passengers in the test dataset.
Pass the test_table into the predict function, which uses our trained model for prediction.

In [10]:

Copied!

prediction = fitted_classifier.predict(transformed_test_data)
prediction = fitted_classifier.predict(transformed_test_data)

Reverse-Transforming the Prediction¶

After making a prediction, the values will be in a transformed format. To interpret the results using the original values, we need to reverse this transformation. This is done using inverse_transform_table with the fitted transformers that support inverse transformation.

In [11]:

Copied!

reverse_transformed_prediction = prediction.to_table().inverse_transform_table(fitted_one_hot_encoder_train)
# For visualisation purposes we only print out the first 15 rows.
reverse_transformed_prediction.slice_rows(length=15)
reverse_transformed_prediction = prediction.to_table().inverse_transform_table(fitted_one_hot_encoder_train)
# For visualisation purposes we only print out the first 15 rows.
reverse_transformed_prediction.slice_rows(length=15)

Out[11]:

shape: (15, 8)

name	age	siblings_spouses	travel_class	fare	survived	sex	port_embarked
str	f64	i64	i64	f64	i64	str	str
"Dantcheff, Mr. Ristiu"	25.0	0	3	7.8958	0	"male"	"Southampton"
"Chip, Mr. Chang"	32.0	0	3	56.4958	1	"male"	"Southampton"
"McEvoy, Mr. Michael"	29.851124	0	3	15.5	0	"male"	"Queenstown"
"Harrison, Mr. William"	40.0	0	1	0.0	0	"male"	"Southampton"
"West, Mrs. Edwy Arthur (Ada Ma…	33.0	1	2	27.75	1	"female"	"Southampton"
…	…	…	…	…	…	…	…
"Andersson, Miss. Ida Augusta M…	38.0	4	3	7.775	0	"female"	"Southampton"
"Sheerlinck, Mr. Jan Baptist"	29.0	0	3	9.5	0	"male"	"Southampton"
"Drew, Master. Marshall Brines"	8.0	0	2	32.5	1	"male"	"Southampton"
"O'Connell, Mr. Patrick D"	29.851124	0	3	7.7333	0	"male"	"Queenstown"
"Foley, Mr. William"	29.851124	0	3	7.75	0	"male"	"Queenstown"

Testing the Accuracy of the Model¶

We evaluate the performance of the trained model by testing its accuracy on the transformed test data using accuracy.

In [12]:

Copied!

accuracy = fitted_classifier.accuracy(transformed_test_data) * 100
f"Accuracy on test data: {accuracy:.4f}%"
accuracy = fitted_classifier.accuracy(transformed_test_data) * 100
f"Accuracy on test data: {accuracy:.4f}%"

Out[12]:

'Accuracy on test data: 78.2443%'