Classification¶

This tutorial uses safeds on titanic passenger data to predict who will survive and who will not, using sex as a feature for the prediction.

Load your data into a Table, the data is available under docs/tutorials/data/titanic.csv:

In [1]:

Copied!

from safeds.data.tabular.containers import Table

titanic = Table.from_csv_file("data/titanic.csv")
#For visualisation purposes we only print out the first 15 rows.
titanic.slice_rows(0, 15)
from safeds.data.tabular.containers import Table

titanic = Table.from_csv_file("data/titanic.csv")
#For visualisation purposes we only print out the first 15 rows.
titanic.slice_rows(0, 15)

Out[1]:

	id	name	sex	age	siblings_spouses	parents_children	ticket	travel_class	fare	cabin	port_embarked	survived
0	0	Abbing, Mr. Anthony	male	42.0000	0	0	C.A. 5547	3	7.5500	NaN	Southampton	0
1	1	Abbott, Master. Eugene Joseph	male	13.0000	0	2	C.A. 2673	3	20.2500	NaN	Southampton	0
2	2	Abbott, Mr. Rossmore Edward	male	16.0000	1	1	C.A. 2673	3	20.2500	NaN	Southampton	0
3	3	Abbott, Mrs. Stanton (Rosa Hunt)	female	35.0000	1	1	C.A. 2673	3	20.2500	NaN	Southampton	1
4	4	Abelseth, Miss. Karen Marie	female	16.0000	0	0	348125	3	7.6500	NaN	Southampton	1
5	5	Abelseth, Mr. Olaus Jorgensen	male	25.0000	0	0	348122	3	7.6500	F G63	Southampton	1
6	6	Abelson, Mr. Samuel	male	30.0000	1	0	P/PP 3381	2	24.0000	NaN	Cherbourg	0
7	7	Abelson, Mrs. Samuel (Hannah Wizosky)	female	28.0000	1	0	P/PP 3381	2	24.0000	NaN	Cherbourg	1
8	8	Abrahamsson, Mr. Abraham August Johannes	male	20.0000	0	0	SOTON/O2 3101284	3	7.9250	NaN	Southampton	1
9	9	Abrahim, Mrs. Joseph (Sophie Halaut Easu)	female	18.0000	0	0	2657	3	7.2292	NaN	Cherbourg	1
10	10	Adahl, Mr. Mauritz Nils Martin	male	30.0000	0	0	C 7076	3	7.2500	NaN	Southampton	0
11	11	Adams, Mr. John	male	26.0000	0	0	341826	3	8.0500	NaN	Southampton	0
12	12	Ahlin, Mrs. Johan (Johanna Persdotter Larsson)	female	40.0000	1	0	7546	3	9.4750	NaN	Southampton	0
13	13	Aks, Master. Philip Frank	male	0.8333	0	1	392091	3	9.3500	NaN	Southampton	1
14	14	Aks, Mrs. Sam (Leah Rosen)	female	18.0000	0	1	392091	3	9.3500	NaN	Southampton	1

Split the titanic dataset into two tables. A training set, that we will use later to implement a training model to predict the survival of passengers, containing 60% of the data, and a testing set containing the rest of the data. Delete the column survived from the test set, to be able to predict it later:

In [2]:

Copied!

train_table, testing_table = titanic.split_rows(0.6)

test_table = testing_table.remove_columns(["survived"]).shuffle_rows()
train_table, testing_table = titanic.split_rows(0.6)

test_table = testing_table.remove_columns(["survived"]).shuffle_rows()

Use OneHotEncoder to create an encoder, that will be used later to transform the training table.

We use OneHotEncoder to transform non-numerical categorical values into numerical representations with values of zero or one. In this example we will transform the values of the sex column, hence they will be used in the model for predicting the surviving of passengers.
Use the fit function of the OneHotEncoder to pass the table and the column names, that will be used as features to predict who will survive to the encoder.
The names of the column before transformation need to be saved, because OneHotEncoder changes the names of the fitted Columns:

In [3]:

Copied!

from safeds.data.tabular.transformation import OneHotEncoder

encoder = OneHotEncoder().fit(train_table, ["sex"])
from safeds.data.tabular.transformation import OneHotEncoder

encoder = OneHotEncoder().fit(train_table, ["sex"])

Transform the training table using the fitted encoder, and create a set with the new names of the fitted Columns:

In [4]:

Copied!

transformed_table = encoder.transform(train_table)
transformed_table = encoder.transform(train_table)

Mark the survived Column as the target variable to be predicted. Include some columns only as extra columns, which are completely ignored by the model:

In [5]:

Copied!

extra_names = ["id", "name", "ticket", "cabin", "port_embarked", "age", "fare"]

train_tabular_dataset = transformed_table.to_tabular_dataset("survived", extra_names)
extra_names = ["id", "name", "ticket", "cabin", "port_embarked", "age", "fare"]

train_tabular_dataset = transformed_table.to_tabular_dataset("survived", extra_names)

Use RandomForest classifier as a model for the classification. Pass the "train_tabular_dataset" table to the fit function of the model:

In [6]:

Copied!

from safeds.ml.classical.classification import RandomForestClassifier

model = RandomForestClassifier()
fitted_model= model.fit(train_tabular_dataset)
from safeds.ml.classical.classification import RandomForestClassifier

model = RandomForestClassifier()
fitted_model= model.fit(train_tabular_dataset)

Use the fitted random forest model, that we trained on the training dataset to predict the survival rate of passengers in the test dataset. Transform the test data with OneHotEncoder first, to be able to pass it to the predict function, that uses our fitted random forest model for prediction:

In [7]:

Copied!





encoder = OneHotEncoder().fit(test_table, ["sex"])
transformed_test_table = encoder.transform(test_table)

prediction = fitted_model.predict(
    transformed_test_table
)
#For visualisation purposes we only print out the first 15 rows.
prediction.to_table().slice_rows(start=0, end=15)
encoder = OneHotEncoder().fit(test_table, ["sex"])
transformed_test_table = encoder.transform(test_table)

prediction = fitted_model.predict(
    transformed_test_table
)
#For visualisation purposes we only print out the first 15 rows.
prediction.to_table().slice_rows(start=0, end=15)

Out[7]:

	id	name	sex__male	sex__female	age	siblings_spouses	parents_children	ticket	travel_class	fare	cabin	port_embarked	survived
0	1116	Slabenoff, Mr. Petco	1.0	0.0	NaN	0	0	349214	3	7.8958	NaN	Southampton	0
1	1153	Stokes, Mr. Philip Joseph	1.0	0.0	25.0	0	0	F.C.C. 13540	2	10.5000	NaN	Southampton	0
2	1284	Williams, Mr. Charles Eugene	1.0	0.0	NaN	0	0	244373	2	13.0000	NaN	Southampton	0
3	1003	Rice, Master. Eric	1.0	0.0	7.0	4	1	382652	3	29.1250	NaN	Queenstown	0
4	786	McGovern, Miss. Mary	0.0	1.0	NaN	0	0	330931	3	7.8792	NaN	Queenstown	1
5	985	Pulbaum, Mr. Franz	1.0	0.0	27.0	0	0	SC/PARIS 2168	2	15.0333	NaN	Cherbourg	0
6	1011	Ridsdale, Miss. Lucy	0.0	1.0	50.0	0	0	W./C. 14258	2	10.5000	NaN	Southampton	1
7	967	Petranec, Miss. Matilda	0.0	1.0	28.0	0	0	349245	3	7.8958	NaN	Southampton	1
8	989	Radeff, Mr. Alexander	1.0	0.0	NaN	0	0	349223	3	7.8958	NaN	Southampton	0
9	1295	Wright, Miss. Marion	0.0	1.0	26.0	0	0	220844	2	13.5000	NaN	Southampton	1
10	846	Myles, Mr. Thomas Francis	1.0	0.0	62.0	0	0	240276	2	9.6875	NaN	Queenstown	0
11	1245	Warren, Mr. Frank Manley	1.0	0.0	64.0	1	0	110813	1	75.2500	D37	Cherbourg	0
12	953	Peltomaki, Mr. Nikolai Johannes	1.0	0.0	25.0	0	0	STON/O 2. 3101291	3	7.9250	NaN	Southampton	0
13	1023	Rogers, Mr. Reginald Harry	1.0	0.0	19.0	0	0	28004	2	10.5000	NaN	Southampton	0
14	913	Osman, Mrs. Mara	0.0	1.0	31.0	0	0	349244	3	8.6833	NaN	Southampton	1

You can test the accuracy of that model with the initial testing_table as follows:

In [8]:

Copied!

encoder = OneHotEncoder().fit(test_table, ["sex"])
testing_table = encoder.transform(testing_table)

test_tabular_dataset = testing_table.to_tabular_dataset("survived", extra_names)
fitted_model.accuracy(test_tabular_dataset)
encoder = OneHotEncoder().fit(test_table, ["sex"])
testing_table = encoder.transform(testing_table)

test_tabular_dataset = testing_table.to_tabular_dataset("survived", extra_names)
fitted_model.accuracy(test_tabular_dataset)

Out[8]:

0.7958015267175572