Classification¶
This tutorial uses safeds on titanic passenger data to predict who will survive and who will not, using sex as a feature for the prediction.
- Load your data into a
Table
, the data is available underdocs/tutorials/data/titanic.csv
:
In [1]:
Copied!
from safeds.data.tabular.containers import Table
titanic = Table.from_csv_file("data/titanic.csv")
#For visualisation purposes we only print out the first 15 rows.
titanic.slice_rows(0,15)
from safeds.data.tabular.containers import Table
titanic = Table.from_csv_file("data/titanic.csv")
#For visualisation purposes we only print out the first 15 rows.
titanic.slice_rows(0,15)
Out[1]:
id | name | sex | age | siblings_spouses | parents_children | ticket | travel_class | fare | cabin | port_embarked | survived | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0 | Abbing, Mr. Anthony | male | 42.0000 | 0 | 0 | C.A. 5547 | 3 | 7.5500 | NaN | Southampton | 0 |
1 | 1 | Abbott, Master. Eugene Joseph | male | 13.0000 | 0 | 2 | C.A. 2673 | 3 | 20.2500 | NaN | Southampton | 0 |
2 | 2 | Abbott, Mr. Rossmore Edward | male | 16.0000 | 1 | 1 | C.A. 2673 | 3 | 20.2500 | NaN | Southampton | 0 |
3 | 3 | Abbott, Mrs. Stanton (Rosa Hunt) | female | 35.0000 | 1 | 1 | C.A. 2673 | 3 | 20.2500 | NaN | Southampton | 1 |
4 | 4 | Abelseth, Miss. Karen Marie | female | 16.0000 | 0 | 0 | 348125 | 3 | 7.6500 | NaN | Southampton | 1 |
5 | 5 | Abelseth, Mr. Olaus Jorgensen | male | 25.0000 | 0 | 0 | 348122 | 3 | 7.6500 | F G63 | Southampton | 1 |
6 | 6 | Abelson, Mr. Samuel | male | 30.0000 | 1 | 0 | P/PP 3381 | 2 | 24.0000 | NaN | Cherbourg | 0 |
7 | 7 | Abelson, Mrs. Samuel (Hannah Wizosky) | female | 28.0000 | 1 | 0 | P/PP 3381 | 2 | 24.0000 | NaN | Cherbourg | 1 |
8 | 8 | Abrahamsson, Mr. Abraham August Johannes | male | 20.0000 | 0 | 0 | SOTON/O2 3101284 | 3 | 7.9250 | NaN | Southampton | 1 |
9 | 9 | Abrahim, Mrs. Joseph (Sophie Halaut Easu) | female | 18.0000 | 0 | 0 | 2657 | 3 | 7.2292 | NaN | Cherbourg | 1 |
10 | 10 | Adahl, Mr. Mauritz Nils Martin | male | 30.0000 | 0 | 0 | C 7076 | 3 | 7.2500 | NaN | Southampton | 0 |
11 | 11 | Adams, Mr. John | male | 26.0000 | 0 | 0 | 341826 | 3 | 8.0500 | NaN | Southampton | 0 |
12 | 12 | Ahlin, Mrs. Johan (Johanna Persdotter Larsson) | female | 40.0000 | 1 | 0 | 7546 | 3 | 9.4750 | NaN | Southampton | 0 |
13 | 13 | Aks, Master. Philip Frank | male | 0.8333 | 0 | 1 | 392091 | 3 | 9.3500 | NaN | Southampton | 1 |
14 | 14 | Aks, Mrs. Sam (Leah Rosen) | female | 18.0000 | 0 | 1 | 392091 | 3 | 9.3500 | NaN | Southampton | 1 |
- Split the titanic dataset into two tables. A training set, that we will use later to implement a training model to predict the survival of passengers, containing 60% of the data, and a testing set containing the rest of the data.
Delete the column
survived
from the test set, to be able to predict it later:
In [2]:
Copied!
split_tuple = titanic.split_rows(0.60)
train_table = split_tuple[0]
testing_table = split_tuple[1]
test_table = testing_table.remove_columns(["survived"]).shuffle_rows()
split_tuple = titanic.split_rows(0.60)
train_table = split_tuple[0]
testing_table = split_tuple[1]
test_table = testing_table.remove_columns(["survived"]).shuffle_rows()
- Use
OneHotEncoder
to create an encoder, that will be used later to transform the training table.
- We use
OneHotEncoder
to transform non-numerical categorical values into numerical representations with values of zero or one. In this example we will transform the values of the sex column, hence they will be used in the model for predicting the surviving of passengers. - Use the
fit
function of theOneHotEncoder
to pass the table and the column names, that will be used as features to predict who will survive to the encoder. - The names of the column before transformation need to be saved, because
OneHotEncoder
changes the names of the fittedColumn
s:
In [3]:
Copied!
from safeds.data.tabular.transformation import OneHotEncoder
old_column_names = train_table.column_names
encoder = OneHotEncoder().fit(train_table, ["sex"])
from safeds.data.tabular.transformation import OneHotEncoder
old_column_names = train_table.column_names
encoder = OneHotEncoder().fit(train_table, ["sex"])
- Transform the training table using the fitted encoder, and create a set with the new names of the fitted
Column
s:
In [4]:
Copied!
transformed_table = encoder.transform(train_table)
new_column_names = transformed_table.column_names
new_columns= set(new_column_names) - set(old_column_names)
transformed_table = encoder.transform(train_table)
new_column_names = transformed_table.column_names
new_columns= set(new_column_names) - set(old_column_names)
- Tag the
survived
Column
as the target variable to be predicted. Use the new names of the fittedColumn
s as features, which will be used to make predictions based on the target variable.
In [5]:
Copied!
tagged_train_table= transformed_table.tag_columns("survived", feature_names=[
*new_columns
])
tagged_train_table= transformed_table.tag_columns("survived", feature_names=[
*new_columns
])
- Use
RandomForest
classifier as a model for the classification. Pass the "tagged_titanic" table to the fit function of the model:
In [6]:
Copied!
from safeds.ml.classical.classification import RandomForest
model = RandomForest()
fitted_model= model.fit(tagged_train_table)
from safeds.ml.classical.classification import RandomForest
model = RandomForest()
fitted_model= model.fit(tagged_train_table)
- Use the fitted random forest model, that we trained on the training dataset to predict the survival rate of passengers in the test dataset.
Transform the test data with
OneHotEncoder
first, to be able to pass it to the predict function, that uses our fitted random forest model for prediction:
In [7]:
Copied!
encoder = OneHotEncoder().fit(test_table, ["sex"])
transformed_test_table = encoder.transform(test_table)
predicition = fitted_model.predict(
transformed_test_table
)
#For visualisation purposes we only print out the first 15 rows.
predicition.slice_rows(0,15)
encoder = OneHotEncoder().fit(test_table, ["sex"])
transformed_test_table = encoder.transform(test_table)
predicition = fitted_model.predict(
transformed_test_table
)
#For visualisation purposes we only print out the first 15 rows.
predicition.slice_rows(0,15)
Out[7]:
id | name | sex__male | sex__female | age | siblings_spouses | parents_children | ticket | travel_class | fare | cabin | port_embarked | survived | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 816 | Mock, Mr. Philipp Edmund | 1.0 | 0.0 | 30.0000 | 1 | 0 | 13236 | 1 | 57.7500 | C78 | Cherbourg | 0 |
1 | 1134 | Somerton, Mr. Francis William | 1.0 | 0.0 | 30.0000 | 0 | 0 | A.5. 18509 | 3 | 8.0500 | NaN | Southampton | 0 |
2 | 997 | Reuchlin, Jonkheer. John George | 1.0 | 0.0 | 38.0000 | 0 | 0 | 19972 | 1 | 0.0000 | NaN | Southampton | 0 |
3 | 1043 | Ryan, Mr. Patrick | 1.0 | 0.0 | NaN | 0 | 0 | 371110 | 3 | 24.1500 | NaN | Queenstown | 0 |
4 | 1244 | Warren, Mr. Charles William | 1.0 | 0.0 | NaN | 0 | 0 | C.A. 49867 | 3 | 7.5500 | NaN | Southampton | 0 |
5 | 1184 | Thomas, Master. Assad Alexander | 1.0 | 0.0 | 0.4167 | 0 | 1 | 2625 | 3 | 8.5167 | NaN | Cherbourg | 0 |
6 | 986 | Quick, Miss. Phyllis May | 0.0 | 1.0 | 2.0000 | 1 | 1 | 26360 | 2 | 26.0000 | NaN | Southampton | 1 |
7 | 988 | Quick, Mrs. Frederick Charles (Jane Richards) | 0.0 | 1.0 | 33.0000 | 0 | 2 | 26360 | 2 | 26.0000 | NaN | Southampton | 1 |
8 | 1063 | Sage, Mr. Frederick | 1.0 | 0.0 | NaN | 8 | 2 | CA. 2343 | 3 | 69.5500 | NaN | Southampton | 0 |
9 | 961 | Peruschitz, Rev. Joseph Maria | 1.0 | 0.0 | 41.0000 | 0 | 0 | 237393 | 2 | 13.0000 | NaN | Southampton | 0 |
10 | 832 | Moss, Mr. Albert Johan | 1.0 | 0.0 | NaN | 0 | 0 | 312991 | 3 | 7.7750 | NaN | Southampton | 0 |
11 | 1098 | Silvey, Mr. William Baird | 1.0 | 0.0 | 50.0000 | 1 | 0 | 13507 | 1 | 55.9000 | E44 | Southampton | 0 |
12 | 981 | Ponesell, Mr. Martin | 1.0 | 0.0 | 34.0000 | 0 | 0 | 250647 | 2 | 13.0000 | NaN | Southampton | 0 |
13 | 1290 | Windelov, Mr. Einar | 1.0 | 0.0 | 21.0000 | 0 | 0 | SOTON/OQ 3101317 | 3 | 7.2500 | NaN | Southampton | 0 |
14 | 848 | Najib, Miss. Adele Kiamie 'Jane' | 0.0 | 1.0 | 15.0000 | 0 | 0 | 2667 | 3 | 7.2250 | NaN | Cherbourg | 1 |
- You can test the accuracy of that model with the initial testing_table as follows:
In [8]:
Copied!
encoder = OneHotEncoder().fit(test_table, ["sex"])
testing_table = encoder.transform(testing_table)
tagged_test_table= testing_table.tag_columns("survived", feature_names=[
*new_columns
])
fitted_model.accuracy(tagged_train_table)
encoder = OneHotEncoder().fit(test_table, ["sex"])
testing_table = encoder.transform(testing_table)
tagged_test_table= testing_table.tag_columns("survived", feature_names=[
*new_columns
])
fitted_model.accuracy(tagged_train_table)
Out[8]:
0.7745222929936306