Regression¶
This tutorial uses Safe-DS on house sales data to predict house prices.
File and Imports¶
Start by creating a Python-File with the suffix .py
.
from safeds.data.tabular.containers import Table
pricing = Table.from_csv_file("data/house_sales.csv")
# For visualisation purposes, we only print out the first 15 rows.
pricing.slice_rows(length=15)
id | year | month | day | zipcode | latitude | longitude | sqft_lot | sqft_living | sqft_above | sqft_basement | floors | bedrooms | bathrooms | waterfront | view | condition | grade | year_built | year_renovated | sqft_lot_15nn | sqft_living_15nn | price |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
i64 | i64 | i64 | i64 | i64 | f64 | f64 | i64 | i64 | i64 | i64 | f64 | i64 | f64 | i64 | i64 | i64 | i64 | i64 | i64 | i64 | i64 | i64 |
0 | 2014 | 5 | 2 | 98001 | 47.3406 | -122.269 | 9397 | 2200 | 2200 | 0 | 2.0 | 4 | 2.5 | 0 | 1 | 3 | 8 | 1987 | 0 | 9176 | 2310 | 285000 |
1 | 2014 | 5 | 2 | 98003 | 47.3537 | -122.303 | 10834 | 2090 | 1360 | 730 | 1.0 | 3 | 2.5 | 0 | 1 | 4 | 8 | 1987 | 0 | 8595 | 1750 | 285000 |
2 | 2014 | 5 | 2 | 98006 | 47.5443 | -122.177 | 8119 | 2160 | 1080 | 1080 | 1.0 | 4 | 2.25 | 0 | 1 | 3 | 8 | 1966 | 0 | 9000 | 1850 | 440000 |
3 | 2014 | 5 | 2 | 98006 | 47.5746 | -122.135 | 8800 | 1450 | 1450 | 0 | 1.0 | 4 | 1.0 | 0 | 1 | 4 | 7 | 1954 | 0 | 8942 | 1260 | 435000 |
4 | 2014 | 5 | 2 | 98006 | 47.5725 | -122.133 | 10000 | 1920 | 1070 | 850 | 1.0 | 4 | 1.5 | 0 | 1 | 4 | 7 | 1954 | 0 | 10836 | 1450 | 430000 |
… | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … |
10 | 2014 | 5 | 2 | 98023 | 47.3256 | -122.378 | 33151 | 3240 | 3240 | 0 | 2.0 | 3 | 2.5 | 0 | 3 | 3 | 10 | 1995 | 0 | 24967 | 4050 | 604000 |
11 | 2014 | 5 | 2 | 98024 | 47.5643 | -121.897 | 16215 | 1580 | 1580 | 0 | 1.0 | 3 | 2.25 | 0 | 1 | 4 | 7 | 1978 | 0 | 16215 | 1450 | 335000 |
12 | 2014 | 5 | 2 | 98027 | 47.4635 | -121.991 | 35100 | 1970 | 1970 | 0 | 2.0 | 3 | 2.25 | 0 | 1 | 4 | 9 | 1977 | 0 | 35100 | 2340 | 437500 |
13 | 2014 | 5 | 2 | 98027 | 47.4634 | -121.987 | 37277 | 2710 | 2710 | 0 | 2.0 | 4 | 2.75 | 0 | 1 | 3 | 9 | 2000 | 0 | 39299 | 2390 | 630000 |
14 | 2014 | 5 | 2 | 98029 | 47.5794 | -122.025 | 67518 | 2820 | 2820 | 0 | 2.0 | 5 | 2.5 | 0 | 1 | 3 | 8 | 1979 | 0 | 48351 | 2820 | 675000 |
Cleaning your Data¶
At this point it is usual to clean the data. Here's an example how to do so:
pricing_columns = (
# Removes columns "latitude" and "longitude" from table
pricing.remove_columns(["latitude", "longitude"])
# Removes rows which contain missing values
.remove_rows_with_missing_values()
# Removes rows which contain outliers
.remove_rows_with_outliers()
)
# For visualisation purposes, we only print out the first 5 rows.
pricing_columns.slice_rows(length=5)
id | year | month | day | zipcode | sqft_lot | sqft_living | sqft_above | sqft_basement | floors | bedrooms | bathrooms | waterfront | view | condition | grade | year_built | year_renovated | sqft_lot_15nn | sqft_living_15nn | price |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
i64 | i64 | i64 | i64 | i64 | i64 | i64 | i64 | i64 | f64 | i64 | f64 | i64 | i64 | i64 | i64 | i64 | i64 | i64 | i64 | i64 |
0 | 2014 | 5 | 2 | 98001 | 9397 | 2200 | 2200 | 0 | 2.0 | 4 | 2.5 | 0 | 1 | 3 | 8 | 1987 | 0 | 9176 | 2310 | 285000 |
1 | 2014 | 5 | 2 | 98003 | 10834 | 2090 | 1360 | 730 | 1.0 | 3 | 2.5 | 0 | 1 | 4 | 8 | 1987 | 0 | 8595 | 1750 | 285000 |
2 | 2014 | 5 | 2 | 98006 | 8119 | 2160 | 1080 | 1080 | 1.0 | 4 | 2.25 | 0 | 1 | 3 | 8 | 1966 | 0 | 9000 | 1850 | 440000 |
3 | 2014 | 5 | 2 | 98006 | 8800 | 1450 | 1450 | 0 | 1.0 | 4 | 1.0 | 0 | 1 | 4 | 7 | 1954 | 0 | 8942 | 1260 | 435000 |
4 | 2014 | 5 | 2 | 98006 | 10000 | 1920 | 1070 | 850 | 1.0 | 4 | 1.5 | 0 | 1 | 4 | 7 | 1954 | 0 | 10836 | 1450 | 430000 |
See how to perform further data cleaning in the dedicated Data Processing Tutorial.
Create Training and Testing Set¶
Split the house sales dataset into two tables. A training set, that will be used later to implement a training model to predict the house prices. It contains 60% of the data. The testing set contains the rest of the data.
train_table, testing_table = pricing_columns.split_rows(0.60)
Mark the price
Column
as the target variable to be predicted. Include the id
column only as an extra column, which is completely ignored by the model:
extra_names = ["id"]
train_tabular_dataset = train_table.to_tabular_dataset("price", extra_names=extra_names)
Creating and Fitting a Regressor¶
Use Decision Tree
regressor as a model for the regression. Pass the "train_tabular_dataset" table to the fit function of the model:
from safeds.ml.classical.regression import DecisionTreeRegressor
fitted_model = DecisionTreeRegressor().fit(train_tabular_dataset)
Predicting with the Fitted Regressor¶
Use the fitted decision tree regression model, that we trained on the training dataset to predict the price of a house in the test dataset.
prediction = fitted_model.predict(testing_table)
# For visualisation purposes we only print out the first 15 rows.
prediction.to_table().slice_rows(length=15)
id | year | month | day | zipcode | sqft_lot | sqft_living | sqft_above | sqft_basement | floors | bedrooms | bathrooms | waterfront | view | condition | grade | year_built | year_renovated | sqft_lot_15nn | sqft_living_15nn | price |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
i64 | i64 | i64 | i64 | i64 | i64 | i64 | i64 | i64 | f64 | i64 | f64 | i64 | i64 | i64 | i64 | i64 | i64 | i64 | i64 | f64 |
10549 | 2014 | 10 | 13 | 98103 | 5000 | 1240 | 1000 | 240 | 1.0 | 2 | 1.0 | 0 | 1 | 3 | 7 | 1920 | 0 | 3500 | 1480 | 550661.111111 |
17590 | 2015 | 3 | 14 | 98002 | 7312 | 2010 | 2010 | 0 | 1.0 | 4 | 2.0 | 0 | 1 | 4 | 7 | 1976 | 0 | 7650 | 2010 | 269944.444444 |
10889 | 2014 | 10 | 17 | 98103 | 3220 | 1120 | 1120 | 0 | 1.0 | 2 | 1.0 | 0 | 1 | 4 | 7 | 1923 | 0 | 3220 | 1440 | 550661.111111 |
12511 | 2014 | 11 | 14 | 98144 | 2457 | 1950 | 1950 | 0 | 3.0 | 2 | 2.5 | 0 | 1 | 3 | 8 | 2009 | 0 | 1639 | 1650 | 382300.0 |
20572 | 2015 | 4 | 27 | 98056 | 5038 | 1220 | 1220 | 0 | 1.0 | 3 | 1.0 | 0 | 1 | 5 | 6 | 1942 | 0 | 5038 | 1140 | 195130.555556 |
… | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … |
9097 | 2014 | 9 | 18 | 98003 | 7400 | 1130 | 1130 | 0 | 1.0 | 4 | 1.0 | 0 | 1 | 4 | 7 | 1969 | 0 | 7379 | 1540 | 223916.666667 |
8454 | 2014 | 9 | 8 | 98023 | 8470 | 840 | 840 | 0 | 1.0 | 3 | 1.0 | 0 | 1 | 4 | 6 | 1961 | 0 | 8450 | 840 | 109291.666667 |
7829 | 2014 | 8 | 26 | 98126 | 4025 | 820 | 820 | 0 | 1.0 | 2 | 1.0 | 0 | 3 | 5 | 6 | 1922 | 0 | 5750 | 1410 | 330190.0 |
19952 | 2015 | 4 | 17 | 98198 | 10187 | 1120 | 1120 | 0 | 1.0 | 3 | 1.75 | 0 | 1 | 3 | 7 | 1968 | 0 | 8736 | 1900 | 201880.0 |
18382 | 2015 | 3 | 26 | 98042 | 5929 | 2210 | 2210 | 0 | 2.0 | 4 | 2.5 | 0 | 1 | 3 | 8 | 2004 | 0 | 5901 | 2200 | 311491.666667 |
Evaluating the Fitted Regressor¶
You can test the mean absolute error of that model with the initial testing_table as follows:
fitted_model.mean_absolute_error(testing_table)
92480.98339349499
Full Code¶
from safeds.data.tabular.containers import Table
from safeds.ml.classical.regression import DecisionTreeRegressor
pricing = Table.from_csv_file("data/house_sales.csv")
pricing_columns = (
pricing.remove_columns(["latitude", "longitude"])
.remove_rows_with_missing_values()
.remove_rows_with_outliers()
)
train_table, testing_table = pricing_columns.split_rows(0.60)
extra_names = ["id"]
train_tabular_dataset = train_table.to_tabular_dataset("price", extra_names=extra_names)
fitted_model = DecisionTreeRegressor().fit(train_tabular_dataset)
prediction = fitted_model.predict(testing_table)
fitted_model.mean_absolute_error(testing_table)
92552.00101077871