Regression¶
This tutorial uses Safe-DS on house sales data to predict house prices.
File and Imports¶
Start by creating a Python-File with the suffix .py.
from safeds.data.tabular.containers import Table
pricing = Table.from_csv_file("data/house_sales.csv")
# For visualisation purposes, we only print out the first 15 rows.
pricing.slice_rows(length=15)
| id | year | month | day | zipcode | latitude | longitude | sqft_lot | sqft_living | sqft_above | sqft_basement | floors | bedrooms | bathrooms | waterfront | view | condition | grade | year_built | year_renovated | sqft_lot_15nn | sqft_living_15nn | price |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| i64 | i64 | i64 | i64 | i64 | f64 | f64 | i64 | i64 | i64 | i64 | f64 | i64 | f64 | i64 | i64 | i64 | i64 | i64 | i64 | i64 | i64 | i64 |
| 0 | 2014 | 5 | 2 | 98001 | 47.3406 | -122.269 | 9397 | 2200 | 2200 | 0 | 2.0 | 4 | 2.5 | 0 | 1 | 3 | 8 | 1987 | 0 | 9176 | 2310 | 285000 |
| 1 | 2014 | 5 | 2 | 98003 | 47.3537 | -122.303 | 10834 | 2090 | 1360 | 730 | 1.0 | 3 | 2.5 | 0 | 1 | 4 | 8 | 1987 | 0 | 8595 | 1750 | 285000 |
| 2 | 2014 | 5 | 2 | 98006 | 47.5443 | -122.177 | 8119 | 2160 | 1080 | 1080 | 1.0 | 4 | 2.25 | 0 | 1 | 3 | 8 | 1966 | 0 | 9000 | 1850 | 440000 |
| 3 | 2014 | 5 | 2 | 98006 | 47.5746 | -122.135 | 8800 | 1450 | 1450 | 0 | 1.0 | 4 | 1.0 | 0 | 1 | 4 | 7 | 1954 | 0 | 8942 | 1260 | 435000 |
| 4 | 2014 | 5 | 2 | 98006 | 47.5725 | -122.133 | 10000 | 1920 | 1070 | 850 | 1.0 | 4 | 1.5 | 0 | 1 | 4 | 7 | 1954 | 0 | 10836 | 1450 | 430000 |
| … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … |
| 10 | 2014 | 5 | 2 | 98023 | 47.3256 | -122.378 | 33151 | 3240 | 3240 | 0 | 2.0 | 3 | 2.5 | 0 | 3 | 3 | 10 | 1995 | 0 | 24967 | 4050 | 604000 |
| 11 | 2014 | 5 | 2 | 98024 | 47.5643 | -121.897 | 16215 | 1580 | 1580 | 0 | 1.0 | 3 | 2.25 | 0 | 1 | 4 | 7 | 1978 | 0 | 16215 | 1450 | 335000 |
| 12 | 2014 | 5 | 2 | 98027 | 47.4635 | -121.991 | 35100 | 1970 | 1970 | 0 | 2.0 | 3 | 2.25 | 0 | 1 | 4 | 9 | 1977 | 0 | 35100 | 2340 | 437500 |
| 13 | 2014 | 5 | 2 | 98027 | 47.4634 | -121.987 | 37277 | 2710 | 2710 | 0 | 2.0 | 4 | 2.75 | 0 | 1 | 3 | 9 | 2000 | 0 | 39299 | 2390 | 630000 |
| 14 | 2014 | 5 | 2 | 98029 | 47.5794 | -122.025 | 67518 | 2820 | 2820 | 0 | 2.0 | 5 | 2.5 | 0 | 1 | 3 | 8 | 1979 | 0 | 48351 | 2820 | 675000 |
Cleaning your Data¶
At this point it is usual to clean the data. Here's an example how to do so:
pricing_columns = (
# Removes columns "latitude" and "longitude" from table
pricing.remove_columns(["latitude", "longitude"])
# Removes rows which contain missing values
.remove_rows_with_missing_values()
# Removes rows which contain outliers
.remove_rows_with_outliers()
)
# For visualisation purposes, we only print out the first 5 rows.
pricing_columns.slice_rows(length=5)
| id | year | month | day | zipcode | sqft_lot | sqft_living | sqft_above | sqft_basement | floors | bedrooms | bathrooms | waterfront | view | condition | grade | year_built | year_renovated | sqft_lot_15nn | sqft_living_15nn | price |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| i64 | i64 | i64 | i64 | i64 | i64 | i64 | i64 | i64 | f64 | i64 | f64 | i64 | i64 | i64 | i64 | i64 | i64 | i64 | i64 | i64 |
| 0 | 2014 | 5 | 2 | 98001 | 9397 | 2200 | 2200 | 0 | 2.0 | 4 | 2.5 | 0 | 1 | 3 | 8 | 1987 | 0 | 9176 | 2310 | 285000 |
| 1 | 2014 | 5 | 2 | 98003 | 10834 | 2090 | 1360 | 730 | 1.0 | 3 | 2.5 | 0 | 1 | 4 | 8 | 1987 | 0 | 8595 | 1750 | 285000 |
| 2 | 2014 | 5 | 2 | 98006 | 8119 | 2160 | 1080 | 1080 | 1.0 | 4 | 2.25 | 0 | 1 | 3 | 8 | 1966 | 0 | 9000 | 1850 | 440000 |
| 3 | 2014 | 5 | 2 | 98006 | 8800 | 1450 | 1450 | 0 | 1.0 | 4 | 1.0 | 0 | 1 | 4 | 7 | 1954 | 0 | 8942 | 1260 | 435000 |
| 4 | 2014 | 5 | 2 | 98006 | 10000 | 1920 | 1070 | 850 | 1.0 | 4 | 1.5 | 0 | 1 | 4 | 7 | 1954 | 0 | 10836 | 1450 | 430000 |
See how to perform further data cleaning in the dedicated Data Processing Tutorial.
Create Training and Testing Set¶
Split the house sales dataset into two tables. A training set, that will be used later to implement a training model to predict the house prices. It contains 60% of the data. The testing set contains the rest of the data.
train_table, testing_table = pricing_columns.split_rows(0.60)
Mark the price Column as the target variable to be predicted. Include the id column only as an extra column, which is completely ignored by the model:
extra_names = ["id"]
train_tabular_dataset = train_table.to_tabular_dataset("price", extra_names=extra_names)
Creating and Fitting a Regressor¶
Use Decision Tree regressor as a model for the regression. Pass the "train_tabular_dataset" table to the fit function of the model:
from safeds.ml.classical.regression import DecisionTreeRegressor
fitted_model = DecisionTreeRegressor().fit(train_tabular_dataset)
Predicting with the Fitted Regressor¶
Use the fitted decision tree regression model, that we trained on the training dataset to predict the price of a house in the test dataset.
prediction = fitted_model.predict(testing_table)
# For visualisation purposes we only print out the first 15 rows.
prediction.to_table().slice_rows(length=15)
| id | year | month | day | zipcode | sqft_lot | sqft_living | sqft_above | sqft_basement | floors | bedrooms | bathrooms | waterfront | view | condition | grade | year_built | year_renovated | sqft_lot_15nn | sqft_living_15nn | price |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| i64 | i64 | i64 | i64 | i64 | i64 | i64 | i64 | i64 | f64 | i64 | f64 | i64 | i64 | i64 | i64 | i64 | i64 | i64 | i64 | f64 |
| 18183 | 2015 | 3 | 24 | 98103 | 3880 | 1350 | 950 | 400 | 1.0 | 2 | 1.0 | 0 | 1 | 3 | 6 | 1927 | 0 | 3920 | 1670 | 449500.0 |
| 2557 | 2014 | 6 | 12 | 98092 | 8465 | 2210 | 1490 | 720 | 1.0 | 4 | 2.5 | 0 | 1 | 3 | 8 | 1990 | 0 | 7917 | 2210 | 299533.2 |
| 9685 | 2014 | 9 | 26 | 98075 | 8808 | 3320 | 3320 | 0 | 2.0 | 4 | 3.5 | 0 | 1 | 3 | 9 | 2005 | 0 | 9226 | 3160 | 764285.714286 |
| 9249 | 2014 | 9 | 19 | 98144 | 1610 | 1950 | 1950 | 0 | 3.0 | 2 | 2.75 | 0 | 1 | 3 | 8 | 2009 | 0 | 1745 | 910 | 441380.0 |
| 11429 | 2014 | 10 | 28 | 98010 | 5233 | 1050 | 1050 | 0 | 1.0 | 3 | 1.0 | 0 | 1 | 5 | 5 | 1906 | 0 | 7500 | 970 | 374200.0 |
| … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … |
| 20938 | 2015 | 4 | 30 | 98106 | 6954 | 1060 | 1060 | 0 | 1.0 | 3 | 1.5 | 0 | 1 | 4 | 6 | 1983 | 0 | 6372 | 1560 | 313333.166667 |
| 15430 | 2015 | 1 | 27 | 98031 | 8314 | 1560 | 1560 | 0 | 1.0 | 3 | 1.5 | 0 | 1 | 3 | 7 | 1962 | 0 | 8925 | 1820 | 237414.285714 |
| 16242 | 2015 | 2 | 18 | 98023 | 8120 | 2260 | 2260 | 0 | 2.0 | 3 | 2.5 | 0 | 1 | 3 | 8 | 2004 | 0 | 7784 | 2250 | 315466.666667 |
| 15039 | 2015 | 1 | 14 | 98125 | 6332 | 1500 | 1500 | 0 | 1.0 | 3 | 1.5 | 0 | 1 | 3 | 7 | 1953 | 0 | 6337 | 1500 | 570375.0 |
| 4663 | 2014 | 7 | 11 | 98002 | 6697 | 810 | 810 | 0 | 1.0 | 2 | 1.0 | 0 | 1 | 4 | 6 | 1923 | 0 | 6695 | 1140 | 146107.142857 |
Evaluating the Fitted Regressor¶
You can test the mean absolute error of that model with the initial testing_table as follows:
fitted_model.mean_absolute_error(testing_table)
88119.9660738172
Full Code¶
from safeds.data.tabular.containers import Table
from safeds.ml.classical.regression import DecisionTreeRegressor
pricing = Table.from_csv_file("data/house_sales.csv")
pricing_columns = (
pricing.remove_columns(["latitude", "longitude"]).remove_rows_with_missing_values().remove_rows_with_outliers()
)
train_table, testing_table = pricing_columns.split_rows(0.60)
extra_names = ["id"]
train_tabular_dataset = train_table.to_tabular_dataset("price", extra_names=extra_names)
fitted_model = DecisionTreeRegressor().fit(train_tabular_dataset)
prediction = fitted_model.predict(testing_table)
fitted_model.mean_absolute_error(testing_table)
88146.59355509149