Regression¶
This tutorial uses safeds on house sales data to predict house prices.
- Load your data into a
Table
, the data is available underdocs/tutorials/data/pricing.csv
:
In [1]:
Copied!
from safeds.data.tabular.containers import Table
pricing = Table.from_csv_file("data/house_sales.csv")
# For visualisation purposes we only print out the first 15 rows.
pricing.slice_rows(0,15)
from safeds.data.tabular.containers import Table
pricing = Table.from_csv_file("data/house_sales.csv")
# For visualisation purposes we only print out the first 15 rows.
pricing.slice_rows(0,15)
Out[1]:
id | year | month | day | zipcode | latitude | longitude | sqft_lot | sqft_living | sqft_above | sqft_basement | floors | bedrooms | bathrooms | waterfront | view | condition | grade | year_built | year_renovated | sqft_lot_15nn | sqft_living_15nn | price | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0 | 2014 | 5 | 2 | 98001 | 47.3406 | -122.269 | 9397 | 2200 | 2200 | 0 | 2.0 | 4 | 2.50 | 0 | 1 | 3 | 8 | 1987 | 0 | 9176 | 2310 | 285000 |
1 | 1 | 2014 | 5 | 2 | 98003 | 47.3537 | -122.303 | 10834 | 2090 | 1360 | 730 | 1.0 | 3 | 2.50 | 0 | 1 | 4 | 8 | 1987 | 0 | 8595 | 1750 | 285000 |
2 | 2 | 2014 | 5 | 2 | 98006 | 47.5443 | -122.177 | 8119 | 2160 | 1080 | 1080 | 1.0 | 4 | 2.25 | 0 | 1 | 3 | 8 | 1966 | 0 | 9000 | 1850 | 440000 |
3 | 3 | 2014 | 5 | 2 | 98006 | 47.5746 | -122.135 | 8800 | 1450 | 1450 | 0 | 1.0 | 4 | 1.00 | 0 | 1 | 4 | 7 | 1954 | 0 | 8942 | 1260 | 435000 |
4 | 4 | 2014 | 5 | 2 | 98006 | 47.5725 | -122.133 | 10000 | 1920 | 1070 | 850 | 1.0 | 4 | 1.50 | 0 | 1 | 4 | 7 | 1954 | 0 | 10836 | 1450 | 430000 |
5 | 5 | 2014 | 5 | 2 | 98007 | 47.6022 | -122.134 | 6700 | 1570 | 1570 | 0 | 1.0 | 3 | 1.50 | 0 | 1 | 4 | 7 | 1956 | 0 | 7300 | 1570 | 419000 |
6 | 6 | 2014 | 5 | 2 | 98008 | 47.6188 | -122.114 | 8030 | 2000 | 1000 | 1000 | 1.0 | 3 | 2.25 | 0 | 1 | 4 | 8 | 1963 | 0 | 8250 | 2070 | 420000 |
7 | 7 | 2014 | 5 | 2 | 98011 | 47.7698 | -122.222 | 9655 | 2210 | 1460 | 750 | 1.0 | 5 | 2.50 | 0 | 1 | 3 | 8 | 1976 | 0 | 8633 | 2080 | 470000 |
8 | 8 | 2014 | 5 | 2 | 98011 | 47.7419 | -122.205 | 12261 | 2730 | 2730 | 0 | 2.0 | 4 | 2.50 | 0 | 1 | 3 | 9 | 1991 | 0 | 10872 | 2730 | 612500 |
9 | 9 | 2014 | 5 | 2 | 98014 | 47.6517 | -121.906 | 23103 | 1800 | 1800 | 0 | 1.0 | 3 | 1.75 | 0 | 1 | 3 | 7 | 1968 | 0 | 18163 | 1410 | 284000 |
10 | 10 | 2014 | 5 | 2 | 98023 | 47.3256 | -122.378 | 33151 | 3240 | 3240 | 0 | 2.0 | 3 | 2.50 | 0 | 3 | 3 | 10 | 1995 | 0 | 24967 | 4050 | 604000 |
11 | 11 | 2014 | 5 | 2 | 98024 | 47.5643 | -121.897 | 16215 | 1580 | 1580 | 0 | 1.0 | 3 | 2.25 | 0 | 1 | 4 | 7 | 1978 | 0 | 16215 | 1450 | 335000 |
12 | 12 | 2014 | 5 | 2 | 98027 | 47.4635 | -121.991 | 35100 | 1970 | 1970 | 0 | 2.0 | 3 | 2.25 | 0 | 1 | 4 | 9 | 1977 | 0 | 35100 | 2340 | 437500 |
13 | 13 | 2014 | 5 | 2 | 98027 | 47.4634 | -121.987 | 37277 | 2710 | 2710 | 0 | 2.0 | 4 | 2.75 | 0 | 1 | 3 | 9 | 2000 | 0 | 39299 | 2390 | 630000 |
14 | 14 | 2014 | 5 | 2 | 98029 | 47.5794 | -122.025 | 67518 | 2820 | 2820 | 0 | 2.0 | 5 | 2.50 | 0 | 1 | 3 | 8 | 1979 | 0 | 48351 | 2820 | 675000 |
- Split the house sales dataset into two tables. A training set, that we will use later to implement a training model to predict the house price, containing 60% of the data, and a testing set containing the rest of the data.
Delete the column
price
from the test set, to be able to predict it later:
In [2]:
Copied!
split_tuple = pricing.split_rows(0.60)
train_table = split_tuple[0]
testing_table = split_tuple[1]
test_table = testing_table.remove_columns(["price"]).shuffle_rows()
split_tuple = pricing.split_rows(0.60)
train_table = split_tuple[0]
testing_table = split_tuple[1]
test_table = testing_table.remove_columns(["price"]).shuffle_rows()
- Tag the
price
Column
as the target variable to be predicted. Use the new names of the fittedColumn
s as features, which will be used to make predictions based on the target variable.
In [3]:
Copied!
feature_columns = set(train_table.column_names) - set(["price", "id"])
tagged_train_table = train_table.tag_columns("price", feature_names=[
*feature_columns])
feature_columns = set(train_table.column_names) - set(["price", "id"])
tagged_train_table = train_table.tag_columns("price", feature_names=[
*feature_columns])
- Use
Decision Tree
regressor as a model for the regression. Pass the "tagged_pricing" table to the fit function of the model:
In [4]:
Copied!
from safeds.ml.classical.regression import DecisionTree
model = DecisionTree()
fitted_model = model.fit(tagged_train_table)
from safeds.ml.classical.regression import DecisionTree
model = DecisionTree()
fitted_model = model.fit(tagged_train_table)
- Use the fitted decision tree regression model, that we trained on the training dataset to predict the price of a house in the test dataset.
In [5]:
Copied!
prediction = fitted_model.predict(
test_table
)
# For visualisation purposes we only print out the first 15 rows.
prediction.slice_rows(0,15)
prediction = fitted_model.predict(
test_table
)
# For visualisation purposes we only print out the first 15 rows.
prediction.slice_rows(0,15)
Out[5]:
id | year | month | day | zipcode | latitude | longitude | sqft_lot | sqft_living | sqft_above | sqft_basement | floors | bedrooms | bathrooms | waterfront | view | condition | grade | year_built | year_renovated | sqft_lot_15nn | sqft_living_15nn | price | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 14656 | 2015 | 1 | 2 | 98058 | 47.4650 | -122.123 | 35171 | 2760 | 2760 | 0 | 2.0 | 3 | 2.50 | 0 | 1 | 3 | 9 | 1990 | 0 | 35171 | 2720 | 530000.0 |
1 | 19472 | 2015 | 4 | 10 | 98168 | 47.4717 | -122.323 | 9086 | 1480 | 1480 | 0 | 1.0 | 3 | 1.50 | 0 | 1 | 3 | 7 | 1962 | 0 | 9750 | 1540 | 248000.0 |
2 | 18494 | 2015 | 3 | 27 | 98053 | 47.6640 | -122.041 | 46538 | 3030 | 3030 | 0 | 2.0 | 3 | 2.50 | 0 | 1 | 3 | 10 | 1997 | 0 | 51450 | 3370 | 999000.0 |
3 | 14318 | 2014 | 12 | 22 | 98072 | 47.7434 | -122.106 | 19991 | 1990 | 1340 | 650 | 1.0 | 3 | 2.75 | 0 | 1 | 3 | 7 | 1977 | 0 | 9775 | 1750 | 450000.0 |
4 | 20542 | 2015 | 4 | 27 | 98028 | 47.7637 | -122.266 | 23030 | 1140 | 1140 | 0 | 1.0 | 2 | 1.00 | 0 | 1 | 3 | 8 | 1980 | 0 | 14260 | 1850 | 375000.0 |
5 | 21194 | 2015 | 5 | 5 | 98034 | 47.7190 | -122.173 | 8155 | 1770 | 1770 | 0 | 1.5 | 4 | 2.50 | 0 | 1 | 4 | 6 | 1970 | 1993 | 7360 | 1460 | 487585.0 |
6 | 14763 | 2015 | 1 | 6 | 98106 | 47.5352 | -122.361 | 4380 | 1230 | 1230 | 0 | 1.0 | 3 | 1.00 | 0 | 1 | 3 | 6 | 1947 | 0 | 6026 | 1525 | 343000.0 |
7 | 19223 | 2015 | 4 | 8 | 98033 | 47.6882 | -122.171 | 5107 | 1810 | 1810 | 0 | 2.0 | 3 | 2.25 | 0 | 1 | 3 | 8 | 1989 | 0 | 5454 | 1760 | 472500.0 |
8 | 14796 | 2015 | 1 | 7 | 98042 | 47.3774 | -122.160 | 13482 | 2980 | 1730 | 1250 | 1.0 | 5 | 2.75 | 0 | 1 | 4 | 8 | 1975 | 0 | 14800 | 2900 | 375000.0 |
9 | 21458 | 2015 | 5 | 8 | 98125 | 47.7153 | -122.284 | 5759 | 910 | 910 | 0 | 1.0 | 2 | 1.00 | 0 | 1 | 3 | 6 | 1951 | 0 | 7518 | 1520 | 371000.0 |
10 | 20634 | 2015 | 4 | 27 | 98144 | 47.5920 | -122.295 | 2268 | 1220 | 610 | 610 | 1.0 | 2 | 1.75 | 0 | 1 | 4 | 6 | 1909 | 0 | 1675 | 1240 | 850000.0 |
11 | 19912 | 2015 | 4 | 17 | 98092 | 47.3163 | -122.188 | 5250 | 1570 | 1570 | 0 | 1.0 | 3 | 2.00 | 0 | 1 | 3 | 7 | 1998 | 0 | 5250 | 1570 | 265000.0 |
12 | 21053 | 2015 | 5 | 3 | 98119 | 47.6501 | -122.370 | 5210 | 2010 | 1890 | 120 | 1.5 | 5 | 1.00 | 0 | 1 | 3 | 9 | 1927 | 0 | 5000 | 2330 | 1370000.0 |
13 | 17670 | 2015 | 3 | 16 | 98146 | 47.5088 | -122.371 | 7921 | 1430 | 1430 | 0 | 1.0 | 2 | 1.75 | 0 | 1 | 3 | 7 | 1983 | 0 | 8040 | 1290 | 361810.0 |
14 | 14826 | 2015 | 1 | 7 | 98148 | 47.4536 | -122.330 | 7582 | 1400 | 1400 | 0 | 1.0 | 3 | 1.50 | 0 | 1 | 3 | 7 | 1956 | 0 | 7872 | 1280 | 225000.0 |
- You can test the mean absolute error of that model with the initial testing_table as follows:
In [6]:
Copied!
tagged_test_table= testing_table.tag_columns("price", feature_names=[
*feature_columns
])
fitted_model.mean_absolute_error(tagged_test_table)
tagged_test_table= testing_table.tag_columns("price", feature_names=[
*feature_columns
])
fitted_model.mean_absolute_error(tagged_test_table)
Out[6]:
105232.47698091382