Regression¶

This tutorial uses Safe-DS on house sales data to predict house prices.

File and Imports¶

Start by creating a Python-File with the suffix .py.

Reading Data¶

Download the house sales data from here and load it into a Table:

In [1]:

Copied!

from safeds.data.tabular.containers import Table

pricing = Table.from_csv_file("data/house_sales.csv")
# For visualisation purposes, we only print out the first 15 rows.
pricing.slice_rows(length=15)
from safeds.data.tabular.containers import Table

pricing = Table.from_csv_file("data/house_sales.csv")
# For visualisation purposes, we only print out the first 15 rows.
pricing.slice_rows(length=15)

Out[1]:

shape: (15, 23)

id	year	month	day	zipcode	latitude	longitude	sqft_lot	sqft_living	sqft_above	sqft_basement	floors	bedrooms	bathrooms	waterfront	view	condition	grade	year_built	year_renovated	sqft_lot_15nn	sqft_living_15nn	price
i64	i64	i64	i64	i64	f64	f64	i64	i64	i64	i64	f64	i64	f64	i64	i64	i64	i64	i64	i64	i64	i64	i64
0	2014	5	2	98001	47.3406	-122.269	9397	2200	2200	0	2.0	4	2.5	0	1	3	8	1987	0	9176	2310	285000
1	2014	5	2	98003	47.3537	-122.303	10834	2090	1360	730	1.0	3	2.5	0	1	4	8	1987	0	8595	1750	285000
2	2014	5	2	98006	47.5443	-122.177	8119	2160	1080	1080	1.0	4	2.25	0	1	3	8	1966	0	9000	1850	440000
3	2014	5	2	98006	47.5746	-122.135	8800	1450	1450	0	1.0	4	1.0	0	1	4	7	1954	0	8942	1260	435000
4	2014	5	2	98006	47.5725	-122.133	10000	1920	1070	850	1.0	4	1.5	0	1	4	7	1954	0	10836	1450	430000
…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…
10	2014	5	2	98023	47.3256	-122.378	33151	3240	3240	0	2.0	3	2.5	0	3	3	10	1995	0	24967	4050	604000
11	2014	5	2	98024	47.5643	-121.897	16215	1580	1580	0	1.0	3	2.25	0	1	4	7	1978	0	16215	1450	335000
12	2014	5	2	98027	47.4635	-121.991	35100	1970	1970	0	2.0	3	2.25	0	1	4	9	1977	0	35100	2340	437500
13	2014	5	2	98027	47.4634	-121.987	37277	2710	2710	0	2.0	4	2.75	0	1	3	9	2000	0	39299	2390	630000
14	2014	5	2	98029	47.5794	-122.025	67518	2820	2820	0	2.0	5	2.5	0	1	3	8	1979	0	48351	2820	675000

Cleaning your Data¶

At this point it is usual to clean the data. Here's an example how to do so:

In [2]:

Copied!





pricing_columns = (
    # Removes columns "latitude" and "longitude" from table
    pricing.remove_columns(["latitude", "longitude"])
    # Removes rows which contain missing values
    .remove_rows_with_missing_values()
    # Removes rows which contain outliers
    .remove_rows_with_outliers()
)
# For visualisation purposes, we only print out the first 5 rows.
pricing_columns.slice_rows(length=5)
pricing_columns = (
    # Removes columns "latitude" and "longitude" from table
    pricing.remove_columns(["latitude", "longitude"])
    # Removes rows which contain missing values
    .remove_rows_with_missing_values()
    # Removes rows which contain outliers
    .remove_rows_with_outliers()
)
# For visualisation purposes, we only print out the first 5 rows.
pricing_columns.slice_rows(length=5)

Out[2]:

shape: (5, 21)

id	year	month	day	zipcode	sqft_lot	sqft_living	sqft_above	sqft_basement	floors	bedrooms	bathrooms	waterfront	view	condition	grade	year_built	year_renovated	sqft_lot_15nn	sqft_living_15nn	price
i64	i64	i64	i64	i64	i64	i64	i64	i64	f64	i64	f64	i64	i64	i64	i64	i64	i64	i64	i64	i64
0	2014	5	2	98001	9397	2200	2200	0	2.0	4	2.5	0	1	3	8	1987	0	9176	2310	285000
1	2014	5	2	98003	10834	2090	1360	730	1.0	3	2.5	0	1	4	8	1987	0	8595	1750	285000
2	2014	5	2	98006	8119	2160	1080	1080	1.0	4	2.25	0	1	3	8	1966	0	9000	1850	440000
3	2014	5	2	98006	8800	1450	1450	0	1.0	4	1.0	0	1	4	7	1954	0	8942	1260	435000
4	2014	5	2	98006	10000	1920	1070	850	1.0	4	1.5	0	1	4	7	1954	0	10836	1450	430000

See how to perform further data cleaning in the dedicated Data Processing Tutorial.

Create Training and Testing Set¶

Split the house sales dataset into two tables. A training set, that will be used later to implement a training model to predict the house prices. It contains 60% of the data. The testing set contains the rest of the data.

In [3]:

Copied!

train_table, testing_table = pricing_columns.split_rows(0.60)
train_table, testing_table = pricing_columns.split_rows(0.60)

Mark the price Column as the target variable to be predicted. Include the id column only as an extra column, which is completely ignored by the model:

In [4]:

Copied!

extra_names = ["id"]

train_tabular_dataset = train_table.to_tabular_dataset("price", extra_names=extra_names)
extra_names = ["id"]

train_tabular_dataset = train_table.to_tabular_dataset("price", extra_names=extra_names)

Creating and Fitting a Regressor¶

Use Decision Tree regressor as a model for the regression. Pass the "train_tabular_dataset" table to the fit function of the model:

In [5]:

Copied!

from safeds.ml.classical.regression import DecisionTreeRegressor

fitted_model = DecisionTreeRegressor().fit(train_tabular_dataset)
from safeds.ml.classical.regression import DecisionTreeRegressor

fitted_model = DecisionTreeRegressor().fit(train_tabular_dataset)

Predicting with the Fitted Regressor¶

Use the fitted decision tree regression model, that we trained on the training dataset to predict the price of a house in the test dataset.

In [6]:

Copied!

prediction = fitted_model.predict(testing_table)
# For visualisation purposes we only print out the first 15 rows.
prediction.to_table().slice_rows(length=15)
prediction = fitted_model.predict(testing_table)
# For visualisation purposes we only print out the first 15 rows.
prediction.to_table().slice_rows(length=15)

Out[6]:

shape: (15, 21)

id	year	month	day	zipcode	sqft_lot	sqft_living	sqft_above	sqft_basement	floors	bedrooms	bathrooms	waterfront	view	condition	grade	year_built	year_renovated	sqft_lot_15nn	sqft_living_15nn	price
i64	i64	i64	i64	i64	i64	i64	i64	i64	f64	i64	f64	i64	i64	i64	i64	i64	i64	i64	i64	f64
18183	2015	3	24	98103	3880	1350	950	400	1.0	2	1.0	0	1	3	6	1927	0	3920	1670	449500.0
2557	2014	6	12	98092	8465	2210	1490	720	1.0	4	2.5	0	1	3	8	1990	0	7917	2210	299533.2
9685	2014	9	26	98075	8808	3320	3320	0	2.0	4	3.5	0	1	3	9	2005	0	9226	3160	764285.714286
9249	2014	9	19	98144	1610	1950	1950	0	3.0	2	2.75	0	1	3	8	2009	0	1745	910	441380.0
11429	2014	10	28	98010	5233	1050	1050	0	1.0	3	1.0	0	1	5	5	1906	0	7500	970	374200.0
…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…
20938	2015	4	30	98106	6954	1060	1060	0	1.0	3	1.5	0	1	4	6	1983	0	6372	1560	313333.166667
15430	2015	1	27	98031	8314	1560	1560	0	1.0	3	1.5	0	1	3	7	1962	0	8925	1820	237414.285714
16242	2015	2	18	98023	8120	2260	2260	0	2.0	3	2.5	0	1	3	8	2004	0	7784	2250	315466.666667
15039	2015	1	14	98125	6332	1500	1500	0	1.0	3	1.5	0	1	3	7	1953	0	6337	1500	570375.0
4663	2014	7	11	98002	6697	810	810	0	1.0	2	1.0	0	1	4	6	1923	0	6695	1140	146107.142857

Evaluating the Fitted Regressor¶

You can test the mean absolute error of that model with the initial testing_table as follows:

In [7]:

Copied!

fitted_model.mean_absolute_error(testing_table)
fitted_model.mean_absolute_error(testing_table)

Out[7]:

88119.9660738172

Full Code¶

In [8]:

Copied!





from safeds.data.tabular.containers import Table
from safeds.ml.classical.regression import DecisionTreeRegressor

pricing = Table.from_csv_file("data/house_sales.csv")

pricing_columns = (
    pricing.remove_columns(["latitude", "longitude"]).remove_rows_with_missing_values().remove_rows_with_outliers()
)

train_table, testing_table = pricing_columns.split_rows(0.60)

extra_names = ["id"]
train_tabular_dataset = train_table.to_tabular_dataset("price", extra_names=extra_names)

fitted_model = DecisionTreeRegressor().fit(train_tabular_dataset)
prediction = fitted_model.predict(testing_table)

fitted_model.mean_absolute_error(testing_table)
from safeds.data.tabular.containers import Table
from safeds.ml.classical.regression import DecisionTreeRegressor

pricing = Table.from_csv_file("data/house_sales.csv")

pricing_columns = (
    pricing.remove_columns(["latitude", "longitude"]).remove_rows_with_missing_values().remove_rows_with_outliers()
)

train_table, testing_table = pricing_columns.split_rows(0.60)

extra_names = ["id"]
train_tabular_dataset = train_table.to_tabular_dataset("price", extra_names=extra_names)

fitted_model = DecisionTreeRegressor().fit(train_tabular_dataset)
prediction = fitted_model.predict(testing_table)

fitted_model.mean_absolute_error(testing_table)

Out[8]:

88146.59355509149