Data Visualization¶
This tutorial shows you how to use the safeds.plotting
module to visualize your data and understand it better.
The data¶
First, we need some data to visualize. For this, we use the common example of the Titanic disaster, which is also included in our safe-ds-examples
package. If you want to know more about the dataset, check out its documentation. Naturally, you can also use your own data.
from safeds.data.tabular.containers import Table
titanic = Table.from_csv_file("data/titanic.csv")
Let's have a quick look at the first 10 rows of the data:
titanic.slice_rows(length=10)
id | name | sex | age | siblings_spouses | parents_children | ticket | travel_class | fare | cabin | port_embarked | survived |
---|---|---|---|---|---|---|---|---|---|---|---|
i64 | str | str | f64 | i64 | i64 | str | i64 | f64 | str | str | i64 |
0 | "Abbing, Mr. Anthony" | "male" | 42.0 | 0 | 0 | "C.A. 5547" | 3 | 7.55 | null | "Southampton" | 0 |
1 | "Abbott, Master. Eugene Joseph" | "male" | 13.0 | 0 | 2 | "C.A. 2673" | 3 | 20.25 | null | "Southampton" | 0 |
2 | "Abbott, Mr. Rossmore Edward" | "male" | 16.0 | 1 | 1 | "C.A. 2673" | 3 | 20.25 | null | "Southampton" | 0 |
3 | "Abbott, Mrs. Stanton (Rosa Hun… | "female" | 35.0 | 1 | 1 | "C.A. 2673" | 3 | 20.25 | null | "Southampton" | 1 |
4 | "Abelseth, Miss. Karen Marie" | "female" | 16.0 | 0 | 0 | "348125" | 3 | 7.65 | null | "Southampton" | 1 |
5 | "Abelseth, Mr. Olaus Jorgensen" | "male" | 25.0 | 0 | 0 | "348122" | 3 | 7.65 | "F G63" | "Southampton" | 1 |
6 | "Abelson, Mr. Samuel" | "male" | 30.0 | 1 | 0 | "P/PP 3381" | 2 | 24.0 | null | "Cherbourg" | 0 |
7 | "Abelson, Mrs. Samuel (Hannah W… | "female" | 28.0 | 1 | 0 | "P/PP 3381" | 2 | 24.0 | null | "Cherbourg" | 1 |
8 | "Abrahamsson, Mr. Abraham Augus… | "male" | 20.0 | 0 | 0 | "SOTON/O2 3101284" | 3 | 7.925 | null | "Southampton" | 1 |
9 | "Abrahim, Mrs. Joseph (Sophie H… | "female" | 18.0 | 0 | 0 | "2657" | 3 | 7.2292 | null | "Cherbourg" | 1 |
The visualizations we present in this tutorial work on numerical data only. So, let's remove the columns that are not numerical and let's get rid of the id
column as well while we are at it:
titanic_numerical = titanic.remove_columns(
["id", "name", "sex", "ticket", "cabin", "port_embarked"],
)
Correlation heatmap¶
The correlation heatmap is ideal to get a quick overview of the relationships between the columns in your dataset. Each cell represents the correlation between two columns as a value between -1 and 1:
- A negative value (blue) means that if one column increases, the other decreases.
- A positive value (red) indicates that if one column increases, the other increases as well.
- A value of 0 (white) means that the two columns are not correlated at all.
Let's look at the correlations involving the travel_class
column:
travel_class
is negatively correlated withage
. This means that older passengers tended to travel in better travel classes (1st/2nd class) than younger passengers.travel_class
has no strong correlation with eithersiblings_spouses
orparents_children
.travel_class
is, unsurprisingly, positively correlated with itself. You'll always find that the diagonal of a correlation heatmap is bright red.travel_class
is negatively correlated withfare
. Naturally, better travel classes were more expensive.travel_class
is negatively correlated withsurvived
. People in better travel classes were more likely to survive the accident.
Lineplot¶
Next, we use a lineplot to better understand the relationship between survived
and fare
. The line itself represents the mean value and the hued area around it a 95% confidence interval around the mean.
We can conclude that survivors paid around twice as much for their ticket as non-survivors.
Boxplot¶
A boxplot is a type of univariate graphic display of the distribution of a numerical dataset. Univariate refers to the analysis of a single variable at a time.
- It displays the minimum, first quartile (25th percentile), median (50th percentile), third quartile (75th percentile), and maximum values.
- It also shows any potential outliers in the data.
- The above boxplot depicts the
age
range from 0 to 80 including outliers. It can be deduced that the middle 50% of passengers were between the ages of 20 and 40, because this is where the box lies and is less influenced by outliers. - Moreover, the boxplot clearly depicts outliers which are the dots found above the T-shaped whisker. It can further be deduced with the help of the median that half of the passengers were aged below 28 years and the other half were above 28 years.
Boxplot of all numerical columns¶
Let's look at the different boxplots of all numerical columns:
- The second boxplot depicts the number of passengers with
siblings/spouses
aboard the Titanic from 0 to 8. Here, the middle 50% of passengers had either 0 or 1 sibling/spouse and the rest were considered outliers. - The
parents/children
boxplot illustrates the count rate of the number of passengers withparents/children
. More than 80% of the passengers had no parents nor children with them because the upper and lower whiskers including the box are on 0. The rest were considered outliers. - The
travel_class
boxplot has its middle 50% of passengers either in the 2nd or 3rd class. The remaining passengers were considered first class. - In the
fare
boxplot, the box lies somewhere between 10 and 40. This implies that about half of the passengers paid a passenger fare below 20. The other half paid above 20. It was also dominated by extreme outliers. An extreme outlier is a value 3 times the difference between the 75th percentile and the 25th percentile (3 x IQR). - The last boxplot,
survived
, depicts the survival rate of all passengers between 0.0 and 1.0. It is clear that box plot is not a good graphic display for such a variable because it is either a passenger survived (1) or died (0).
Violinplot¶
Like boxplots, violinplots are also univariate but can offer a more precise visualisation of the underlying distribution of your data.