When I run the Box command and choose “Violin PDF” type, a setting labeled “Window” appears below, with a default value of 1. It appears to be the size of the bins used to draw the contour line that envelopes the data. Is that correct? Could not find any documentation on it.
Also, when I overlay a regular box plot with IQR option, it shows many outliers, but the violin doesn’t quite go up that far. Shouldn’t the violin outline extent up to include the outliers, even if it’s very skinny (i.e. essentially a line)?
Q1 – Yes. You can image the window as the bin width for a corresponding histogram. The violin is a density estimation to create a smooth representation.
To illustrate here is some data we graphed with the Box command.
Here is the Box command with the default Whisker, and then three other representations: Violin PDF, Probability (sideways histogram), and a Smooth version of the histogram. All three of these have a window option.
Here is what it looks like with the window set to 0.5. The violin is the same as the smooth version but as a mirror image. They are derived with a kernel density estimation.
A smaller window shows more detail.
A larger window has more smoothing.
Maybe more detail than you need, but hopefully helpful.
One approach we have seen is cutting off the violins’ ends when they exceed the data min/max. We do not do that and instead show the result of the kernel density estimation. If this was an important option that anyone needs let us know.
We’ll answer Q2 in the following post.
Q2 – When you have outliers they typically show up in the violin as disconnected sections.
Here is an example with a lot of data clustered in one section and then just a couple of outliers above the violin. You see them as very small lines because they are scaling based on the width of the entire data set.
We did some testing and found that when the graph width was reduced these disconnected regions may not be rendered. Here is the same graph now at a smaller width and you do not see the outlier with the violin.
It’s possible that this is happening to you. I would be curious if you made the graph wider do you see the location of the outliers then?
In any case, this is an issue we need to check into.
For my current dataset, increasing the width to max did not make a line appear extending out to the max and min values. So I overlayed a boxplot on top, so the reader can see where the max and min are:
Without boxplot overlay:
With boxplot overlay:
However, that uncovered another issue: I set the boxplots so that the whiskers represent min & max. And the violin plots extend below the min. You can see that in the purprle and green violins above, where the min whiskers end about 0.1 (or more?) above the bottoms of the violins. In the violinplot setting, I have window set to 0.05. Maybe that’s just how violin plots work — they don’t tightly wrap around all data points? But it sure seems misleading to have the reader think that a distribution goes into negative, when the boxplot whisker clearly show the min values stop at/near zero.
(Otherwise, this was very easy to use, compared to other software! So thanks.)
Overlaying the box plot is a great solution. In fact, there are many interesting combinations you can create using the Box command (overlaying points, …)
In the Beta version, we have a new Point + Interval option that you might also want to try.
One thing to clarify, when we found increasing the width would help, it was the width of the graph itself in the Canvas settings. That said, we applied a fix in the latest Beta so you will always have at least a line showing where some data exists.
Here is a redo of the graph we posted before using the latest Beta. The violin plot is shown for two window settings. In this case, when the window = 2, you see less detail in the points distribution. Also, the line where the outlier is located is wider, compared to the version when the window = 1.
You’ll also notice that the entire range of the violin gets larger with a larger window and in both cases it extends beyond the data. The kernel density estimation will extend beyond the data. The approach is fitting a series of Gaussian curves to your data. So unless your data is a perfect Gaussian, the ends will absolutely extend beyond the min and max.
Here is an example to illustrate. This is data that is normally distributed with a mean of 0 and a standard deviation of 1. We plotted the data with the box plot, the violin plot, and a sideways histogram. You can see how the violin plot and the sideways histogram extend basically to the min and the max of the data. These both have a window of 0.5 here.
Of course the great thing about violin plots is that they can show you detail that a box plot simply cannot. Here’s another example where we have a bimodal distribution.
Some software programs do, in all cases, cap the violin plots at the min and the max values. We do not recommend this. If you find the violin plots you are creating extend too far beyond the data, better to adjust the window. Today we have not had anyone ask for this option. If this is something people want them it would be good to let us know.
You also might want to consider using the sideways histogram. This is a nice balance between showing the detail of the distribution, but will not be smoothing the data so you would not have issues with long tails.
Again, maybe more detail than you need but hopefully this is helpful! Glad you are finding the software easy to use 🙂
Also would be great if you could try the Beta and confirm that you see a line now, anywhere where a point exists. You can always switch back to the release version if you prefer.
This is a nice explanation. Thanks.
I think you’re saying that when you fit a gaussian curve to a distribution that ends abruptly, the curve can’t bend sharply enough to end exactly at the data minimum. While I understand that concept, it seems it could be problematic. In my case, the violins extended into negative values, and with that particular dataset it didn’t make sense to suggest to the reader that there are negative values in that distribution. I guess I should probably use the sideways histogram.