Histograms are column-shaped charts, in which each column represents a range of the values, and the height of a column corresponds to how many values are in that range.
Histograms are the most useful tools to say something about a bouquet of numeric values. Compared to other summarizing methods, histograms have the richest descriptive power while being the fastest way to interpret data – the human brain prefers visual perception.
However, if you are not careful, viewers will not be able to understand your histogram, or you may fail to get the most out of it. It is especially important to specify the optimal bin size.
Histogram vs Bar Graph
Histograms may seem identical to bar graphs at first sight. Both are column-shaped and numerous standing rectangles are placed after each other. But the usage of these two differ significantly.
Bar Graphs are used to compare different categories of data, and the scaling is applied to measure the extreme values of the categories within one chart. Columns can be placed vertically or horizontally.
If they are vertical, we speak about a column chart, where the vertical axis contains the scale while the horizontal axis shows the categories like: age group, year, months, etc. Major disadvantage of this chart is that naming of the columns is impossible if there are too many categories.
If you want to see the development of data within your chart, histograms will be your choice.In a histogram, the horizontal axis shows frequency and vertically you can see the interval or time range values. This way, you can get a picture of data distribution and you can clearly see the outliers in your set of data.
Why Choose Histograms?
If you have a set of data values, you probably want to share this information with your boss or co-workers to make better business decisions based on the information contained in these data. Instead of showing the numbers it is better if you use visuals.
Examples of the most common data types visualized on a histogram:
- Customers’ ages
- Monthly revenues
- Length of time visitors spend on your website
- The number of sold cars by agents
- Any other important numbers related to your business
You should share the information in a compact way because nobody wants to read numeric values one by one./calculators/histogram)
Create Histogram with AnswerMiner. Sign up for free.
Alternatives are Wrong
Suppose you have a set of numbers: 1, 23, 24, 25, 25, 25, 26, 27, 30, 32, 999
Mean
The mean value (112.45) is very sensitive to outliers. Almost all real-world data has outliers, so the mean value can be very misleading.
Median
The median value (25) does not tell you anything about the distribution.
Full range
The full range (1 – 999) just shows the outliers.
Standard deviation
The standard deviation (294.1436) can be hard to interpret without a statistical background.
Variance
The variance (86520.47) can be also hard to interpret without a statistical background.
Interquartile range
Interquartile range (IQR) (24.5 – 28.5) is the central 50% of your values and does not tell you anything about the other 50%.
The above listed statistical values are very useful, but always keep in mind to use them in a context with other information, not just as a standalone metric.
These numeric summarizing techniques do not include any information about spikes, or the shape of the distribution. Therefore, we suggest using a histogram to communicate the distribution. Additionally you can place the statistical numbers on the visualization to share more information.
How to Bin a Histogram?
The wider the range (bin width) you use, the fewer columns (bins) you will have.
numberofbins = ceil( (maximumvalue - minimumvalue) / binwidth )
Bins that are too wide can hide important details about distribution while bins that are too narrow can cause a lot of noise and hide important information about the distribution as well.
The width of the bins should be equal, and you should only use round values like 1, 2, 5, 10, 20, 25, 50, 100, and so on to make it easier for the viewer to interpret the data.
These histograms were created from the same example dataset that contains 550 values between 12 and 69.
Too wide bins
Too-wide: Too wide bins, unable to detect unusual spike at around 53
Too narrow bins
Too-narrow: Too narrow bins, there are lots of spikes just by coincidence
Unpretty bins
**Unpretty: **Hard to read, because bins have unpretty 7 width
Unequal bins
Unequal: Hard to read, because widths of bins are not equal
Ideal bins
Ideal: This one is good.
If you have a small amount of data, use wider bins to eliminate noise. If you have a lot of data, use narrower bins because the histogram will not be that noisy.
The Methods of Histogram Binning
In the case of the above used dataset (that contains 550 values between 12 and 69) we get the following result:
Square-root | Sturges | Rice | Scott | Freedman-Diaconis | |
---|---|---|---|---|---|
Number of bins | 23 | 11 | 17 | 14 | 16 |
Bin width | 2 | 5 | 3 | 4 | 4 |
Another examples
100/1000/10000 normally distributed numbers with mean 50 and standard deviation 10: (Number of bins)
Square-root | Sturges | Rice | Scott | Freedman-Diaconis | |
---|---|---|---|---|---|
#100 | 10 | 8 | 10 | 6 | 7 |
#1000 | 32 | 11 | 20 | 20 | 26 |
#10000 | 100 | 15 | 44 | 51 | 66 |
100/1000/10000 normally distributed numbers with mean 50 and standard deviation 10: (Bin width)
Square-root | Sturges | Rice | Scott | Freedman-Diaconis | |
---|---|---|---|---|---|
#100 | 4 | 5 | 4 | 8 | 6 |
#1000 | 2 | 6 | 3 | 4 | 3 |
#10000 | 1 | 6 | 2 | 2 | 1 |
100/1000/10000 uniformly distributed numbers with mean 50 and standard deviation 10: (Number of bins)
Square-root | Sturges | Rice | Scott | Freedman-Diaconis | |
---|---|---|---|---|---|
#100 | 10 | 8 | 10 | 5 | 5 |
#1000 | 32 | 11 | 20 | 10 | 10 |
#10000 | 100 | 15 | 44 | 21 | 21 |
100/1000/10000 uniformly distributed numbers with mean 50 and standard deviation 10: (Bin width)
Square-root | Sturges | Rice | Scott | Freedman-Diaconis | |
---|---|---|---|---|---|
#100 | 10 | 12 | 10 | 20 | 19 |
#1000 | 3 | 9 | 5 | 10 | 10 |
#10000 | 1 | 7 | 2 | 5 | 5 |
Opened or Closed Histogram Bins?
It is not so easy to decide. Now comes the trouble. If you look at the 10-15-20-25… binned histogram, are the occurrences of value “20” represented in the second column, the third column, or both? Obviously, you need to put each specific value into an exact bin.
Two options are available to be able to do so:
Option A - All bins should have left-open, right-closed intervals
First Bin: | (10,15] | Contains these values: | 11 | 12 | 13 | 14 | 15 |
---|---|---|---|---|---|---|---|
Second bin: | (15,20] | Contains these values: | 16 | 17 | 18 | 19 | 20 |
Third bin: | (20,25] | Contains these values: | 21 | 22 | 23 | 24 | 25 |
Option B - All bins should have left-closed, right-open intervals
First bin: | [10,15) | Contains these values: | 10 | 11 | 12 | 13 | 14 |
---|---|---|---|---|---|---|---|
Second bin: | [15,20) | Contains these values: | 15 | 16 | 17 | 18 | 19 |
Third bin: | [20,25) | Contains these values: | 20 | 21 | 22 | 23 | 24 |
Avoid the Trap
You are free to choose any of these options, but be careful! With both of these options, one value will not be included in the histogram. If you choose option #1, then value “10” will not be included in any of the bins. If you choose option #2, then value “25” will not be included in any of the bins.
The solution is to force the histogram to have the first or last bin be a full-closed interval. We suggest you do this with the last bin when using option #2 because uniform bins are usually more important on the left side than on the right. If you have integer values, it is recommended to label the bins “10-14,” “15-19,” and “20-25” instead of writing out “10,” “15,” “20,” “25.” In this case, viewers of the histogram will understand it better.
Summary
Remember to always ask for histograms if you are about to be tricked by a single average.
- If your marketing specialist says that your campaigns usually reach 1000 people
- If your salesman tells you that your purchasers spend approximately $100 in your shop
- If your car mechanic says your vehicle will be ready in seven days
- If your family physician tells you that you will recover from the disease in five days
- If your mom says that the lunch is going to be ready in approximately 15 minutes
Keep in Mind
AnswerMiner helps you to create automatic histograms, so you do not need to bother with finding ideal settings.
AnswerMiner is an exploratory data analysis platform with which you can create histogram and many other visualizations without coding or math. With the tool you will be able to explore and understand your data, create visualizations and dashboards, analyze correlation and build a prediction tree.
As an intro take a look at one of our free calculators to quickly use what you have learned reading this article. If you want to go beyond histograms you can also try the platform./calculators/histogram)