Visualizing and Analyzing Distributions of Nominal Variables

library(nomiShape)

Visualizing Nominal Distributions with nomiShape

Data can be measured on different scales, which fundamentally affects how they can be analyzed and visualized (Table 1). Four commonly recognized measurement scales are nominal, ordinal, interval, and ratio. Variables measured on continuous scales can take any value within a range and are often modeled using continuous probability distributions, whereas variables with a finite set of possible values follow discrete distributions.

Among discrete and qualitative variables, nominal variables are unique in that they classify observations into categories without any inherent order, ranking, or numerical meaning. Nominal categories indicate membership only: an observation either belongs to a category or it does not. No information about magnitude, distance, or direction is implied. Common examples of nominal variables include species identities in an ecological community, political attitudes or party affiliation in social surveys, behavioral categories in ethological or psychological studies (e.g. play, aggression, vigilance), word types in a linguistic corpus, or thematic codes in qualitative research.

Although nominal variables lack intrinsic numeric structure, the frequency with which categories occur provides rich information about the organization of the system under study. Count data derived from nominal variables can reveal patterns of dominance, rarity, symmetry, and tail structure—features that are rarely formalized but are often visually apparent. The nomiShape package is designed to make these distributional properties explicit by combining centered visualizations with quantitative indices and model-based comparisons tailored specifically to nominal data.

Table 1. Summary of Nominal Data Characteristics and Visualization and Analysis Tools in the nomiShape package

Concept Description
Variable Type Nominal (categorical, unordered)
Core Properties Discrete categories with no intrinsic order or numeric meaning
Typical Examples Species in a biological community; political attitudes (e.g. conservative, liberal, undecided); behavioral categories (e.g. play, aggression, grooming); word types in a text corpus; qualitative themes or codes
What Can Be Counted Frequencies, proportions, dominance, rarity
What Cannot Be Computed Means, medians, variances, distances, or ranks derived from numeric magnitude
Common Visualizations Standard bar plots (unordered or frequency-ranked)
Often-Ignored Distributional Structure Dominance, symmetry, central concentration, tail heaviness
Main Analytical Challenge Distributional “shape” exists but is difficult to formalize for nominal data
Visual Tools in nomiShape Centered Bar Plot, Centered Dot Plot, Ranked Bar Plot, Ranked Dot Plot, Pareto Chart
Analytical Tools in nomiShape Pielou’s evenness, Dominance index, Central concentration, Tail index
Model-Based Shape Comparison AIC-based comparison of uniform, triangular, normal-like, and exponential (Pareto-like) shapes
Design Philosophy Reveal latent distributional structure visually (via centering and ranking), then formalize it analytically

Handling nominal (categorical) data is an essential part of data analysis. Almost every data science project involves working with such variables, and students and practitioners alike should know how to store, summarize, visualize, and manipulate them. Traditional visualizations of nominal variables often use unordered bar plots or frequency-sorted bar plots (from high to low), which emphasize category counts but rarely provide insight into distributional structure. As a result, concepts like symmetry, skewness, dominance, or tail behaviour—commonly discussed for numerical variables—are seldom considered for nominal data. However, exceptions include Pareto charts and other ranked visualizations, which can highlight the “vital few” categories following the 80:20 rule or reveal long-tailed distributions, such as rank-abundance plots in ecology where typically most species are relatively rare and a few are common. These visualizations allow insights into categorical dominance and rarity patterns even for nominal variables.

The nomiShape package is designed to further explore the shape of nominal distributions. It offers multiple plotting functions, including classic visualizations such as Pareto charts and ranked bar plots, as well as novel centered bar and dot plots. These functions help users understand frequency structures, dominance patterns, and distributional characteristics of nominal variables, facilitating more nuanced analysis of categorical data.

Visualizing and Analyzing Distributions of Nominal Variables

This vignette demonstrates how to visualize and analyze the distributions of nominal variables using various plotting functions provided by the nomiShape package. We will explore centered bar plots, ranked bar plots, centered dot plots, and ranked dot plots.

Plotting Shapes of Nominal Distributions

Ranked Bar Plots

Ranked bar plots order categories from the most frequent to the least frequent, providing a clear view of category dominance and distribution.

# Example usage of ranked_barplot
ranked_barplot(categories, "animal")

# Example usage of ranked_barplot
ranked_barplot(categories2, "animal")

# Example usage of ranked_barplot
ranked_barplot(categories3, "animal")

Ranked Dot Plots

Ranked dot plots display categories as points ordered from the most frequent to the least frequent, allowing for easy comparison of category frequencies.

# Example usage of ranked_dotplot
ranked_dotplot(categories, "animal", connect = TRUE)

# Example usage of ranked_dotplot
ranked_dotplot(categories2, "animal", connect = TRUE, shade = TRUE)

# Example usage of ranked_dotplot
ranked_dotplot(categories3, "animal", connect = FALSE, shade = TRUE)

Pareto Charts

Pareto charts combine bar plots and line graphs to highlight the most significant categories in a nominal variable. They help identify the “vital few” categories that contribute most to the overall distribution.

# Example usage of pareto
pareto(categories3, "animal")
#>      Category Freq cumulative cumulative_percentage
#> 1  Sea sponge  110        110                  44.0
#> 2    Starfish   75        185                  74.0
#> 3     Octopus   20        205                  82.0
#> 4        Crab   12        217                  86.8
#> 5    Squirrel    9        226                  90.4
#> 6     Copepod    7        233                  93.2
#> 7       Snail    6        239                  95.6
#> 8  Pufferfish    5        244                  97.6
#> 9       Whale    3        247                  98.8
#> 10    Lobster    2        249                  99.6
#> 11    Sea god    1        250                 100.0

Centered Bar Plots

Centered bar plots arrange categories symmetrically around the center, with the most frequent categories in the middle and less frequent ones towards the edges. This layout helps to visualize the distribution shape effectively.

# Example usage of centered_barplot
centered_barplot(categories, "animal")

# Example usage of centered_barplot
centered_barplot(categories2, "animal",scale = "percent")

# Example usage of centered_barplot
centered_barplot(categories3, "animal")

Centered Dot Plots

Centered dot plots display categories as points arranged symmetrically around the center, with the most frequent categories in the middle. Optionally, points can be connected with lines to highlight trends.

# Example usage of centered_dotplot
centered_dotplot(categories, "animal",connect = TRUE,shade = TRUE)

# Example usage of centered_dotplot
centered_dotplot(categories2, "animal",connect = TRUE,shade = TRUE)

# Example usage of centered_dotplot
centered_dotplot(categories3, "animal",connect = TRUE,shade = TRUE)

Measuring Shapes of Nominal Distributions

Evenness

Pielou’s evenness quantifies how evenly individuals are distributed across categories in a nominal variable.

# Example usage of pielou_evenness
pielou_evenness(categories, "animal")
#> [1] 0.9981314
# Example usage of pielou_evenness
pielou_evenness(categories2, "animal")
#> [1] 0.9462875
# Example usage of pielou_evenness
pielou_evenness(categories3, "animal")
#> [1] 0.6553931

Dominance Index

The dominance index quantifies the degree to which a few categories dominate the distribution of a nominal variable.

# Example usage of dominance_index
dominance_index(categories, "animal")
#> [1] 0.091712
# Example usage of dominance_index
dominance_index(categories2, "animal")
#> [1] 0.113056
# Example usage of dominance_index
dominance_index(categories3, "animal")
#> [1] 0.295584

Central Concentration

The central concentration quantifies how concentrated the distribution of a nominal variable is around its most frequent categories.

# Example usage of central_concentration
central_concentration(categories, "animal")
#> [1] 0.3
# Example usage of central_concentration
central_concentration(categories2, "animal")
#> [1] 0.448
# Example usage of central_concentration
central_concentration(categories3, "animal")
#> [1] 0.82

Tail Index

The tail index quantifies the proportion of categories contributing to the lower part of the distribution, useful for identifying long-tail structures in nominal data. By default, it uses a threshold of 0.8, following the Pareto principle, but this can be adjusted as needed.

# Example usage of tail_index
tail_index(categories, "animal")
#> [1] 0.1818182
# Example usage of tail_index
tail_index(categories2, "animal", threshold = 0.9)
#> [1] 0.1818182
# Example usage of tail_index
tail_index(categories3, "animal", threshold = 0.75)
#> [1] 0.7272727

Detecting theoretical distributions in nominal variables

Visualizing Theoretical Shapes

The shape_comp_plot function allows users to visualize common theoretical distribution shapes (uniform, triangular, normal-like, and exponential/Pareto-like) for nominal variables in comparison with the observed distribution. This helps in understanding how different distributions appear when plotted.

# Example usage of shape_comp_plot
shape_comp_plot(categories, "animal")

# Example usage of shape_comp_plot
shape_comp_plot(categories2, "animal")

# Example usage of shape_comp_plot
shape_comp_plot(categories3, "animal")

# Example usage of shape_comp_plot
shape_comp_plot(starwars, "species")

AIC comparison of theoretical shapes

The shape_aic function computes the Akaike Information Criterion (AIC) for different theoretical shape models fitted to the distribution of a nominal variable. This allows users to quantitatively compare how well each model fits the observed data.

# Example usage of shape_aic
shape_aic(categories, "animal")
#>         Shape      AIC   DeltaAIC
#> 1     Uniform 1198.948    0.00000
#> 2  Triangular 1267.201   68.25385
#> 3 Exponential 1994.144  795.19640
#> 4      Normal 2897.166 1698.21859
# Example usage of shape_aic
shape_aic(categories2, "animal")
#>         Shape      AIC   DeltaAIC
#> 1  Triangular 1137.919   0.000000
#> 2 Exponential 1140.017   2.098416
#> 3     Uniform 1198.948  61.028672
#> 4      Normal 1986.456 848.536855
# Example usage of shape_aic
shape_aic(categories3, "animal")
#>         Shape       AIC DeltaAIC
#> 1 Exponential  822.6242   0.0000
#> 2      Normal  961.7948 139.1706
#> 3  Triangular  981.4556 158.8314
#> 4     Uniform 1198.9476 376.3234