This article was originally published in Perspective, Volume 18, Number 2, 1995, pp. 15-24.


DIAMOND and Ice: Visual Exploratory Data Analysis Tools

by Matthew Schall, Ph.D.

Introduction

Suppose someone gives you a dataset, say the earthquake record from the week that includes the January 17, 1994 Northridge quake in California. You have no initial hypotheses about the data, but you want to explore the relationships between the location, time of day and magnitude of earthquake occurrences. This is a classic exploratory data analysis (EDA) problem.

The earthquake data along with a second dataset will be used to demonstrate two new visual exploratory data analysis tools available at OAC. The datasets are in some sense the problem: we have data, but do not have a clear understanding of what relationships to look for between the variables. DIAMOND and Ice, as visual exploratory data analytic tools, are the solution to the problem, in that DIAMOND and Ice can be used to reveal relationships in data. The example datasets used here provide a framework to demonstrate both some of what DIAMOND and Ice can do and some of the challenges facing you when confronted with data you want to explore.

The Exploratory Data Analysis Tools

DIAMOND and Ice (BMDP, 1993) are visualization tools that can be used to reveal relationships between variables that can be very difficult to see using conventional statistical approaches. These two new software products overcome the three-dimensional barrier that makes it difficult to see pictures of more than just a few variables and their relationships. The best way to see how this works is to take a moment to look at the graphics that accompany this article, then come back to the article to read about their meaning.

We are used to looking at a three-dimensional world. But, when there are nine variables to explore in a dataset, we need a nine-dimensional viewer. Both DIAMOND and Ice address the need for high- dimension (more than three dimensions) data visualization tools. In a quad-wise plot, one can see the relationships between four variables simultaneously. In a parallel coordinate plot, the number of visible dimensions is limited only by the size and resolution of the monitor being used. Both quad-wise and parallel plots are a part of DIAMOND. Ice displays up to nine simultaneous dimensions.

Visualizations of data, like those available in DIAMOND and Ice, are useful for more than just exploratory data analysis. They are an excellent means of presenting complex relationships in data to an audience that is not statistically sophisticated. A good picture may mean far more than the results of any numeric analysis. That is, a good picture may be worth a thousand words, but to a data analyst, a good picture may be worth a million numbers.

When data do not meet the requirements (e.g. multivariate normality) for conventional statistical tests, a picture can always be presented. As you will see in this article, you can visually evaluate hypotheses using DIAMOND and Ice, though their evaluation is only qualitative.

I will not attempt to demonstrate all of the capabilities of DIAMOND and Ice in this article. Instead the focus is on just a few examples of how both products can be used to reveal characteristics in data. Both products are far more useful than I can describe in a short article. Both DIAMOND and Ice are available at the OAC Visualization Laboratory, and I am advised by Jan de Leeuw, via the UCLA-stat email list, that DIAMOND and Ice are also currently available on laplace.stat.ucla.edu.

As my exploratory questions, I am interested in location, time and magnitude of earthquakes, and alcohol use among heavy drinkers in the alcohol data. In both cases I am interested in exploring the data, both for interest and potentially to generate research questions.

The Data

The earthquake data are based on the January 13-19, 1994 Weekly Earthquake Report for Southern California. The report is prepared by the following people associated with the California Institute of Technology: Kate Hutton, Seismological Laboratory; Egill Hauksson, Seismological Laboratory; Lucy Jones, U.S. Geological Survey. My thanks to Jan DeLeeuw, who retrieved the data from the Internet.

The variables include the DATE in January on which an earthquake occurred, the TIME in 24-hour format, north LATITUDE, west LONGITUDE, and MAGNITUDE.

The second dataset has information on alcohol use and related measures, personality measures, and demographic measures. It was gathered from 600 introductory psychology students at UCLA in 1983. The alcohol-related measures are: ALCOHOL use, the number of locations visited in the last month where alcohol was AVAILABLE, each person's CONCERN over how much alcohol is consumed, and a scale which measures how each student uses alcohol as a way to COPE. The personality measures are EXTROVERsion and DISINHIBition. In addition, the dataset includes GENDER and GPA. This is a subset of the variables from a much larger study conducted between 1983 and 1986 by Irving Maltzman, Matthew Schall, and Allon Shiff, all of the Behavior and Alcohol Laboratory at UCLA.

Reading data into either DIAMOND or Ice is easy. Both programs accept space delimited rectangular data. A separate title file provides the variable names for the programs.

DIAMOND Main Window

The initial display is a matrix of all possible bivariate scatter plots. This scatter plot matrix presents a clear and easy to comprehend picture of all the bivariate relationships in the data. It has been described as a visual correlation matrix. The display is far more informative than the traditional correlation matrix, which provides information only about the strength of the linear relationship between two variables.

With scatter plots, you can qualitatively evaluate nonlinearity, homogeneity of variance, and the strength of the relationships. Companion histograms allow you to look at the distributions of the individual variables, making it very easy to make a qualitative assessment of the distribution of each variable. All the plots in DIAMOND, as well as traditional univariate measures of central tendency, variability, skew, kurtosis, and a measure of nonlinearity, are available after DIAMOND displays the main menu of scatter plots. For example, pressing the Enter key with the cursor over any scatter plot causes another window to open, which explodes the plot of the two variables, and presents their correlation.

Much of the information a data analyst needs before applying statistics to a problem can be qualitatively determined from a scatter plot. Traditionally the variable X has values plotted on the horizontal axis and the variable Y is plotted on the vertical, though DIAMOND makes no such distinction. Two very highly correlated variables will produce a scatter plot with an almost one-to-one linear mapping between values on the horizontal and vertical axes of the plot. A nonlinear relationship between the variables will show up in the scatter plot as curvature in the pattern of plotted points. Low correlations between variables produce spherical, rectangular or irregular distributions of points. One of the characteristics of data that needs to be evaluated before many statistical tools can be used is homogeneity of variance. That is, whether the variance of one variable is consistent at different levels of the other variable. For example, a scatter plot of X and Y, with increasing variance on X as Y increases will look like a horn.

Additional information is available by the application of color to the scatter plots. Figure 1 shows the alcohol dataset with the heavy drinkers colored in red. Just by glancing across the bottom row of the picture, which contains the scatter plots of ALCOHOL with all other variables, a comparison can be made between the data distributions of the heavy drinkers in red to that for the non- to moderate-drinkers in white.

Parametric Snake

The parametric snake plot represents the effect of one variable on the relationship between two other variables. The basic representation is that of a scatter plot with the added feature that the points are connected in order by the value of a third variable. The result is a connect-the-dots picture that visually conveys information that would otherwise be very difficult to understand.

In a statistical context, the parametric snake presents one more level of information than a scatter plot. With a parametric snake plot, I can look for the cause of a failure of homogeneity of variance. Irregularity, that is deviation from a straight line, in the connected plot of points indicates variability introduced by a third variable on the relationship between the other two. A line that is very irregular in one portion of the plot, and forms a much narrower band in another portion of the plot, indicates the connect-the-dots variable is associated with nonhomogeneous variances in different regions of the X-Y plot.

In Figure 2, the earthquake data is used to show that the time of the earthquake is unrelated to it's MAGNITUDE or LONGITUDE. The big quake is highlighted in green. MAGNITUDE is on the vertical axis, LONGITUDE on the horizontal axis, and the points are connected in order of TIME of day. From the very irregular pattern of the connected lines, in which they almost alternate from one end of the graph to the other, we can see that there is no relationship between the TIME of day at which an earthquake occurs and its MAGNITUDE or LONGITUDE. However, from the two sets of closely packed lines, one vertical and one angled at about 50 degrees, it is shown that MAGNITUDE and LONGITUDE are related to each other such that high magnitude earthquakes tend to occur in the same geographical area.

In Figure 3, the white picture shows no clear effect of COPE (the connecting variable) on ALCOHOL use (vertical axis) and alcohol AVAILABILITY (horizontal axis). Notice the horn shape in the scatter plot, which indicates a failure of homogeneity of variance. The faint red and white dots without any lines through them indicate missing values.

The failure of homogeneity of variance in Figure 3 is explained by composing the same graph for just the heavy drinkers in Figure 4. As you can see, the heavy drinkers are high on ALCOHOL use and the number of AVAILABILITY locations, but widely scattered on their self-reported use of alcohol to COPE. This variability causes the center and right portions of the white graph to be wide. Since nondrinkers do not use alcohol as a coping mechanism, and tend to go to fewer locations where alcohol is available, the left side of the white graph has less dispersion. The cause of a failure of homogeneity of variance is very difficult to isolate using conventional statistical tools, but is very easy to see with the parametric snake plot.

Parallel Coordinates

The wealth of information in any dataset can be overwhelming. For example, there are 45 unique bivariate scatter plots possible in a dataset with just 10 variables. And those are just the pairwise relationships. One of the nicest ways to look at the patterns of relationships between more than a few variables is a parallel coordinate plot. Using Figure 5 as an example, we can see two locational groups of earthquakes from looking at the longitude axis. The axis has a point on it corresponding to the longitude of each earthquake, with lower values towards the bottom and higher values towards the top. All the axes are structured in the same manner. The lines across the vertical variable axes connect the cases. For example, the "big one" is indicated by a purple line that is easy to find at the top of the MAGNITUDE axis. It is clear that the earthquakes that occurred at the greater LONGITUDE are those of greater MAGNITUDE. Since almost all the quakes occurred at approximately the same LATITUDE, the white and red lines cross there. Since all the red lines intersect the DATE axis on the higher valued (later) days, we can easily see that there is a relationship between DATE, LONGITUDE and MAGNITUDE.

A qualitative assessment of the correlations between adjoining variables is very straightforward in a parallel coordinate plot. If, as the values on one variable increase, so do the values on the other variable, there is a positive correlation. If the lines are all straight, or about a third are slightly positively sloped, a third slightly negatively sloped, and the rest are horizontal, there is no strong correlation between the variables. A crossing pattern, where the correlation is positive for some cases and negative for others, indicates an interaction effect due to whatever makes those two groups of cases different. This is an interaction effect in the classic analysis of variance sense of the word. I regularly recommend people use this type of plot to identify interaction effects in their data.

A look at Figure 6, which is a parallel coordinate plot of the alcohol data, shows a clear interaction between EXTROVERsion and GENDER, with males having higher extroversion scores than females. By looking at the red lines of the heavy drinkers at the bottom of the concern axes, we can see that the heaviest drinkers are not the people who are most CONCERNed about their drinking. Further, heavy drinkers who are moderate or low on CONCERN tend to have moderate to high scores on using alcohol to COPE. This is a complicated set of relationships that is easy to show in DIAMOND, but hard for the statistically unsophisticated to see in cross-tables or statistical results.

Quad-Wise Plots

A quad-wise plot in DIAMOND is two scatter plots with a line connecting the points that represent the same case in both plots. By connecting two scatter plots in this way, we can see the interrelationships between two pairs of variables. In Figure 7, ALCOHOL use is on horizontal axis in the left picture, and DISINHIBition is on the vertical axis. In the right plot, CONCERN over drinking is on the horizontal axis and using alcohol to COPE is on the vertical axis. Heavy drinkers are in red. We can see a complex interaction between alcohol use, disinhibition, and concern. The heavy drinking red lines almost all slope down, while the non-drinking to moderate drinking white lines slope in all directions. This is because the heavy drinkers (red) tend to have comparatively higher scores on both DISINHIBition and ALCOHOL use, and lower scores on CONCERN over how much they drink (with one exception). The using alcohol-to-COPE scores for heavy drinkers are almost as variable as they are for everybody else.

Ice

Ice has one window, which uses glyphs, that is graphical structures, such as polygons or faces, which convey information by their shape, size, color, and orientation. Ice was used to create Figures 8 and 9. Take a look at them to make the description that follows more concrete.

The glyphs are displayed in a 3-D orthogonal XYZ grid. The grid is based on the values of three variables that you select. Each cell in the grid represents a different conjunction of the data values for the three variables. In each of the cells a glyph is presented. The location of the glyph in this XYZ coordinate system can be as informative as location of a point in a point cloud, while the shape, orientation, and color of the glyph can reveal additional information about the values of other variables for the cases in that grid.

Ice uses a box-shaped glyph. The form of the glyph in a particular XYZ cell describes the characteristics of the cases in that cell. The attributes of the glyph are assigned to different variables. The amount (for example the degree of rotation) these attributes change across the XYZ coordinates can reveal very complex relationships in data.

The glyphs can have a hole in their centers, where the size of the hole can be used to convey information. This is called fullness, where fuller means less hole inside the glyph. Fullness is mapped onto variation, where less hole means more variability in the cases in that cell. The thickness of the glyph is mapped onto population, with a thicker glyph indicating more cases in that cell. The glyphs can also be reoriented, for example, the more alcohol is used to COPE by the people in that cell, the more the glyph is elevated or tilted upwards. Variables can be assigned to three orientation specifiers: azimuth (turning the face with the hole from side to side), elevation (tilting the face with the hole up or down), and roll (rotation around the hole in the glyph). Hue or the color of the glyph is used in Figures 8 and 9 to show how much ALCOHOL the people in that cell drink. The colors are ordered white, magenta, blue, cyan, green, yellow, and red, with white being the least and red being most. Finally, X, Y, and Z convey information just as they do in a three dimensional scatter plot. These nine attributes may be assigned to any of the variables, though by default thickness is mapped onto the number of cases in the grid and fullness is mapped onto the variation of the data inside each cell. With these nine attributes, we can look at nine variables in a dataset simultaneously, which is a wonderful data exploration capability.

Figure 8 shows a picture with EXTROVERsion mapped onto the X axis, number of places where alcohol was AVAILABLE mapped onto the Y axis, and GENDER on the Z axis. Variation and Population are mapped onto fullness and thickness, respectively. ALCOHOL use is mapped onto hue (color). Males are to the left of the picture, and females are to the right. The closer a cell is to the front of the cube, the higher the extroversion score. The closer to the top of the cube, the greater the average score on AVAILABLE for the people in that cell. At the bottom border are the non-drinkers in white. The heaviest drinkers are towards the upper left front corner. The yellow square is the heaviest drinker in the group, the greens are the next heaviest drinkers, the blues and lavender are moderate drinkers, and the reds are light drinkers.

From the picture we can see that: a) the heaviest drinkers are male, b) both males and females tend to be extroverted, and c) the heaviest male drinkers frequent more places where ALCOHOL is available than anyone else.

Figure 9 shows how easy it is to add another dimension to the data. Figure 9 is the same picture as Figure 8, with the addition of mapping COPE onto elevation. By looking at how much the glyphs are rotated upwards, we can see that the heaviest male drinkers use alcohol to COPE the most, and that male drinkers use alcohol to COPE more than female drinkers. The picture in Figure 9 displays seven variables at once: EXTROVERsion, AVAILABLEness, GENDER, using alcohol to COPE, quantity of ALCOHOL consumed in the last month, variation, and population.

Two more variables can be added to the picture, and their effects on the picture can be turned off and on easily, which allows you to flip between pictures to see the effects of a particular variable.

Conclusions

I do believe we have a lot of pictures in our future. There are a lot of reasons for saying this. A major use of DIAMOND and Ice is the exploration of data to develop hypotheses as a prelude to statistical analysis. But even without the use of formal statistics, a picture can be used to make convincing arguments by showing the complex interrelationships among variables. In addition, DIAMOND and Ice are useful because it is easier to describe the nature of data to a non-statistical audience with pictures than with numbers. But, all these reasons devolve to my original premise, a picture can be worth a million numbers.

References


Figure 1: Initial display in DIAMOND of a matrix of all possible bivariate scatterplots in alcohol dataset


Figure 2: Parametric snake that plots magnitude and longitude of earthquakes connected in order of time of day


Figure 3: Plot of alcohol use and alcohol availability in order of using alcohol to cope


Figure 4: Plot of alcohol use and alcohol availability in order of using alcohol to cope, for just the heavy drinkers


Figure 5: Parallel coordinate plot of earthquake data, colored by longitude


Figure 6: Parallel coordinate plot of the alcohol data with heavy drinkers in red


Figure 7: Quad-wise plot of disinhibition by alcohol use and using alcohol to cope by concern over alcohol consumption


Figure 8: Six-dimensional visualization of extroversion, alcohol availability, gender, alcohol user, variation and population


Figure 9: Seven-dimensional visualization with cope added as glyph elevation


Matthew Schall, Ph.D., is a Consultant at OAC who regularly contributes to American Statistical Association's Statistical Computing and Graphics, has lectured on high-dimension visualization at IBM, and is in frequent contact with visualization and exploratory data analysis software developers around the country.

*OAC/CS 21 Jun 95; Rev. 19 Dec 95