This article was originally published in Perspective, Volume 19, Number 1, 1995, pp. 28-32.


Using XLISP-STAT to Explore New and Unusual Statistical Paradigms

by Michael Mitchell, Ph.D.

XLISP-STAT is an extension of the Xlisp programming language featuring fundamental statistical building blocks which can be combined to explore new and unusual statistical paradigms more easily than with other traditional software packages. Written by Luke Tierney of the University of Minnesota School of Statistics, XLISP-STAT expands upon the Xlisp language which itself is an extension of Common Lisp. This article briefly describes the capabilities of XLISP-STAT and provides an example XLISP-STAT program which examines how the type I error rate is affected by violations of ANOVA assumptions with simulated data.

XLISP-STAT contains functions which enable users to perform graphical display of data (e.g. histograms, scatter plots, and 3-d spin plots) descriptive statistics (mean, median, standard deviation), and basic statistical procedures (regression and one-way analysis of variance). These fundamental building blocks, when combined with custom written routines, enable a wide range of statistical problems to be studied. Such custom routines can be written by the user, or may be routines developed by a third party. A growing collection of such third party custom routines has been assembled by Jan de Leeuw, Professor of Psychology and Mathematics, Director Interdivisional Program in Statistics, and can be obtained via anonymous ftp to ftp.stat.ucla.edu. Some of these offerings provide functions for:

More information about XLISP-STAT and Jan de Leeuw's UCLA Statistics Program can be viewed on the UCLA Statistics Home Page via the World Wide Web (WWW) at the URL http://www.stat.ucla.edu.

Pros and Cons of Using XLISP-STAT

Whether users choose to write their own routines or use those written by others, they should consider the pros and cons associated with each approach. Writing your own routines opens up a world of novel statistical procedures which are not commonly available in standard statistical packages. However, familiarity with the Xlisp language is needed to write and debug such routines, which may be quite time consuming. By contrast, routines written by others may save development time, but users should verify that the routine is performing the expected analysis, and that the results of the analysis are valid.

While XLISP-STAT offers capabilities not commonly found in traditional statistics packages (like SAS, SPSS, BMDP), XLISP-STAT lacks some basic features found in these packages. For example, XLISP-STAT does not currently have support for missing values and can only read in space delimited text files. Over time, these gaps in XLISP-STAT's functionality may be progressively filled as new versions of XLISP-STAT are released, and through the development of custom routines provided by third parties (such as the routines mentioned above). Meanwhile, XLISP-STAT may be best used in tandem with a traditional statistics package such as SAS. A package like SAS could be used for data management and validation. Once the data is ready for analysis, SAS can then be used to produce a space delimited text file which could be read by XLISP-STAT.

Version 2.1 of XLISP-STAT is available on OAC's SPx/cluster and can be invoked by entering the command:

    xlispstat

LoadLeveler should be used when running XLISP-STAT in batch mode. Instructions on using LoadLeveler and sample LoadLeveler command files can be found on the SPx gopher server at URL: gopher//cluster.oac.ucla.edu or by using the "xgopher" command when logged onto the SPx/cluster.

If you need assistance in using XLISP-STAT or in transferring XLISP-STAT files to AIX on the SPx/cluster, contact OAC Consulting at (82)5-7452 for an appointment. Unfortunately, OAC Consultants cannot help you debug custom XLISP-STAT routines, written either by users or third parties. If you would like an SPx/cluster account for using XLISP-STAT, contact the OAC User Relations Office in 4302 MSA.

An XLISP-STAT Example: Analysis of Variance Simulation

The building blocks provided by XLISP-STAT are well suited for writing programs which perform monte carlo simulation studies. Such studies perform a particular statistical analysis iteratively (say 10,000 times) and then examine how the results vary across different conditions. Such analyses are possible within a program like SAS, by creating simulated input and outputting the results of statistical procedures, but this could lead to the creation of some very large datasets. For example, a modest study examining 10 different conditions, with 10,000 iterations per condition, and requiring 40 data records per iteration would need 10 * 10,000 * 40 = 4,000,000 records for the input data set, and would create 100,000 output records where each output record contains a set of summary statistics for each iteration within each condition. Perhaps clever SAS programming would avoid the creation of all of these records at one time, but nonetheless, standard SAS procedures would generally require the creation of this many input and output records via SAS datasets. By contrast, at each iteration XLISP-STAT can create the simulated input data via memory variables, analyze the simulated data, and make the summary statistics immediately available in other memory variables for processing. The information carry over from one iteration to the next can be as little as one memory variable, a much more frugal use of the computer's resources.

Such a simulation is illustrated here, examining the behavior of Analysis of Variance when one of its assumptions is violated. Analysis of Variance (ANOVA) is a statistical procedure which evaluates whether differences among means in K samples are due to true differences among the means in the K populations. When all of the assumptions of ANOVA are met, the researcher knows that 5% of the time results will suggest true differences among the K population means when no such differences exist. This error rate (of claiming differences when none exist) is called the type I error rate.

When the ANOVA assumptions have been violated, the researcher may believe that the type I error rate is 5%, when in actuality the rate is much different. One such assumption is the homogeneity of variance assumption, i.e. that the variances of the scores in the K populations are all equal. Violation of this assumption can alter the actual type I error rate. There is one exception; when the sample sizes of the K groups are equal, the type I error rate should remain as expected (5%) even if the variances of the K groups differ.

The effects of heterogeneous variances on type I error rates in ANOVA can be studied using simulation, by intentionally creating simulated data where the K means are equal, but with different variances. By performing repeated analyses, we can then observe how the actual type I error rate varies from the expected type I error rate under various conditions. Table 1 shows the eight conditions studied. The first four have equal variances, hence their type I error rate would be expected to be 5%. Conditions 5 and 6 have different variances, but the sample sizes are the same, so their type I error rate should still be 5%. Conditions 7 and 8 have both different sample sizes and different variances, hence their simulated type I error rate should differ from 5%.

Table 1: Simulation Conditions and Simulation Results

Table 2 shows the XLISP-STAT program used to create and analyze the simulated data. The program begins by defining a function (called SimAnova) which accepts one parameter, the number of iterations per simulation (1). The function begins by describing the conditions we wish to simulate, i.e. the values of the samples sizes and variances for groups 1 and 2; then Total Conditions and type1ErrorList are initialized (2).

Then, a loop is set up to cycle through all of the conditions described above (3). Next, NumSignificant is set to 0, and N1, N2, V1, V2 are assigned the values for the current condition (4).

A loop is then set up to perform the following steps for the number of requested iterations (5). Simulated data is created for groups 1 and 2, both having means of 0 (i.e. the means of the two populations are known to be equal) and with sample sizes and variances for the current condition, N1, N2, V1, V2 (6). Then, an analysis of variance is performed on the data (7), and the F value is computed as well as Fcritical (8). If the computed F exceeds the Fcritical the number of significant observations is incremented (9). The proportion significant is saved in type 1 Error List (10) for each condition.

Table 2: XLISP-STAT code for Analysis of Variance Simulation

The function is called in step (11), with the requested number of iterations per condition, 1000. Finally, the results are saved in a file (called SimAnova.Lsp) for further examination (12).

The results of running this simulation with 1000 iterations per condition are shown in the last column of Table 1. (The SPx/cluster completed this simulation with 1000 iterations per condition in 163 CPU seconds; approximately 20 minutes elapsed time). As expected, the first four conditions (which conform to the ANOVA assumptions) yield an observed type I error rate quite close to our expected type I error rate. And even though conditions 6 and 7 violate the ANOVA assumption of homogeneity of variance, these conditions yield acceptable type I error rates because the sample sizes are equal. The results for conditions 7 and 8 illustrate that violating the homogeneity of variance assumption does affect the type I error rate, but the direction of the effect depends on the pattern of the sample sizes and variances. In condition 7, when the larger sample size is associated with the larger variance, the type I error rate is deflated (1.34%). This would make it harder to detect true differences among population means, which would lead to an increased rate of believing that no mean differences exist among the K populations when the means of the K populations indeed differ. In condition 8 when the larger sample size is associated with the smaller variance, the type I error rate is inflated (to 14.54%). Data like this could mislead a researcher into believing that there are mean differences among the K populations when actually the means of the K populations are all the same.

The results for conditions 7 and 8 suggest that researchers using ANOVA should not only be attentive to whether they violate the assumption of homogeneity of variance, but also how the assumption is violated. The conclusions could be trusted if a researcher found evidence for mean differences under condition 7, or found evidence of no mean differences under condition 8. However, a finding of no mean differences under condition 7, or a finding of mean differences under condition 8 would be ambiguous; being attributable either to the true nature of the data, or to violating the homogeneity of variance assumption. In these latter two cases, researchers would need to use an alternate statistic which is robust to violations of the homogeneity of variance assumption (Reference 1).

Summary

The program illustrated in this article shows the most basic way in which XLISP-STAT can be used to perform simulation studies. With some minor modifications, this simple program could be extended to accommodate any number of groups (not just 2), and/or to study violations in the assumption of normality (by varying the skewness and kurtosis of the data). The general framework of this program could be used for monte carlo simulations of other statistical procedures (e.g. to study violations of the homoschedacity assumption in regression). Yet simulations are just one of the domains where XLISP-STAT may be preferred over a standard statistics package (like SAS, SPSS, BMDP). XLISP-STAT is well suited for a number of other tasks, including the visual display of data, bootstrap resampling, and analysis of data using general linear models. XLISP-STAT does not replace SAS, SPSS or BMDP, but it is a flexible tool which augments such packages by providing an open environment for the study of new and non-traditional statistical methods.

References

  1. Wilcox, R. R. (1987). New designs in analysis of variance. Annual Review of Psychology, 38, 29-60.

Michael Mitchell, Ph.D., is an OAC Statistical Consultant who provides guidance to users in the implementation of statistical methodologies and the analysis of large and complex databases.

*OAC/CS

Jan 95; Rev. 19 Dec 95