Stata FAQ How can I analyze an unbalanced randomized block design?

Randomized block type designs are relatively common in certain fields. Balanced randomized designs can be analyzed using traditional anova and regression methods but unbalanced designs require the use of maximum likelihood methods.

We will begin by analyzing a balanced design with four levels of variable a and 8 subjects denoted s on response variable y using tradition anova.

use http://www.ats.ucla.edu/stat/stata/faq/randomblock, clear

describe

Contains data from http://www.ats.ucla.edu/stat/stata/faq/randomblock.dta
obs:            32
vars:             3                          1 May 2002 09:01
size:           224 (99.9% of memory free)
-------------------------------------------------------------------------------
storage  display     value
variable name   type   format      label      variable label
-------------------------------------------------------------------------------
s               byte   %8.0g
a               byte   %8.0g
y               byte   %8.0g
-------------------------------------------------------------------------------
Sorted by:

/* balanced design -- traditional anova */

anova y a s

Number of obs =      32     R-squared     =  0.7318
Root MSE      = 1.18523     Adj R-squared =  0.6041

Source |  Partial SS    df       MS           F     Prob > F
-----------+----------------------------------------------------
Model |        80.5    10        8.05       5.73     0.0004
|
a |          49     3  16.3333333      11.63     0.0001
s |        31.5     7         4.5       3.20     0.0180
|
Residual |        29.5    21   1.4047619
-----------+----------------------------------------------------
Total |         110    31   3.5483871

anova, regress

Source |       SS       df       MS              Number of obs =      32
-------------+------------------------------           F( 10,    21) =    5.73
Model |        80.5    10        8.05           Prob > F      =  0.0004
Residual |        29.5    21   1.4047619           R-squared     =  0.7318
Total |         110    31   3.5483871           Root MSE      =  1.1852

------------------------------------------------------------------------------
y        Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
------------------------------------------------------------------------------
_cons                   8   .6949006    11.51   0.000     6.554875    9.445125
a
1        -3.25   .5926133    -5.48   0.000    -4.482407   -2.017593
2        -2.75   .5926133    -4.64   0.000    -3.982407   -1.517593
3           -2   .5926133    -3.37   0.003    -3.232407   -.7675933
4    (dropped)
s
1         -2.5   .8380817    -2.98   0.007    -4.242886   -.7571137
2        -2.25   .8380817    -2.68   0.014    -3.992886   -.5071137
3         -2.5   .8380817    -2.98   0.007    -4.242886   -.7571137
4         -2.5   .8380817    -2.98   0.007    -4.242886   -.7571137
5         -2.5   .8380817    -2.98   0.007    -4.242886   -.7571137
6         -1.5   .8380817    -1.79   0.088    -3.242886    .2428863
7         -.25   .8380817    -0.30   0.768    -1.992886    1.492886
8    (dropped)
------------------------------------------------------------------------------
Next we will set two of the observations to missing.
replace y=. if a==1 & s==2
replace y=. if a==2 & s==3

generate complete = 1
replace complete = 0 if s==2 | s==3

tabulate a, gen(a)

a |      Freq.     Percent        Cum.
------------+-----------------------------------
1 |          7       23.33       23.33
2 |          7       23.33       46.67
3 |          8       26.67       73.33
4 |          8       26.67      100.00
------------+-----------------------------------
Total |         30      100.00

tabulate s if complete

s |      Freq.     Percent        Cum.
------------+-----------------------------------
1 |          4       16.67       16.67
4 |          4       16.67       33.33
5 |          4       16.67       50.00
6 |          4       16.67       66.67
7 |          4       16.67       83.33
8 |          4       16.67      100.00
------------+-----------------------------------
Total |         24      100.00
In traditional anova, particularly procedures that use with wide data, only subjects with complete data are included in the analysis. In this case, only the six subjects with complete data would be used.
/* unbalanced design -- traditional anova */

anova y a s if complete

Number of obs =      24     R-squared     =  0.7105
Root MSE      = 1.33229     Adj R-squared =  0.5560

Source |  Partial SS    df       MS           F     Prob > F
-----------+----------------------------------------------------
Model |  65.3333333     8  8.16666667       4.60     0.0054
|
a |      38.125     3  12.7083333       7.16     0.0033
s |  27.2083333     5  5.44166667       3.07     0.0420
|
Residual |      26.625    15       1.775
-----------+----------------------------------------------------
Total |  91.9583333    23  3.99818841

anova, regress

Source |       SS       df       MS              Number of obs =      24
-------------+------------------------------           F(  8,    15) =    4.60
Model |  65.3333333     8  8.16666667           Prob > F      =  0.0054
Residual |      26.625    15       1.775           R-squared     =  0.7105
Total |  91.9583333    23  3.99818841           Root MSE      =  1.3323

------------------------------------------------------------------------------
y        Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
------------------------------------------------------------------------------
_cons            8.041667   .8158584     9.86   0.000     6.302706    9.780628
a
1    -3.166667   .7691987    -4.12   0.001    -4.806175   -1.527158
2           -3   .7691987    -3.90   0.001    -4.639508   -1.360492
3           -2   .7691987    -2.60   0.020    -3.639508   -.3604917
4    (dropped)
s
1         -2.5   .9420722    -2.65   0.018    -4.507979   -.4920207
4         -2.5   .9420722    -2.65   0.018    -4.507979   -.4920207
5         -2.5   .9420722    -2.65   0.018    -4.507979   -.4920207
6         -1.5   .9420722    -1.59   0.132    -3.507979    .5079793
7         -.25   .9420722    -0.27   0.794    -2.257979    1.757979
8    (dropped)
------------------------------------------------------------------------------

If you have Stata 9 you can use xtmixed. We will run it using both the default REML and full maximum likelihood. With maximum likelihood we can run the analysis with all observations not just on subjects with complete data.
xtmixed y a1 a2 a3 || s:, variance

Performing EM optimization:

Iteration 0:   log restricted-likelihood =  -49.98054
Iteration 1:   log restricted-likelihood =  -49.98054

Computing standard errors:

Mixed-effects REML regression                   Number of obs      =        30
Group variable: s                               Number of groups   =         8

Obs per group: min =         3
avg =       3.8
max =         4

Wald chi2(3)       =     29.29
Log restricted-likelihood =  -49.98054          Prob > chi2        =    0.0000

------------------------------------------------------------------------------
y |      Coef.   Std. Err.      z    P>|z|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
a1 |  -3.135557   .6416073    -4.89   0.000    -4.393084    -1.87803
a2 |  -2.753089   .6416073    -4.29   0.000    -4.010616   -1.495562
a3 |         -2   .6157501    -3.25   0.001    -3.206848   -.7931519
_cons |       6.25   .5327192    11.73   0.000      5.20589     7.29411
------------------------------------------------------------------------------

------------------------------------------------------------------------------
Random-effects Parameters  |   Estimate   Std. Err.     [95% Conf. Interval]
-----------------------------+------------------------------------------------
s: Identity                  |
var(_cons) |   .7537252   .6242815      .1486605     3.82147
-----------------------------+------------------------------------------------
var(Residual) |   1.516593   .4889405      .8062073    2.852931
------------------------------------------------------------------------------
LR test vs. linear regression: chibar2(01) =     3.43 Prob >= chibar2 = 0.0320

test a1 a2 a3

( 1)  [y]a1 = 0
( 2)  [y]a2 = 0
( 3)  [y]a3 = 0

chi2(  3) =   29.29
Prob > chi2 =    0.0000

/* approximate F-test */

display r(chi2)/r(df)

9.7619818

/* next with full maximum likelihood */

xtmixed y a1 a2 a3 || s:, variance mle

Performing EM optimization:

Iteration 0:   log likelihood = -50.864974
Iteration 1:   log likelihood = -50.864974

Computing standard errors:

Mixed-effects ML regression                     Number of obs      =        30
Group variable: s                               Number of groups   =         8

Obs per group: min =         3
avg =       3.8
max =         4

Wald chi2(3)       =     33.86
Log likelihood = -50.864974                     Prob > chi2        =    0.0000

------------------------------------------------------------------------------
y |      Coef.   Std. Err.      z    P>|z|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
a1 |   -3.13568   .5967199    -5.25   0.000    -4.305229    -1.96613
a2 |  -2.753432   .5967199    -4.61   0.000    -3.922982   -1.583883
a3 |         -2    .572654    -3.49   0.000    -3.122381   -.8776188
_cons |       6.25    .496394    12.59   0.000     5.277086    7.222914
------------------------------------------------------------------------------

------------------------------------------------------------------------------
Random-effects Parameters  |   Estimate   Std. Err.     [95% Conf. Interval]
-----------------------------+------------------------------------------------
s: Identity                  |
var(_cons) |   .6595255   .5088202      .1453899    2.991774
-----------------------------+------------------------------------------------
var(Residual) |    1.31173    .393547      .7285611    2.361691
------------------------------------------------------------------------------
LR test vs. linear regression: chibar2(01) =     3.99 Prob >= chibar2 = 0.0229

test a1 a2 a3

( 1)  [y]a1 = 0
( 2)  [y]a2 = 0
( 3)  [y]a3 = 0

chi2(  3) =   33.86
Prob > chi2 =    0.0000

/* approximate F-test */

display r(chi2)/r(df)

11.287767

If you don't have Stata 9 (and therefore don't have xtmixed) you can use xtreg with the mle (maximum likelihood) option.
/*  unbalanced design -- maximum likelihood using xtreg */

xtreg y a1 a2 a3, i(s) mle

Random-effects ML regression                    Number of obs      =        30
Group variable (i): s                           Number of groups   =         8

Random effects u_i ~ Gaussian                   Obs per group: min =         3
avg =       3.8
max =         4

LR chi2(3)         =     20.35
Log likelihood  = -50.864974                    Prob > chi2        =    0.0001

------------------------------------------------------------------------------
y |      Coef.   Std. Err.      z    P>|z|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
a1 |   -3.13568   .5967916    -5.25   0.000     -4.30537    -1.96599
a2 |  -2.753432   .5972855    -4.61   0.000     -3.92409   -1.582774
a3 |         -2   .5726538    -3.49   0.000    -3.122381   -.8776192
_cons |       6.25   .4963939    12.59   0.000     5.277086    7.222914
-------------+----------------------------------------------------------------
/sigma_u |   .8121118   .3132698     2.59   0.010     .1981143    1.426109
/sigma_e |   1.145308   .1718083     6.67   0.000       .80857    1.482046
-------------+----------------------------------------------------------------
rho |   .3345713   .1958727                      .0692199    .7346624
------------------------------------------------------------------------------
Likelihood-ratio test of sigma_u=0: chibar2(01)=    3.99 Prob>=chibar2 = 0.023

test a1 a2 a3

( 1)  [y]a1 = 0
( 2)  [y]a2 = 0
( 3)  [y]a3 = 0

chi2(  3) =   33.83
Prob > chi2 =    0.0000

/* approximate F-test */

display r(chi2)/r(df)

11.277428
As you can see the maximum likelihood solutions are much closer to the balanced design results then is the analysis using only subjects complete data.

The content of this web site should not be construed as an endorsement of any particular web site, book, or software product by the University of California.