Saturday, November 13, 2010

Reporting Standard Errors for USL Coefficients

In a recent Guerrilla CaP Group discussion, Baron S. wrote:
....
BS> Using gnuplot against the dataset I gave, I get 
BS>    sigma   0.0207163 +/- 0.001323 (6.385%) 
BS>    kappa   0.000861226 +/- 5.414e-05 (6.287%) 
The Gnuplot output includes the errors for each of the universal scalability law (USL) coefficients. A question about the magnitude of these errors also arose in a recent talk I gave. Typically, this question doesn't come up because there's more focus on assessing the residual errors as a measure of fit for the USL against the data set. Also, statistical accuracy can be a bigger issue when there are only a small number of samples. Barron reported 32 data points, so that's not an problem in this case.

All this led me to wonder if the same error information is available in R. The answer is, yes, and here's how to get hold of it. In the following, where Baron used coefficient names sigma and kappa (as in Chapter 4 of my GCaP book), I'll use alpha and beta to mean the same thing respectively because the context is processes (i.e., thread scalability) rather than processors (i.e., hardware configuration scalability as discussed in Chap. 4).

First, let's check the coefficient values determined by R when it fits the nonlinear USL model against Baron's data:

> coef(usl)
       alpha         beta 
0.0207164433 0.0008612175 
They agree. More complete statistics are available in R using the summary() command:

> summary(usl)

Formula: Norm ~ N/(1 + alpha * (N - 1) + beta * N * (N - 1))

Parameters:
       Estimate Std. Error t value Pr(>|t|)    
alpha 2.072e-02  1.322e-03   15.67 5.48e-16 ***
beta  8.612e-04  5.412e-05   15.91 3.61e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 

Residual standard error: 0.2005 on 30 degrees of freedom

Number of iterations to convergence: 7 
Achieved convergence tolerance: 8.628e-08 
Indeed, we can now see that the standard errors are reported in the formatted column labeled "Std. Error" but the question is, how can we get hold of those values individually? For that, we need to understand the underlying formatting data structure being used by R. The str() function in R provides that information:

> str(summary(usl))
List of 11
 $ formula     :Class 'formula' length 3 Norm ~ N/(1 + alpha * (N - 1) + beta * N * (N - 1))
  .. ..- attr(*, ".Environment")= 
 $ residuals   : num [1:32] 0 0.011 -0.0522 -0.0144 -0.0269 ...
 $ sigma       : num 0.2
 $ df          : int [1:2] 2 30
 $ cov.unscaled: num [1:2, 1:2] 4.35e-05 -1.73e-06 -1.73e-06 7.29e-08
  ..- attr(*, "dimnames")=List of 2
  .. ..$ : chr [1:2] "alpha" "beta"
  .. ..$ : chr [1:2] "alpha" "beta"
 $ call        : language nls(formula = Norm ~ N/(1 + alpha * (N - 1) + beta * N * (N -      1)), data = input, start = c(alpha = 0.1, beta = 0.01), algorithm = "default",  ...
 $ convInfo    :List of 5
  ..$ isConv     : logi TRUE
  ..$ finIter    : int 7
  ..$ finTol     : num 8.63e-08
  ..$ stopCode   : int 0
  ..$ stopMessage: chr "converged"
 $ control     :List of 5
  ..$ maxiter  : num 50
  ..$ tol      : num 1e-05
  ..$ minFactor: num 0.000977
  ..$ printEval: logi FALSE
  ..$ warnOnly : logi FALSE
 $ na.action   : NULL
 $ coefficients: num [1:2, 1:4] 2.07e-02 8.61e-04 1.32e-03 5.41e-05 1.57e+01 ...
  ..- attr(*, "dimnames")=List of 2
  .. ..$ : chr [1:2] "alpha" "beta"
  .. ..$ : chr [1:4] "Estimate" "Std. Error" "t value" "Pr(>|t|)"
 $ parameters  : num [1:2, 1:4] 2.07e-02 8.61e-04 1.32e-03 5.41e-05 1.57e+01 ...
  ..- attr(*, "dimnames")=List of 2
  .. ..$ : chr [1:2] "alpha" "beta"
  .. ..$ : chr [1:4] "Estimate" "Std. Error" "t value" "Pr(>|t|)"
 - attr(*, "class")= chr "summary.nls"
From this output we see that the α and β coefficients—the same ones reported by the coef() function—can be accessed by indexing them as a 2-dimensional array.

> summary(usl)$coefficients[1,1]
[1] 0.02071644
> summary(usl)$coefficients[2,1]
[1] 0.0008612175
Similarly, the corresponding standard errors can be accessed as:

> summary(usl)$coefficients[1,2]
[1] 0.001322276
> summary(usl)$coefficients[2,2]
[1] 5.412075e-05
and they also agree with the standard errors reported in Gnuplot.

No comments: