Thursday, November 29, 2012

PDQ 6.0 from a Developer Standpoint

This is a guest post by Paul Puglia, who contibuted significant development effort for the PDQ 6.0 release; especially as it relates to interfacing with R. Here, Paul provides more details about the motivation described in my eariler announcement.


PDQ was designed and implemented around a couple of basic assumptions. First, the library would be a C-language API running on some variant of the Unix operating system where we could reasonably assume that we'd be able to link it against a standard C library. Second, programs built using this API would be "stand-alone" executables in the sense that they'd run in their own, dedicated memory address spaces, could route its I/O through the standard streams (stdout or stderr), and had complete control over how error conditions would be handled.

Not surprisingly, the above assumptions drove a set of implementation decisions for the library, namely:

  • All I/O would be pushed through the standard stream library functions like printf and fprintf
  • Memory for internal data structures would be allocated and released through calls to the standard library functions calloc and free
  • Error conditions would result in PDQ causing the model execution to stop with an explicit call to exit().
These aren’t usual decisions for a C API.

With the arrival of PDQ 2.0, we introduced foreign interfaces programming environments (PERL, Python and R) that allowed PDQ to be called from these other environments. All these new foreign interfaces were built and released using the SWIG interface building tool, which allows us to build these interfaces with absolutely no modification to the underlying PDQ code—a major benefit when you’ve got a mature, debugged API that you really want to remain that way. For the most part this arrangement worked pretty well—at least for those environments where it was natural to write and execute PDQ models like standalone C-programs (you can also read this as PERL and Python).

When it came to R, however, our early implementation decisions weren’t such a great fit for how R is commonly used, which is as an interactive environment, similar to programs like Mathematica, Maple, and Matlab. Like these other environments, R users do most of their interaction with a REPL (Read-Execute-Print Loop) usually wrapped in either full-fledged GUI interface or a terminal-like interface called the console.

It turns out that most of PDQ's implementation decisions could (and do) interfere with using R interactively. In particular:

  • Calling the exit() function results in the entire R environment exiting – not a good feature for an interactive environment.
  • Writing directly to the stdout and strerr using fprintf, bypasses R's own internal I/O mechanisms and prevents internal /O functions (like the sink() command) from working properly.
  • Using calloc() and free() functions interfere with R's own internal memory management mechanisms and would prove to be a major impediment for any Windows version of the interface.

Not only do these severely degrade the interactive experience for R users, their use also gets flagged by R’s extension building mechanism when it does a consistency check. And not passing that check would prove a major impediment for getting the PDQ's R interface accepted on CRAN (Comprehensive R Archive Network).

Luckily, none of the fixes for these issues are particularly hard to implement. Most are either fairly simple substitutions of the R API calls for C library routines or/and localized changes to PDQ library. And, while all of this does potentially create a risk of introducing bugs in the PDQ library, the reward for taking that risk is a stable R interface that can be eventually be submitted to CRAN. A version of the PDQ library can be easily built under Windows™ using the Rtools utilities.

No comments: