A language should be focused on writing Internal DSLs

I’ve been reading Martin Fowler’s book, Domain-Specific Languages, this weekend. He covered a number of ways in which you can structure your code to achieve what he terms an Internal DSL. Quite a bit is focused on the discussion of a fluent interface. It turns out in many languages there are only so many ways to structure the code so it looks like a DSL rather than the host language. Chapter 4 of the book explores a number of these techniques, and summarizes the results in a nice table, mapping language structure (for a DSL) to host language patterns:

Structure BNF Consider…
Mandatory list parent ::= first second third Nested Function
Optional list parent ::= first maybeSecond? maybeThird? Method Chaining, Literal Map
Homogenous bag parent ::= child* Literal List, Function Sequence
Heterogenous bag parent ::= (this | that | theOther)* Method Chaining
Set n/a Literal Map

What’s interesting about this mapping is that, we can use it to engineer a language whose sole purpose would be to solve problems by authoring and combining DSL’s (what the Lisper’s are always claiming they do through the use of macro’s).
The focus here is to get a function call specification that can do a good job of representing all the control structures that are present within a grammar.
So let’s take a careful look at Wirth’s syntax notation to get a good idea of what needs to be modeled.

  • Sequencing is conveyed by a space-delimited list of grammar elements: a b c d
  • Repitition is denoted by curly brackets: {a} stands for ε | a | aa | aaa | …
  • Optionality is expressed by square brackets: [a]b stands for ab | b
  • Grouping is indicated by parenthesis: (a|b)c stands for ac | ab

Fowler gives some interesting suggestions, both in the above table and in the surrounding discussion within the book. I’d like to take some space here and explore each of the options, and how it relates to a Wirth-style grammar. Throughout this exposition, we’ll be using programming language constructs such as function calls and scoping to model the grammar rules typically seen in the above table. We’ll also assume that each grammar rule can be made to correspond to either a method or type (or both) in the implementors programming language.

  • Mandatory list. Within a grammar, a mandatory list is indistinguishable from a sequence. Within a programming language, the mandatory list can also be expressed as a list of arguments that must be provided to a function. We can even invoke compile-time checking by having each grammar rule correspond to a type. Fowler doesn’t quite go that far, but he does give this illustrative example:

    Nested Function Example declarations
    computer(
        processor(
            cores(2),
            speed(2500),
            i386
        ),
        disk(
            size(150)
        ),
        disk(
            size(75),
            speed(7200),
            sata()
        )
    )
    ComputerT computer(ProcessorT, DiskT);
    ProcessorT processor(CoresT, SpeedT, ArchT);
    DiskT disk(SizeT);

  • Optional List. As long as each list can be clearly expressed, we can achieve an optional list via function overloading. This way the compiler will match up directly to the instance of the correct rule. However, this amount of explicitness comes at a cost: either we keep the number of optional choices in the grammar to a minimum, or we commit ourselves to writing a large number of very similar functions (as the optionality can explode exponentially). If optionality is given in the grammar as a grouping, then it may make sense to create a new function representing only that grouping. This can reduce the amount of function overloading by reducing the number of function signatures that need to be supported. Fowler recommends two options:

    Method Chaining Literal Map
    computer()
        .processor()
            .cores(2),
            .speed(2500),
            .i386()
        .disk()
            .size(150)
        .disk()
            .size(75)
            .speed(7200)
            .sata()
        .end()
    computer(
        processor(
            :cores => 2,
            :speed => 2500,
            :type => :i386
        ),
        disk(
            :size => 150
        ),
        disk(
            :size => 75,
            :speed => 7200,
            :interface => :sata
        )
    )

    Although the Method Chaining approach readily allows for the assumption of default parameters (for example the speed of the first disk), the hierarchical structure here is not fully captured by the chain of method calls. To support such a chain as that in the example, it would be necessary to hold a ContextVariable that each method automatically recognizes. (for example, so that size() can refer to the last disk inserted. Because this context is kept implicit, the compiler will not be able to capture an invalid sequence of method calls (instead this much be caught by validity checking code at runtime). Furthermore, it is important to keep straight the terminology of each different component. For example, it is not clear that speed(7200) should refer to the last mentioned disk or to the last mentioned processor. Reshuffling the order of method calls can be highly problematic, and can result in unwanted side-effects if terminology becomes mixed up.

    The Literal Map approach works pretty well for dynamic languages (python and ruby) which allow functions to accept a dictionary mapping terms to values. It would be nice to have named parameters in more of the static languages (Ada and C# have them, C/C++ and Java do not). Having a statically type-checked language implement this feature, including the ability to overload with default values, would be ideal. There is also nothing that preserves the order of entries (which isn’t all that important in this example).

  • Homogeneous Bag. This is directly about modeling the repetition rule in Wirth’s syntax. In most programming languages, there do not exist easily used mechanisms for passing a variable length list of items.

    Literal List Function Sequence
    martin.follows(
        "WardCunningham",
        "bigballofmud",
        "KentBeck",
        "neal4d");
    computer();
        processor();
            cores(2);
            speed(2500);
            i386();
        disk();
            size(150);
        disk();
            size(75);
            speed(7200);
            sata();

    The Literal List approach can be implemented, in languages such as C, with the varargs mechanism. But it forgoes type-checking, and definitely doesn’t feel ‘built-in’. Dynamic languages are more permissive in this regard. For example, Python has the *args mechanism, and JavaScript allows access to the arguments array. In either case, it turns out, aside from creating, explicitly, your own separate list to pass in, there just really isn’t a great way to implement this pattern. All is not quite lost, however, as some language allow for the in-line initialization of arrays. For example,

    • C++ allows passing of a streamed object: m(StringArray() << "blah" << "hey" << "yo"), which isn’t quite what we want.
    • Java does a bit better with m(new String[]{"blah", "hey", "yo"}); which sacrifices readability, but achieves the desired goal.
    • Java also allows declaration of a type-checked varargs m(String...); which is called like m("blah", "hey", "yo"), which is precisely what we want.

    The biggest drawback of using the varargs mechanism is that it lacks sequential composability. You cannot easily declare a function for the rule: A := B* C*. In this case, the more crufty, in-line instantiation mechanisms turn out to be better.

    The Function Sequence approach lacks pretty much everything I’ve considered thus far: it’s sequencing isn’t type-checked, terminology can be confused, it requires a State Variable to maintain context. The only way in which it isn’t worse than Method Chaining is that the compiler can tell you on what line you crashed, provided you responsibly littered your code with assert‘s.

  • Heterogeneous Bag. Because this is more about repetition of optionality, it’s probably best handled by combining previous approaches. For example, you could choose to factor out the group as it’s own separate type, and then have a vararg listing of these types. Method Chaining can be used for list of configuration parameters, especially if order isn’t really the concern. You couldn’t implement this with a dictionary or map though, because of the repetition aspect.
  • Set. Clearly the only approach to implementing an unordered collection of possibly heterogeneous types is a dictionary or associative array, best represented in the the Literal Map approach. Again, we’d like to have a clean syntax for this in our programming languages: type-checked keyword arguments.

For me, the most important aspect of this exposition has been the progressive realization that type-checked keyword arguments and inline-initialization are necessary elements to supporting a mini-DSL within your own programming language. Without these features, syntax too easily obscures what you are tying to express. It’s probably worth complicating both the semantics and parsing of the host language to provide these elements.

We also find that, using function calls as our implementation interface, the most difficult aspect of Wirth’s syntax is repetition. All the other rules fit very nicely into existing function interfaces. It would be very nice for more languages to support varargs in the way that Java has. However, I would like for this to be extended a bit further: it should be possible to have multiple vararg arguments, as long as they are type-distinguishable. For example, m(String..., Integer...), should be callable with m("a", "b", "c", 3, 1, 4, 1, 5, 9).