Comp* – eric the fruitbat

C++ casting operators

erich — Sat, 20 Apr 2013 02:28:58 +0000

Today in my research, I came across an interesting challenge. I’m editing an older version of JavaScriptCore (JSC), redefining the most basic typedef in the sytem, EncodedJSValue. Previously it was a simple void*, but I need to gift it with some special secret sauce, so I changed it to a struct.

typedef struct EncodedJSValue {
    void *value;

    // additional functionality elided
}

Most of the additional functionality could be filled in by following compiler errors. But one error deep in the system took some special cleverness. My initial (and failing) attempts at fixing the issue led me into a strange corner of the C++: overloading of the cast operators.

g++’s error messages are confusing, so I originally (wrongly) thought that the EncodedJSValue‘s were sometimes used to hold function pointers. Specifically typedef int(*jitStub)(void**); So, how to convert an EncodedJSValue to a function pointer?

Well, I know from online reading, that it’s possible to implement an operator that can convert from EncodedJSValue to int.

typedef struct EncodedJSValue {
    void *value;

    operator int() {
        return static_cast(value);
    }
}

So what would be the syntax for writing a typecast operator from EncodedJSValue to a function pointer? Although I know the syntax for function pointers (see the jitStub typedef previously), I was pretty sure that I could not just paste that into the operator definition. Besides that, function pointers don’t read very nicely, and keeping a typedef around helps the code to be more self-explanatory. So, I tried making an operator jitStub, and it worked. No complaints from the compiler!

typedef int(*jitStub)(void**);

typedef struct EncodedJSValue {
    void *value;

    operator jitStub() {
        return static_cast(value);
    }
}

Also, I found out by writing a separate test, the conversion can be called implicitly.

EncodedJSValue jsv;
void **arg;

int x = jsv(jitStub);

Although this didn’t solve my compilation problem, I did learn another dark corner of the labyrinthine C++. Using the much better error reporting in clang helped me to interpret my actual problem.

Flexible Iterators

erich — Thu, 31 Jan 2013 09:18:58 +0000

Java has some odd quirks which make it far more inflexible than it needs to be. For example, many programs have data structures which need to be iterated both forwards and backwards, and some algorithms require treating the first or last element differently than the others. My goal here is to find a tweak to the Java language that would permit use of the foreach loop in all of these cases. To achieve that, it would be nice if the data structure in question could return different iterators appropriate to the task at hand.

Let’s first review the laborious, multi-step process of gifting a class to support the foreach syntactic sugar.

Declare the class with so that it implements Iterable.
Define a method Iterator iterator().
Implement an appropriate class DataIterator extends Iterator
Define methods supported by all Iterators: next(), hasNext(), and remove().

The foreach loop, which looks like:

for(Data d : collection) {
  // do something with d
}

desugars into

Iterator it = collection.iterator();
while (it.hasNext()) {
  Data d = it.next();
  // do something with d
}

Knowing this implementation, I can achieve my stated desire in one of two ways 1. a syntax change or 2. a language change. I shall first present the syntax change because it’s such a horrible idea.

Syntax Modification

We first observe that the foreach loop desugars into a call to iterator() passing no arguments. This constraint forces each class into a box where it can only implement one kind of iteration. Immediately, my object oriented (damaged) brain thought to alleviate this constraint by overloading of the iterator() method with versions accepting different arguments. For example, a DataIterator which allowed slicing might be called with three arguments: start, end, and stride. The foreach loop syntax can be extended to perform this lookup, by a small tweak to the desugaring:

for(Data d : collection)(start, end, stride) {
  // do something with d
}

The syntactic change can even be made backwards compatible with the use of varargs.

Semantic Modification

Change the semantics of the foreach desugarer so that it can handle both Iteratables and Iterators.
This is by far the easiest change, because I find it comparatively easy to gift my class with many well-named methods which each return an Iterator.
The same example reads much better now:

for (Data d : collection.slice(start, end, stride)) {
  // do something with d
}

Instead of slice I could also implement reverse or any other kind of iteration.

I wondered why the Java implementors did not already put in the ability to handle both Iteratables and Iterators.
An answer on StackOverflow, of course, held the standard objection:

The reason for-each loops require an iterable is to allow the same object to be traversed multiple times (so that you can use multiple for-each loops over the same object without surprising behaviour), whereas an iterator only allows one traversal. If for-each allowed iterators to be used, the behaviour would be surprising to programmers who didn’t realise that their iterators would be exhausted after the loop is run.

Closing Remarks

My postdoc Per pointed out that Python programmers obviously have an easier time. Because Python has dynamic lookup mechanisms that allow the for-in loop to accept any iterator, it supports much more composition, allowing filters, generators, and iterators.

Notes: CSTA CS+IT Conference

erich — Wed, 11 Jul 2012 09:29:10 +0000

Today I attended the 2012 ACM Computer Science and Information Technology Conference. It was focused heavily on the advocacy of teaching Computer Science material in the K-12 system.

I most liked the fact that every session I attended have some kind of interactive portion. Including trial activities, and Think-Pair-Share.

Think positively. Have your students relate 3 items of good news everyday. (I’ve read positive psychology, that proves this works to increase feelings of happiness).
Act as the host that elicits solutions from your audience. Don’t tell students the answers, instead get them to help each other or model the learning/investigation process.
Debrief on your process after every problem. How did you solve it? How does it apply to other things? Make sure the steps of solution.
Model the inquiry process to help students find the answer through their own knowledge.
First course may be more about building community that it is about content.
Have every student journal (think-pair-share)
Use scaffolding: play guess the number for binary search, fill out a cooking recipe for discovering a Gantt chart.
Scaffolding is especially important for group work. Assign roles so that everyone has a clear part to play. Make the steps to solution clear and incremental. Rotate the roles.
Can’t skip on the orientation part to save time. People need a context for the problem.
Ask meta-questions: 1. Directed 2. Convergent (pattern identification) 3. Divergent (pattern application to other place)
www.pogil.org and cspogil.org

A couple of new terms: Brogrammer and Hacker Hostel.

A couple of references:

The Art of Changing the Brain: Enriching the Practice of Teaching by Exploring the Biology of Learning, by Zull.
Stuck in the Shallow End: Education, Race, and Computing, by Margolis

By the way, I’m actually appalled at how low the minority and women turnout in CS eduction is. I would rather like to think that, behind the impersonal interface of text, minorities and women would have MORE opportunity to express themselves without being shutdown via automatic cultural biases. It should be a source of freedom. We have an image problem if the cultural message are so strong as to thwart those opportunities.

I’m still left with a question: What have I done to welcome students into my classroom? What have I done to be more inclusive? to break down the nerdy white male stereotype? How can I do this without appearing condescending or patronizing?

Nanopass Compiler

erich — Sat, 19 May 2012 19:12:43 +0000

Through a friend, I got hold of a provocative paper A Nanopass Framework for Compiler Education, by Sarkar, Waddell, and Dybvig. They describe a compiler written in scheme that makes 50ish passes. Each pass is described as a language transform, where both the starting input and output language must have precise grammars. To reduce the code bloat, languages descriptions can be inherited; so the developer must really only specify what is changed via each transform. The remove-not pass was 7 lines.

The framework also provides some nice debugging features. Since each pass is quite small, it’s easier to get right. It’s also logically separated from other transforms. Adherence to each IR language is checked via scheme’s dynamic typing system. And there’s also a parser/unparser pair that allow visualization of the IR language.

I find most fascinating the extreme language-centric view of the entire approach. It strikes me as something that only scheme/lisp people would invent. The framework itself reads as a DSL for describing incremental language transforms. The DSL removes the crufty parts from my idea: that the compiler is a Visitor of visitors.

Embedded Languages

erich — Sun, 29 Apr 2012 02:12:47 +0000

I don’t like them.

I’ve ranted before about how the Web is a festering polyglot made horrific by Postel’s Law. Many, including Tim Bray, advocate more knowledge at the client end, when an error occurs in parsing the steaming pile of HTML that forms today’s Web pages. I almost fell in line with this reasoning, because more information is better, right? I thought a draconian policy would so irritate customers that businesses would be quick to fix it, and expend much effort on prevention. So, all the Web becomes well-formed.

Oh how wrong I was!
Jeff Atwood recounts an interesting tale at http://diveintomark.org/archives/2004/01/14/thought_experiment:

Imagine that you posted a long rant about how this is the way the world should work, that clients should be the gatekeepers of wellformedness, and strictly reject any invalid XML that comes their way. You click â€˜Publishâ€, you double-check that your page validates, and you merrily close your laptop and get on with your life.

A few hours later, you start getting email from your readers that your site is broken. Some of them are nice enough to include a URL, others simply scream at you incoherently and tell you that you suck. (This part of the thought experiment should not be terribly difficult to imagine either, for anyone who has ever dealt with end-user bug reports.) You test the page, and lo and behold, they are correct: the page that you so happily and validly authored is now not well-formed, and it not showing up at all in any browser. You try validating the page with a third-party validator service, only to discover that it gives you an error message youâ€ve never seen before and that you donâ€t understand.

You pore through the raw source code of the page and find what you think is the problem, but itâ€s not in your content. In fact, itâ€s in an auto-generated part of the page that you have no control over. What happened was, someone linked to you, and when they linked to you they sent a trackback with some illegal characters (illegal for you, not for them, since they declare a different character set than you do). But your publishing tool had a bug, and it automatically inserted their illegal characters into your carefully and validly authored page, and now all hell has broken loose.

You desperately jump to your administration page to delete the offending trackback, but oh no! The administration page itself tries to display the trackbacks youâ€ve received, and you get an XML processing error. The same bug that was preventing your readers from reading your published page is now preventing you from fixing it! Youâ€re caught in a catch-22. … All the while, your page is completely inaccessible and visibly broken, and readers are emailing you telling you this over and over again.

…

Hereâ€s the thing: that wasnâ€t a thought experiment; it all really happened. Itâ€s a funny story, actually, because it happened to Nick Bradbury, on the very page where he was explaining why it was so important for clients to reject non-wellformed XML. His original post was valid XHTML, and his surrounding page was valid XHTML, but a trackback came in with a character that wasnâ€t in his character set, and Typepad didnâ€t catch it, and suddenly his page became non-wellformed XML.

The moral of the story is actually not about well-formedness and draconian client validation, but one of security. It should not be possible for somebody else to break your system. The mechanism by which we include foreign content into our pages in fundamentally broken. HTML systems usually function as templated string processing, a practice which results in the above problems. It’s an issue of content injection and a lack of sandboxing, that’s only masquerading itself as one of well-formedness and validation. Embedded languages shall never escape this quagmire.

Scaling Automated CS Education

erich — Sat, 28 Apr 2012 23:46:06 +0000

The success of Salman Kahn’s Academy and other instances of disruptive education, have started me thinking about how computer science education might scale. Let’s first analyze how Kahn is organizing the learning experience.

First: Have a huge collection of videos. Kahn’s library has been organically grown. Each video introduces only a single topic, through the use of an example. The videos are never longer that 15 minutes, so the example and explanation have to be concise and straightforward. Any context or motivation for the topic/idea need to be embedded in the example itself, and do not warrant their own digression. Extended lecture is one of the least effective mechanisms for learning, so the short video format works well. Additionally, the video only has to be made once, and it can be repeated as often as necessary for every student.

Second: Map out the dependencies. Once the video library grew large enough it has to be organized. Some subjects, such as mathematics, have a clear dependency chain of topics. But all subjects can be broken down and categorized into a network of interdependent topics. This web is not necessarily topologically order-able, but an ordering roughly corresponding to historical development likely minimizes the forward references and digressions to another leaf that occur when two seemingly unrelated topics need to be introduced before they are later unified. The mapping itself has a visualization.

Third: Merit badges for topic understanding. If the discipline itself can be broken down into separate topics, then you can develop exercises that test each individual concept. Mastery of each topic can be measured by performance on the topic exercises. Each time a student can prove the ‘get it’ a merit badge can be awarded that unlocks access to the more abstract, more complex, more involved topics that follow. This ensures that each student builds the skills needed before they continue. Everyone can develop mastery at their own pace, and we don’t have to keep a strict schedule. Understanding and development become the focus while examination and assessment are de-emphasized. You can visualize your achievements and progress in the topic dependency visualization.

Fourth: Lack of penalization for failure. Each student is allowed to develop at their own pace, watch the topic video as many times as necessary, and try the exercises as many times as necessary until they ‘get it’. No ‘Fail’ or other negative mark appears on your permanent record just because you were unable to keep pace with the rest of the class and the teachers schedule. You are free to go back and review topics that you have forgotten, and will likely do so because the video is less than 15 minutes. Understanding the material to collect merit badges becomes its own reward.

That sounds like a wonderful system! And it works out quite nicely for subjects that have short, simple exercises that can assess individual isolated topics. But for subjects other than math/science I’m not sure that it works quite as well. There might be issues scaling this educational framework.

Apply the above organization to Computer Science. We get something like CodeAcademy. It doesn’t have a graphical visualization of the dependency map of topics, but it does have a way to track and share your progress. The exercises are simple, but I find them too knowledge-based. Yes, you can easily learn how for loops work, and the interface encourages exploration. But, it doesn’t teach the most important part of programming skill: design trade-offs and decision making.

An automated system is pretty good at teaching the basics: types, variables, memory, assignments, loops, conditionals, etc. Yes, it can be gamed and deceived, but if you do that you’re probably have more understanding than the automated assessment expects. But it’s not good at giving design feedback. The exercises don’t scale. It doesn’t have a way for me to learn the GoF Design Patterns. It doesn’t have a library of common solutions for common problems in that particular language. It doesn’t tell me that I have to practice good indentation and naming lest I confound myself. It doesn’t tell me that I should make many small functions with good names, that I should handle edge cases first to get the out of the way, that I should prefer local variables over global variables. In short, it allows me to acquire coding habits (like use of global variables and flags that control loops) that work on small exercises but will ultimately prevent me from becoming a professional developer.

If we are going to automate our education by breaking it up into small pieces, we have to make sure that we also teach how those pieces interconnect. It’s not enough to know all the different Lego pieces, I also have to know how to assemble them together. Unfortunately, design issues are much fuzzier. The trade-offs are trickier. The feedback is more about understanding explanation and reasons why vs rote knowledge and brute-force logic.

People tend to ignore their designs until they run into maintenance problems, and only then do they begin to desire and learn about code organization and design. Only when they’ve created complexity beyond their capacity to manage it, do they feel a need for a way out. So, our educational system has to have built into it, a mechanism for waiting on the student to be ready for the knowledge. For example, as Steve Yegge recounts in his reading of Fowler’s book on Refactoring, even smart developers that continually practice their craft can be blind for years before realizing this need.

I’ll quit with a question: How can you automate feedback about design trade-offs and code organization?

Measuring Effectiveness of a Domain Specific Language

erich — Thu, 05 Apr 2012 22:43:50 +0000

Also, at CGO I met Hassan Chafi, who is working on a graph-based Domain Specific Language. Even though I never seem to find time that I can explicitly devote to studying them, DSL’s are, to me, an compulsively fascinating topic. A day or so after the discussion it occurred to me that we need some metrics by which a DSL can be measured. Now, in the general purpose language field Wirth has come up with what is, in my opinion, a very elegant metric: language complexity can be measured by the size of the self-hosting compiler. That works great for general purpose languages that have to do string processing, parsing, data structures, traversals, modeling, etc. Each of which is a component of the self-hosting compiler. But it works less well for a DSL, because the focus on particular domain means they aren’t general purpose.

In the case of a DSL for graphs though, I think the case is clear: It should run graph algorithms well. But which ones? And how do you measure expressibility? It took a couple of days for the answer to arrive in my head. I had at one point encountered a wonderful paper on On variants of shortest-path betweenness centrality and their generic computation by Ulrik Brandes. This paper provides a dozen related graph algorithms. It is presented in a way that emphasizes the changes between the base centrality algorithm and each variant. This style of presentation helps to measure how well the DSL allows similar algorithmic changes.

So I think it’s a good start to answer the general question, “How to measure the effectiveness of a DSL?”, with a case study. Make a list representative of what you wish to do, and try it out, looking for patterns and variations on a theme.

Express yourself: to the compiler and to your fellow developer.

erich — Thu, 05 Apr 2012 05:23:19 +0000

The keynote speaker at CGO 2012 (Chris Lattner, LLVM) put some crazy thoughts into my head.

Want compiler to know about:

memory disjointness
aliasing
Usage of data structures (array of struct vs struct of arrays)
whether arithmetic is done on a pointer (and the bounds)
invariants (in loops and between methods)

A language needs to be able to express some of these concerns. Not just because the analysis within the compiler benefits from having the data, but that programmers themselves should be documenting these properties. A great programmer knows about the analysis the compiler can perform. A great programmer knows about the assumptions that such analysis requires. And a great language supports the great programmer by allowing her to express these properties within the code itself.

Creating a language that supports these higher-level descriptions allows other programmers to see why a certain portion of code is structured the way that it is. It helps them from innocently re-structuring the code so that the compiler’s analysis fails (and performance is lost). It makes more clear what you shouldn’t say as a programmer.

I’m not familiar with Eiffel, so I may be completely out of place with this example. But, it seems to me that Eiffel’s choice to allow the programmer to express explicitly, in source the invariants of their programs has two distinct benefits: (1) the compiler has more information to work with during program analysis, (2) the programmers are encouraged to think more deeply about their code’s structure. Much compiler research has tried to investigate “How much can we do this automatically (so that we don’t have to change existing code and so that programmers don’t have to learn anything new)?” I’m using Eiffel as an example of why these objectives are actually harmful. In my experience as a programmer, knowing more has always helped. I desire a language that encourages me to know more, and the best encouragement is to have linguistic support for describing higher-level properties. In Eiffel’s case it’s program invariants, but why not include also some of those things in the list above?

The compiler shouldn’t be a magic black box! It should be a tool that yields increasing benefit the more a programmer devotes to learning how to use it. The benefits should build on each other incrementally, so that even though complete mastery takes 10 years, each incremental step is worth taking.

Information Uncertainty Principle

erich — Mon, 26 Mar 2012 01:35:15 +0000

Through Denning’s Presentation Great Principles of Computing I heard of this fascinating tale regarding Buridan’s ass.

It refers to a hypothetical situation wherein an ass is placed precisely midway between a stack of hay and a pail of water. Since the paradox assumes the ass will always go to whichever is closer, it will die of both hunger and thirst since it cannot make any rational decision to choose one over the other.

What’s more interesting than the philosophy behind the donkey’s dilemma, are the actual applications:

Leslie Lamport calls this result Buridanâ€s principle, and states it as: A discrete decision based upon an input having a continuous range of values cannot be made within a bounded length of time.
…
A version of Buridan’s principle actually occurs in electrical engineering. Specifically, the input to a digital logic gate must convert a continuous voltage value into either a 0 or a 1 which is typically sampled and then processed. If the input is changing and at an intermediate value when sampled, the input stage acts like a comparator. The voltage value can then be likened to the position of the ass, and the values 0 and 1 represent the bales of hay. Like the situation of the starving ass, there exists an input on which the converter cannot make a proper decision, resulting in a metastable state. Having the converter make an arbitrary choice in ambiguous situations does not solve the problem, as the boundary between ambiguous values and unambiguous values introduces another binary decision with its own metastable state.

I think that it might be possible to turn this into an Information uncertainty principle, analogous to that of the physicist’s quantum uncertainty principle.

Specifically, an observer of a Turing machine calculation can only be in one of two scenarios:

uncertain about the decision a computation will take within a given (determined) time, or
certain about the decision, but uncertain about the time it takes to decide.

If the knowledge trade-off between two extremes can be quantified, in the same way that ħ quantifies the quantum uncertainty trade-off between position and momentum, we might really have something substantial!

Segregate Third-Party JS Libraries

erich — Mon, 12 Mar 2012 20:50:54 +0000

Typically, web authors simply load whatever library they’d like to use with full trust. In JS, such loading amounts essentially to a #include. I’m flabbergasted that this practice remains normal. It could be paranoia, but even without invoking all the security concerns, I’d be reluctant to include other people’s code simply because of the potential for a naming conflict on a global variable.

But, recently, in reviewing a new paper about JS information flow security, I had an interesting thought. (I should credit this thought to my co-worker, Christoph Kerschbaumer, because it was his phrasing of a bullet point that made me question existing practice.) Why don’t we use iframes to segregate the included code?

Currently, that can’t be done because the inter-iframe communication channel is hideous: uses fragment identifiers in a url to re-navigation a frame, which continuously polls the document url looking for that update. Of course re-navigation has it’s own security issues[Securing Frame Communication in Browsers]. This mechanism just operates too slowly and unreliably to use with a library.

Additionally, there’s a strong reason to keep iframes completely separate from each other: advertisements. Where a JS library is third-party code that you want to use, advertisement code is something that you expressly don’t want polluting namespace or accessing your variables and memory. Because of syndication, it’s also code that you shouldn’t trust in any way. iframes currently provide the best mechanism to segregate an advertisement from the rest of the page.

But! what if the iframes came equipped with a communication channel specifically meant for inter-frame communication? This channel, should of course come with a mechanism to attach security monitors so that each use can be customized for an appropriate level of security. The presence of the channel should act also as a telephone: that is, JS on either side, should be able to opt-out, and not answer a request for communication from the transmitter. The channel will need to be high-bandwidth, so that it can be used with libraries. And the segregation can use either message-passing semantics [good for memory separation (parallelism), might cost copies during communication (c.f. erlang for how to optimize this)] or reference-passing semantics [less communication overhead, bad for memory separation, potential for abuses: library might access complete data structure via referential transitivity].

I think the advantages here are pretty cool:

If the browser can execute iframes in separate memory spaces, perhaps this could be a way to have parallel code execution.
The iframe keeps untrusted JS behind a monitored firewall.
The ability to attach a security monitor to the channel should mollify most complaints about cross-iframe script contamination.
Talking to an iframe can have its own linguistic support by creating a JS object with overloaded getter/setter that wraps the communication. Calling library methods should look nearly the same as before, modulo a prefix for the interface (which is good, more like including a module).

Update Tue Mar 13 20:45:34 PDT 2012: Christoph alerted me that I have to be more careful in my reading. Indeed, Barth’s paper (cited above) actually contains exactly this proposal, as well as an implementation (adopted in WebKit), postMessage. The remaining difficulty lies in library adoption. Many JS libs now have to be rewritten to use the new communication mechanism (or clients have to care enough to write wrappers).