Artifact Evaluation for Software Conferences

Shriram Krishnamurthi

First version: 2012-10-14. Revision: 2013-11-13. Revision: 2014-06-07.

This page is now obsolete. All the useful and current information from it has been moved to the Artifact Evaluation site. The page remains here purely for historical reasons.

What Is It and Why Do We Need It?

For many years, some of us in programming languages and software engineering have been concerned by the insufficient respect paid to the artifacts that back papers. We find it especially ironic that areas that are so centered on software, models, and specifications would not want to evaluate them as part of the paper review process, as well as archive them with the final paper. Not examining artifacts enables everything from mere sloppiness to, in extreme cases, dishonesty. More subtly, it also imposes a subtle penalty on people who take the trouble to vigorously implement and test their ideas.

In 2011, Andreas Zeller, the program chair for ESEC/FSE, decided to institute a committee to address this problem. Andreas asked Carlo Ghezzi and me to run this process. This document reports on our experiences but, because it includes personal opinions, I have written this document in the first person. Be aware, however, that all the work was jointly done with Carlo, Andreas, and the AEC members. Very, very special thanks go to the committee members who, as this document describes, went well beyond the call of duty.

An aside on naming. I have long wanted to create such a committee and call it the “Program Committee” (ha, ha). However, not only is that name taken, we also wanted to be open-minded to all sorts of artifacts that are not programs (not only models but also data sets, etc.). We therefore called this the Artifact Evaluation Committee (AEC). Someone will surely come up with a better name for this eventually.

My overall view is that, despite some concerns, the process was received well by the community, believed important by the program committee, particapated in well, and useful. I provide more information on each of these below. Owing to its initial success, I strongly encourage the continued use of such a process, and in particular especially encourage future chairs to adopt the AEC structure we did or modify it with caution.

Design Criteria

For several months before the deadline, we consulted with several software engineering community leaders about the wisdom of having an AEC. Most responded positively; a few were tepid; a small number were negative and gave us constructive feedback. Here are some of the most prominent issues that people raised, which I present without judgment:

It became clear that there was a strong desire to be conservative in the design of this process, at least initially. We therefore decided that, in addition to treating artifacts with the same confidentiality rules as papers (as we had always intended), they did not need to be made public: it was sufficient if only the AEC saw them. (Obviously, we encouraged authors to upload them to the supplement section of the ACM DL and/or make them public on their own sites.)

We also made two especially important decisions:

  1. We chose to erect a Chinese Wall between the paper and artifact evaluation processes: that is, the outcome of artifact evaluation would have no bearing at all on the paper's decision. The simplest way to assure the public of this was via temporal ordering: that is, artifact submission and evaluation began only after paper decisions had been published. Confident authors could, of course, provide artifact links in their papers, as some already do, but they were not required to do so.
  2. The outcome of artifact evaluation for individual papers would be reported not by us but by the authors, who had a choice of suppressing this information in case of a negative review.

These decisions seemed to reassure several people that the process would truly be a conservative extension of current review mechanisms, and would not adversely affect the conference.

AEC Composition

We decided to populate the AEC entirely with either PhD students or very recent graduates (post-docs). We contacted a small set of researchers who we felt represented the standards we wished to enforce, and asked them to suggest students. Almost all replied affirmatively (a handful even offering themselves in addition!). Our reasoning for using the student population was four-fold:

  1. they may be more familiar with the mechanics of modern-day software tools;
  2. they may have more time to conduct detailed experiments with the artifacts;
  3. they might be more responsive to the short deadlines; and,
  4. participating in the review process might give them some perspective on the importance of artifacts, and influence the people likely to become the next generation of leaders.

This was perhaps our most controversial decision beyond that of having an AEC itself. I admit that I am very cynical of the ability of regular PC members to perform the artifact evaluation on their own. Many of them fail on the first three criteria above (and do not contribute to the fourth). The first criterion is the most important, and it makes no sense to rule out someone from the PC because they lack this strength, which is what is implied by making PC members perform artifact evaluation.

I have also seen too many PC members simply delegate their review work to others and then pass it along even unread. This often leads to friction at PC meetings, often followed by an embarrassed, “Yes, this review was too (positive/negative), I should not have passed it along. But let me read the paper tonight and form my own views.” Pc members are entirely capable of reading the papers for themselves overnight and truly forming their own views. If they cannot do the same with the artifacts, then we end up with a proxy discussion that sheds little light. Since most of the artifacts will anyway be evaluated by graduate students, having the students discuss the artifacts directly with one another seems far preferable, and also gives those students more direct credit.

In practice, the AEC exceeded even our high expectations. When's the last time you delegated part of the process to the committee, and they organized themselves and efficiently converged on a good decision? When's the last time you gave 72 hours for bidding and had all the bids done in 48 hours? When's the last time you sent mail to your committee saying “The following four submissions have insufficient reviewers” and within minutes got multiple replies saying, “Then treat those as my bid!”? Reviews were being submitted even before all the artifacts had been uploaded! Therefore, we cannot recommend the use of young researchers strongly enough. And once again, these are the people who made Carlo and me look good.

Artifact Review Process

Every artifact was reviewed by two committee members. They were given the accepted version of the paper to read before examining the artifacts. They were then asked to individually submit a detailed review along these lines:

In retrospect, it would also have been useful to indicate (as some reviewers already did)

so any problems found could more easily be reproduced by the authors.

We asked for an assessment at one of five levels:

  1. Significantly exceeded expectations
  2. Exceeded expectations
  3. Met expectations
  4. Fell below expectations
  5. Significantly fell below expectations

If there was a difference in overall assessment between the two reviewers, we asked them to reconcile the difference between themselves. This process went smoothly and only in rare cases were they unable to agree (and even then, their assessments were off-by-one). At this point the chairs helped break ties. This decision was communicated to the authors, along with the reviews.

We conveyed decisions about a week before the camera-ready deadline so that authors could include it in their paper. We stipulated nothing about how (or even whether) they reported it, leaving the decision to them. If they chose to mention it, they could do so in the abstract, the introduction, as a separate section, as a rosette icon on their paper, or indeed anything else.

One complication with the current AEC structure, coming after the papers have been accepted, is that revewiers were required to base their views on the paper as is. Sometimes reviewers were unhappy with the paper itself. They needed to be reminded that they could not allow this to bias their views too much: we were only charged with deciding whether the artifact met the expectations set by the paper, no matter how low or incorrect those might be! This is of course a difficult emotional barrier to ignore.

Conflicts of Interest

In general, we treated conflicts of interest precisely as we would for regular conference papers.

One issue we failed to consider is who could submit artifacts. If artifacts become part of the evaluation process of papers, the status of AEC members becomes akin to that of PC members. Since we went with a two-phase approach where paper decisions were made independently of and before artifact evaluation, we decided that only those with direct conflicts with the chairs could submit artifacts. This meant, however, that their papers would not be able indicate their artifact quality, which might make readers question it. We therefore informed such authors that they were welcome to mention in their paper that they were prohibited from submitting an artifact due to the conflict of interest.

Packaging

We published detailed packaging guidelines. In retrospect, we should have asked the authors to set up a Web page with links to everything. Because we did not, many authors sent us a link to an archive file, but then sent instructions (often quite complex) by email.

The artifact packaging instructions say that author should

make a genuine effort to not learn the identity of the reviewers through logs. This may mean either turning off analytics or creating a fresh, analytics-free instance, or only referring to sites with high enough usage that AEC accesses will not stand out.

This instruction should apply to all forms of download, not only for live running instances.

We did not tackle subtle questions such as what to do with results that take many machine-months to compute, or require proprietary hardware that cannot easily be made Web-accessible, and so on. We leave these as challenges for future consideration.

Details from ESEC/FSE 2011

These now live on their own page.