Artifact Evaluation for Software Conferences

Shriram Krishnamurthi

First version: 2012-10-14. Revision: 2013-11-13.

If you don't know what artifact evaluation is about, skip ahead. In a month or two I will thoroughly reorganize this page with lots more information from other artifact evaluation events, at which point this will break up into multiple interlinked pages that will continue to reside here.

Proposed AEC Structure Guidelines

A team consisting of Matthias Hauswirth, Steve Blackburn, Jan Vitek, and I (with much of the impetus from Matthias) has tried to codify guidelines that we believe reflect a good practice for structuring AEC chairs. The goal of these guidelines is to create a process that is smooth and transfers knowledge across years:

Selection of the observer (and thus future co-chair) should consider these guidelines:

Additionally, we recommend that program chairs of the associated conferences consider putting one of the two co-chairs [I, C] on the PC, possibly with a reduced reviewing load. This enables the co-chair to participate in the PC process and take notes on the expectations set by the paper.

What Is It and Why Do We Need It?

For many years, some of us in programming languages and software engineering have been concerned by the insufficient respect paid to the artifacts that back papers. We find it especially ironic that areas that are so centered on software, models, and specifications would not want to evaluate them as part of the paper review process, as well as archive them with the final paper. Not examining artifacts enables everything from mere sloppiness to, in extreme cases, dishonesty. More subtly, it also imposes a subtle penalty on people who take the trouble to vigorously implement and test their ideas.

In 2011, Andreas Zeller, the program chair for ESEC/FSE, decided to institute a committee to address this problem. Andreas asked Carlo Ghezzi and me to run this process. This document reports on our experiences but, because it includes personal opinions, I have written this document in the first person. Be aware, however, that all the work was jointly done with Carlo, Andreas, and the AEC members. Very, very special thanks go to the committee members who, as this document describes, went well beyond the call of duty.

An aside on naming. I have long wanted to create such a committee and call it the “Program Committee” (ha, ha). However, not only is that name taken, we also wanted to be open-minded to all sorts of artifacts that are not programs (not only models but also data sets, etc.). We therefore called this the Artifact Evaluation Committee (AEC). Someone will surely come up with a better name for this eventually.

My overall view is that, despite some concerns, the process was received well by the community, believed important by the program committee, particapated in well, and useful. I provide more information on each of these below. Owing to its initial success, I strongly encourage the continued use of such a process, and in particular especially encourage future chairs to adopt the AEC structure we did or modify it with caution.

Design Criteria

For several months before the deadline, we consulted with several software engineering community leaders about the wisdom of having an AEC. Most responded positively; a few were tepid; a small number were negative and gave us constructive feedback. Here are some of the most prominent issues that people raised, which I present without judgment:

It became clear that there was a strong desire to be conservative in the design of this process, at least initially. We therefore decided that, in addition to treating artifacts with the same confidentiality rules as papers (as we had always intended), they did not need to be made public: it was sufficient if only the AEC saw them. (Obviously, we encouraged authors to upload them to the supplement section of the ACM DL and/or make them public on their own sites.)

We also made two especially important decisions:

  1. We chose to erect a Chinese Wall between the paper and artifact evaluation processes: that is, the outcome of artifact evaluation would have no bearing at all on the paper's decision. The simplest way to assure the public of this was via temporal ordering: that is, artifact submission and evaluation began only after paper decisions had been published. Confident authors could, of course, provide artifact links in their papers, as some already do, but they were not required to do so.
  2. The outcome of artifact evaluation for individual papers would be reported not by us but by the authors, who had a choice of suppressing this information in case of a negative review.

These decisions seemed to reassure several people that the process would truly be a conservative extension of current review mechanisms, and would not adversely affect the conference.

AEC Composition

We decided to populate the AEC entirely with either PhD students or very recent graduates (post-docs). We contacted a small set of researchers who we felt represented the standards we wished to enforce, and asked them to suggest students. Almost all replied affirmatively (a handful even offering themselves in addition!). Our reasoning for using the student population was four-fold:

  1. they may be more familiar with the mechanics of modern-day software tools;
  2. they may have more time to conduct detailed experiments with the artifacts;
  3. they might be more responsive to the short deadlines; and,
  4. participating in the review process might give them some perspective on the importance of artifacts, and influence the people likely to become the next generation of leaders.

This was perhaps our most controversial decision beyond that of having an AEC itself. I admit that I am very cynical of the ability of regular PC members to perform the artifact evaluation on their own. Many of them fail on the first three criteria above (and do not contribute to the fourth). The first criterion is the most important, and it makes no sense to rule out someone from the PC because they lack this strength, which is what is implied by making PC members perform artifact evaluation.

I have also seen too many PC members simply delegate their review work to others and then pass it along even unread. This often leads to friction at PC meetings, often followed by an embarrassed, “Yes, this review was too (positive/negative), I should not have passed it along. But let me read the paper tonight and form my own views.” Pc members are entirely capable of reading the papers for themselves overnight and truly forming their own views. If they cannot do the same with the artifacts, then we end up with a proxy discussion that sheds little light. Since most of the artifacts will anyway be evaluated by graduate students, having the students discuss the artifacts directly with one another seems far preferable, and also gives those students more direct credit.

In practice, the AEC exceeded even our high expectations. When's the last time you delegated part of the process to the committee, and they organized themselves and efficiently converged on a good decision? When's the last time you gave 72 hours for bidding and had all the bids done in 48 hours? When's the last time you sent mail to your committee saying “The following four submissions have insufficient reviewers” and within minutes got multiple replies saying, “Then treat those as my bid!”? Reviews were being submitted even before all the artifacts had been uploaded! Therefore, we cannot recommend the use of young researchers strongly enough. And once again, these are the people who made Carlo and me look good.

Artifact Review Process

Every artifact was reviewed by two committee members. They were given the accepted version of the paper to read before examining the artifacts. They were then asked to individually submit a detailed review along these lines:

In retrospect, it would also have been useful to indicate (as some reviewers already did)

so any problems found could more easily be reproduced by the authors.

We asked for an assessment at one of five levels:

  1. Significantly exceeded expectations
  2. Exceeded expectations
  3. Met expectations
  4. Fell below expectations
  5. Significantly fell below expectations

If there was a difference in overall assessment between the two reviewers, we asked them to reconcile the difference between themselves. This process went smoothly and only in rare cases were they unable to agree (and even then, their assessments were off-by-one). At this point the chairs helped break ties. This decision was communicated to the authors, along with the reviews.

We conveyed decisions about a week before the camera-ready deadline so that authors could include it in their paper. We stipulated nothing about how (or even whether) they reported it, leaving the decision to them. If they chose to mention it, they could do so in the abstract, the introduction, as a separate section, as a rosette icon on their paper, or indeed anything else.

One complication with the current AEC structure, coming after the papers have been accepted, is that revewiers were required to base their views on the paper as is. Sometimes reviewers were unhappy with the paper itself. They needed to be reminded that they could not allow this to bias their views too much: we were only charged with deciding whether the artifact met the expectations set by the paper, no matter how low or incorrect those might be! This is of course a difficult emotional barrier to ignore.

Conflicts of Interest

In general, we treated conflicts of interest precisely as we would for regular conference papers.

One issue we failed to consider is who could submit artifacts. If artifacts become part of the evaluation process of papers, the status of AEC members becomes akin to that of PC members. Since we went with a two-phase approach where paper decisions were made independently of and before artifact evaluation, we decided that only those with direct conflicts with the chairs could submit artifacts. This meant, however, that their papers would not be able indicate their artifact quality, which might make readers question it. We therefore informed such authors that they were welcome to mention in their paper that they were prohibited from submitting an artifact due to the conflict of interest.

Packaging

We published detailed packaging guidelines. In retrospect, we should have asked the authors to set up a Web page with links to everything. Because we did not, many authors sent us a link to an archive file, but then sent instructions (often quite complex) by email.

The artifact packaging instructions say that author should

make a genuine effort to not learn the identity of the reviewers through logs. This may mean either turning off analytics or creating a fresh, analytics-free instance, or only referring to sites with high enough usage that AEC accesses will not stand out.

This instruction should apply to all forms of download, not only for live running instances.

We did not tackle subtle questions such as what to do with results that take many machine-months to compute, or require proprietary hardware that cannot easily be made Web-accessible, and so on. We leave these as challenges for future consideration.

Details from ESEC/FSE 2011

A Prize!

Andreas Zeller was able to secure a cash prize of USD 1000 from Microsoft Research for a Distinguished Artifact, chosen by the AEC members. We chose to not call this the “Best” artifact since that term is too subjective and implicitly makes a value judgment that may be unfair to other artifacts. In principle it should be possible to have several Distinguished Artifacts, though we chose only one.

Artifact Submissions

We contacted the authors of all 34 accepted papers, inviting them to submit an artifact. Of these, only 21 replied. Of those who replied:

Our message was ambiguous: it set June 8 as the deadline for response but didn't make clear this was also the deadline for submitting the artifact. Therefore, authors were given an extension until June 12. Eleven authors submitted by June 8, and the rest a few days later. (On the other hand, the ambiguity meant some authors thought they had much longer than they actually did to prepare the artifact—some thought it was only due in August—so it arguably gave us an upper-bound on the artifacts we would have received.)

In short, out of 34 submissions, 16 indicated they would submit an artifact and 14 actually did so. With better deadline announcements, presumably about 18 would have submitted artifacts. This is over half the papers! This ratio indicates that there is widespread interest in submitting artifacts for evaluation.

Evaluation Outcome

Of the 14 submissions, 7 met or exceeded expectations, while the rest, sadly, did not. Some of them were easy to fix; indeed, within a day or two of receiving the decision the authors had created a fresh submission addressing what the AEC found (in one case, the student had simply submitted the wrong version), and asked for consideration. Unfortunately, we had to inform them that the AEC had been disbanded and it would be unfair to ask them to re-convene. All but one author accepted the AEC's negative decision with grace (indeed, some thanked us for finding problems in their distribution).

Only one author was dissatisfied with the AEC outcome. Though our exchange was pleasant, we were unable to arrive at a mutually satisfactory conclusion. He did, however, provide the following useful suggestion: “My conclusion is that it may be a good idea to request authors to submit a brief description of what they think the artifacts are that are valuable for the community”. This would indeed be a useful component of the artifact package, because it provides an “interface specification” and, like all good interfaces, helps identify mismatches early.

In turn, in my opinion, one AEC member was excessively harsh on the artifacts. For instance, on one submission he remarked, “In its present state, the tool could not be adopted by software engineers” and listed several concrete usability problems. While these were all genuine problems from the perspective of a product, we do not usually expect a research artifact to stand up to these standards. This felt no different from the excessive harshness that some PC members exhibit, and required the same moderation from chairs.

Scaling to All Papers

To make the AEC outcome part of the paper decision process, we would need to consider artifacts submitted with all papers. Is this feasible?

It is unlikely that the same ratio will hold in the general population (one presumes that papers with artifacts suitable for dissemination are more likely to have strong experimental results, and thus be more likely to be accepted). Still, even if a quarter of the all submissions were accompanied by artifacts, this would still mean about fifty submissions, which is four times the current load. We believe it is entirely reasonable to double the load per reviewer (currently reviewers were assigned 2-3 artifacts each), especially with significantly more time. Similarly, the committee (currently only 12 people, in contrast to a paper PC of 27) could be made larger. Through these measures, it should be possible to allow all authors to submit artifacts, to be evaluated in parallel with the papers.

Of course, the artifacts that accompany weaker papers will presumably not be packaged as well, which may consume more time and lead to more frustration. One approach would be to delay the start of artifact evaluation. Suppose that in the two-month review period, all papers are expected to get two reviews in the first month. At this point many papers already fall below threshold. The artifacts of these papers presumably need not be reviewed at all.

One virtue of evaluating all papers is that the AEC and PC can work cooperatively. In particular, as a PC member reads the paper, she can send questions to the AEC about issues that were not covered by the authors but whose outcome would affect her decision. In turn, I can even imagine the AEC making observations that may suggest issues that PC members had not considered from the paper text.

Program Committee Survey

The program committee meeting discussed 58 papers. After a decision had been made about each paper, we asked all the reviewers to answer the following two questions:

Each reviewer voted independently of the others, and we did not attempt to form a single, unified opinion per paper. Though the intent was to vote quickly and move on to the next paper, on a few instances there was a brief, spontaneous discussion between reviewers that in a few cases caused some to change their vote.

The outcome of this survey is as follows. (Of course, these numbers may be slightly inflated: the presence of Andreas, Carlo and me in the room may have made people vote more positively.)

First, let's see whether the reviewers thought the paper even had an artifact to review. In 51 of the papers, all reviewers felt that it did. For six papers, two reviewers felt it did while one reviewer felt it didn't. On only one paper did a majority of reviewers feel it did not, and for that paper, all three reviewers agreed there was no artifact to review. (Incidentally, that paper was rejected.)

Next, let's look at how reviewers regarded the utility of artifact evaluation. On 29 papers, all reviewers felt it would have played a significant role. On 6 papers, two felt significance and the third felt utility; on 5 more, one felt significance and the other two felt utility. On another 9 papers, all felt utility. On 3 more papers, two felt utility and the third felt it was of no use. This leaves only six papers! One of these is, of course, the paper on which the reviewers felt there was no artifact. Thus, there were only five papers on which the overall value of artifact evaluation was deemed to not be at least useful by a majority of reviewers.

I also examined whether there was a difference in estimation of utility between accepted and rejected papers. There was no perceptible difference: the spread of utilities was almost exactly the same, in keeping with the ratio of the number of accepted and rejected papers themselves. It is not clear whether this should be surprising.

Finally, as mentioned, there was sometimes a brief discussion about artifact evaluation for each paper. Some of the comments recorded during this discussion include the following (in parentheses, the paper's outcome):

We note that most of these comments are for rejected papers, presumably because on accepted papers the paper already contained enough to convince the reviewers, but for rejected papers the evaluation would have helped answer open questions.

Conclusion

I hope the data from 2011 show that there was real appreciation from both the PC and authors for the AE process. Unfortunately, neither ESEC 2012 nor FSE 2012 appears to have repeated our experiment in this form. However, Jan Vitek and I were discussing this, and Jan was sufficiently excited that he is implementing it, mostly unchanged, for ECOOP 2013. With luck, therefore, the process will live on through other conferences.