Artifact Evaluation for Software Conferences
What is Artifact Evaluation?
See the Web site for the process.
This page only reports on the experience from the first major software conference artifact evaluation process, run for ESEC/FSE 2011.
What follows is meant to be a report about the process. Though it contains actual data and real quotes, it also includes some of my opinions. These should be easy to tell apart.
Details from ESEC/FSE 2011
Andreas Zeller was able to secure a cash prize of USD 1000 from Microsoft Research for a Distinguished Artifact, chosen by the AEC members. We chose to not call this the “Best” artifact since that term is too subjective and implicitly makes a value judgment that may be unfair to other artifacts. In principle it should be possible to have several Distinguished Artifacts, though we chose only one.
We contacted the authors of all 34 accepted papers, inviting them to submit an artifact. Of these, only 21 replied. Of those who replied:
- Two said they did not believe they had an artifact to submit.
- One indicated a desire to submit but said the lead student was busy with exams so they could not prepare the artifact in time.
- One more indicated a desire to submit but said the software was not in a suitable state to distribute for reproduction. We note that this team has been writing papers about versions of this tool for three years now, with no public software releases.
- One intended to submit but was unable to get the packaging done in time for the deadline.
- One intended to submit but suffered a catastrophic disk crash and was therefore unable to create a distribution.
- One was unable to get permission in time from lawyers to prepare a distributable binary.
Our message was ambiguous: it set June 8 as the deadline for response but didn't make clear this was also the deadline for submitting the artifact. Therefore, authors were given an extension until June 12. Eleven authors submitted by June 8, and the rest a few days later. (On the other hand, the ambiguity meant some authors thought they had much longer than they actually did to prepare the artifact—some thought it was only due in August—so it arguably gave us an upper-bound on the artifacts we would have received.)
In short, out of 34 submissions, 16 indicated they would submit an artifact and 14 actually did so. With better deadline announcements, presumably about 18 would have submitted artifacts. This is over half the papers! This ratio indicates that there is widespread interest in submitting artifacts for evaluation.
Of the 14 submissions, 7 met or exceeded expectations, while the rest, sadly, did not. Some of them were easy to fix; indeed, within a day or two of receiving the decision the authors had created a fresh submission addressing what the AEC found (in one case, the student had simply submitted the wrong version), and asked for consideration. Unfortunately, we had to inform them that the AEC had been disbanded and it would be unfair to ask them to re-convene. All but one author accepted the AEC's negative decision with grace (indeed, some thanked us for finding problems in their distribution).
Only one author was dissatisfied with the AEC outcome. Though our exchange was pleasant, we were unable to arrive at a mutually satisfactory conclusion. He did, however, provide the following useful suggestion: “My conclusion is that it may be a good idea to request authors to submit a brief description of what they think the artifacts are that are valuable for the community”. This would indeed be a useful component of the artifact package, because it provides an “interface specification” and, like all good interfaces, helps identify mismatches early.
In turn, in my opinion, one AEC member was excessively harsh on the artifacts. For instance, on one submission he remarked, “In its present state, the tool could not be adopted by software engineers” and listed several concrete usability problems. While these were all genuine problems from the perspective of a product, we do not usually expect a research artifact to stand up to these standards. This felt no different from the excessive harshness that some PC members exhibit, and required the same moderation from chairs.
Scaling to All Papers
To make the AEC outcome part of the paper decision process, we would need to consider artifacts submitted with all papers. Is this feasible?
It is unlikely that the same ratio will hold in the general population (one presumes that papers with artifacts suitable for dissemination are more likely to have strong experimental results, and thus be more likely to be accepted). Still, even if a quarter of the all submissions were accompanied by artifacts, this would still mean about fifty submissions, which is four times the current load. We believe it is entirely reasonable to double the load per reviewer (currently reviewers were assigned 2-3 artifacts each), especially with significantly more time. Similarly, the committee (currently only 12 people, in contrast to a paper PC of 27) could be made larger. Through these measures, it should be possible to allow all authors to submit artifacts, to be evaluated in parallel with the papers.
Of course, the artifacts that accompany weaker papers will presumably not be packaged as well, which may consume more time and lead to more frustration. One approach would be to delay the start of artifact evaluation. Suppose that in the two-month review period, all papers are expected to get two reviews in the first month. At this point many papers already fall below threshold. The artifacts of these papers presumably need not be reviewed at all.
One virtue of evaluating all papers is that the AEC and PC can work cooperatively. In particular, as a PC member reads the paper, she can send questions to the AEC about issues that were not covered by the authors but whose outcome would affect her decision. In turn, I can even imagine the AEC making observations that may suggest issues that PC members had not considered from the paper text.
Program Committee Survey
The program committee meeting discussed 58 papers. After a decision had been made about each paper, we asked all the reviewers to answer the following two questions:
- Did you think this paper had an artifact to review?
If so, what impact might the AEC have had in the decision?
- The AEC would have answered important questions
- The AEC would have been useful but was not essential
- The AEC was unnecessary
Each reviewer voted independently of the others, and we did not attempt to form a single, unified opinion per paper. Though the intent was to vote quickly and move on to the next paper, on a few instances there was a brief, spontaneous discussion between reviewers that in a few cases caused some to change their vote.
The outcome of this survey is as follows. (Of course, these numbers may be slightly inflated: the presence of Andreas, Carlo and me in the room may have made people vote more positively.)
First, let's see whether the reviewers thought the paper even had an artifact to review. In 51 of the papers, all reviewers felt that it did. For six papers, two reviewers felt it did while one reviewer felt it didn't. On only one paper did a majority of reviewers feel it did not, and for that paper, all three reviewers agreed there was no artifact to review. (Incidentally, that paper was rejected.)
Next, let's look at how reviewers regarded the utility of artifact evaluation. On 29 papers, all reviewers felt it would have played a significant role. On 6 papers, two felt significance and the third felt utility; on 5 more, one felt significance and the other two felt utility. On another 9 papers, all felt utility. On 3 more papers, two felt utility and the third felt it was of no use. This leaves only six papers! One of these is, of course, the paper on which the reviewers felt there was no artifact. Thus, there were only five papers on which the overall value of artifact evaluation was deemed to not be at least useful by a majority of reviewers.
I also examined whether there was a difference in estimation of utility between accepted and rejected papers. There was no perceptible difference: the spread of utilities was almost exactly the same, in keeping with the ratio of the number of accepted and rejected papers themselves. It is not clear whether this should be surprising.
Finally, as mentioned, there was sometimes a brief discussion about artifact evaluation for each paper. Some of the comments recorded during this discussion include the following (in parentheses, the paper's outcome):
- "real input data would have been very, very important" (accepted)
- "would have helped determine the standard deviation, which was important" (rejected)
- "would have loved to see an evaluation" (rejected)
- "enthusiastically yes" (rejected)
- "data was the foundation for what they were doing" (rejected)
- "I would have tried their examples!" (rejected)
- "The interview transcripts would have been useful" (rejected)
- "Could have explored more models and scopes" (rejected)
We note that most of these comments are for rejected papers, presumably because on accepted papers the paper already contained enough to convince the reviewers, but for rejected papers the evaluation would have helped answer open questions.
I hope the data from 2011 show that there was real appreciation from both the PC and authors for the AE process. Unfortunately, neither ESEC 2012 nor FSE 2012 appears to have repeated our experiment in this form. However, Jan Vitek and I were discussing this, and Jan was sufficiently excited that he is implementing it, mostly unchanged, for ECOOP 2013. With luck, therefore, the process will live on through other conferences.