Hi! I'm Danaë, and this page is a summary of my DREU research experience this summer. I'm working with Professor Michael Bernstein at Stanford University in the Human-Computer Interaction Group (based in the Gates Building, pictured above). Below you'll find more detailed information about me and my time here:
My Stanford graduate student mentor for this project is Niloufar Salehi. Niloufar just finished her first year as a Ph.D. student at Stanford, and she's been a huge help, offering guidance and advice whenever I need it, and letting me figure things out on my own the rest of the time.
Gili Rusak is a rising senior at Shaker High School in New York. She may still be in high school, but she already has experience doing research, and she's a very capable and motivated project partner.
There is a huge disconnect between the content we see on the web and people's actual lived experiences. Searching for "chest pain" on Google, for example, returns pages of results focusing on severe medical issues, though most people who experience momentary chest pain only suffer a temporary annoyance. Similarly, searching for pages related to a topic such as smartphones might lead an Internet user to mistakenly conclude that a large proportion of smartphone users own Blackberry or Windows Phone devices, though this is far from the truth.
This is the dark matter of the web: the vast majority of activity that goes undocumented, overshadowed by the small subset that is archived by search engines such as Google. Our goal for this project is to quantify that dark matter. Through a combination of web crawl data, crowdsourced annotation, and data science techniques, we hope to better understand the extent to which the web accurately represents us in contrast to the a thin, widely disseminated slice of our collective experiences.
This first week was mostly coding. The first phase of this project is to crawl the web beginning with a seed search query and collecting URLs relevant to a topic by following links and other search engine queries - a guided/random walk of the web. To that end, I've been working with Scrapy, an open source web scraping framework for Python.
Our web crawler had been developed in advance by Lucas Throckmorton, a Stanford undergraduate, so I spent the week refactoring that code and making it generalizable. I altered the crawler's algorithm to accommodate different topics, and changed the structure of the data processing pipeline. By the end of this week, I was completely familiar and comfortable with the code base that underlies the first phase of this project.
P.S. In addition to work, I've been adjusting to living here for the summer, and easy transition - Stanford is beautiful, and the weather is always perfect!
In my second week, I began data collection. I set up an EC2 instance to run the crawler remotely in the cloud, a process that turned out to be a somewhat unexpected challenge. In the process of trying to configure the EC2 instance to run the crawler, Niloufar and I apparently/probably/somehow managed to break the instance so that it was inaccessible... A bit of a problem, since we were not the only team using that instance! At this point everyone's data has been recovered, but we felt bad, since our mistake definitely caused some extra work for some others in the lab.
Aside from the adventures with EC2, Gili started work this week and together we began finding "ground truth" statistics and brainstorming topics for our data collection. We also started piping crawl data into crowdsourcing tasks, running small jobs to learn crowdsourcing task design.
Week 3 was a continuation of data collection, as well as our first foray into reading related work. Gili and I have continued to put create and run crowdsourcing jobs, including our first job at a larger scale in which workers analyzed 1000 websites for us. Crowdsourcing, it turns out, is much more complicated and delicate of a process than coding. I have prior experience with crowdsourcing (see the Faculty Dataset project with Professor Jeff Huang at Brown), so I already knew this to be true, but this project has definitely confirmed my previous experience. Unlike writing code, when you get something wrong in crowdsourced work, you get a lot of angry people yelling at you in all-caps through your computer screen, something neither Gili nor I particularly enjoyed (though at times it was a bit amusing!). On the other hand, getting it right and managing to communicate effectively with a large, varied group of anonymous people definitely feels exciting.
At the end of the week we had our first Hackathon: in the HCI lab here, 2-5pm on Fridays are designated "hackathons," when the entire lab meets up to work in the same room. At the beginning of the hackathon, each person states a goal for the afternoon, and there's candy and good company throughout, as well as a reward at the end! Our first hackaton was a great success, and I can't wait for next week's!
This week we worked on collecting data for more topics. We're starting to see a trend in the numbers we've collected, so the next few steps are an iterative strategy of identifying features of interest in the data and then expanding those by collecting more data that we expect might fall in line with or unsettle the patterns we're seeing so far.
This week is notable because we came across a big issue late in the week: as with all crowdsourcing work, there's a chance that so-called "lazy" or "eager-beaver" workers will produce low-quality work. As it turns out, due to a combination of our being overly trusting and our crowdsourcing platform being overly permissive, we had fallen prey to many lazy workers. These workers were either always choosing the first option for each question, or in the case of multiple choice questions were choosing every answer. This resulted in data that was supposedly very accurate (because there was high agreement between lazy workers gaming the system in the same way), but completely useless. Our next step will be to debug these crowdbugs. To do so, we are considering several adjustments, including changes we can make to our current jobs to ensure higher quality, and the possibility of moving to a different platform altogether.
In other news, I organized the Hackathon this week, and I think it was a success! Mostly that meant sending out a survey and then passing along its results to the people who actually make things happen - like Jillian Lentz, Michael's assistant, who did an incredible job meeting everyone's requests better than we could've imagined. We held it outside this week, on the Gates Building's back patio, which was beautiful.
After discovering some serious issues with our crowdsourcing last week, this week we ran some initial experiments and decided to switch to a combination of hand-picked workers found through personal interactions on Turker forums and a new crowdsourcing platform. This way we hope to be able to more closely control and troubleshoot the crowdsourced phase of data collection. The immense effect that personal communication with crowdworkers can have is something that continually impresses me. Reaching out to workers yields a more productive, pleasant experience for both workers and employers, though it seems to be a bit of an under-utilized strategy. At the end of this week, we also made some big progress nailing down a conceptual framework for the topics we've been exploring. While there's not much more to say about this week, it was definitely one of the most productive for me so far.
Danaë Metaxa-Kakavouli, Summer 2014