Quality Inference for Mechanical Turk Results

You can use the form below to calculate the quality of the workers that submit answers to your tasks on Mechanical Turk. The algorithm does not rely on having access to to known ("gold") answers. We simply assume that we have multiple workers completing the same HIT. Optionally, you may also provide a few known ("gold") answers. See Worker Evaluation in Crowdsourcing: Gold Data or Multiple Workers? for a discussion on how much help we can get from the usage of the "gold" answers.

Example Application: Porn or Not Porn? The pre-populated example has five workers (worker1 to worker5), five objects (url1 to url5), and two classes (porn and notporn). The correct (but unknown) label for url1, url3, and url5 is notporn; for url2 and url4 is porn.

Goal: The goal is to find the correct class of each example using the labels provided by the workers.

The workers have varying levels of quality.

In this example, we know the correct classification for two of the examples:

and we pasted the correct labels into the "Gold" Answers area.

Press submit to see the analysis of the results.

Unlabeled Examples "Gold" Answers Classification Costs
Line format
Line format
Line format

Most of the results are self-explanatory. The basic novelty is the computation of the "Quality", which computes the quality after correcting for the worker biases. Th "Quality" metric measures the "non-recoverable" error cost of a worker.

For example, see worker1 and worker5. While worker worker5 is malicious, we can reconstruct perfectly the correct labels from whatever worker5 submits (just invert everything!). So, the quality for worker5 is 100%! In contrast, for worker1 who always says "porn", we cannot get any information. So, its quality is 0%!

And feel free to post your own data above! Let me know about the results!

For any bug reports or feature requests, contact Panos Ipeirotis.
The code is available at Google Code.
Powered by Google App Engine