Project Blackhat Instructions
Loom Video:
https://www.loom.com/share/8237c34fed5c4d82a6c696c844a907ae?sid=b52cc5dc-51c8-4aeb-
b87e-b5a90b457cb6
In this task, you will be rating and comparing two responses that were generated from the same
coding related prompt by an AI chatbot.
The form is structured in four sections:
1. You will be shown two different responses to the prompt, and asked if you are able to
directly attempt to run and test the code blocks present
2. You will be asked to rate the correctness of both responses individually.
3. You will be asked to rate how well both responses follow the instructions given in the
prompt
4. You will then provide a rating on the relative quality of the two responses.
Important notes:
● Questions come with detailed instructions. Please read these carefully. Many questions
require explanations depending on the response.
● Some questions require explanations of issues that were found, if any. Aim to write
reasonably concise, but not overly generic explanations that would allow a skilled
programmer to identify and remedy the problem. Expect to write 1-2 sentences on
average.
Out of scope items:
If a prompt is out of scope (e.g. not code related), please select "Cannot Assess" in section 3,
and explain why it is out of scope. You can select the other answers arbitrarily; they will be
filtered out of the metric calculation.
Skipping items:
You should only rate items where you are familiar with the subject material and very confident
that the assessments you’re making are accurate. If you are completely unfamiliar with the
programming language, unable to assess the correctness of responses, or unable to confidently
compare the responses in a reasonable amount of time (~30 minutes), skip the item.
Assessing Correctness:
It is very important to be confident that the ratings you’re providing are accurate. Many aspects
of the rating task are very subtle and nuanced, where relatively small differences between
responses are crucially important, or very reasonable sounding explanations are not factual.
Even if at first glance a response seems perfectly appropriate, it is still important to
, ● Explicitly check each claim and
● Run code blocks yourself (when appropriate and possible) to verify the code runs and
works as expected
Rating Task
The primary focus of this rating task is on the correctness of the model responses, and how
well the responses follow instructions. Here, correctness refers both to the
truthfulness/factuality of any claim made in the response and the functionality of any code
provided. Instruction following refers to how well the response addresses the goals and
questions, and requirements of the prompt. It is your job to:
(a) Explicitly research and check these textual claims and
(b) Run and inspect the output of any code provided (where applicable) to confirm its
executability and functionality.
You will likely need to utilize google search, stack overflow, documentation, or online code
executors/compilers (e.g., jsfiddle, colab, programiz, etc.) to confirm that a response is
reasonable and that your rating is accurate.
The rating task consists of three parts:
1. Rate the correctness of each of the two responses individually (single-sided).
2. Rate how well each of the two responses follow instructions given in the prompt (single-
sided).
3. Rate the relative quality of each response in a side-by-side score.
Task Details:
(1) Code Response Executability:
Are you able to run code provided in the responses?
● Options: Yes-Fully, Yes-Partially, No, No Code Present
● Instructions: Run all code (snippets, functions, programs, etc.) provided in either of the
responses, to check both its executability and its correctness.
○ This may not always be possible. For instance, the code may only make sense
embedded inside a larger program, or it may require some external file/API
dependency for which no execution sandbox is readily available. If the code
cannot be executed, select “No”.
Loom Video:
https://www.loom.com/share/8237c34fed5c4d82a6c696c844a907ae?sid=b52cc5dc-51c8-4aeb-
b87e-b5a90b457cb6
In this task, you will be rating and comparing two responses that were generated from the same
coding related prompt by an AI chatbot.
The form is structured in four sections:
1. You will be shown two different responses to the prompt, and asked if you are able to
directly attempt to run and test the code blocks present
2. You will be asked to rate the correctness of both responses individually.
3. You will be asked to rate how well both responses follow the instructions given in the
prompt
4. You will then provide a rating on the relative quality of the two responses.
Important notes:
● Questions come with detailed instructions. Please read these carefully. Many questions
require explanations depending on the response.
● Some questions require explanations of issues that were found, if any. Aim to write
reasonably concise, but not overly generic explanations that would allow a skilled
programmer to identify and remedy the problem. Expect to write 1-2 sentences on
average.
Out of scope items:
If a prompt is out of scope (e.g. not code related), please select "Cannot Assess" in section 3,
and explain why it is out of scope. You can select the other answers arbitrarily; they will be
filtered out of the metric calculation.
Skipping items:
You should only rate items where you are familiar with the subject material and very confident
that the assessments you’re making are accurate. If you are completely unfamiliar with the
programming language, unable to assess the correctness of responses, or unable to confidently
compare the responses in a reasonable amount of time (~30 minutes), skip the item.
Assessing Correctness:
It is very important to be confident that the ratings you’re providing are accurate. Many aspects
of the rating task are very subtle and nuanced, where relatively small differences between
responses are crucially important, or very reasonable sounding explanations are not factual.
Even if at first glance a response seems perfectly appropriate, it is still important to
, ● Explicitly check each claim and
● Run code blocks yourself (when appropriate and possible) to verify the code runs and
works as expected
Rating Task
The primary focus of this rating task is on the correctness of the model responses, and how
well the responses follow instructions. Here, correctness refers both to the
truthfulness/factuality of any claim made in the response and the functionality of any code
provided. Instruction following refers to how well the response addresses the goals and
questions, and requirements of the prompt. It is your job to:
(a) Explicitly research and check these textual claims and
(b) Run and inspect the output of any code provided (where applicable) to confirm its
executability and functionality.
You will likely need to utilize google search, stack overflow, documentation, or online code
executors/compilers (e.g., jsfiddle, colab, programiz, etc.) to confirm that a response is
reasonable and that your rating is accurate.
The rating task consists of three parts:
1. Rate the correctness of each of the two responses individually (single-sided).
2. Rate how well each of the two responses follow instructions given in the prompt (single-
sided).
3. Rate the relative quality of each response in a side-by-side score.
Task Details:
(1) Code Response Executability:
Are you able to run code provided in the responses?
● Options: Yes-Fully, Yes-Partially, No, No Code Present
● Instructions: Run all code (snippets, functions, programs, etc.) provided in either of the
responses, to check both its executability and its correctness.
○ This may not always be possible. For instance, the code may only make sense
embedded inside a larger program, or it may require some external file/API
dependency for which no execution sandbox is readily available. If the code
cannot be executed, select “No”.