1
Evaluating LLMs in HR
Name
Course
Instructor
Date of Submission
, 2
Organizational Context and Assumptions
Large language models can be used to aid the recruitment and performance management of
human resources in a big Canadian organization. The theoretical company has a national scope, it has
a few thousand employees, and it has a centralized HR department. It has a diverse workforce that
consists of the Indigenous people (First Nations, Metis, and Inuit) and other equity-deserving
populations. The company is liable to Canadian human rights laws, employment laws, and privacy
acts and has official Equity, Diversity, Inclusion, and Indigeneity (EDII) pledges (Hudson et al.,
2023).
It is presumed that the organization is researching the use of generative AI tools as a digital
transformation strategy. Top executives would like to know how large language models can enhance
efficiency and consistency in the HR procedures and facilitate the EDII goals. It may find use in
writing job descriptions and advertisements, filtering applications by minimum qualifications,
overviewing a resume, creating interview questions, bringing together performance data into a draft
storyline, and helping managers to write a performance commentary or development strategy.
Meanwhile, leaders and HR professionals are aware of the dangers of AI use in people management,
such as algorithmic discrimination, unfairness, dependence on discriminatory past information, and
privacy intrusion (Soleimani et al., 2025). Special attention is paid to culturally unsafe practices or the
disappearance of Indigenous voices in case the results produced by AI are unquestioningly accepted.
This is why large language model experimentation is positioned as an organized learning process that
needs to be measured in terms of technical quality and adherence to the principles of EDII, but is not
an initiative with the primary goal of efficiency.
Research Design
The model is implemented according to a comparative evaluation design where three large
language models are evaluated against a human baseline. The discussed models include Microsoft
Copilot (Bing), Google Gemini, and OpenAI ChatGPT and are available using their regular web
interfaces. In this hypothetical organization, responses are raised in the context of recruitment and
performance management by using ten open-ended prompts. Four answers are obtained, one of each
model and one human generated ideal answer, to each prompt.
The research process goes through four phases. Firstly, the prompts are written so that they
mirror real life HR tasks and ethical issues that guarantee that both recruitment and performance
management process is covered. Second, all three models are used and the human baseline is used to
answer the same prompts and record the first complete answer that is generated in each tool as
responses. Third, an anonymized assessment tool is developed and marked by human assessors with
the help of a structured rubric that encompasses such aspects as accuracy, clarity, ethical and EDII
sensitivity, and overall usefulness. Fourth, quantitative ratings and qualitative comments are examined
to compare the performance across sources and the type of questions and to discover any common
strengths and weaknesses.
Development of Prompts
Ten prompts are created to depict the typical decision scenarios during recruitment and
performance management. There are seven prompts which are based on technical and procedural
reasoning. They deal with issues like how to design a fair screening process to professional jobs, the
use of AI to help screen the resumes with transparency, extracting ambiguous performance
information, and suggesting performance improvement plans or the calibration strategy. Every
technical prompt is formulated in such a way that it needs structured arguments, articulated
assumptions regarding the organizational situation and specific suggestions.
The other three prompts include ethical, EDII and Indigenous issues. These include the case
of identifying potential biasness in shortlist results, addressing the concerns of Indigenous employees
that AI screening can be culturally unsafe or unclear, and creating a framework through which AI
screening can be used responsibly without encroaching on the rights and knowledge of Indigenous
people. Every ethical prompt must pay attention to the various stakeholders and the risk of harm or
ostracism (Ong et al., 2024). The combination of the ten prompts forms a balanced set of tasks that
allow comparing the responses of large language models and the human baseline to technical and
ethical issues in HR.
LLM Prompts and Responses
Prompt 1 (technical recruitment):
Evaluating LLMs in HR
Name
Course
Instructor
Date of Submission
, 2
Organizational Context and Assumptions
Large language models can be used to aid the recruitment and performance management of
human resources in a big Canadian organization. The theoretical company has a national scope, it has
a few thousand employees, and it has a centralized HR department. It has a diverse workforce that
consists of the Indigenous people (First Nations, Metis, and Inuit) and other equity-deserving
populations. The company is liable to Canadian human rights laws, employment laws, and privacy
acts and has official Equity, Diversity, Inclusion, and Indigeneity (EDII) pledges (Hudson et al.,
2023).
It is presumed that the organization is researching the use of generative AI tools as a digital
transformation strategy. Top executives would like to know how large language models can enhance
efficiency and consistency in the HR procedures and facilitate the EDII goals. It may find use in
writing job descriptions and advertisements, filtering applications by minimum qualifications,
overviewing a resume, creating interview questions, bringing together performance data into a draft
storyline, and helping managers to write a performance commentary or development strategy.
Meanwhile, leaders and HR professionals are aware of the dangers of AI use in people management,
such as algorithmic discrimination, unfairness, dependence on discriminatory past information, and
privacy intrusion (Soleimani et al., 2025). Special attention is paid to culturally unsafe practices or the
disappearance of Indigenous voices in case the results produced by AI are unquestioningly accepted.
This is why large language model experimentation is positioned as an organized learning process that
needs to be measured in terms of technical quality and adherence to the principles of EDII, but is not
an initiative with the primary goal of efficiency.
Research Design
The model is implemented according to a comparative evaluation design where three large
language models are evaluated against a human baseline. The discussed models include Microsoft
Copilot (Bing), Google Gemini, and OpenAI ChatGPT and are available using their regular web
interfaces. In this hypothetical organization, responses are raised in the context of recruitment and
performance management by using ten open-ended prompts. Four answers are obtained, one of each
model and one human generated ideal answer, to each prompt.
The research process goes through four phases. Firstly, the prompts are written so that they
mirror real life HR tasks and ethical issues that guarantee that both recruitment and performance
management process is covered. Second, all three models are used and the human baseline is used to
answer the same prompts and record the first complete answer that is generated in each tool as
responses. Third, an anonymized assessment tool is developed and marked by human assessors with
the help of a structured rubric that encompasses such aspects as accuracy, clarity, ethical and EDII
sensitivity, and overall usefulness. Fourth, quantitative ratings and qualitative comments are examined
to compare the performance across sources and the type of questions and to discover any common
strengths and weaknesses.
Development of Prompts
Ten prompts are created to depict the typical decision scenarios during recruitment and
performance management. There are seven prompts which are based on technical and procedural
reasoning. They deal with issues like how to design a fair screening process to professional jobs, the
use of AI to help screen the resumes with transparency, extracting ambiguous performance
information, and suggesting performance improvement plans or the calibration strategy. Every
technical prompt is formulated in such a way that it needs structured arguments, articulated
assumptions regarding the organizational situation and specific suggestions.
The other three prompts include ethical, EDII and Indigenous issues. These include the case
of identifying potential biasness in shortlist results, addressing the concerns of Indigenous employees
that AI screening can be culturally unsafe or unclear, and creating a framework through which AI
screening can be used responsibly without encroaching on the rights and knowledge of Indigenous
people. Every ethical prompt must pay attention to the various stakeholders and the risk of harm or
ostracism (Ong et al., 2024). The combination of the ten prompts forms a balanced set of tasks that
allow comparing the responses of large language models and the human baseline to technical and
ethical issues in HR.
LLM Prompts and Responses
Prompt 1 (technical recruitment):