Lecture 1 (Test construction):
In scientific research tests are used for:
- Testing theories
- Testing hypothesis
- Studying development and change
- Predicting behaviour
In practice tests are used for:
- Diagnosis
- Psychological and educational decision making
Origins of psychometrics:
Modern testing has three roots:
1. Civil service examinations → 1115 B.C in China. Candidates for government
positions were examined on:
- Music
- Archery
- Horsemanship
- Writing
- Arithmetic
- The rites and ceremonies of public and private life
In 1791, long after the Chinese abolished this system, France started using it. In 1833 the
UK started using it and in 1870 the US started using it.
2. The assessment of academic achievement → the first ones were for law study in
1219.
- 13th century → oral examinations in France
- 1441 → Belgium on honor, satisfactory, charity pass and failure
- 16th century → written test for the placement of students and the evaluation of
educational instruction
- 1599 → final version of the rules for the conduct of examinations in school
3. The study of individual differences in behaviour in the 19th century
- England → Galton
- United States → Cattell
- Germany → Kraepelin (personality testing)
- First intelligence test by Binet and Simon. It was based on a request to make
recommendations on special education for mentally retarded children → 1905
- Edgeworth → grades assigned to the same examination paper can vary considerably
between examiners, which is the basis of measurement error and reliability.
- Spearman → classic theory for test scores
Since the 20th century there has been a development of a large number of tests for
measuring abilities, achievements, personality characteristics, attitudes and motor skills.
,Terms:
Test → an instrument for the measurement of a person’s maximum or typical performance
under standardized conditions, where the performance is assumed to reflect one or more
latent attributes
A test can be used for many different purposes, but a test is for measurement in the first
place and other uses such as prediction are important applications of the measurements
(e.g: intelligence test to predict future educational achievements).
The test instructions, the test materials and the administration procedure are fixed for
different test takers and on different occasions. This makes test performances comparable
between persons and occasions.
It is assumed that one or more unobservable latent variables underlie the test performance,
and that the latent attributes affect the test performance.
Surveys contain questions which are answered by a respondent, just as for typical
performance tests. However, in surveys it is not assumed that survey questions reflect a
latent attribute. These questions can however be used to form a measurement index (e.g: an
index of stress by combining the answers to a list of negative live events. It is assumed that
these events lead to stress, but not that the stress is a latent attribute that causes the
events).
Subtest → an independent part of a test (e.g: SAT-M, mathematics)
Item → an item is the smallest possible subtest of a test. A test consists of n items and is
called an n-item test.
The dimensionality → the number of latent attributes (variables), which affects test
performance.
A test that measures one latent attribute is called an unidimensional test
A test that measures more than one latent attribute is called a multidimensional test. (two
is called a two-dimensional test, three is called a three-dimensional test, etc.)
Psychological and educational measurement instruments are divided into:
1. Mental tests → it consists of cognitive tasks (e.g: problems and questions)
2. Physical tests → it consists of instruments to make somatic or physiological
measurements (e.g: ECG for heart rate)
Another way to distinguish tests is:
1. Maximum performance tests → it can be subdivided according to:
- Type of maximum performance
- The latent attribute that is measured by the test
2. Typical performance tests → it can be subdivided according to:
- The attribute that is measured
, 1. Maximum performance tests:
- The type of maximum performance:
A performance is considered maximum in two different respects: the performance is
accurate and the performance is fast. If you look at accuracy, you can distinguish the
following maximum performance tests:
A) Pure power test → it consists of problems that the test taker tries to solve. The test
taker has enough time to work on each of the test items, even on the most difficult
items.
B) Time limited power test → it consists of problems that the test taker tries to solve.
The majority of the test takers have enough time to solve the problems and only a
small minority needs more time
If you look at speed, you can distinguish the following performance test:
C) Speed test → it measures the speed taken to solve problems. It usually consists of
very easy items that can be solved by all test takers.
- According the attribute:
A) Ability test (aptitude test) → measuring a person’s best performance in an area that
is not explicitly taught in training and educational programs
B) Achievement test → measuring a person’s best performance in an area that is
explicitly taught in training and educational programs
An aptitude test is an ability test that is used to predict future competences.
2. Typical performance tests
They are often called questionnaires or inventories. There are three types of typical
performance tests:
A) Personality tests → measure personality characteristics
B) Interest inventories → measure a person’s interests
C) Attitude questionnaires → measure a person’s attitude towards a certain topic
, Test construction:
The first question that needs to be asked when one wants to construct a test is:
- What kind of test does one want to make/construct?
The answer to this question depends on the purpose of the test. There are two types of
tests:
1. Maximum performance tests → you have to solve a problem (e.g: finish: 2 5 11 23)
and the response is (partly) correct or incorrect. The person is asked to do his/her
best to solve one or more problems. An example is an intelligence test or an exam.
The subjects are asked to really do something, to do a performance.
2. Typical performance tests → you respond to a task in the way that is typical for you
(e.g: do you like parties? yes/no). The person is asked to respond to one or more
tasks, where the responses are typical for the person. An example is personality
research. There is no right or wrong response. The participant who takes the test
responds in a way that is typical for that person.
The way you construct these two different types of tests is different.
There are seven steps in the test construction process:
1. The construct of interest
2. The measurement mode
3. The objective
4. The population
5. The conceptual framework
6. The response mode
7. The administration mode
For the construction of a maximum and/or typical performance test you go through all these
seven steps. You do not have to go through the steps in this given order.
In scientific research tests are used for:
- Testing theories
- Testing hypothesis
- Studying development and change
- Predicting behaviour
In practice tests are used for:
- Diagnosis
- Psychological and educational decision making
Origins of psychometrics:
Modern testing has three roots:
1. Civil service examinations → 1115 B.C in China. Candidates for government
positions were examined on:
- Music
- Archery
- Horsemanship
- Writing
- Arithmetic
- The rites and ceremonies of public and private life
In 1791, long after the Chinese abolished this system, France started using it. In 1833 the
UK started using it and in 1870 the US started using it.
2. The assessment of academic achievement → the first ones were for law study in
1219.
- 13th century → oral examinations in France
- 1441 → Belgium on honor, satisfactory, charity pass and failure
- 16th century → written test for the placement of students and the evaluation of
educational instruction
- 1599 → final version of the rules for the conduct of examinations in school
3. The study of individual differences in behaviour in the 19th century
- England → Galton
- United States → Cattell
- Germany → Kraepelin (personality testing)
- First intelligence test by Binet and Simon. It was based on a request to make
recommendations on special education for mentally retarded children → 1905
- Edgeworth → grades assigned to the same examination paper can vary considerably
between examiners, which is the basis of measurement error and reliability.
- Spearman → classic theory for test scores
Since the 20th century there has been a development of a large number of tests for
measuring abilities, achievements, personality characteristics, attitudes and motor skills.
,Terms:
Test → an instrument for the measurement of a person’s maximum or typical performance
under standardized conditions, where the performance is assumed to reflect one or more
latent attributes
A test can be used for many different purposes, but a test is for measurement in the first
place and other uses such as prediction are important applications of the measurements
(e.g: intelligence test to predict future educational achievements).
The test instructions, the test materials and the administration procedure are fixed for
different test takers and on different occasions. This makes test performances comparable
between persons and occasions.
It is assumed that one or more unobservable latent variables underlie the test performance,
and that the latent attributes affect the test performance.
Surveys contain questions which are answered by a respondent, just as for typical
performance tests. However, in surveys it is not assumed that survey questions reflect a
latent attribute. These questions can however be used to form a measurement index (e.g: an
index of stress by combining the answers to a list of negative live events. It is assumed that
these events lead to stress, but not that the stress is a latent attribute that causes the
events).
Subtest → an independent part of a test (e.g: SAT-M, mathematics)
Item → an item is the smallest possible subtest of a test. A test consists of n items and is
called an n-item test.
The dimensionality → the number of latent attributes (variables), which affects test
performance.
A test that measures one latent attribute is called an unidimensional test
A test that measures more than one latent attribute is called a multidimensional test. (two
is called a two-dimensional test, three is called a three-dimensional test, etc.)
Psychological and educational measurement instruments are divided into:
1. Mental tests → it consists of cognitive tasks (e.g: problems and questions)
2. Physical tests → it consists of instruments to make somatic or physiological
measurements (e.g: ECG for heart rate)
Another way to distinguish tests is:
1. Maximum performance tests → it can be subdivided according to:
- Type of maximum performance
- The latent attribute that is measured by the test
2. Typical performance tests → it can be subdivided according to:
- The attribute that is measured
, 1. Maximum performance tests:
- The type of maximum performance:
A performance is considered maximum in two different respects: the performance is
accurate and the performance is fast. If you look at accuracy, you can distinguish the
following maximum performance tests:
A) Pure power test → it consists of problems that the test taker tries to solve. The test
taker has enough time to work on each of the test items, even on the most difficult
items.
B) Time limited power test → it consists of problems that the test taker tries to solve.
The majority of the test takers have enough time to solve the problems and only a
small minority needs more time
If you look at speed, you can distinguish the following performance test:
C) Speed test → it measures the speed taken to solve problems. It usually consists of
very easy items that can be solved by all test takers.
- According the attribute:
A) Ability test (aptitude test) → measuring a person’s best performance in an area that
is not explicitly taught in training and educational programs
B) Achievement test → measuring a person’s best performance in an area that is
explicitly taught in training and educational programs
An aptitude test is an ability test that is used to predict future competences.
2. Typical performance tests
They are often called questionnaires or inventories. There are three types of typical
performance tests:
A) Personality tests → measure personality characteristics
B) Interest inventories → measure a person’s interests
C) Attitude questionnaires → measure a person’s attitude towards a certain topic
, Test construction:
The first question that needs to be asked when one wants to construct a test is:
- What kind of test does one want to make/construct?
The answer to this question depends on the purpose of the test. There are two types of
tests:
1. Maximum performance tests → you have to solve a problem (e.g: finish: 2 5 11 23)
and the response is (partly) correct or incorrect. The person is asked to do his/her
best to solve one or more problems. An example is an intelligence test or an exam.
The subjects are asked to really do something, to do a performance.
2. Typical performance tests → you respond to a task in the way that is typical for you
(e.g: do you like parties? yes/no). The person is asked to respond to one or more
tasks, where the responses are typical for the person. An example is personality
research. There is no right or wrong response. The participant who takes the test
responds in a way that is typical for that person.
The way you construct these two different types of tests is different.
There are seven steps in the test construction process:
1. The construct of interest
2. The measurement mode
3. The objective
4. The population
5. The conceptual framework
6. The response mode
7. The administration mode
For the construction of a maximum and/or typical performance test you go through all these
seven steps. You do not have to go through the steps in this given order.