Assessment Tests and Tools
Assessment is an essential part of counselor preparation, involving the review, selection, administration, and interpretation of evaluative procedures. Assessments are designed to prevent test-takers from achieving perfect scores in order to differentiate between individuals and recognize their unique abilities and characteristics.
This section will describe test theory, assessment of ability, clinical assessment and the sspecial issues in assessment.
1. Key Principles of Test Construction
Why is it important for counselors to understand test construction? How do you choose an accurate and credible assessment for measuring depression in adult clients? This section explores validity, reliability, standard error of measurement (SEM), item analysis, test development theory, and common assessment scales.
1.1. Validity
Validity refers to the accuracy of an instrument in measuring a specific construct. It determines how well the instrument measures what it intends to measure and allows meaningful inferences to be made from the results. It’s important to understand that validity is not an inherent quality of the instrument itself but rather of the scores obtained from using the instrument. The validity of scores can vary depending on the purpose of the test and the population being assessed. For example, an instrument measuring anxiety in adults may not have high validity when used with disruptive children but can have high validity when used with anxious adults. Therefore, validity should always be considered in relation to the specific purpose and target population of the test.
Validity in test reports is typically expressed as a correlation coefficient or regression equation. Correlation coefficients measure the relationship between test scores and the criterion, while regression equations predict future scores based on current test results. However, predictions are not 100% accurate, and the standard error of estimate accounts for the expected margin of error in predicted scores due to test imperfections.
The standard error of estimate is calculated by the following equation, where σest is the standard error of the estimate, Y is an actual score, Y′ is a predicted score, and N is the number of pairs of scores. The numerator is the sum of squared differences between the actual scores and the predicted scores.
Professional counselors use psychological tests to improve decision-making in client diagnosis, treatment, and placement. They may administer tests, such as a depression inventory, to enhance accuracy in assessing client symptoms. To ensure the reliability of these tests, counselors evaluate their decision accuracy, which measures how well the instruments support counselor decisions. The following table provides an overview of the terms commonly associated with decision accuracy.
Terms Commonly Associated with Decision Accuracy.
Sensitivity | The instrument’s ability to accurately identify the presence of a phenomenon. |
Specificity | The instrument’s ability to accurately identify the absence of a phenomenon. |
False positive error | An instrument inaccurately identifies the presence of a phenomenon. |
Efficiency | The ratio of total correct decisions divided by the total number of decisions. |
Incremental validity | Concerned with the extent to which an instrument enhances the accuracy of prediction of a specific criterion (e.g., job performance and college GPA). |
1.2 Reliability
Reliability refers to the consistency of scores on the same test across multiple administrations. Although we expect individuals to receive the same score each time, there is always some degree of error due to various factors. This error makes it challenging for individuals to obtain identical scores on retesting, resulting in a distinction between true scores and observed scores.
A person’s observed score (X) is equal to his or her true score (T) plus the amount of error (e) present during test administration:
X = T + e
Reliability measures the extent of error in instrument administration. It helps estimate the impact of personal and environmental factors on obtained scores. Researchers assess reliability to determine the instrument’s freedom from measurement error. Methods such as test-retest reliability, alternative form reliability, and internal consistency approximate reliability.
Spearman − BrownProphecy Formula = 2rhh / (1 + rhh)
with rhh representing the split-half reliability estimate.
Reliability in test reports and manuals is expressed as a correlation, known as a reliability coefficient. A higher reliability coefficient, closer to 1.00, indicates more reliable scores. Reliability coefficients below 1.00 indicate the presence of error in test scores. Typically, reliability coefficients ranging from .80 to .95 are considered useful, but the acceptable level depends on the test’s purpose. Nationally normed achievement and aptitude tests aim for reliability coefficients above .90, while personality inventories can have lower coefficients and still be considered reliable.
The true score of a student’s understanding of the material cannot be directly determined due to the presence of measurement error. To estimate the distribution of scores around the true score, the standard error of measurement (SEM) is used. The SEM provides an estimate of the variability of scores that can be expected from repeated administrations of the same test to the same individual. The SEM is computed using the standard deviation (SD) and reliability coefficient of the test instrument:
The SEM, or standard error of measurement, is the standard deviation of a person’s scores when taking the same test multiple times. It provides an estimate of the distribution of scores around the true score. The SEM is inversely related to reliability, meaning that a larger SEM indicates lower test reliability. The SEM is often reported as a confidence interval, representing the range of scores where the true score is likely to fall.
At least five influences on reliability of test scores have been noted:
While both reliability and validity are necessary for credible test scores, it is possible for scores to be reliable but not valid. On the other hand, valid test scores are always reliable.
1.3. Item Analysis
1.4. Test Theory
Test theory aims to ensure the empirical measurement of test constructs, emphasizing the reduction of test error and enhancement of construct reliability and validity. Professional counselors should be familiar with the following test theories:
1.5. The Development of Instrument Scales
A scale combines items or questions to form a score on a variable. It can measure discrete or continuous variables and present data quantitatively or qualitatively. Quantitative data are numeric, whereas data presented qualitatively use forms other than numbers (e.g., “Check Yes or No”).
Understanding measurement scales is crucial for developing instrument scales. There are four types of measurement scales: nominal, ordinal, interval, and ratio.
1 | 2 | 3 | 4 | 5 |
Strongly | Disagree | Neutral | Agree | Strongly Agree |
How do you feel about your NCE scores?
Bad_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _Good
Agree | Disagree | |
I feel lonely all the time. | [ ] | [ ] |
It is difficult to carry out my daily tasks. | [ ] | [ ] |
I am sleeping more than usual. | [ ] | [ ] |
I have frequent thoughts of death and suicide. | [ ] | [ ] |
Please place a check next to each statement that you agree with.
____________ Are you willing to permit gay students to attend your university?
____________ Are you willing to permit gay students to live in the university dorms?
____________ Are you willing to permit gay students to live in your dorm?
____________ Are you willing to permit gay students to live next door to you?
____________ Are you willing to have a gay student as a roommate?
2. Derived Scores
A derived score converts a raw score into a meaningful result by comparing it to a norm group. In this section, we cover the normal distribution, its connection to derived scores, the use of norm-referenced assessments, and how to calculate and interpret percentile ranks, standard scores, and developmental scores.
2.1. The Normal Distribution
Normal distributions are bell-shaped distributions where most scores cluster around the average, with few scores at the extremes. They are commonly observed in natural and psychological measurements. Their characteristics are valuable in the field of assessment.
A normal distribution is a bell-shaped curve that is symmetrical and asymptotic. It has a highest point at the center and lowest points on either side. Normal distributions also have measures of central tendency and variability.
The Normal Curve.
Derived scores, such as percentiles, normal curve equivalents, and z-scores, are based on the principles of normal distributions. These scores allow for comparisons between clients’ scores and facilitate comparisons across different tests. Normal distributions provide the mathematical framework for these derived scores.
2.2. Norm-Referenced Assessment
Derived scores are common in norm-referenced assessments, where scores are compared to the average performance of a group. Norms represent the typical score against which others are evaluated. By comparing an individual’s score to the group average, we can determine their relative performance. The following tablelists examples of commonly used norm-referenced assessments.
Examples of Norm-Referenced Assessments.
College admissions exams | GRE, SAT, ACT, MCAT, GMAT |
Intelligence testing | Stanford-Binet, Wechsler |
Personality inventories | MBTI, CPI |
2.3. Percentiles
Percentage scores are easily confused with percentiles. A percentage score is simply the raw score (i.e., the number correct items) divided by the total number of test items. Although the percentage score calculation is straightforward, it must be compared to some criterion or norm to give it interpretive meaning.
A percentile rank compares a person’s score to a norm group and indicates the percentage of scores at or below that score. Percentile ranks range from 1 to 99, with a mean of 50. They are not equal units of measurement and tend to exaggerate differences near the mean.
2.4. Standardized Scores
Standardization in assessment involves converting raw scores to standard scores, which serve as reference points for comparing test results. A standardized score is obtained by applying formulas that convert the raw score to a new score based on a norm group. These scores indicate the number of standard deviations a score is above or below the mean.
Standardized scores allow for comparisons across different tests and are used in norm-referenced assessments. Examples of standardized scores include z-scores, T scores, deviation IQ, stanine scores, and normal curve equivalent scores.
The z-score is the most basic type of standard score. A z-score distribution has a mean of 0 and a standard deviation of 1. It simply represents the number of standard deviation units above or below the mean at which a given score falls. z-scores are derived by subtracting the sample mean from the individual’s raw score and then dividing the difference by the sample standard deviation.
z = (X −M) / SD
A T score is a type of standard score that has an adjusted mean of 50 and a standard deviation of 10. These scores are commonly used when reporting the results of personality, interest, and aptitude measures. T scores are easily derived from z-scores by multiplying an individual’s zscore by the T score standard deviation (i.e., 10) and adding it to the T score mean (i.e., 50):
T = 10 (+1) + 50 = 60
T scores are interpreted similarly to z-scores, meaning that a T score above 50 represents a raw score that is above the mean, and a T score below 50 represents a raw score that is below the mean.
Deviation IQ scores are used in intelligence testing. Although deviation IQs are a type of standardized score, they are often referred to simply as standard scores (SS), because they are commonly used to interpret scores from achievement and aptitude tests. Deviation IQs have a mean of 100 and standard deviation of 15 and are derived by multiplying an individual’s z-score by the deviation IQ standard deviation (15) and adding it to the deviation IQ mean (100).
SS = 15 (z) + 100
Deviation IQ scores are interpreted similarly to z-scores and T scores, meaning that a score above 100 represents a raw score that is above the mean, and a score below 100 represents a raw score that is below the mean.
A stanine is a standard score used on achievement tests that divides the normal distribution into nine intervals. Each interval has a width of half a standard deviation. Stanine scores represent a range of z-scores and percentiles. For example, a stanine score of 4 represents a z-score range of −.75 to −.26 and a percentile range of 23 to 40. Stanines have a mean of 5, a standard deviation of 2, and a range of 1 to 9. To calculate stanine scores, multiply the stanine standard deviation by the z-score and add it to the stanine mean. Stanines are rounded to the nearest whole number.
Stanine = 2 (z) + 5
The normal curve equivalent (NCE) is a score used to rank individuals in comparison to their peers. It ranges from 1 to 99, dividing the normal curve into 100 equal parts. The NCE has a mean of 50 and a standard deviation of 21.06. They can be converted from a z-score by multiplying the NCE standard deviation (S D = 21.06) by an individual’s zscore and adding the NCE mean (M = 50).
NCE = 21.06 (z) + 50
2.5. Developmental Scores
Developmental scores place an individual’s raw score on a developmental continuum. They directly compare a person’s score to others of the same age or grade level, providing context for their performance. Developmental scores are commonly used for assessing children and young adolescents.
3. Assesment Tests and Tools
3.1. Ability assessment
Ability assessment encompasses a range of assessment tools that evaluate cognitive skills. These skills involve various cognitive processes such as knowledge, comprehension, application, analysis, synthesis, and evaluation. Ability assessment includes tests that measure both achievement and aptitude. There are ability assessment tests as below:
Achievement tests are designed to assess what one has learned at the time of testing.
Aptitude tests attempt to predict how well that individual will perform in the future.
Intelligence tests broadly assess an individual’s cognitive abilities.
High stakes testing refers to the use of standardized test outcomes to make a major educational decision concerning promotion, retention, educational placement, and entrance into college. Some common High Stakes Testing are:
3.2. Clinical Assessment
Clinical assessment involves a comprehensive evaluation of clients using various methods, including personality testing, observation, interviews, and performance assessments. It aims to enhance clients’ self-awareness and aid professional counselors in understanding clients’ needs and developing treatment plans.
Personality tests focus on evaluating an individual’s emotional and behavioral traits that tend to remain consistent throughout adulthood, such as temperament and behavioral patterns. These tests are classified as either objective or projective personality tests.
Informal assessments refer to subjective assessment techniques that are developed for specific needs. Types of informal assessment include observation, clinical interviewing, rating scales, and classification systems.
Other forms of assessments often used in counseling include the mental status exam, performance assessment, suicide assessment, and trauma assessment.