Assessment Tests and Tools

Assessment Tests and Tools

Assessment is an essential part of counselor preparation, involving the review, selection, administration, and interpretation of evaluative procedures. Assessments are designed to prevent test-takers from achieving perfect scores in order to differentiate between individuals and recognize their unique abilities and characteristics.

This section will describe test theory, assessment of ability, clinical assessment and the sspecial issues in assessment.

1. Key Principles of Test Construction

Why is it important for counselors to understand test construction? How do you choose an accurate and credible assessment for measuring depression in adult clients? This section explores validity, reliability, standard error of measurement (SEM), item analysis, test development theory, and common assessment scales.

1.1. Validity

Validity refers to the accuracy of an instrument in measuring a specific construct. It determines how well the instrument measures what it intends to measure and allows meaningful inferences to be made from the results. It’s important to understand that validity is not an inherent quality of the instrument itself but rather of the scores obtained from using the instrument. The validity of scores can vary depending on the purpose of the test and the population being assessed. For example, an instrument measuring anxiety in adults may not have high validity when used with disruptive children but can have high validity when used with anxious adults. Therefore, validity should always be considered in relation to the specific purpose and target population of the test.

  • Types of Validity
  • Content validity ensures that an instrument’s content is appropriate for its intended purpose. To establish content validity, test items should cover all major content areas of the domain being measured. For example, if developing a depression assessment, items should address physical, psychological, and cognitive factors. The number of items in each content area should reflect their importance.
  • Criterion validity assesses how well an instrument predicts an individual’s performance on a specific criterion. It is determined by examining the relationship between the instrument’s data and the criterion. There are two types of criterion validity: concurrent and predictive validity.
  • Concurrent validity examines the relationship between an instrument’s results and a currently obtainable criterion, collected at the same time. For example, comparing depression scores from the instrument with hospital admissions for suicidal ideation in the past six months establishes concurrent validity.
  • Predictive validity, on the other hand, assesses the relationship between an instrument’s results and a criterion collected in the future. By predicting future performance, the instrument’s scores are compared to future hospitalizations for suicidal ideation occurring two years later. Positive correlation indicates predictive validity in anticipating future hospitalization.
  • Construct validity assesses how well an instrument measures a theoretical concept. It is crucial for instruments that measure abstract constructs like personality traits. Construct validity is established through methods such as experimental designs, factor analysis, comparison with similar measures, and differentiation from unrelated measures.
  • Experimental design validity: Involves using an experimental design to demonstrate that an instrument measures a specific construct. For example, administering the depression instrument to clients before and after therapy to show a decrease in scores.
  • Factor analysis: A statistical technique that analyzes relationships between items to identify latent traits or factors. Construct validity is supported when subscales of the instrument relate to each other and the larger construct of depression.
  • Convergent validity: Demonstrates that the assessment is related to other measures of the same construct. Correlating a new depression test with an established measure like the Beck Depression Inventory II (BDI-II) indicates convergent validity.
  • Discriminant validity: Shows that measures of unrelated constructs have no relationship. Establishing discriminant validity involves demonstrating that depression scores are not related to scores on an achievement instrument.
  • Face validity is often misunderstood as a type of validity, but it is not. It refers to the superficial appearance of an instrument and whether it “looks” valid or credible. However, it does not provide strong evidence of the instrument’s actual validity. 
  • It is important to establish multiple types of validity to ensure the credibility of the instrument.
  • Reporting Validity

Validity in test reports is typically expressed as a correlation coefficient or regression equation. Correlation coefficients measure the relationship between test scores and the criterion, while regression equations predict future scores based on current test results. However, predictions are not 100% accurate, and the standard error of estimate accounts for the expected margin of error in predicted scores due to test imperfections.

The standard error of estimate is calculated by the following equation, where σest is the standard error of the estimate, Y is an actual score, Y′ is a predicted score, and N is the number of pairs of scores. The numerator is the sum of squared differences between the actual scores and the predicted scores.

A picture containing font, line, white, diagram

Description automatically generated
  • Decision Accuracy

Professional counselors use psychological tests to improve decision-making in client diagnosis, treatment, and placement. They may administer tests, such as a depression inventory, to enhance accuracy in assessing client symptoms. To ensure the reliability of these tests, counselors evaluate their decision accuracy, which measures how well the instruments support counselor decisions. The following table provides an overview of the terms commonly associated with decision accuracy.

Terms Commonly Associated with Decision Accuracy.

SensitivityThe instrument’s ability to accurately identify the presence of a phenomenon.
SpecificityThe instrument’s ability to accurately identify the absence of a phenomenon.
False positive errorAn instrument inaccurately identifies the presence of a phenomenon.
EfficiencyThe ratio of total correct decisions divided by the total number of decisions.
Incremental validityConcerned with the extent to which an instrument enhances the accuracy of prediction of a specific criterion (e.g., job performance and college GPA).

1.2 Reliability

Reliability refers to the consistency of scores on the same test across multiple administrations. Although we expect individuals to receive the same score each time, there is always some degree of error due to various factors. This error makes it challenging for individuals to obtain identical scores on retesting, resulting in a distinction between true scores and observed scores.

A person’s observed score (X) is equal to his or her true score (T) plus the amount of error (e) present during test administration:

X = T + e

Reliability measures the extent of error in instrument administration. It helps estimate the impact of personal and environmental factors on obtained scores. Researchers assess reliability to determine the instrument’s freedom from measurement error. Methods such as test-retest reliability, alternative form reliability, and internal consistency approximate reliability.

  • Types of Reliability
  • Test-retest reliability examines the consistency of scores across two administrations of the same test over time.
  • Alternative form reliability compares the consistency of scores between two equivalent forms of the same test administered to the same group.
  • Internal consistency measures the consistency of responses to different items within a single administration of the test.
  • Split-half reliability is a form of internal consistency that compares one half of a test to the other. However, it can be challenging to divide tests into comparable halves. Additionally, split-half reliability reduces the length of the test, which can result in less reliable scores compared to longer tests. To compensate mathematically for the shorter length, the Spearman–Brown Prophecy Formula can be used to estimate reliability: 

Spearman − BrownProphecy Formula = 2rhh / (1 + rhh)

with rhh representing the split-half reliability estimate.

  • Inter-item consistency compares individual test item responses with each other and the total test score, providing an estimate of internal consistency. Formulas like Kuder-Richardson Formula 20 (for dichotomous items) and Cronbach’s coefficient alpha (for multipoint responses) are used to calculate reliability.
  • Inter-scorer reliability (or inter-rater reliability) assesses the consistency of ratings or assessments between multiple observers or interviewers evaluating the same behavior or individual. This type of reliability is crucial when scorer judgment is involved, such as in subjective responses. For instance, to establish inter-scorer reliability for a depression test with open-ended questions, multiple clinicians would independently score the test.
  • Reporting Reliability

Reliability in test reports and manuals is expressed as a correlation, known as a reliability coefficient. A higher reliability coefficient, closer to 1.00, indicates more reliable scores. Reliability coefficients below 1.00 indicate the presence of error in test scores. Typically, reliability coefficients ranging from .80 to .95 are considered useful, but the acceptable level depends on the test’s purpose. Nationally normed achievement and aptitude tests aim for reliability coefficients above .90, while personality inventories can have lower coefficients and still be considered reliable.

  • Standard Error of Measurement

The true score of a student’s understanding of the material cannot be directly determined due to the presence of measurement error. To estimate the distribution of scores around the true score, the standard error of measurement (SEM) is used. The SEM provides an estimate of the variability of scores that can be expected from repeated administrations of the same test to the same individual. The SEM is computed using the standard deviation (SD) and reliability coefficient of the test instrument:

A picture containing font, text, typography, white

Description automatically generated

The SEM, or standard error of measurement, is the standard deviation of a person’s scores when taking the same test multiple times. It provides an estimate of the distribution of scores around the true score. The SEM is inversely related to reliability, meaning that a larger SEM indicates lower test reliability. The SEM is often reported as a confidence interval, representing the range of scores where the true score is likely to fall.

  • Factors That Influence Reliability

At least five influences on reliability of test scores have been noted:

  • Test length. Longer tests are generally more reliable than shorter tests.
  • Homogeneity of test items. Lower reliability estimates are reported when test items vary greatly in content.
  • Range restriction. The reliability of test scores will be lowered by a restriction in the range of test scores.
  • Heterogeneity of test group. Test-takers who are heterogeneous on the characteristic being measured yield higher reliability estimates.
  • Speed tests. These yield spuriously high reliability coefficients because nearly every test-taker gets nearly every item correct.
  • The Relationship Between Validity and Reliability

While both reliability and validity are necessary for credible test scores, it is possible for scores to be reliable but not valid. On the other hand, valid test scores are always reliable.

1.3. Item Analysis

  • Item analysis involves statistically examining test-taker responses to assess the quality of individual test items and the test as a whole. It helps eliminate confusing, easy, and difficult items from future use. Item difficulty is the percentage of test-takers who answer an item correctly, with higher values indicating easier items. 
  • Item discrimination measures a test item’s ability to differentiate between test-takers based on the construct being measured. Positive item discrimination occurs when higher-scoring test-takers perform better on an item than lower-scoring test-takers, while zero or negative discrimination indicates poor item quality.

1.4. Test Theory

Test theory aims to ensure the empirical measurement of test constructs, emphasizing the reduction of test error and enhancement of construct reliability and validity. Professional counselors should be familiar with the following test theories:

  • Classical test theory views the observed score as a combination of the true score and measurement error. Its focus is on increasing test score reliability.
  • Item response theory (IRT), also known as modern test theory, uses mathematical models to analyze assessment data. IRT helps evaluate individual test items and the overall test, detecting item bias, equating scores across different tests, and tailoring items to test-takers.
  • The construct-based validity model considers validity as a comprehensive construct, integrating internal and external aspects. It emphasizes the exploration of internal structure and external factors to understand score validity holistically.

1.5. The Development of Instrument Scales

A scale combines items or questions to form a score on a variable. It can measure discrete or continuous variables and present data quantitatively or qualitatively. Quantitative data are numeric, whereas data presented qualitatively use forms other than numbers (e.g., “Check Yes or No”).

  • Scales of Measurement

Understanding measurement scales is crucial for developing instrument scales. There are four types of measurement scales: nominal, ordinal, interval, and ratio.

  • Nominal scale is the simplest, only naming data without order or equal intervals. Examples include gender, which can be labeled as 0 for males and 1 for females.
  • Ordinal scale ranks data without equal intervals, like Likert scales that measure degrees of satisfaction.
  • Interval scale has equal intervals between points on the scale, but no absolute zero point. Temperature measurements are an example.
  • Ratio scale combines nominal, ordinal, and interval qualities, with an absolute zero point. Height is measured on a ratio scale.
  • Types of Scales
  • A Likert scale is used to assess attitudes or opinions. It includes a statement followed by answer choices ranging from Strongly Agree to Strongly Disagree. For example: “GRE scores accurately predict future graduate school performance.” 
StronglyDisagreeNeutralAgreeStrongly Agree
  • Semantic differential is a scaling technique that asks test-takers to place a mark between two opposite adjectives to assess their response to an affective question.

How do you feel about your NCE scores?

Bad_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _Good

  • A Thurstone scale measures attitudes across multiple dimensions using agreement or disagreement with item statements. It uses equal intervals and a paired comparison method.
I feel lonely all the time.[ ][ ]
It is difficult to carry out my daily tasks.[ ][ ]
I am sleeping more than usual.[ ][ ]
I have frequent thoughts of death and suicide.[ ][ ]
  • A Guttman scale measures the intensity of a variable being measured. Items are presented in a progressive order so that a respondent, who agrees with an extreme test item, will also agree with all previous, less extreme items.

Please place a check next to each statement that you agree with.

____________ Are you willing to permit gay students to attend your university?

____________ Are you willing to permit gay students to live in the university dorms?

____________ Are you willing to permit gay students to live in your dorm?

____________ Are you willing to permit gay students to live next door to you?

____________ Are you willing to have a gay student as a roommate?

2. Derived Scores

A derived score converts a raw score into a meaningful result by comparing it to a norm group. In this section, we cover the normal distribution, its connection to derived scores, the use of norm-referenced assessments, and how to calculate and interpret percentile ranks, standard scores, and developmental scores.

2.1. The Normal Distribution

Normal distributions are bell-shaped distributions where most scores cluster around the average, with few scores at the extremes. They are commonly observed in natural and psychological measurements. Their characteristics are valuable in the field of assessment.

  • Characteristics of Normal Distributions

A normal distribution is a bell-shaped curve that is symmetrical and asymptotic. It has a highest point at the center and lowest points on either side. Normal distributions also have measures of central tendency and variability.

The Normal Curve.

A picture containing text, diagram, screenshot, design

Description automatically generated
  • The relationship Between Normal Distributions and Test Scores

Derived scores, such as percentiles, normal curve equivalents, and z-scores, are based on the principles of normal distributions. These scores allow for comparisons between clients’ scores and facilitate comparisons across different tests. Normal distributions provide the mathematical framework for these derived scores.

2.2. Norm-Referenced Assessment

Derived scores are common in norm-referenced assessments, where scores are compared to the average performance of a group. Norms represent the typical score against which others are evaluated. By comparing an individual’s score to the group average, we can determine their relative performance. The following tablelists examples of commonly used norm-referenced assessments. 

Examples of Norm-Referenced Assessments.

College admissions examsGRE, SAT, ACT, MCAT, GMAT
Intelligence testingStanford-Binet, Wechsler
Personality inventoriesMBTI, CPI

2.3. Percentiles

Percentage scores are easily confused with percentiles. A percentage score is simply the raw score (i.e., the number correct items) divided by the total number of test items. Although the percentage score calculation is straightforward, it must be compared to some criterion or norm to give it interpretive meaning.

A percentile rank compares a person’s score to a norm group and indicates the percentage of scores at or below that score. Percentile ranks range from 1 to 99, with a mean of 50. They are not equal units of measurement and tend to exaggerate differences near the mean.

2.4. Standardized Scores

Standardization in assessment involves converting raw scores to standard scores, which serve as reference points for comparing test results. A standardized score is obtained by applying formulas that convert the raw score to a new score based on a norm group. These scores indicate the number of standard deviations a score is above or below the mean. 

Standardized scores allow for comparisons across different tests and are used in norm-referenced assessments. Examples of standardized scores include z-scores, T scores, deviation IQ, stanine scores, and normal curve equivalent scores.

  • Z-Scores

The z-score is the most basic type of standard score. A z-score distribution has a mean of 0 and a standard deviation of 1. It simply represents the number of standard deviation units above or below the mean at which a given score falls. z-scores are derived by subtracting the sample mean from the individual’s raw score and then dividing the difference by the sample standard deviation.

z = (X −M) / SD

  • In the preceding equation, X is an individual’s raw score, M represents the sample mean, and SD is the sample’s standard deviation. You must know the sample’s mean and standard deviation to convert an individual’s raw score into a z-score. 
  • A z-score represents an individual’s raw score in standard deviation units. Therefore, we can tell if an individual’s score is above or below the mean and where it falls on the normal curve. 
  • It is important to remember that z-scores are the first step in analyzing a raw score, because almost any other type of derived score can be found using the z-score formula.
  • T Scores

A T score is a type of standard score that has an adjusted mean of 50 and a standard deviation of 10. These scores are commonly used when reporting the results of personality, interest, and aptitude measures. T scores are easily derived from z-scores by multiplying an individual’s zscore by the T score standard deviation (i.e., 10) and adding it to the T score mean (i.e., 50):

T = 10 (+1) + 50 = 60

T scores are interpreted similarly to z-scores, meaning that a T score above 50 represents a raw score that is above the mean, and a T score below 50 represents a raw score that is below the mean.

  • Deviation IQ or Standard Score

Deviation IQ scores are used in intelligence testing. Although deviation IQs are a type of standardized score, they are often referred to simply as standard scores (SS), because they are commonly used to interpret scores from achievement and aptitude tests. Deviation IQs have a mean of 100 and standard deviation of 15 and are derived by multiplying an individual’s z-score by the deviation IQ standard deviation (15) and adding it to the deviation IQ mean (100).

SS = 15 (z) + 100

Deviation IQ scores are interpreted similarly to z-scores and T scores, meaning that a score above 100 represents a raw score that is above the mean, and a score below 100 represents a raw score that is below the mean.

  • Stanines

A stanine is a standard score used on achievement tests that divides the normal distribution into nine intervals. Each interval has a width of half a standard deviation. Stanine scores represent a range of z-scores and percentiles. For example, a stanine score of 4 represents a z-score range of −.75 to −.26 and a percentile range of 23 to 40. Stanines have a mean of 5, a standard deviation of 2, and a range of 1 to 9. To calculate stanine scores, multiply the stanine standard deviation by the z-score and add it to the stanine mean. Stanines are rounded to the nearest whole number.

Stanine = 2 (z) + 5

  • Normal Curve Equivalents

The normal curve equivalent (NCE) is a score used to rank individuals in comparison to their peers. It ranges from 1 to 99, dividing the normal curve into 100 equal parts. The NCE has a mean of 50 and a standard deviation of 21.06. They can be converted from a z-score by multiplying the NCE standard deviation (S D = 21.06) by an individual’s zscore and adding the NCE mean (M = 50).

NCE = 21.06 (z) + 50

2.5. Developmental Scores

Developmental scores place an individual’s raw score on a developmental continuum. They directly compare a person’s score to others of the same age or grade level, providing context for their performance. Developmental scores are commonly used for assessing children and young adolescents.

  • Age-equivalent scores compare an individual’s score with the average score of peers of the same age, reported in years and months. For example, a 7-year-5-month-old child with an age equivalent score of 8.2 in height is of average height for an 8-year-2-month-old.
  • Grade-equivalent scores compare an individual’s score with the average score of peers at the same grade level, reported as a decimal representing grade level and months. For example, a grade equivalent score of 5.6 means the individual scored the average score of a student who has completed 6 months of the fifth-grade year.
  • Grade equivalents indicate how an individual’s score compares to peers at the same grade level, but they don’t determine readiness for a higher or lower grade. A seventh-grader with a grade equivalent of 10.2 in math should not be moved to 10th-grade math. Grade equivalents show relative performance within the grade level, not absolute skill level.

3. Assesment Tests and Tools

3.1. Ability assessment

Ability assessment encompasses a range of assessment tools that evaluate cognitive skills. These skills involve various cognitive processes such as knowledge, comprehension, application, analysis, synthesis, and evaluation. Ability assessment includes tests that measure both achievement and aptitude. There are ability assessment tests as below:

  • Achievement Tests

Achievement tests are designed to assess what one has learned at the time of testing.

  •  Survey Batteries: Survey batteries refer to a collection of tests that measure individuals’ knowledge across broad content areas. Survey batteries are usually administered in school settings and are used to assess academic progress, such as Stanford Achievement Test (SAT 10), Iowa Test of Basic Skills, Metropolitan Achievement Test (MAT8), TerraNova, and Third Edition Tests.
  • Diagnostic Tests: Diagnostic tests are designed to identify learning disabilities or specific learning difficulties in a given academic area, such as Wide Range Achievement Test (WRAT4), Key Math Diagnostic Test (Key Math 3), Woodcock Johnson IV–Tests of Achievement (WJ IV ACH), Peabody Individual Achievement Test–Revised, and Test of Adult Basic Education (TABE).
  • Readiness Testing: Readiness tests refer to a group of criterion-referenced achievement assessments that indicate the minimum level of skills needed to move from one grade level to the next.
  • Aptitude Tests

Aptitude tests attempt to predict how well that individual will perform in the future.

  • Cognitive Ability Tests: Cognitive ability tests make predictions about an individual’s ability to perform in future grade levels, colleges, and graduate schools. Some common cognitive ability tests are:
  • The Cognitive Ability Test (CogAT® Form 6)
  • Otis-Lennon School Ability Test® (OLSAT8)
  • ACT Assessment
  • SAT® Reasoning Test
  • GRE® Revised General Test
  • Miller Analogies Test (MAT)
  • Law School Admission Test (LSAT)
  • Medical College Admission Test (MCAT)
  • Vocational Aptitude Testing: Vocational aptitude testing refers to a set of predictive tests designed to measure one’s potential for occupational success. Vocational aptitude testing includes multiple aptitude tests and special aptitude tests, such as The Armed Services Vocational Aptitude Battery (ASVAB) and The Differential Aptitude Test® Fifth Edition (DAT).
  • Intelligence Tests

Intelligence tests broadly assess an individual’s cognitive abilities.

  • Theories of Intelligence provides the familiar intelligence theories used in intelligence tests
  • Francis Galton: Emphasized heritability and eugenics.
  • Alfred Binet and Theodore Simon: the Binet-Simon scale.
  • Lewis Terman: Revised and translated the Binet-Simon
  • William Stern: Developed the ratio intelligence quotient
  • Charles Spearman’s two factor approach
  • Louis Thurston’s multifactor to recognize seven primary mental abilities
  • Phillip Vernon’s Hierarchical model of intelligence
  • J. P. Guilford’s Multidimensional model of 180 factors and involves three types of cognitive ability
  • Raymond Cattell’s fluid and crystallized intelligence model
  • Robert Sternberg’s triarchic theory of intelligence
  • Howard Gardner’s theory of multiple intelligences
  • Cattell-Horn- Carroll (CHC) theory of cognitive abilities
  • Intelligence Tests: The intelligence tests are based on the theories of intelligence.
  • High Stakes Testing

High stakes testing refers to the use of standardized test outcomes to make a major educational decision concerning promotion, retention, educational placement, and entrance into college. Some common High Stakes Testing are:

  • Standardized tests administered to measure school progress under No Child Left Behind (NCLB)
  • Advanced placement exams
  • High school exit exam
  • Driver’s license tests
  • Professional licensure and certification examinations

3.2. Clinical Assessment

Clinical assessment involves a comprehensive evaluation of clients using various methods, including personality testing, observation, interviews, and performance assessments. It aims to enhance clients’ self-awareness and aid professional counselors in understanding clients’ needs and developing treatment plans.

  • Assessment of Personality

Personality tests focus on evaluating an individual’s emotional and behavioral traits that tend to remain consistent throughout adulthood, such as temperament and behavioral patterns. These tests are classified as either objective or projective personality tests.

  • Objective Personality Tests: Objective personality tests identify personality types, personality traits, personality states, and selfconcept. The most commonly administered objective personality tests are Myers-Briggs Type Indicator (MBTI), The Sixteen Personality Factors Questionnaire (16PF), Minnesota Multiphasic Personality Inventory-2 (MMPI-2), Millon Clinical Multiaxial Inventory, California Psychological Inventory (CPI 434), The NEO Personality Inventory (NEO PI-3), and Coopersmith Self-Esteem Inventories (SEI).
  • Projective Personality Tests: Projective personality tests assess personality factors by interpreting a client’s response to ambiguous stimuli. The most commonly administered projective personality tests are Rorschach inkblot test, Thematic Apperception Test (TAT), House-Tree-Person (HTP), and Sentence completion tests.
  • Informal Assessments

Informal assessments refer to subjective assessment techniques that are developed for specific needs. Types of informal assessment include observation, clinical interviewing, rating scales, and classification systems.

  • Observation: Observation refers to the systematic observation and recording of an individual’s overt behaviors. Behavioral assessments can be conducted by gathering data from direct or indirect observation.
  • Clinical Interviewing: Clinical interviewing refers to the process by which a professional counselor uses clinical skills to obtain client information that will facilitate the course of counseling. Many different types of interviews exist, and they can all be classified as structured, semistructured, or unstructured.
  • Rating Scales: Rating scales typically evaluate the quantity of an attribute. 
  • Classification Systems: Classification systems are used to assess the presence or absence of an attribute. Three commonly used classification systems are Behavior and feeling word checklists, Sociometric instruments, and Situational tests. 
  • Other Types of Assessments

Other forms of assessments often used in counseling include the mental status exam, performance assessment, suicide assessment, and trauma assessment.

  • Mental Status Exam: The mental status exam (MSE) is used by professional counselors to obtain a snapshot of a client’s mental symptoms and psychological state. The MSE addresses several key areas Appearance (physical aspects of a client), Attitude, and Movement and behavior.
  • Performance Assessment: Performance assessments are a nonverbal form of assessment that entails minimal verbal communication to measure broad attributes. 
  • Suicide Assessment: Suicide assessment refers to determining a client’s potential for committing suicide. Specifically, the professional counselor must make a clinical judgment concerning the client’s suicide lethality.
  • Trauma Assessment: Trauma can be broadly defined as an emotional response to an event(s) where an individual experiences physical and/or emotional harm.