Reliability, minimum detectable change and sociodemographic biases of selected neuropsychological tests among people living with HIV in south-eastern Nigeria

. We examined agreement using intra-class correlation (ICC), standard error of measurement and MDC. We verified the influence of sociodemographic variables on test performance using Man–Whitney U-test and Kruskal–Wallis test. The HVLT-R-delay recall (DR), TMT-A, TMT-B and COWAT showed excellent inter-rater reliability with ICC values of 0.83, 0.86, 0.78 and 0.89, respectively. The HVLT-R-verbal learning (VL), DST-f and DST-b showed moderate inter-rater reliability with ICCs of 0.4.99, 0.52 and 0.60, respectively. The HVLT-R-DR, TMT-A, DST-b and COWAT showed excellent intra-rater reliability, with ICCs values of 0.76, 0.80, 0.84 and 0.97, respectively. The TMT-A, DST-f and DST-b were free from sociodemographic bias. The HVLT-R-DR, TMT-A, TMT-B, DST-f, DST-b and COWAT are reliable candidate NP tests for PLWHIV in our setting.


Introduction
Human immunodeficiency virus (HIV)-associated neurocognitive disorder (HAND) remains prevalent in the sub-Saharan part of Africa, especially in Nigeria, which has the world's secondhighest HIV and AIDS incidence (Awofala & Ogundele, 2018;Nweke, Akinpelu, & Ezema, 2019;Yakasai et al., 2015). HIV targets the central nervous system (CNS) and compromises the bloodbrain barrier, causing HIV-infected microglia to infiltrate the CNS and secrete neurotoxic viral proteins such as Tat and proinflammatory cytokines, as well as disrupting neurogenesis, dysregulation of CD4+ T cells and damaging synaptodendritic networks (Cody & Vance, 2016). These occurrences result in damage to specific brain structures and neural circuits, increasing the brain's predisposition for acquiring subsequent neuropsychological (NP) diseases, such as neurocognitive disorder (Watkins & Treisman, 2012). The HAND is a spectrum of disorders comprising HIV-associated dementia (HAD), mild neurocognitive disorder and asymptomatic neurocognitive impairment (Antinori et al., 2007). Memory, learning information processing speed, executive function, attention and/or concentration and verbal fluency are the most frequently impaired cognitive abilities in people living with HIV (PLWHIV) (Yakasai et al., 2015). The widespread use of antiretroviral therapy (ART) has reduced the burden of HAD, notwithstanding, the prevalence of mild but limiting phenotypes of HAND remains high (Yakasai et al., 2015). Therefore, proper screening tools are necessary to ensure effective evaluation and prompt initiation of treatment for HAND. Until now, in sub-Saharan Africa, brief measures such as the HIV Dementia Scale and the International HIV Dementia Scale, Montreal Cognitive Verification of the psychometric properties of neuropsychological (NP) tests in each society of people living with HIV (PLWHIV) will facilitate accurate classification of HIV-associated neurocognitive disorder. This study aimed to determine the reliability, minimum detectable change (MDC) and sociodemographic biases of selected NP tests among PLWHIV. The study took place at the HIV clinic of the University of Nigeria Teaching Hospital, Enugu. A total of 60 PLWHIV were randomly recruited into two groups of 30 each. The first group was evaluated by two independent examiners (inter-rater) and the other by a single rater (intrarater). The NP tests utilised included the Hopkins Verbal Learning Test-Revised (HVLT-R), controlled oral word association test (COWAT), Trail Making Test-A (TMT-A) and -B (TMT-B), Digit Span Test-forward (DST-f) and -backward (DST-b). We examined agreement using intra-class correlation (ICC), standard error of measurement and MDC. We verified the influence of sociodemographic variables on test performance using Man-Whitney U-test and Kruskal-Wallis test. The HVLT-R-delay recall (DR), TMT-A, TMT-B and COWAT showed excellent inter-rater reliability with ICC values of 0.83, 0.86, 0.78 and 0.89, respectively. The HVLT-R-verbal learning (VL), DST-f and DST-b showed moderate inter-rater reliability with ICCs of 0.4.99, 0.52 and 0.60, respectively. The HVLT-R-DR, TMT-A, DST-b and COWAT showed excellent intra-rater reliability, with ICCs values of 0.76, 0.80, 0.84 and 0.97, respectively. The TMT-A, DST-f and DST-b were free from sociodemographic bias. The HVLT-Assessment Test, NeuroScreen and CogState Brief Battery dominate neurocognitive screening among PLWHIV (Mwangala, Newton, Abas, & Abubakar, 2019). Their simplicity of administration is fascinating but they are characterised with diagnostic weakness such as an inability to detect asymptomatic or mild cognitive impairment that makes them unacceptable when pursuing a definitive diagnosis of HAND (Singh et al., 2010). Towards the diagnosis of HAND, the comprehensive NP battery tests are the tools of choice (Robertson, Liner, & Heaton, 2009). They are the most useful instruments for identifying and classifying the impact of HIV or AIDS vis-a-vis the CNS (Robertson et al., 2009).
In a resource-limited environment where sophisticated laboratory and neuroimaging techniques are not accessible, the application of NP screening to characterise neurocognitive functioning among PLWHIV is essential to successful diagnosis and treatment of HAND (Robertson et al., 2009). Before now, the utility of the NP battery tests in SSA was narrow because they require instrumentations that are not common in the region. Also, their administration demands previous clinical or research experience among people living with HAND (Nweke, Mshunqane, Govender, & Akinpelu, 2021;Singh et al., 2010). In Nigeria, the use of NP tests in the diagnosis of HAND is gaining momentum in recent times (Jumar et al., 2017;Robertson et al., 2016). The NP tests are the gold standard for the diagnosis of HAND but they are cumbersome and culture-specific (Fernandez & Marcopulos, 2008) and this makes up one of the well-known challenges faced by neuro-HIV researchers or clinicians (Yeatesh & Taylor, 2011). They often require that two or more raters administer them to participants to reduce their burden (Andrews, Janzen, & Saklofske, 2001). This is especially true of clinical trials or normative studies where many participants undergo screening following the diagnostic criteria stipulated by the American Academy of Neurology's research nomenclature and diagnostic method (Antinori et al., 2007). Attrition is a common observation in clinical trials and results from premature discontinuation, missed or flawed assessments (Sridhara, Mandrekar, & Dodd, 2013). The cumbersomeness of NP testing may put the attrition rate above acceptable attrition margin (20%) (Nunan, Aronson, & Bankhead, 2018), thus undermining the feasibility of a clinical trial or observational study requiring large-scale testing. In the research setting where the same assessor or other assessors may follow up treatment outcomes, there is a need to examine the reliability of commonly used NP tests (McHugh, 2012). The significance of rater reliability stems from the fact that it indicates the degree to which the data obtained are accurate representations of the variables being examined (McHugh, 2012). In clinical research, reliability is most commonly verified using intraclass correlation (ICC) (Lee et al., 2013). However, the use of ICC in the estimation of reliability presents a weakness as it is a measure of relative reliability, thus underscoring the necessity of the estimation of minimum detectable change (MDC) (Lee et al., 2013). The MDC is a precise measure of measurement error and consistency (Lee et al., 2013). Based on the data available to us, no study has examined the psychometric properties of NP tests in the Nigerian context. Hence, this study aimed at examining the reliability and MDC of selected NP tests and the effects of demographic characteristics.

Participants
The study was a cross-sectional study conducted among HIV-positive adults aged between 18 and 64 years receiving ART at the HIV clinic of the University of Nigeria Teaching Hospital, Ituku-Ozalla, Enugu, Nigeria. The study took place between June and August 2020. The study constitutes one of the preliminary steps undertaken in pursuit of a clinical trial investigating the 'effects of an aerobic exercise programme on neurocognitive disorder in Nigeria'. Owing to the elaborateness of the NP evaluations, the study took place in two batches using different study samples to save patients' waiting time as many of the participants visited the clinic from distant locations while reducing contact time with patients, in the wake of coronavirus disease 2019 . A simple random sampling, balloting (3 'YES':1 'NO') was used to select study participants. Those who picked 'YES' from the ballot box were recruited into the study and consecutively assigned to raters who carried out the independent evaluation. The first batch involved a sample of 30 HIV-positive adults recruited to examine the inter-rater reliability between a neurological physiotherapist and a clinical psychologist, while the second batch, which involved a sample of 30 HIV-positive adults examined the intra-rater reliability.
The study population was PLWHIV on ART. In line with Viechtbauera et al. (2015), we estimated the sample size for this pilot study using a pilot sample size calculator available at http://www.pilotsamplesize.com. Using a HAND prevalence rate of 21.5% (Yusuf et al., 2014), an acceptable level of withdrawal or an incomplete assessment rate of 10%, we required a minimum sample size of 28 to detect the problem with 95% confidence. The inclusion criteria for the study include being HIV positive, adult between 18 and 65 years, being on ART for at least three months, formal (primary six) education with an ability to use English and capacity for consent. For the inter-rater examination, we used the consecutive assessment to assign participants to the raters. We excluded the following participants: individuals above 65 years of age, smokers, alcohol-dependents, substance abusers, individuals with cardiorespiratory disease (heart attack, asthma or chronic obstructive pulmonary disease [COPD]) disease; history of focal neurological deficit, traumatic brain injury with history of loss of consciousness, stroke, psychiatric illness including depression, opportunistic infection such as TB, candidiasis, hepatitis or record blood pressure over 140/90 mmHg, structured physical activity or cognitive enhancing medication were all excluded. Exclusion criteria were determined through the use of self-reports obtained during recruitment. Blood pressure, depression, alcohol abuse and substance abuse were measured using appropriate instruments.

Instruments
Neurocognitive evaluations were undertaken with the aid of NP tests. The NP tests constitute the gold standard instrument in screening and diagnosis of HAND (Yakasai et al., 2015). The Frascati criteria stipulate that NP evaluation for PLWHIV should cover at least five ability domains commonly impaired among PLWHIV including verbal learning (VL), memory, working memory, attention, abstraction and/or executive function, information processing speed and verbal fluency (Antinori et al., 2007). In this study, we examined four NP tests covering seven ability domains. They include the Hopkins Verbal Learning Test-Revised (HVLT-R), with test-retest reliability indices of 0.537-0.818 (O'Neil-Pirozzi, Goldstein, Strangman, & Glenn, 2012), Trail Making Test (TMT), with ICC of 0.7-0.98 (Salthouse, 2011), Digit Span Test (DST), with a testretest reliability index of 0.7-0.78 (Groth-Marnat & Baker, 2003) and the Controlled oral word association Test (COWAT-F-A-S letter fluency), with an excellent inter-rater reliability index ≥ 0.9 (Ross et al., 2007). These tests have been used in African and Nigerian settings (Singh et al., 2010;Yakasai et al., 2015) and a recent clinical trial on HAND (Towe, Puja Patel, & Meade, 2017). The selection of these was mainly based on their utility in African setting and ease-of-administration.

Procedure
A clinical psychologist (evaluator 1) was trained by the lead investigator (evaluator 2), who is familiar with NP testing and an acceptable degree of agreement was established. Neuropsychological testing was carried out by the two independent evaluators in the first batch of the study, with each subject being assessed twice a day. Only evaluator 1 completed NP testing in the second batch, with each individual being assessed twice, one after the other. Data from the first evaluation by evaluator 1 and data from the second evaluation by evaluator 2 completed on the same day were utilised for the inter-rater reliability evaluation (Van Lummel et al., 2016). Data from the first and second measurements of evaluator 1 were used to assess intra-rater reliability. Every effort was made to ensure that participants who were being tested stayed away from the consulting bench to avoid contamination.

Data analysis
Exploratory statistics showed that some data sets were not normally distributed and log-transformation did not improve the distribution, hence we ignored log transformation as this could result from sociodemographic biases. Data were summarised using descriptive statistics. We used the ICC to verify the rater reliability. Specifically, inter-and intra-rater reliability was examined using the use ICC 2,1 and ICC 1,1 , respectively. Both ICC 2,1 and ICC 1,1 were based on the absolute agreement in a two-way (mixed-effects) repeated-measures analysis of variance model. According to the literature, the ICC values were graded as follows: weak (ICC < 0.40), moderate (ICC between 0.40 and 0.75) and excellent (ICC > 0.75) (Sedrez et al., 2016). The following formula was used to calculate the standard error of measurement:

Results
A total of 60 PLWHIV (30 each for inter-and intra-rater reliability groups) with a mean age of 43.5 ± 8.2 years took part in the study. Participants in both groups were comparable concerning the sampled clinical and sociodemographic characteristics. In both groups, females were twice as many as men. Seventy per cent of the participants had secondary to tertiary education ( Table 1).
Evaluation of the inter-rater reliability showed moderate to excellent reliability, with limited to an acceptable level of random measurement error. Specifically, the HVLT-R-DR domain, TMT-A, TMT-B and COWAT showed excellent inter-rater reliability with ICC values of 0.83, 0.86, 0.78 and 0.89, respectively. Three tests, namely the HVLT-R-VL, DST-f and DST-b showed moderate inter-rater reliability with ICCs of 0.4.99, 0.52 and 0.60, respectively. The result shows that all the tests except HVLT-R possessed MDC percentage within the predefined limit of acceptance (< 30%). There was no proportional bias throughout the inter-rater evaluations (p > 0.05) ( Table 2).
Examination of the sociodemographic biases of the NP test revealed that three of the seven tests, namely the TMT-A, DST-f and DST-b were free from any form of sociodemographic bias, while four tests, namely the HVLT-R-VL, HVLTR-DR, TMT-B and COWAT possessed at least one form of sociodemographic bias. Specifically, the inter-rater performance of the COWAT was significantly related to age (Rho = 0.36; 0.048). The intrarater performance of both the HVLT-R-VL (p = 0.001) and HVLT-R-DR (p = 0.010) was affected by sex, with females showing better performance than their male counterparts. The inter-rater performance of the HVLT-R-VL (p = 0.046) and the intra-rater performance of the TMT-b (p = 0.003) showed education bias, with individuals with tertiary and secondary education showing the better performance (p = 0.007) than those with primary education (Table 4).

Discussion
All the NP tests showed moderate to excellent inter-rater reliability, with acceptable levels of measurement error as determined by the MDC percentage. To facilitate the right interpretation of the results obtained during clinical or research follow-up, it is essential to note the variability inherent in the measurement as defined by the SEM. The SEM shows that the measurement's precision varied from 0.2 to 5.3. We can consider these values clinically acceptable, showing that NP evaluation can be reliably conducted by two more raters using the selected tests. This simply proves the fact that NP assessments can be reliably conducted by two or more raters and yield similar results. By implication, NP evaluations using these tests can be undertaken by two or more raters with similar experience or training during a clinical or research follow-up. There is a paucity of literature on the inter-reliability of the selected NP tests among PLWHIV, notwithstanding they remain the gold standard instrument in the diagnosis of neurocognitive disorders among PLWHIV (Antinori et al., 2007). The finding of this study collaborates with that of Singh et al. (2010), in which an ICC of 0.89 was obtained between a psychiatrist and a psychologist. Similarly, Fals-Stewart (1992) reported excellent inter-rater reliability with the use of TMT-A and -B, although in an HIV-seronegative population. Using two or more evaluators in the assessment of neurocognitive performance is especially important in normative studies and clinical trials of many participants. The MDC percentage for the seven tests was within the predefined limit of acceptance, thus suggesting that that the tests possess good sensitivity among PLWHIV in Nigeria. This provides clinicians and researchers with a baseline for the evaluation of the impact of a clinical intervention on NP performance of PLWHIV in Nigeria. There was no inter-rater proportional bias, indicating that measurement approaches agree evenly over the measurement range. This means that the boundaries of the agreement are unaffected by the measurement itself (Bland & Altman, 1999).
As touching intra-rater reliability, all but the HVLTR-VL showed good to excellent intra-rater reliability, with ICC values between 0.65 and 0.97. The MDC percentage values showed limited measurement error except for the HVLTR-VL, where the measurement error was greater than the pre-defined mark of 30%. The MDC values for the  HVLT-R-VL, HVLT-R-DR, TMT-A, TMT-B, DST-f, DST-b and COWAT were approximately eight words, one word, 18 s, 23 s, one digit, one digit and one word, respectively. The poor intra-reliability of the HVLTR-VL reflects the considerable level of measurement error, which was the result of interruption experienced during evaluation. In this study, we assessed participants who were waiting to collect their drugs in an ART clinic. Occasionally, a few participants' attentions were drawn during NP evaluation and may this constitute a bias for the HVLTR-VL. For these participants, we ensured, most of the time, that they completed the test at hand before responding to the call. We also emphasised the need to pay attention during the test. However, the poor-intra-rater reliability of the HVLTR-VL suggests that the HVLTR-VL is sensitive to minute distractions.
Regarding the effects of sociodemographic characteristics on the selected NP tests, previous studies have been inconsistent in their support of age as a variable influencing verbal fluency performance, our finding supports the postulation that age affects verbal fluency performance. To nullify the potential age-related bias, the verbal fluency when used as a candidate NP test requires the use of aged-matched HIV negative control. This agrees with the findings of Barry, Bates and Labouvie (2008). However, contrary to Bates & colleagues, our study showed that older adults performed better than younger adults. Although verbal ability is a crystallised ability that does not deteriorate with age or minor brain dysfunction, phonemic fluency necessitates executive ability, specifically the ability to initiate and maintain effort, as well as the ability to organise information for retrieval, both of which are abilities that are sensitive to nuanced cerebral  dysfunction and ageing (Burke, Crowder, Hagan-Burke, & Zou, 2009;Henry & Beatty, 2006;Plumet, Gil, & Gaonac'h, 2005). Hence, the better COWAT performance of the older adults relative to younger adults found in this study could be because of the poor sample size, with skewed age distribution: older adults made up only 10% of the participants. Although education is a potential predictor of COWAT performance, in our study, educational levels and sex did not influence verbal fluency. This suggests that education-and sex-adjusted norms may not be necessary for this measure in our setting. The finding that age and sex affected the performance of the HVLT-R-DR is consistent with the result obtained by Vanderploeg et al. (2000), in which age and sex had a significant impact on the performance of the HVLT-R. However, unlike Vanderploeg and colleagues, our studies showed that educational level did not influence participants' performance on HVLT-R, suggesting that its election as a candidate NP tests for PLWHIV in our setting may not require educational consideration. Unlike Vanderploeg et al. (2000), it is likely that our sample size was insufficient to identify an influence of education on the HVLT-R or that educational differences were not correctly reflected by years of education. Although Feng et al. (2014) found that being single was linked to a higher risk of cognitive impairment than being married, our study is the first to report a significant effect of marital status on HVLT-R performance among PLWHIV. However, contrary to Feng et al. (2014), this study revealed better performance among single and widows compared with the ones who are married. The discrepancy could be because of the difference in the study population, with the former conducted amongst elderly Chinese.
In this study, the performance of TMT-B was influenced by sex. This agrees with the findings of Singh et al. (2010) in which gender was associated with TMT performance. It implies that the use of TMT-B in the diagnosis of HAND among PLWHIV must be based on sex-matched norms. However, the fact that education influenced the TMT-B in this study is consistent with Mitrushina, Boone, Razani and D'Elia (1999) and Tombaugh (2004) but contrary to Singh et al. (2010), which is a South African-based study among PLWHIV. This discrepancy may point to variation in sample size, and how many years of education is a true reflection of experience between the societies. Just like other NP tests, the effect of sociodemographic characteristics on DST performance is controversial (Choi et al., 2014). Some studies reported significantly higher DST score in females than in males (Ostrosky-Solis & Azucena, 2006;Singh et al., 2010), while others showed a minor or non-existent gender effect, implying that no gender-related modifications to the normative data are required (Anstey, Matters, Brown, & Lord, 2000;Pena-Casanova et al., 2009). Our studies showed that DST performance was not subject to any sex, age or educational bias, thus indicating that sociodemographic matched norms may not be necessary when it is used as a candidate NP test to aid HAND diagnosis. Overall, the imperfect agreement between the findings of this study and the previous studies constitutes support to the postulation that the use of cross-cultural data for a NP test will lead to errors in HAND classification (Fernandez & Marcopulos, 2008).
The limitations of the study include the use of unequal sample categories, practice effect and the distractions encountered during NP evaluation; the healthcare professional drew the attention of some participants during NP assessment. The study draws its strength in being the first of its kind to report reliability and MDC for seven cognitive ability domains relevant to PLWHIV in Nigeria. The HVLT-R-DR, TMT-A, TMT-B, DST-f, DST-b and COWAT are reliable candidate NP tests for PLWHIV. In our setting, the TMT-A, DST-f and DST-b are free from demographic bias and hence may not require demographically adjusted norms, while TMT-B, HVLT-R and COWAT exhibit at least one form of sociodemographic biases. This may require that appropriate adjustments be made when they are selected as candidates for NP tests in pursuit of HAND diagnosis. Extra caution aiming at minimising distraction should be ensured when administering the HVLT-R. Further studies are recommended with large sample size to establish normative score of the selected NP tests and to examine the effect of socioeconomic status on NP test performance.