Which criteria are most effective for identifying invalid responses on Goodville questionnaires?

Every year, smartphones, tablets and wearable sensors are becoming more popular. Mobile phones and wearable sensors can collect physiological, social, emotional, and behavioral data in real time with little effort on the client's part (Areán et al., 2016). Smartphones have become an important tool in psychological research because of these features. The use of mobile versions of questionnaires is an effective method for obtaining information about people's subjective psychological states. (Harari et al., 2016). Using mobile psychological testing data, treatment decisions can be informed and therapeutics responses can be tracked (Areán et al., 2016; Areán et al., 2016). 

Developing novel and meaningful behavioral measures requires a psychological approach that considers reliability, validity, and generalizability. Inferences made from smartphone data may underestimate or overestimate certain behaviors, which makes this area of research particularly important for future research. Validity and reliability of clinical outcome assessments are essential in mental health apps. When users use tools that are not tested for diagnostic properties, false results may be produced. In most mobile mental health applications, well-known psychometrically sound assessment tools are used. Despite excellent psychological properties, however, mobile apps use tools in a new format almost blindly. There is no clear guidance for respondents when answering questions in a mobile application.

As well as being a mobile game, the Goodville app includes a module for assessing emotional wellness. A variety of self-assessment questionnaires are included in the emotional well-being assessment module for screening depression (QIDS-SR16, PHQ-8) and anxiety (GAD-7). In a gamified emotional health assessment, participants from a wide range of groups can participate. The module's questionnaire items can be negatively affected by game motivation, however. There is a possibility that some respondents will answer the questionnaire items incorrectly. Lack of motivation often leads to players clicking on the same response categories or providing random answers without reading the items' content. In the meantime, it is assumed that a substantial portion of participants will provide accurate emotional health information. It is therefore extremely important to determine the extent to which the mobile psychological data obtained are in line with the diagnostic purpose stated in the data collection.

It is important that the data acquired by mobile devices captures the concept of interest (e.g. depression or anxiety). Due to the fact that data interpretation is based on total scores, results also need to be based on a suitable measurement model, which means that the summary score obtained should be valid and measurable from a psychometric perspective (Embretson & Reise, 2013; Mari et al., 2021). The underlying theory specifies how indicators (items) that are related to a construct (e.g. depression) must be used to measure the construct. A person's endorsement of an item should be determined only by their ability (for example, depression) and not by other factors (Mari et al., 2021; McClimans et al., 2017; Mohamad et al., 2015). An individual who is more depressed is more likely to endorse an item describing low energy level than an individual who is not depressed. Even though this item does not measure depression directly, it contributes to a depression total score when combined with other items related to depression. Traditionally, self-reported outcomes questionnaires have been evaluated using classical test theory (CTT) (Cappelleri et al., 2014). Test development has been based on the CTT since the 1930s. When evaluating mobile test data, CTT does not provide methods for detecting fake (invalid) responses. The study results are significantly less accurate and valid because all mobile data is accepted as valid (Cappelleri et al., 2014; de Champlain, 2010; Embretson & Reise, 2013; McClimans et al., 2017). Due to the shortcomings of this theory, the situation has changed significantly in the last few decades. The psychometric requirements that must be met when designing items for mobile apps that are supposed to measure a particular construct can be determined using modern item-response theories (IRT) (Mari et al., 2021; McClimans et al., 2017; Mohamad et al., 2015). Recent research has demonstrated that modern statistical approaches, such as Rasch Measurement Theory (RMT), can contribute to the improvement of diagnostic instruments. There are some particularly desirable attributes of the RMT, such as interval scale levels of model parameters, sample-free calibration of tests, and item-free measurement of individuals. In the RMT framework, an unidimensional relationship is calculated between item difficulty (e.g. depression level expressed by the item) and person ability level (e.g. depression level) by evaluating the number of positive and negative endorsements and expressing the difference as log-odds. In the RMT, the likelihood of an individual endorsing an item is logistically related to the difference between the level of ability and the level of difficulty. The RMT provides an advantage over the other IRT models in estimating model parameters using a summary score. Unlike other measurement models, the RMT integrates a total score with an analysis of individual responses. In the model measurement process, Rasch analysis determines if the total score is a sufficient statistic for assessing the severity of the ability measured (Engelhard, 2013; Wilson & Fisher, n.d.).

Mobile psychological data must agree with the model predictions to achieve the Rasch model's properties. In order to achieve the properties of the Rasch model, empirical data must fit the model predictions. Rasch fit statistics indicate whether the data fits a model well and whether the mobile measurement scale is useful (Andrich & Marais, 2019). In fact, Rasch fit statistics are chi-squares divided by degrees of freedom, also known as mean-squares. They are calculated by squaring the differences between the observed and predicted responses and averaging these residuals (B. D. Wright & Masters, 1990). Rasch fit statistics are divided into two categories. Person fit statistics measure how well a person's responses to the items match the model's predictions. Item fit statistics indicate how well items contribute to a one-dimensional scale. For each class, two chi-square statistics, Outfit and Infit, are calculated to determine quality of respondent responses and items (Andrich & Marais, 2019; Green & Frantom, 2002). The Outfit includes standardized on expected response dispersion mean square (UMS) residuals that are averaging on items (person UMS) and respondents (item UMS). The Infit can be expressed as non standardized mean square residuals that are weighted on summary dispersion of expected responses for persons (person WMS) and for items (item WMS). Both UMS and WMS have standardized versions based on the normal Z-distribution and expressed in standard Z-units. Wilson-Hilferty cubic transformation is used to standardize mean squares (B. D. Wright & Masters, 1990). Thus, in RMT, four fit indices are calculated for each person and item: UMS, WMS and their standardized versions ZstdUMS, ZstdWMS.


In the context of the article, person fit statistics are of particular interest. Person fit indices are indicators of whether individuals respond to items consistently or whether they respond idiosyncratically or irregularly. Inattentiveness, boredom, confusion, or unusual salience may alter people's responses, which may be inconsistent. The respondent can simply choose the same predictable answers to the questionnaire items if he is in a hurry or does not want to strain himself. By using Rasch person fit statistics, it is possible to select invalid response patterns that are unsuitable for mobile psychological measurement. In general, responses with mean square fit statistics between 0.50 and 1.50 (B. Wright & Linacre, 1980) are considered productive for measurement, while fit statistics less than 0.50 are highly predictive with little dispersion. Fit statistics above 1.5 distort measurement due to irregular, inconsistent or bizarre structure. As for standardized statistics, values between -2.0 and 2.0 (Embretson & Reise, 2000; B. D. Wright & Masters, 1990) are appropriate for measurement, less than -2.0 are excessively predictive, and above 2.0 are distorted. The rationale behind these criteria is mainly theoretical. They are primarily used to assess the quality of items. The application of these values of fit statistics to evaluating respondents' responses remains unclear. Behind this, the UMS (outfit) considers unexpected responses provided by people who give irregular, chaotic and invalid responses. Its value increases as the data sample exhibits more unpredictable response patterns. The WMS is information weighted, it emphasizes residuals from well-matched person-item encounters, and places less attention on residuals from highly unexpected responses. Due to the possibility of a significant number of unexpected or invalid responses in big data samples obtained through mobile assessments, the values of both person fit statistics may differ significantly. There are no empirical studies on which type of Rasch person fit statistics is more informative for selecting valid responses from samples of big data obtained during a mobile psychological testing.

The aim of this study was to determine the most effective criteria for selecting valid responses from big data samples obtained using mobile applications for questionnaires based on Rasch person fit statistics.

A validity analysis was conducted with regard to the items of the Quick Inventory of Depressive Symptomatology (QIDS-SR16). The QIDS-SR16 has been developed in 2000 using items from the 30-item Inventory of Depressive Symptomatology (IDS) (Bernstein et al., 2010; Brown et al., 2008; Rush et al., 2003; Trivedi et al., 2004). There are 16 items, which are transformed into nine domains of the DSM-IV symptom profile for major depression. Items 1-4 describe different types of sleep disturbance. Item 5 evaluates depressive mood.   Items 6 and 7 describe decreased appetite and increased appetite respectively. Weight loss and weight gain are the topics of items 8 and 9. The next items evaluate such symptoms of depression as concentrating difficulties (item 10), self-criticism (item 11), suicidal thoughts (item 12), low interest (item 13), low energy (item 14). Symptoms of depression associated with the last two items (15 and 16) include slowness and restlessness. Each item score ranges from 0 to 3, with higher scores representing greater psychopathology (Rush et al., 2003). The QIDS-SR16 was used to determine the criteria for determining the validity of responses because of the unique features of the item set. There are some items on the questionnaire that evaluate states that are in opposition to each other. Items such as these include "decreased appetite", "increased appetite", "decreased weight", "increased weight", "slowness", "restless". In light of the fact that the questionnaire evaluates the respondent's state over the past seven days, it is unlikely that the person could have both significantly reduced and significantly increased their appetite or weight during that time period. The response category 2  on  item "reduced appetite" for example includes the following statement: "I eat much less than usual and only with personal effort," while the equivalent response category  on item "increased appetite" says: " I regularly eat more often and/or greater amounts of food than usual ." There is a little chance that the respondent will have to eat hard within 7 days, and take more and more food than usual on a regular basis. Accordingly, the response category 3 content of "decreased weight" item includes "I have lost 5 pounds or more”, while similar response category content of "increased weight" item includes " I have gained 5 pounds or more ". Generally, it is impossible to gain 5 pounds or more and immediately lose them within 1-2 weeks. Among the QIDS-SR16 items, "decreased appetite" and "decreased weight" reflect typical symptoms of depression, while "increased appetite" and "increased weight" measure atypical symptoms. A person with depression cannot experience typical and atypical somatic symptoms at the same time. To a lesser extent, a similar situation may affect items "slowness" and "restless". Feelings of slowdown and restlessness in depression, however, are not as stable as appetite and body weight symptoms. Thus, when the respondent selects the maximum response  categories responses on the items "reduced appetite" and "reduced weight", it is expected that he will choose zero responses for "increased appetite" and "increased weight", indicating the absence of these symptoms. Having high categories on all of these items cannot be considered valid, since a person cannot experience the opposite somatic symptoms of depression within a short period of time. From the above, high-scoring patterns of respondents' responses regarding "reduced appetite" and "increased appetite" or "reduced weight" and "increased weight" can be considered to be clear variants of invalid responses.

Person fit statistics are calculated using functional relationships between model parameters (item difficulty, ability, and total score) without considering the content of items.  Generally, these functional relations follow the law of order. According to this law, the higher the total score on the scale, the higher the category of answers expected. There is no contradiction between all the items in this case and the scale is  evaluated as constructively valid. Consequently, if all the answers to the items "decreased appetite", "increased appetite", "decreased weight", "increased weight" have high scores and the same increasing order of response categories (0 - absence of a symptom, 3 - maximum severity of the symptom), then such response patterns are considered valid under the Rasch system. But realistically, such response patterns are invalid. Therefore, to make these patterns visible as invalid for the Rasch analysis, we reversed the order of responses to the items "increased appetite" and "increased weight" (from 0 -  maximum severity of the symptom, to 3 – no symptom). In this case, such patterns of responses to these items, for example, as 3-0-2-0, seemed to be valid, and patterns of type 3-2-3-3 were strongly considered as invalid. In order to determine the validity of response patterns more accurately, we included only the following items in the analysis: "depressive mood", "decreased appetite", "increased appetite", "decreased weight", "increased weight", "concentration difficulties", "self-criticism", "suicidal thoughts", "low interest", "low energy”. We excluded items describing sleep disturbances and psychomotor symptoms from the analysis due to their potential impact on validity evaluations.


The data sample included 275,799 response patterns provided by respondents from the United States from April 2021 to September 2022. Rasch analysis was applied to all the data. Four fit statistics indices were calculated for each response pattern to determine the validity of the responses: unweighted mean square (UMS), standardized unweighted mean square (ZstdUMS), weighted mean square (WMS) and standardized weighted mean square (ZstdWMS). Validity assessment used three ranges of fit statistics values. Using non-standardized indices, responses that were up to 0.5 were considered excessively predictable, those between 0.5 and 1.5 were identified  as valid, and responses that were more than 1.5 were regarded invalid. Standardized indices indicated excessively predictable responses up to -2.0, valid ranges from -2.0-2.0, and invalid responses over 2.0. For validity evaluation, two groups of response patterns were selected. Patterns of inadequate responses on appetite and weight items were included in one group. The other group consisted of responses that had adequate consequences.  There were 10111 presumably invalid patterns in the first group. Among the second group, 5491 patterns had correct responses related to appetite and weight.

The figure 1 shows the results of using Rasch person fit statistics to recognize invalid responses in the first group.

Figure 1. The distribution of invalid and valid responses in the first group based on the person fit statistic indices

According to Figure 1, the nonstandardized unweighted mean square index (UMS) was the only one that correctly identified invalid responses. Other indices identified only about half of the invalid patterns. Considering that all fit statistics were expected to recognize the vast majority of incorrect patterns in the first group, these results should be considered unexpected. In all data from the first group, high scores were recorded for the items "reduced appetite", "increased appetite", "reduced weight", and "increased weight". In every response pattern, at least two answers to items related to appetite and body weight were invalid since they contradicted the other two items regarding the same symptoms. In the event of two unexpected responses with maximal categories in pattern, the total score can shift by six points. In Rasch analysis, all four person fit statistics are considered when evaluating the validity of respondents' responses. In light of the fact that three out of four fit statistics for the first group revealed approximately 50%-60% of invalid responses, it is possible to conclude that the UMS overestimated the invalidity of respondents' responses if the validity of those responses was not known in advance. This study, however, used completely invalid response patterns as the first group. Thus, WMS and both standardized person fit statistics tend to consider invalid response patterns to be patterns of good quality.

Next, the validity of the second group's response patterns was evaluated. In contrast to the first group, the validity of the response patterns included in the second group was unknown before analysis. There were only correct responses to items describing appetite and body weight symptoms in this group. There is, however, a possibility that this group could contain invalid patterns caused by incorrect responses to other items.  Figure 2 shows the histograms of the distribution of response patterns by quality according to the criteria of person fit statistics for the second group.

Figure 2. Distribution of responses from the second group by quality according to the indices of the person fit statistic

According to the histogram in the figure, almost all responses from the second group were valid, based on the criteria of standardized fit statistics. Non-standardized WMS and UMS criteria evaluated 74%-77% of responses as valid. In contrast to the first group, a small portion of the response patterns in the second group were considered excessively predictable, which was to be expected. It is impossible to interpret excessively predictable responses corresponding to low fit statistics unambiguously as invalid or valid. Response patterns with a high degree of predictability have very low variances in item response scores. Such patterns often have the same score for most or all answers. Most likely, response patterns consisting entirely of identical scores are invalid and should be excluded from the study's results. However, it can be difficult to conclude that predictable patterns with low dispersion with nonidentical responses are invalid. Due to the fact that the purpose of the study was not to analyze excessively predictable responses, we did not interpret them in a negative light. All the invalid responses determined by the ZstdWMS and UMS were also identified by the WMS. Additionally, the WMS fit index revealed 10% more invalid responses than the standardized one and 6% more than the UMS.  ZstdUMS criteria indicated that none of the responses in the second group were invalid.  Figure 3 shows examples of response patterns that were invalid according to the criteria of the WMS, but valid according to other fit statistics.

Figure 3. Patterns of responses that were assessed as invalid by the WMS and as valid by the other person fit statistics 

It is important to note that all the patterns of the second group had zero responses to the items "increased appetite" and "increased weight". Generally, Rasch analysis found that zero responses were correct for these items because they were accompanied by high scores for "reduced appetite" and "reduced weight”. Analyzing the patterns in the figure, we find that there are some responses with high scores and some with zero or low scores. As mentioned above, zero responses to items indicating an increase in appetite or weight were recorded in the reverse order of categories. As a result, such null responses were not evaluated as incorrect during Rasch analysis. At the same time, zero answers to other items were evaluated by the Rasch system exactly as zero scores.  Thus, the patterns shown in Figure 3 are irregular and contain contradictory answers. The high severity of some depression symptoms is accompanied by a low severity or absence of others. Due to the fact that such combinations of symptoms rarely occur in reality, the patterns on Figure 3 should be considered invalid. Therefore, weighted mean square was better than other fit statistics indices in identifying invalid patterns in the second group.

Two types of response patterns on the QIDS-SR16 items were evaluated in this study. The first type consisted of invalid patterns with two clearly incorrect responses. Low quality of these patterns was significantly better evaluated using unweighted mean square index (UMS). The UMS criteria identified 94% of invalid patterns, while the remaining fit statistics criteria identified half as many invalid patterns. In the second group, patterns presumably did not have incorrect response sequences, but their validity was largely unknown. It turned out that invalid patterns were present in this group, which were best determined by the weighted mean square criteria. In our opinion, the advantage of unweighted mean square index (UMS) in the first group of patterns is due to its sensitivity to large residuals between expected and observed responses with a relatively low variance of the expected response. Rasch model response prediction with low expected response variance is highly accurate. It means that the modeled response probability is consistent with the observed response in most cases. Consequently, Rasch analysis evaluates response patterns without this correspondence as grossly inconsistent with the expectations of the model. In other words, response patterns exhibiting a sharp discrepancy between the expected and observed responses can be interpreted as outliers. Among the data outliers were all response patterns in the first group with an unexpected discrepancy in responses to items describing changes in appetite and body weight in depression. Due to this, almost all of these response patterns were identified as invalid by unweighted mean square index. Weighted mean square index (WMS) is a fit statistic that is weighted by the variance of the expected response. Outliers in single responses will not significantly affect WMS values if the residuals for other responses in such patterns are small and the variance of expected responses is high enough. Since in the Rasch measurement system, the same fit statistics are used both to assess the quality of respondents' responses and questionnaire items, the weighted mean square was developed for the purpose of leveling outliers' impacts on the quality of items first and foremost. Sharply unexpected responses (e.g. outliers) are not so much caused by the quality of items as by respondents' unexpected behavior. As a result, 40% of the patterns with outliers in the first group did not get recognized by the WMS. It is likely that these patterns contained only outliers on two items concerning appetite and body weight and relatively high predictive answers on other items. In spite of the fact that outliers may not have a significant impact on item quality, it is essential to exclude them when we are selecting valid response patterns for further analysis. A total of 3.6% of the data sample showed patterns with outliers in responses on two items about appetite and body weight. But the whole sample may contain a substantial number of patterns with outliers in responses to other items. It was not possible to analyze these patterns in this study.

 As for the second group of patterns, responses describing disturbances in appetite and body weight did not include outliers.  This led to almost the same numbers of invalid patterns being identified by WMS and UMS criteria. The weighted index helped identify a few more invalid patterns at the same time. The reason for this is that it is more sensitive to identifying systematic irregular responses that are not outliers.

According to the study results, standardized versions of unweighted and weighted fit statistics were ineffective at identifying invalid responses. It was found that the standardized person fit statistics overfitted the validity of both first and second group patterns. We believe that this is due to the peculiarities of transforming unweighted and weighted fit statistics into standardized Z-scores using the Wilson-Hilferty cubic transformation.


Results of this study suggest that the best way to evaluate respondents' response patterns in massive samples of data obtained from mobile psychological assessments is to use Rasch person non-standardized unweighted (UMS) and weighted (WMS) fit statistics. The UMS criteria better identify patterns with sharply mismatched responses. The WMS effectively evaluates patterns that do not have single outliers. The validity of respondents' response patterns can be assessed by using a threshold value of 1.5, which corresponds to the upper limit of valid WMS and UMS values. Any pattern with person fit statistics greater than 1.5 should not be considered suitable for further analysis and formulation of scientifically based conclusions. Due to their overfitting results, standardized fit statistics (ZstdUMS, ZstsWMS) cannot be used to verify respondents' response patterns.

No items found.
Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.
Напишите нам
Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.