Data Quality in Survey-Based Research and Time Series for Data Verification

The best online survey-based research requires high-quality data. A comprehensive approach is needed to detect and avoid fraud, flag and remove inattentive and unengaged respondents, minimize biases and noise in the data. The data should be representative of the population and most relevant to research objectives.

Big Village is constantly monitoring and improving data quality at every stage of research. Our data quality portfolio consists of a wide range of tools and methods, including automated data quality tools, quality flags, and Advanced Analytics methods. We are recommending using time series analysis to verify data in long running tracker studies.

Highest quality survey data provides a great support to your business decision-making. But research based on low quality data might lead to erroneous conclusions and inaccurate recommendations. The Alation State of Data Culture survey conducted in 2021 in the US and Europe among 300 data and analytics leaders found that 39% of respondents believed that data quality was the top challenge in using data to drive business value at their companies. Big Village is constantly evaluating and improving ways to protect survey data integrity.

The process starts with the optimal survey design.

We make sure that our surveys are clear, engaging, relevant and not too lengthy. We incorporate multiple questions to uncover poor respondents. The set of recommended quality flags includes open ends, red herrings, low incidence questions, attentiveness and consistency checks. It is important to use multiple questions to assess quality of responses, since only one incidence or concern might not be representative and would not be sufficient to exclude a respondent from a dataset. Also, we make sure that data quality questions are related to the survey content are not disrupting respondents’ perception of the survey, and don’t unnecessarily extend the survey’s length.

In addition to that, Big Village is actively using automated data-quality tools to validate respondents’ IP addresses, verify data accuracy, automate review processes, and target fraudsters. Modern technology for automated data quality checks helps to fight bots and survey farms.

Pre-survey checks are necessary to eliminate obvious frauds, but we continue constantly monitoring data quality while respondents are going through questionnaires. Some of fraudulent responses would be detected only after survey entry. Lastly, in some cases, respondents must be removed manually after survey completion if they raise multiple data quality flags. Data quality monitoring throughout fielding helps to make sure the sample composition is not negatively affected.

Advanced analytical techniques are often more sensitive to data quality than traditional research approaches.

There is a possibility that the noise in the data would be misinterpreted as real information and fitted and even amplified in modeling. This could result in misleading conclusions and recommendations. No data is perfect, so to avoid problems, it is very important to choose analytical approaches robust enough to a realistic data quality. In addition to that, researchers need to pay special attention to sample size and sample and survey design to collect data for advanced statistical modeling.

Advanced analytics and data science do not just make demands on data quality. They can also contribute to improving and validating survey data. For these purposes, we are applying a wide range of approaches from traditional statistical techniques of imputation and detecting outliers to the latest advances of Machine Learning and Artificial Intelligence.

At Big Village, we use time series analysis to validate data quality in long-running tracker studies.

Time series analysis is a way of studying a sequence of data points collected at consistent intervals over a set period. The analysis uncovers the underlying structure of the data and shows how variables are changing, presenting a dynamic view on the data.

Let us consider a set of key metrics (such as unaided or aided awareness, consideration, sales, etc.) in a large brand tracker collected monthly over a few years. A standard way to approach this data would be to look at the current values and maybe compare them with the values last month or last year. Time series analysis would approach the dataset collected over years as a whole and use all the accumulated information for conclusions and predictions. This analysis helps to uncover underlying forces and structure that form the data and allows us to proceed to forecasting, monitoring or even control.

One of the descriptive time series analysis methods is decomposition.

It breaks a time series up into three basic components: trend, seasonality, and a random component or a residual. Trend is the overall movement of a series over a long period of time, seasonality is a regular and predictable change of a time series that recurs every year, and a random component is the rest of a time series variation, excluding the trend and seasonality. The decomposition provides important information about data if it is organized in a time series. It is widely used in many industries and marketing tools including Google Analytics. For the data quality monitoring purposes, we suggest considering the residual of a time series assuming it should be expected in a certain range depending on the series nature. In our time series tests, the range for key metrics has not exceeded +/-4%.

One of the most popular applications for time series is predictive analysis.

Time series forecasting is done through modeling to predict future values in a time series and to make strategic decisions. There are many ways to build predictive models for time series. Autoregressions, moving averages, ARIMA, and Exponential Smoothing are some of them. For data verification purposes, we use Exponential Smoothing to make a prediction of a key metric value in a tracker and compare it with an actual value in a current moment of time. If we see that the key metrics’ current values are not far from corresponding predictions made using the historic data, we conclude that the current measurements are consistent with the whole time series and the data is validated. The time series using a long running tracker data in many categories showed a high level of predictability, the forecast error does not exceed 2%-3% and often stays way below that.

Data quality is essential for online survey-based research. Strong data helps researchers and clients to back up important decisions, evaluate the success of brands and products, and grow revenue. Big Village is constantly monitoring data quality and is using an extensive toolkit to improve. Ensure your data quality with us today.

Written by Faina Shmulyian, VP Insights at Big Village

Consumer Insights Hub with Smart Targets

Audience Intelligence

Brand Lift

CARAVAN® Surveys

Digital Hives Online Communities

Secondary Research

Cassandra Youth Research

Public Opinion Polling

MomLife Panel