Whether modeling a brand’s performance, forecasting key changes to enhance brand revenues, or creating clusters of customers with varying needs and wants from a brand, a train-test methodology can help fine-tune brand building models.
While circumstances are not always conducive to apply a train-test methodology, researchers should recognize, plan, and seize the opportunity when it presents itself.
The application detailed below is straight-forward, and the benefits of confirming the brand models are beneficial in demonstrating the robustness and building confidence in the future use of a brand measurement tool.
Take advantage of train-test methodology when applicable
It is not always possible in survey research to obtain a sample size robust enough to apply a train-test methodology. However, it is an important step to verify clustering algorithms, regression models, classification tree predictions and the like.
It may not always make sense to split into train-test files. For instance, if the data set is small, creating even smaller factions just makes both sets less robust.
When we do have the luxury of a larger data file that allows us to take advantage of splitting into train and test, it’s important to know the benefits and best practices.
Drawbacks of building a model from single source data
A model built using only a single data set might pick up random effects and overfit to a unique pattern in that particular data, which may not generalize to future unseen data. Therefore, when a good estimate of model performance is critical, train-test is most appropriate.
Benefits of train-test for a good estimate of model performance
A train-test split is beneficial in estimating the performance of machine-learning algorithms. A model is developed based on observations in a train data set, and then that model is applied to the test data to determine how well it predicts on “unknown” data.
When it is appropriate, perform any general cleaning, such as removing straight-lining or rectifying any quality issues, dropping respondents, etc., to the overall data file before splitting so it only needs to be done once.
What’s the right amount of train versus test data?
Next, determine what percentage of your data you want to be contained in the train and test data sets. The majority of data should be split into the train data and the remaining into the test data. There is no set amount for train and test, typically from 67-80% are allotted to the train data, and the remaining to test. Some considerations are computational resources, number of observations, and representativeness of each data set.
If the project is global and includes more than one county, we recommend splitting into train and test files at the country level and then aggregating all train files and all test files into global train & test data sets.
Use stratification to ensure balance across key variables
Often, the data is split on a random basis; however, there is no guarantee that key variables such as demographics are balanced in the two resulting data sets. If one of the data sets randomly contains more young vs. older respondents, or an imbalance of gender or other key variables, results may be impacted.
To ensure each train and test file is representative of the original sample, Big Village recommends using a stratified sampling technique when applying the split to ensure files match across key variables.
Simple R syntax for stratification on key variables
R syntax is readily available to ensure a balanced split. In the following example, the data file is split 70% train, 30% test, and key variables of age, gender, and region are stratified. Setting a seed guarantees consistent files will be generated if you need to re-run.
Once files are split into train and test, compare to verify key variables are balanced.
Models are built using the train data set, then the test data is run through the same parameters to confirm that model by using a new set of respondents that were not used in model building.
A recent train-test application at Big Village
Big Village most recently applied this methodology for the purpose of segmenting a brand’s customers, and potential customers, into groups based on similar needs from brands in the category. To ensure the recommended clusters would be highly reproducible we applied the train-test methodology. Numerous train-test data files were split 70%/30%, with demographic stratification at the country level, then aggregated into total train and total test data.
Due to a large number of total respondents, splitting the data files was also beneficial in reducing the run time of algorithms.
The train set was used to develop initial cluster solutions utilizing various algorithms such as k-means, bicluster, and ensemble methods. For key proposed segments, the same algorithms were applied to the remaining test respondents to ensure the same clusters were found. We were able to demonstrate with high accuracy that the clusters found with the train data were consistently found with the test data, so the client was highly confident about targeting key segment groups.
Creating a model to build brand strength with train data and applying the results to a test data set can be very powerful in demonstrating to clients the robustness of the model’s application on future unknown data. So, when possible, Big Village recommends taking advantage of every opportunity to apply a train-test methodology.
Written by Sheilah Wagner, Director, Data Sciences, at Big Village Insights.
ABOUT Big Village
Big Village is a global, full-service media and marketing services company that unites culture and commerce to move brands forward faster. We are: Driven by data. Fueled by imagination. Powered by technology. Founded in 2005, Big Village has global headquarters in New York and 16 offices across North America, the UK, Europe and Asia-Pacific. Big Village empowers clients to outperform in the present and win in the future with its vast range of marketing solutions including – insights, creative, media, data and technology. Find out more at big-village.com and follow @Big Villageworldwide.