We describe a system built using Python and Apache Spark to measure the effectiveness of different query configurations of an Apache Solr search platform, using click logs for a reference query set. The system replays the click logs against the engine to compute the Average Click Rank metric as a proxy for user satisfaction, ensuring that only improved configurations are deployed for A/B test.
We describe a system built using Python and Apache Spark which measured the effectiveness of different query configurations of an Apache Solr search platform, using click logs for a reference query set of 80,000+ user queries. The system replays the click logs against the engine to compute the Average Click Rank (ACR) metric as a proxy for user satisfaction, providing a way to identify improvements in quality without having to do a production deployment, and ensuring that only improved configurations are submitted to a slow and expensive A/B testing process.
For each search engine configuration, the ACR is recomputed by replaying the query logs against it and finding the position (or click rank) for the user's selected document. The ACR is computed by averaging those positions across all user queries in the query log. Lower click ranks are indicative of better engine configuration for that query, since it implies that the user found what they were looking for nearer the top of the results. Similarly, a low ACR across all queries is an indicator of good search engine configuration as a whole.
This system has also been used to analyze user behavior, by partitioning the results across content types, response times, etc, and analyzing differences in click rank distribution. It has also been used to identify and investigate slow queries, resulting in improvements in which have also benefited the search application.