Summary
Using human labels in an open e-commerce search dataset as a baseline, I measure the LLM’s preference for a product, seeing if it matches human raters. If I can do this, then I can use my laptop as the search relevance judge. This can then guide search quality tuning and iterations, without an expensive OpenAI bill.
For each product, I crafted prompts for 4 product attributes: Product name, taxonomic categorization, product description, product name and product description. The prompts were designed to test the ability of users to choose between different product descriptions.
Results are based on the permutations of fields, double checking, allowing neither and not double checking. All of these permutations (fields, double-checking, allowingNeither) are experiments I’ve run, with results below.
Query: entrance table alea coffee table marta coffee Table Neither LHS LHS RHS L HS RHS R HS LHS The more votes of these little attribute-specific agents wins! If they mostly point to LHS, then just choose LHS. If they mainly point to RHS, just choose OR… OR… Or… and hear me out.
We have a set of features - the individual predictions of each field. We have a label we want to predict - the human preference The table above becomes a classification problem! We can “learn” the right ensemble.
The training script builds out our training features. The agent preference in each is our feature. As a basic spike, I use a simple Scikit decision tree to try to predict human preference.
Trees allow us to see the dependencies between features when predicting relevance. We can use this as an exploratory tool to guide searc solutions. According to this tree, any search solution might want to focus on category first before considering name.
