Utilizing large language models (LLMs) has demonstrated an appealing problem space for the classification of unsafe ad content. LLMs have a relative advantage over traditional machine learning systems in the areas of deep contextual and cultural understanding because of the inherent complexity of the process of identifying content that violates policies. But fine-tuning LLMs for such complex tasks requires high-fidelity training data that is difficult and expensive to curate at the necessary quality and scale. Standard data-intensive approaches to training models are costly, especially given the need to handle concept drift as safety policies evolve or as new types of unsafe ad content arise. In the worst case the model must be retrained on a completely new data set. Reducing the amount of training data needed is therefore paramount.
In light of this, we present a brand-new, scalable curation procedure for active learning that has the potential to significantly improve model alignment with human experts while simultaneously reducing the amount of training data required for fine-tuning LLMs. The process can be applied to datasets of hundreds of billions of examples to iteratively identify the examples for which annotation would be most valuable and then use the resulting expert labels for fine-tuning.
In our experiments, we were able to reduce the scale of training data needed from 100,000 to under 500 training examples, while increasing model alignment with human experts by up to 65%. Production systems using larger models have seen even greater reductions in data scale, using up to four orders of magnitude less data while maintaining or improving quality.
Process of archiving Our process starts with a zero- or few-shot initial model (LLM-0), which we provide with a prompt describing the content of interest, e.g., defining clickbait and asking “Is this ad clickbait?” The LLM-0 model then generates a large labeled data set, as shown in (1), by labeling advertisements as either benign (blue) or clickbait (orange) in the figure below. Note that this initial data set is typically highly imbalanced, since in production traffic only very few (<1%) ads are actually clickbait. The LLM’s true positive rate is also low because it has not yet been fine-tuned. We cluster examples labeled “clickbait” and “beneficial” separately to find the most instructive examples. Some of the clusters overlap, indicating potential model confusion between clickbait and benign examples (2). For each such overlapping cluster pair, we find pairs of examples lying nearest each other that have different labels (3) and send these to human experts for an opinion. We will prioritize pairs of examples that cover a larger portion of our search space if necessary to stay within our review budget (4). The curated set that was produced is both informative and diverse due to the fact that it includes the examples that are the most confusable along the decision boundary. These expert-provided labels are split randomly into two sets. The first is used for model evaluation, based on two key alignment metrics: the internal alignment measuring how much experts agree, and the model–human alignment between the current model and human experts. The subsequent iteration of the model is made by using the second to fine-tune the existing models. The process repeats until the model–human alignment either matches the internal alignment or plateaus and cannot be improved further.
Metric
Our curation process does not assume the existence of ground truth. Even among policy experts, many classification problems in the ads safety space, like content moderation or fraud detection, are inherently ambiguous and necessitate interpretation and discussion. As a result, we are unable to rely on standard metrics like recall and precision without a ground truth label. Instead, we employ Cohen’s Kappa, which is a measure of the degree to which two independent annotation experts agree more than would be expected from chance. In our experiments, Cohen’s Kappa is used as both a quality indicator for datasets (including model evaluation during the curation process, as noted above); and as a measure of model performance. Values closer to 1 show higher alignment, 0 indicates no alignment above chance, and negative values indicate systematic disagreement. While standards for interpreting these scores vary, Kappa values above .8 are widely considered to be exceptionally good, and values above .4 are generally considered acceptable.
Experiments
We wanted to know which models and projects would benefit the most from our curation method. We used crowdsourced labels to fine-tune two LLMs of varying sizes—Gemini Nano-1 with 1.8B parameters and Nano-2 with 3.25B parameters—on two tasks of varying (lower and higher, based on expert alignment) complexity as baselines for our experiments. Each crowdsourced data set has ~100K annotations and a strong class imbalance, with around 95% benign labels on average.
We compared each of these four baseline conditions to the corresponding curated condition, in which each model (Nano-1 and Nano-2) is fine-tuned through the aforementioned curation procedure in multiple rounds. At each iteration, we selected our curated set of examples and used them for model evaluation and fine-tuning, as described above. We stopped at six iterations (400 fine-tuning and 250 evaluation samples) for the task of lower complexity and five iterations (250 fine-tuning and 150 evaluation samples) for the task of higher complexity because none of the models reached parity with the experts’ internal alignment. (Note that the lower complexity task had a larger variety of examples, which may account for the longer time needed to converge.) The final class balance for both data sets was less than 40% positive examples. The scale and quality of the data used in each condition are summarized in the table below. During the curation process, experts achieved an average pairwise Cohen’s Kappa of.81 for the lower complexity task and.78 for the higher complexity task. These serve as our benchmark for model performance. Based on our complete curated set, we calculated Kappa alignment between crowdsourced annotations and experts, which was.59 (lower complexity) and.41 (higher complexity) to determine the quality of our crowdsourced data.