This blog post is completely written in English, just for practicing my English writing skills. Please let me know if there is any suggestions.
Basic Information
- Title: An End-to-End Submodular Framework for Data-Efficient In-Context Learning [1]
- Authors: Lilly Kumari, Shengjie Wang, Arnav Das, Tianyi Zhou, Jeff Bilmes
- Conference: NAACL 2024
- Open Access: https://aclanthology.org/2024.findings-naacl.209
The Problem to Solve
The problem of annotating, selecting and ordering in-context exemplars for Large Language Models (LLMs).
The In-Context Learning (ICL) performance of LLMs is largely affected by the selection and ordering of in-context exemplars, which makes it necessary to develop a methodology to select the in-context exemplars according to the query.
The Proposed Algorithm
In reality, the most of the data we have are unannotated data, i.e. queries without answers, some recent methods [2] suggest to select and annotate a subset from an unannotated huge dataset $\mathcal X_\mathrm{unlabeled}$ to form a small annotated dataset $\mathcal X_\mathrm{labeled}$, and to choose the in-context exemplars from this subset during evaluation. This passage followed this paradigm, and proposed Div-S3, a two-stage data-efficient learning-free framework for exemplar selection.
- The first stage (Div): Exemplar Annotation, from $\mathcal X_\mathrm{unlabeled}$ to $\mathcal X_\mathrm{labeled}$.
- The second stage (S3): Exemplar Retrieval, from $\mathcal X_\mathrm{labeled}$ and query set $Q$ to get in-context exemplars $\mathcal D_\mathrm{context}$.
Exemplar Annotation
Problem Definition. The first stage of Div-S3, as mentioned above, selects from unannotated data $\mathcal X_\mathrm{unlabeled}$ and let homo sapiens do the annotations to make an informative and diverse subset $\mathcal X_\mathrm{labeled}$, with the constraint of $|\mathcal X_\mathrm{labeled}| \ll |\mathcal X_\mathrm{unlabeled}|$. This is a process similar to one iteration of pool-based active learning. With the objective of diversity and reducing redundancy, this can also be formulated as a set optimization problem [3] as follows: $$\max_{A\subset \mathcal X_\mathrm{unlabeled},|A|\le k} f(A),$$ where $k$ is a hyperparameter representing the annotation budget, $f:~2^{\mathcal X_\mathrm{unlabeled}}\rightarrow\mathbb R$ is a submodular function mapping all subsets of $\mathcal X_\mathrm{unlabeled}$ to the respective score of each subset. The higher the score, the better the subset is.
The submodular function must satisfy the properties of monotone and decreasing marginal profit, i.e., respectively, $$\begin{aligned}\forall A\subseteq T, \quad&f(A) \le f(T), \\ \forall A\subset T,~ \forall x\not\in T, \quad&f(\{x\}|A) \ge f(\{x\}|T), \end{aligned}$$ where $f(\{x\}|T) = f(\{x\}\cup T) - f(T)$ is the marginal profit of adding $\{x\}$ to $T$.
The authors set the submodular function to be facility location, which is defined as $$f(A) = \sum_{s_i\in\mathcal X_\mathrm{unlabeled}}\max_{s_j\in A}\mathrm{sim}(s_i,s_j),$$ where $\mathrm{sim}(\cdot,\cdot)$ denotes the cosine similarity of two queries’ embeddings, generated by sentence-BERT [4].
Intuitive Interpretation. This problem can be intuitively and geometrically interpreted as minimizing the sum of all distances between each node and its nearest selected nodes, which is a k-medoids problem and is pretty similar to k-means clustering.
Solution. The authors of this paper use a greedy algorithm proposed by Nemhauser et al. [5], called Greedy Submodular Maximization (GSM). Its fundamental idea is to greedily select the item with the maximum marginal profit for $k$ iterations, i.e. $$A\leftarrow A\cup \{\argmax_{v\in V, v\not\in A} f(A\cup \{v\}) - f(A) \}.$$ Considering the time spent in calculating $f$, this is an algorithm with the time complexity of $O((nk)^2)$. The paper says it’s $O(n^2+nk)$ but I beg to differ as it seems to have ignored the time complexity of calculating $f$, which is $\Theta(nm)$ with pre-calculated similarity matrix, where $m=0,1,\ldots,k-1$ in the iterations. But I agree with the authors that the algorithm can be accelerated by techniques like caching and priority queues, etc.
Exemplar Retrieval
Exemplar retrieval is intended to find the best in-context exemplars $\mathcal D_\mathrm{context}$ for the given query set $Q$ from $\mathcal X_\mathrm{labeled}$. Previous similarity-based methods usually yield in-context exemplars with redundant information. To reduce such redundancy, the authors formalize exemplar retrieval as a conditional submodular subset selection problem. The purpose of this stage is to “obtain a set of exemplars that are not only relevant to the test query but also encompass diverse aspects crucial for aiding the LLM in the target task” [1]. They come up with a two-phase method called Submodular Span Summarization (S3), which was published prior to this paper [6].
Phase 1 of S3
Problem Definition. This phase targets to select a relatively large subset relevant to the query set $Q$, but might be redundant. The original problem might be difficult to solve, so the paper considered solving the dual problem of it: $$\min_{A\subseteq V - Q,|A|\ge k_1}f(A|Q),$$ which is a cardinality-constrained submodular minimization problem. The authors use $m_Q(A) = \sum_{a\in A}f(a|Q)$ to approximate $f(A|Q)$, which is an upper bound of $f(A|Q)$.
Solution. Although the paper doesn’t state it clearly, I think the algorithm to solve this problem, given the approximation, is to select $k_1$ exemplars with minimal values of $f(a|Q)$ from $A$. And this algorithm matches the time complexity $O(k+k\log k_1)$ stated in Appendix B, if ignoring the time for calculating $f$.
Phase 2 of S3
This stage is intended to select the most representative exemplars from the result of phase 1, and is mathematically the same as Exemplar Annotation: $$\max_{A\subset A_Q,|A|\le k_2} f(A),$$ where $A_Q$ is the set of selected exemplars from phase 1. Optionally, we can apply a knapsack constraint on the problem, and solve it with a modified version of GSM.
Comments
The authors proposed a learning-free framework named Div-S3 to select in-context exemplars from unlabeled datasets, and evaluated its effectiveness with ablation experiments across 7 Natural Language Processing (NLP) tasks and 5 LLMs. Although the algorithm doesn’t take ordering into consideration, the paper proves its insensitiveness to the order of exemplars.
However, from my perspective, I still have some problems related to the paper:
- In Exemplar Retrieval stage, what’s the meaning of computing $f(Q)$ against all unlabeled data $\mathcal X_\mathrm{unlabeled}$. Given that $\mathcal X_\mathrm{labeled}$ is a representative subset of $\mathcal X_\mathrm{unlabeled}$, would it be computationally better to calculate $f(Q)$ and $f(a|Q)$ only against $\mathcal X_\mathrm{labeled}\cup Q$?
- What’s the purpose of dealing with the queries as a whole (query set $Q$), instead of retrieving the best exemplars for each query $q$?
- The experiments didn’t cover the comparison of Div-S3 with some recent algorithms, especially the learning-based ones.
- Will the algorithm averagely perform better if we handcraft a rule to arrange the selected exemplars? Also, in Figure 3, the variances do not seem to be small.