Home ➤ Reseach Talks ➤ 089 24 04 2024

Exploiting Asymmetry for Synthetic Training Data Generation: SynthIE and the Case of Information Extraction

LLMs have a high potential for synthetic data generation. This work shows that useful data can be synthetically generated even for tasks that cannot be solved directly by LLMs: for problems with structured outputs, it is possible to prompt an LLM to perform the task in the reverse direction, by generating plausible input text for a target output structure. This work leverages this asymmetry in task difficulty and makes it possible to produce large scale, high-quality data for complex tasks. This work demonstrates the effectiveness of this approach on closed information extraction, where collecting ground-truth data is challenging, and no satisfactory dataset exists to date. In this paper, a dataset of 1.8M data points is synthetically generated, establish its superior quality compared to existing datasets in a human evaluation, and use it to finetune small models (220M and 770M parameters), termed SynthIE, that outperform the prior state of the art (with equal model size) by a substantial margin of 57 absolute points in micro-F1 and 79 points in macro F1.

Page: /