A methodology for synthetic corpus engineering
DOI:
https://doi.org/10.7764/onomazein.70.11Palabras clave:
synthetic corpus engineering, corpus linguistics, language model, text classification, social problemResumen
This article describes a methodological framework for developing synthetic corpora with a small language model from a prompt engineering perspective. Instead of the typical approach in text mining, which prioritises corpus size in model training, this study applies a corpus linguistics methodology that accounts for corpus domain and distribution considerations to generate diverse and realistic texts. These synthetic corpora were evaluated through their integration into a text classification system to detect social problems. Therefore, the objective is to demonstrate whether using a theoretically sound methodology based on corpus linguistics can improve the performance of systems trained with such synthetic corpora. The study concludes that factors such as stratification and proportionality in the sampling method have even more impact than corpus size.
