Evaluation of open-source transformers for cloud question answering
In recent years, there has been a surge of interest in Question Answering (QA) systems, making this an important area of research within Natural Language Processing (NLP). These systems play a crucial role in applications involving human interaction, such as customer service chatbots and virtual assistants like SIRI and Alexa. The primary goal of a QA system is to generate natural language responses that answer user questions in a satisfactory manner. In practice, however, this requires more than simply retrieving explicitly mentioned answers from a text.
As described by Saeidi et al. and Mensio et al., many practical QA challenges require a deeper understanding of text, with answers being derived based on background knowledge and context. This requires reasoning ability and insight into subtle meanings within a question. Mohnish et al. emphasized that raw answers from QA systems often do not meet user expectations, while Strzalkowski et al. observed that users value detailed, context-rich information more than short, direct answers.
Developments in transformer models for QA
Modern NLP approaches, such as transformer models, are used to address this complexity within QA tasks. Unlike traditional sequential models, which process input step by step, transformers use self-attention mechanisms to calculate relationships between all elements within an input sequence. This allows meaningful connections between components that are far apart to be captured. Research by Izacard et al. shows that adding additional knowledge via retrieval significantly improves the QA performance of transformer models.
With the emergence of Large Language Models (LLMs), with millions to billions of parameters, QA systems have evolved further. Proprietary LLMs such as OpenAI's ChatGPT 3 and its successors have achieved great success in a variety of NLP tasks, with ChatGPT3.5-Turbo reaching over 100 million active users in 2022. Despite these achievements, concerns remain around privacy, including issues around training data, handling of user data, and potential vulnerabilities. This underscores the importance of exploring decentralized and open-source (OS) QA solutions that prioritize privacy and transparency.
Research focus and evaluation
This study focused on evaluating the effectiveness of various open-source transformer models—Extractive Pre-trained Transformer (EPT), Generative Pre-trained Transformer (GPT), and Text-to-Text Transfer Transformer (T5)—within a realistic QA use case. These models were compared to OpenAI's proprietary LLM, ChatGPT3.5-Turbo, to assess their performance and competitiveness.
The domain chosen for this research is cloud computing, with a specific focus on Kubernetes technology. The QA system uses the public Kubernetes documentation, supplemented with real-time searches via Google, as its primary source of knowledge. An innovative Machine-trained Evaluation Score (MTES) called Estimated Human Label (EHL) has been developed to evaluate the different models. This score uses machine learning techniques trained on N-gram-based metrics and human-labeled datasets. It includes a variety of question types, including closed-ended, open-ended, conceptual, and procedural questions.