What are Large Language Models?
Large Language Models (LLMs) are a class of deep learning models that have gained significant attention and popularity in the field of Natural Language Processing (NLP) due to their impressive performance on a wide range of language-related tasks. These models are characterized by their vast size, containing millions to billions of parameters, and are typically based on the transformer architecture. Large Language Models (LLMs) are versatile tools with numerous applications across diverse domains and have potential to revolutionize industries. They can generate text, translate languages, summarize content, analyze sentiment, answer questions, and understand natural language. LLMs find uses in content creation, translation services, sentiment analysis, virtual assistants, content recommendation, and code generation. They play a role in healthcare, content moderation, legal document review, and educational assistance. It's important to note that the capabilities of LLMs are continually evolving, and new applications are being discovered as research in the field of natural language processing advances. Additionally, ethical considerations, such as bias and fairness, play a crucial role in the deployment of LLMs across these applications. There are four broad categories of LLMs such as Base or Foundational LLM, Finetuned LLM, Retrieval Augmented Generation (RAG) LLM, and Domain Adaptive Pretrained LLM.
Base or Foundational Large Language Model
In this category, a Large Language Model (LLM) is trained from scratch which is a complex and resource-intensive process that involves training a deep learning model, typically based on the transformer architecture, on a massive amount of text data without any pre-existing knowledge.
Examples: Chat GPT (GPT 3.5) and GPT-4 by Open AI, Bard by Google, Claude 2 by Anthropic, Llama 1 and Llama 2 by Meta, Falcon by TII, UAE, MPT by Mosaic ML, etc.
Advantages:
- Flexibility: Training from scratch allows customization of the model architecture and training process to suit specific requirements.
- Dataset control: Attain complete command over the training datasets employed for pre-training, exerting a direct influence on model quality and addressing concerns related to bias and toxicity.
- Novelty: Training LLMs from scratch enables the exploration of new research directions and the development of cutting-edge models.
Disadvantages:
- Computational Resources: Training an LLM from scratch requires substantial computational power, including high-performance GPUs or TPUs, and can be time- consuming.
- Data Collection: Gathering a large and diverse corpus of data for training can be challenging, especially for niche or specialized domains.
- Expertise: Training LLMs from scratch demands expertise in machine learning, natural language processing, and large-scale distributed computing.
Finetuned LLM
Fine-tuning refers to the process of taking a pre-trained language model (foundational large language model), such as a model pretrained on a large corpus of text data, and then training it further on a smaller, task-specific dataset for a specific natural language processing (NLP) task. Fine-tuning is a crucial step in the deployment of LLMs for various applications.
Examples: Alpaca by Stanford, Vicuna by LMSYS, WizardLM, Orca by Microsoft, h2ogpt by H2O.ai, etc.
Advantages:
- Reduced Computational Requirements: Fine-tuning is typically faster and requires less computational power compared to training LLMs from scratch, as it builds on the pretrained model's knowledge.
- Transfer Learning: Fine-tuning allows leveraging the pretrained model's general language understanding to improve performance on specific tasks.
- Specialization: Fine-tuning enables adapting the pretrained LLM to specific domains or applications, enhancing its performance in those areas.
Disadvantages:
- Limited Generalization: Fine-tuning an LLM on a specific task or domain may lead to overfitting, where the model becomes excessively tailored to the training data. This can result in reduced generalization capability, making it less effective in handling diverse or out-of-domain inputs.
- Data Requirements: Fine-tuning typically requires task-specific labeled or annotated data for training. Collecting or creating such datasets can be time-consuming, costly, or challenging, especially for niche or specialized domains where labeled data may be scarce.
- Bias Amplification: If the fine-tuning dataset contains biases, the LLM may amplify and perpetuate those biases during the training process. This can lead to biased or unfair predictions, reinforcing existing societal or cultural biases.
- Dependency on Pretrained Models: Fine-tuning relies heavily on the quality and suitability of the pretrained LLM. If the pretrained model does not align well with the target task or domain, the fine-tuning process may not yield significant improvements.
Retrieval Augmented Generation (RAG) LLM:
A Retrieval Augmented Generation (RAG) LLM is another type of LLM that combines elements of both retrieval and generation-based approaches. It is designed to enhance the generation of text or responses by incorporating a retrieval step.
Retrieval: The model first performs a retrieval step where it searches through a large database of text or documents to find relevant information or context related to the given query or prompt.
Augmentation: The retrieved information is then used to augment the generation process. The model may use this retrieved context to inform and generate more contextually relevant and coherent responses. This can improve the quality of generated text, making it more accurate and context-aware. Additionally, this mitigates the issue of hallucination found in both base models and fine-tuned models.
Retrieval-augmented LLMs are especially useful for tasks that require generating responses based on specific knowledge or context. They leverage the advantages of both retrieval models, which are good at finding relevant information, and generation models, which are good at generating human-like text. It also addresses the issue of data freshness problem (for example, ChatGPT lacks awareness of events occurring after September 2021).
Examples: Private GPT, Local GPT, Langchain with local data (pdf, docx, md, etc).
Advantages:
- Improved Contextual Understanding: By incorporating a retrieval component, these models can access external knowledge sources, such as documents, websites, or databases, to enhance their contextual understanding. This enables them to provide more accurate and informed responses.
- Enhanced Accuracy: Retrieval-augmented LLMs can leverage the retrieved information to generate more accurate and relevant responses, reducing the likelihood of generating incorrect or misleading information.
- Domain Adaptation: The retrieval component allows for targeted information retrieval from domain-specific knowledge sources. This enables the model to adapt and provide better responses within specific domains or specialized contexts.
- Better Handling of Out-of-Distribution Inputs: Retrieval-augmented LLMs can handle inputs that fall outside their training distribution by retrieving relevant information from external sources. This helps them provide meaningful responses even for queries or topics they haven't encountered during training.
Disadvantages:
- RAG can be sensitive to the quality of the external knowledge base.
Domain Adaptive Pretrained LLM:
Continued pretraining on domain-specific datasets, referred to as domain-adaptive pretraining, has shown its effectiveness in tailoring diverse natural language understanding models to particular domains. This strategy empowers language models to harness their general capabilities while integrating domain-specific expertise, ultimately enhancing performance in specialized tasks with minimized expenses. Preliminary experiments across three domains (biomedicine, finance, and law) utilizing raw corpora from the researchers (Cheng et al., 2023) indicate that it may have a negative impact on prompting performance but remains advantageous for fine-tuning evaluation and knowledge probing assessments. In order to harness domain-specific expertise while also improving prompting performance, the authors (Cheng et al., 2023) present a straightforward approach for converting extensive, unprocessed corpora into reading comprehension materials. Each original text is enriched with a set of tasks that are pertinent to its subject matter. Their 7B language model attains competitive performance when compared to much larger domain-specific models, such as BloombergGPT-50B. Additionally, they illustrate that the inclusion of domain-specific reading comprehension texts can enhance the model's performance, even on standard benchmarks. This highlights the potential for the development of a versatile model that spans an even broader range of domains.