# NLP with Transformers & Large Language Models

## Table of Contents

1. [Introduction to NLP & Transformers](#introduction)
2. [Understanding Transformer Architecture](#transformers)
3. [Pre-trained Language Models](#pretrained)
4. [Fine-tuning & Transfer Learning](#finetuning)
5. [Working with LLMs](#llms)
6. [Advanced Techniques](#advanced)
7. [Production Deployment](#deployment)
8. [Future Directions](#future)

## Introduction to NLP & Transformers

Natural Language Processing (NLP) has undergone a revolutionary transformation with the introduction of Transformer models. This course covers the state-of-the-art techniques for working with transformers and large language models (LLMs).

### Evolution of NLP

Historical progression:

- **Statistical NLP**: Rule-based and probabilistic models
- **Word Embeddings**: Word2Vec, GloVe (2013-2015)
- **RNNs/LSTMs**: Sequential processing (2014-2016)
- **Attention Mechanism**: Vaswani et al., 2017
- **Transformers**: Parallel processing revolution
- **Large Language Models**: GPT, BERT, T5, and beyond

### Why Transformers?

Key advantages:

- Parallel processing capability
- Better long-range dependencies
- Easier to scale to large datasets
- Transfer learning friendly
- State-of-the-art performance across tasks

## Understanding Transformer Architecture

### Self-Attention Mechanism

The core innovation of transformers:

```python
import torch
import torch.nn.functional as F

def scaled_dot_product_attention(query, key, value, mask=None):
    """
    Compute scaled dot-product attention.
    
    Args:
        query: (batch, seq_len, d_k)
        key: (batch, seq_len, d_k)
        value: (batch, seq_len, d_v)
        mask: Optional mask to prevent attention to certain positions
    
    Returns:
        output: (batch, seq_len, d_v)
        attention_weights: (batch, seq_len, seq_len)
    """
    d_k = query.shape[-1]
    
    # Compute attention scores
    scores = torch.matmul(query, key.transpose(-2, -1)) / torch.sqrt(torch.tensor(d_k, dtype=torch.float32))
    
    # Apply mask if provided
    if mask is not None:
        scores = scores.masked_fill(mask == 0, float('-inf'))
    
    # Apply softmax
    attention_weights = F.softmax(scores, dim=-1)
    
    # Compute output
    output = torch.matmul(attention_weights, value)
    
    return output, attention_weights
```

### Multi-Head Attention

Parallel attention representations:

```python
import torch
import torch.nn as nn

class MultiHeadAttention(nn.Module):
    def __init__(self, d_model, num_heads):
        super().__init__()
        assert d_model % num_heads == 0, "d_model must be divisible by num_heads"
        
        self.d_model = d_model
        self.num_heads = num_heads
        self.d_k = d_model // num_heads
        
        # Linear layers for projections
        self.W_q = nn.Linear(d_model, d_model)
        self.W_k = nn.Linear(d_model, d_model)
        self.W_v = nn.Linear(d_model, d_model)
        self.W_o = nn.Linear(d_model, d_model)
    
    def forward(self, query, key, value, mask=None):
        batch_size = query.shape[0]
        
        # Project and split into heads
        Q = self.W_q(query).view(batch_size, -1, self.num_heads, self.d_k).transpose(1, 2)
        K = self.W_k(key).view(batch_size, -1, self.num_heads, self.d_k).transpose(1, 2)
        V = self.W_v(value).view(batch_size, -1, self.num_heads, self.d_k).transpose(1, 2)
        
        # Apply attention
        attn_output, _ = scaled_dot_product_attention(Q, K, V, mask)
        
        # Concatenate heads
        attn_output = attn_output.transpose(1, 2).contiguous()
        attn_output = attn_output.view(batch_size, -1, self.d_model)
        
        # Final linear projection
        output = self.W_o(attn_output)
        
        return output
```

### Complete Transformer Block

Full encoder/decoder architecture:

```python
class TransformerBlock(nn.Module):
    def __init__(self, d_model, num_heads, d_ff, dropout_rate=0.1):
        super().__init__()
        
        self.attention = MultiHeadAttention(d_model, num_heads)
        self.feed_forward = nn.Sequential(
            nn.Linear(d_model, d_ff),
            nn.ReLU(),
            nn.Linear(d_ff, d_model)
        )
        
        self.norm1 = nn.LayerNorm(d_model)
        self.norm2 = nn.LayerNorm(d_model)
        self.dropout = nn.Dropout(dropout_rate)
    
    def forward(self, x, mask=None):
        # Self-attention with residual connection
        attn_output = self.attention(x, x, x, mask)
        x = self.norm1(x + self.dropout(attn_output))
        
        # Feed-forward with residual connection
        ff_output = self.feed_forward(x)
        x = self.norm2(x + self.dropout(ff_output))
        
        return x
```

## Pre-trained Language Models

### BERT (Bidirectional Encoder Representations)

```python
from transformers import BertTokenizer, BertModel
import torch

# Load pre-trained BERT
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = BertModel.from_pretrained('bert-base-uncased')

# Tokenize input
text = "The cat sat on the mat"
tokens = tokenizer(text, return_tensors='pt')

# Get embeddings
with torch.no_grad():
    outputs = model(**tokens)
    last_hidden_state = outputs.last_hidden_state
    pooled_output = outputs.pooler_output

print(f"Sequence length: {last_hidden_state.shape[1]}")
print(f"Hidden dimension: {last_hidden_state.shape[2]}")
```

### GPT (Generative Pre-trained Transformer)

```python
from transformers import GPT2Tokenizer, GPT2LMHeadModel

# Load GPT-2
tokenizer = GPT2Tokenizer.from_pretrained('gpt2')
model = GPT2LMHeadModel.from_pretrained('gpt2')

# Generate text
prompt = "The future of AI is"
input_ids = tokenizer.encode(prompt, return_tensors='pt')

# Generate with constraints
output = model.generate(
    input_ids,
    max_length=100,
    num_return_sequences=3,
    temperature=0.7,
    top_p=0.9,
    no_repeat_ngram_size=2
)

for i, sample in enumerate(output):
    print(f"Sample {i+1}:")
    print(tokenizer.decode(sample, skip_special_tokens=True))
```

### T5 (Text-to-Text Transfer Transformer)

```python
from transformers import T5Tokenizer, T5ForConditionalGeneration

# Load T5
tokenizer = T5Tokenizer.from_pretrained('t5-base')
model = T5ForConditionalGeneration.from_pretrained('t5-base')

# T5 prefix-based tasks
tasks = {
    'translate': 'translate English to French: The quick brown fox',
    'summarize': 'summarize: The quick brown fox jumps over the lazy dog',
    'question': 'question: What is the capital of France?'
}

for task, input_text in tasks.items():
    input_ids = tokenizer(input_text, return_tensors='pt').input_ids
    
    output = model.generate(input_ids, max_length=50)
    result = tokenizer.decode(output[0], skip_special_tokens=True)
    
    print(f"{task}: {result}")
```

## Fine-tuning & Transfer Learning

### Fine-tuning BERT for Classification

```python
from transformers import BertForSequenceClassification, Trainer, TrainingArguments
from datasets import load_dataset

# Load dataset
dataset = load_dataset('imdb')

# Load model for classification
model = BertForSequenceClassification.from_pretrained('bert-base-uncased', num_labels=2)

# Define training arguments
training_args = TrainingArguments(
    output_dir='./results',
    num_train_epochs=3,
    per_device_train_batch_size=8,
    per_device_eval_batch_size=8,
    warmup_steps=500,
    weight_decay=0.01,
    logging_dir='./logs',
    learning_rate=2e-5,
)

# Create trainer
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=dataset['train'],
    eval_dataset=dataset['test'],
)

# Train
trainer.train()

# Evaluate
eval_results = trainer.evaluate()
print(f"Accuracy: {eval_results['eval_accuracy']:.4f}")
```

### Fine-tuning with LoRA (Low-Rank Adaptation)

Efficient fine-tuning for large models:

```python
from peft import LoraConfig, get_peft_model
from transformers import AutoModelForCausalLM

# Load base model
base_model = AutoModelForCausalLM.from_pretrained('gpt2')

# Configure LoRA
lora_config = LoraConfig(
    r=8,
    lora_alpha=16,
    lora_dropout=0.05,
    bias='none',
    task_type='CAUSAL_LM',
    target_modules=['c_attn', 'c_proj']
)

# Get PEFT model
model = get_peft_model(base_model, lora_config)

# Only train LoRA parameters
trainable_params = sum(p.numel() for p in model.parameters() if p.requires_grad)
print(f"Trainable parameters: {trainable_params:,}")
```

## Working with LLMs

### Using LLM APIs

```python
import anthropic
import openai

# Using Claude API
client = anthropic.Anthropic(api_key="your-api-key")

message = client.messages.create(
    model="claude-3-sonnet-20240229",
    max_tokens=1024,
    messages=[
        {"role": "user", "content": "Explain transformers in simple terms"}
    ]
)

print(message.content[0].text)

# Using OpenAI API
openai.api_key = "your-api-key"

response = openai.ChatCompletion.create(
    model="gpt-4",
    messages=[
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "What is machine learning?"}
    ],
    temperature=0.7
)

print(response.choices[0].message.content)
```

### Prompt Engineering

Techniques for better results:

```python
def few_shot_prompt(examples, query):
    """Few-shot learning example"""
    prompt = "Classify the following sentiment:\n\n"
    
    for example in examples:
        prompt += f"Text: {example['text']}\nSentiment: {example['sentiment']}\n\n"
    
    prompt += f"Text: {query}\nSentiment:"
    return prompt

examples = [
    {"text": "This movie is great!", "sentiment": "Positive"},
    {"text": "I didn't enjoy the book.", "sentiment": "Negative"},
]

query = "The game is amazing"
prompt = few_shot_prompt(examples, query)
```

### Retrieval-Augmented Generation (RAG)

Combining LLMs with external knowledge:

```python
from langchain.embeddings import HuggingFaceEmbeddings
from langchain.vectorstores import FAISS
from langchain.text_splitter import CharacterTextSplitter
from langchain.llms import OpenAI
from langchain.chains import RetrievalQA

# Load documents
documents = load_documents("path/to/documents")

# Split documents
text_splitter = CharacterTextSplitter(chunk_size=1000, chunk_overlap=200)
docs = text_splitter.split_documents(documents)

# Create embeddings
embeddings = HuggingFaceEmbeddings()

# Create vector store
vectorstore = FAISS.from_documents(docs, embeddings)

# Create QA chain
llm = OpenAI()
qa_chain = RetrievalQA.from_chain_type(
    llm=llm,
    chain_type="stuff",
    retriever=vectorstore.as_retriever()
)

# Query
result = qa_chain.run("What is the main topic?")
print(result)
```

## Advanced Techniques

### Quantization for Efficiency

```python
from transformers import AutoModelForCausalLM, BitsAndBytesConfig
import torch

# Configure quantization
quantization_config = BitsAndBytesConfig(
    load_in_8bit=True,
    bnb_8bit_compute_dtype=torch.float16,
)

# Load quantized model
model = AutoModelForCausalLM.from_pretrained(
    "meta-llama/Llama-2-7b",
    quantization_config=quantization_config,
    device_map="auto"
)

# Model now uses 8-bit precision, reducing memory by ~4x
```

### Multi-GPU Training

```python
from torch.nn import DataParallel
import torch.distributed as dist
from torch.nn.parallel import DistributedDataParallel as DDP

# Initialize distributed training
dist.init_process_group(backend='nccl')

model = model.to(torch.device('cuda', rank))
model = DDP(model, device_ids=[rank])

# Training continues as normal
# Loss is automatically synchronized across GPUs
```

## Production Deployment

### Model Serving

```python
from fastapi import FastAPI
from pydantic import BaseModel
from transformers import pipeline

app = FastAPI()

# Load model once
classifier = pipeline(
    "text-classification",
    model="distilbert-base-uncased-finetuned-sst-2-english"
)

class TextInput(BaseModel):
    text: str

@app.post("/classify")
async def classify_text(input: TextInput):
    result = classifier(input.text)
    return {
        "text": input.text,
        "label": result[0]["label"],
        "score": result[0]["score"]
    }

# Run with: uvicorn app:app --reload
```

## Future Directions

The field of NLP continues to evolve rapidly:

- **Multimodal Models**: Combining text, images, and audio
- **Efficient Models**: Smaller, faster alternatives to large LLMs
- **Reasoning**: Better handling of complex logical tasks
- **Interpretability**: Understanding model decisions
- **Real-time Adaptation**: Updating knowledge without retraining

Master these foundational concepts and you'll be well-positioned to work with cutting-edge NLP technologies.