Hugging Face Reference
Free reference guide: Hugging Face Reference
About Hugging Face Reference
The Hugging Face Transformers Reference is a searchable quick-reference for the transformers, datasets, evaluate, and huggingface_hub Python libraries. It covers eight practical categories — Models, Tokenizers, Pipelines, Trainer, Datasets, Evaluation, Quantization, and Hub — giving machine learning engineers instant access to the exact API signatures and code snippets they need while building NLP and generative AI applications.
Researchers and ML engineers use this reference to recall patterns such as AutoModelForCausalLM.from_pretrained with device_map="auto", Trainer with TrainingArguments for learning rate scheduling, dataset.map with batched tokenization, and compute_metrics callbacks with the evaluate library. The reference also covers advanced topics including BitsAndBytesConfig for 4-bit NF4 quantization, LoRA/PEFT with target_modules, GPTQ and AWQ quantized model loading, and Gradio Spaces deployment.
Content is organized to follow a typical fine-tuning workflow: load a pretrained model and tokenizer, prepare a dataset from the Hugging Face Hub or a local dict, define training arguments, train with the Trainer API, evaluate with BLEU/ROUGE or accuracy metrics, optionally apply LoRA or 4-bit quantization to reduce memory footprint, and finally push the fine-tuned model back to the Hub. This makes the reference useful for both quick API lookups and end-to-end project planning.
Key Features
- AutoModel family: AutoModel, AutoModelForSequenceClassification, AutoModelForCausalLM, AutoModelForTokenClassification with from_pretrained options
- Tokenizer API: AutoTokenizer.from_pretrained, encode/decode, padding and truncation with max_length and return_tensors
- Pipeline tasks: text-classification, text-generation, question-answering, translation, summarization, zero-shot-classification
- Trainer and TrainingArguments: learning_rate, weight_decay, evaluation_strategy, per_device_train_batch_size, num_train_epochs
- Datasets library: load_dataset from Hub, dataset.map with batched processing, filter, train_test_split, Dataset.from_dict
- Evaluation metrics: evaluate.load for accuracy, BLEU, ROUGE; compute_metrics callback; perplexity from eval_loss
- Quantization: BitsAndBytesConfig (4-bit NF4), GPTQ and AWQ model loading, LoRA/PEFT with LoraConfig and target_modules
- Hub operations: push_to_hub, save_pretrained, huggingface-cli login, HfApi, create_repo, snapshot_download, Gradio Spaces
Frequently Asked Questions
What is the Hugging Face Transformers library?
The Hugging Face Transformers library provides thousands of pretrained models for NLP, vision, and audio tasks. It offers a unified API (AutoModel, AutoTokenizer, pipeline) to load models from the Hub and fine-tune them with the Trainer class. It supports PyTorch, TensorFlow, and JAX backends and integrates with the datasets and evaluate libraries.
What is the pipeline API and when should I use it?
The pipeline API is the simplest way to run inference with a pretrained model. Calling pipeline("text-classification") returns a callable that accepts raw strings and returns structured predictions. It is ideal for prototyping and production inference when you do not need custom training logic. For fine-tuning you would use Trainer or a custom training loop instead.
How do I fine-tune a model with the Trainer class?
Instantiate TrainingArguments with output_dir, num_train_epochs, learning_rate, and per_device_train_batch_size. Then create a Trainer passing the model, args, train_dataset, eval_dataset, and optionally a compute_metrics callback. Call trainer.train() to start training and trainer.evaluate() to compute validation metrics. The Trainer handles gradient accumulation, mixed precision, and distributed training automatically.
What is the difference between GPTQ, AWQ, and BitsAndBytes 4-bit quantization?
BitsAndBytesConfig with load_in_4bit performs post-training quantization at load time using the bitsandbytes library — convenient but slightly lower quality. GPTQ pre-quantizes weights using calibration data, storing a quantized model on the Hub (e.g., TheBloke repositories). AWQ (Activation-aware Weight Quantization) also pre-quantizes but preserves salient weights for better accuracy. For everyday inference use BitsAndBytes; for production deployment prefer GPTQ or AWQ.
How does LoRA / PEFT reduce fine-tuning memory requirements?
LoRA (Low-Rank Adaptation) freezes the original model weights and adds small trainable rank-decomposition matrices to specific linear layers (target_modules like q_proj, v_proj). Only these adapter weights are updated during training, reducing GPU memory by 10–100x compared to full fine-tuning. The peft library wraps any model with get_peft_model(model, LoraConfig(...)). After training, adapters can be merged back into the base model or kept separate.
How do I compute BLEU or ROUGE scores with the evaluate library?
Install the evaluate library and call evaluate.load("rouge") or evaluate.load("bleu"). Pass predictions (generated text) and references (ground truth) as lists of strings to metric.compute(). For ROUGE this returns rouge1, rouge2, rougeL, and rougeLsum scores. You can integrate this into Trainer via the compute_metrics callback which receives an EvalPrediction namedtuple with logits and label_ids.
How do I upload a fine-tuned model to the Hugging Face Hub?
Log in with huggingface-cli login or login(token="hf_xxx") in Python. Then call model.save_pretrained("./my_model") and tokenizer.save_pretrained("./my_model") to save locally, and model.push_to_hub("username/my-model") to upload. You can make the repository private with create_repo("my-model", private=True) before pushing. The HfApi class provides programmatic control over repository management.
What is model.generate() and how do I control output length and sampling?
model.generate() runs autoregressive text generation. Pass input_ids from the tokenizer and use max_new_tokens to limit output length. Control quality with parameters like do_sample=True (enables sampling), temperature (lower = more focused), top_p (nucleus sampling), and repetition_penalty. For greedy decoding omit do_sample. The generated token IDs must be decoded with tokenizer.decode(outputs[0], skip_special_tokens=True).