Setting up Hugging Face
Hugging Face packages
Hugging Face ecosystem is spread across multiple packages that provide different functionalities. We’ll go through some of the major ones.
Transformers
Transformers is a core package of Hugging Face that defines what Hugging Face models and pipelines are. It makes it possible to run inference on models from Hugging Face model hub
The name of the library is transformers in conda-forge and in PyPi.
Datasets
Datasets is a library for downloading, using and sharing datasets in Hugging Face datasets hub.
The name of the library is datasets in conda-forge and in PyPi.
Accelerate
Accelerate is a library for doing model training with PyTorch. It supports distributed training as well.
The name of the library is accelerate in conda-forge and in PyPi.
Hugging Face Hub Client
Hugging Face Hub client library allows you to interact with the Hugging Face hub. Datasets and Transformers use it to download datasets and models from the Hugging Face hub. There is also a command line interface (CLI) that you can use to login to Hugging Face hub and to download models manually.
The name of the library is huggingface_hub in conda-forge and in PyPi. If you want the command line client when installing from PyPi use huggingface_hub[cli].
Other libraries
In these demos we also use:
sentencepiecebitsandbytes
Logging into Hugging Face hub
If you have created an account in Hugging Face hub you can login with hf, which is the Hugging Face command line interface (in older versions this was huggingface-cli).
Run this in a command line:
hf auth login
If you do not have a token created, go to Settings > Access Tokens and create one. Its good idea to have individual tokens for different systems so you can revoke them if necessary.
Environment variables and storage locations
Hugging Face uses many environment variables (more detailed list is here) to determine where it should store various files.
By default, Hugging Face client library will store everything under HF_HOME, which is ~/.cache/huggingface in Linux systems.
Stored data includes:
Authentication tokens
Downloaded datasets
Downloaded models
In many systems you do not want to use this default location. In high-performance clusters you’ll most likely want to use some storage location reserved for your project for the large datasets.
Some of the most common environment variables to move things around are:
HF_HOME- Default location for storing everything related to Hugging Face hub.HF_HUB_CACHE- Location where cached models and datasets are downloaded. Defaults to$HF_HOME/hub.HF_TOKEN- Location where your Hugging Face hub token is stored. Defaults to$HF_HOME/token.
Usually you’ll want to set these in your submission scripts, but if you’re using notebooks, you might want to use notebooks cells like these before importing Hugging Face libraries.
import os
# Here environment variable WRKDIR points to a personal work directory
os.environ["HF_HOME"] = f"{os.environ["WRKDIR"]}/huggingface"
os.environ["HF_TOKEN"] = "~/.cache/huggingface/token"