open source AI

Exponential growth brews 1 million AI models on Hugging Face

The sand has come alive —

Hugging Face cites community-driven customization as fuel for diverse AI model boom.

Benj Edwards – Sep 26, 2024 7: 50 pm UTC

On Thursday, AI hosting platform Hugging Face surpassed 1 million AI model listings for the first time, marking a milestone in the rapidly expanding field of machine learning. An AI model is a computer program (often using a neural network) trained on data to perform specific tasks or make predictions. The platform, which started as a chatbot app in 2016 before pivoting to become an open source hub for AI models in 2020, now hosts a wide array of tools for developers and researchers.

The machine-learning field represents a far bigger world than just large language models (LLMs) like the kind that power ChatGPT. In a post on X, Hugging Face CEO Clément Delangue wrote about how his company hosts many high-profile AI models, like “Llama, Gemma, Phi, Flux, Mistral, Starcoder, Qwen, Stable diffusion, Grok, Whisper, Olmo, Command, Zephyr, OpenELM, Jamba, Yi,” but also “999,984 others.”

The reason why, Delangue says, stems from customization. “Contrary to the ‘1 model to rule them all’ fallacy,” he wrote, “smaller specialized customized optimized models for your use-case, your domain, your language, your hardware and generally your constraints are better. As a matter of fact, something that few people realize is that there are almost as many models on Hugging Face that are private only to one organization—for companies to build AI privately, specifically for their use-cases.”

Enlarge / A Hugging Face-supplied chart showing the number of AI models added to Hugging Face over time, month to month.

Hugging Face’s transformation into a major AI platform follows the accelerating pace of AI research and development across the tech industry. In just a few years, the number of models hosted on the site has grown dramatically along with interest in the field. On X, Hugging Face product engineer Caleb Fahlgren posted a chart of models created each month on the platform (and a link to other charts), saying, “Models are going exponential month over month and September isn’t even over yet.”

The power of fine-tuning

As hinted by Delangue above, the sheer number of models on the platform stems from the collaborative nature of the platform and the practice of fine-tuning existing models for specific tasks. Fine-tuning means taking an existing model and giving it additional training to add new concepts to its neural network and alter how it produces outputs. Developers and researchers from around the world contribute their results, leading to a large ecosystem.

For example, the platform hosts many variations of Meta’s open-weights Llama models that represent different fine-tuned versions of the original base models, each optimized for specific applications.

Hugging Face’s repository includes models for a wide range of tasks. Browsing its models page shows categories such as image-to-text, visual question answering, and document question answering under the “Multimodal” section. In the “Computer Vision” category, there are sub-categories for depth estimation, object detection, and image generation, among others. Natural language processing tasks like text classification and question answering are also represented, along with audio, tabular, and reinforcement learning (RL) models.

Enlarge / A screenshot of the Hugging Face models page captured on September 26, 2024.

Hugging Face

When sorted for “most downloads,” the Hugging Face models list reveals trends about which AI models people find most useful. At the top, with a massive lead at 163 million downloads, is Audio Spectrogram Transformer from MIT, which classifies audio content like speech, music, and environmental sounds. Following that, with 54.2 million downloads, is BERT from Google, an AI language model that learns to understand English by predicting masked words and sentence relationships, enabling it to assist with various language tasks.

Rounding out the top five AI models are all-MiniLM-L6-v2 (which maps sentences and paragraphs to 384-dimensional dense vector representations, useful for semantic search), Vision Transformer (which processes images as sequences of patches to perform image classification), and OpenAI’s CLIP (which connects images and text, allowing it to classify or describe visual content using natural language).

No matter what the model or the task, the platform just keeps growing. “Today a new repository (model, dataset or space) is created every 10 seconds on HF,” wrote Delangue. “Ultimately, there’s going to be as many models as code repositories and we’ll be here for it!”

Debate over “open source AI” term brings new push to formalize definition

AI, AI regulation, AI workshops, All Things Open, Biz & IT, chatgpt, chatgtp, flux, free software, GPT-4o, machine learning, Meta, Open Source, open source AI, Open Source Initiative, open weights, Open weights AI, openai, OSI, Raleigh, SB-1047, source available / Paul Patrick / August 27, 2024

Enlarge / A man peers over a glass partition, seeking transparency.

The Open Source Initiative (OSI) recently unveiled its latest draft definition for “open source AI,” aiming to clarify the ambiguous use of the term in the fast-moving field. The move comes as some companies like Meta release trained AI language model weights and code with usage restrictions while using the “open source” label. This has sparked intense debates among free-software advocates about what truly constitutes “open source” in the context of AI.

For instance, Meta’s Llama 3 model, while freely available, doesn’t meet the traditional open source criteria as defined by the OSI for software because it imposes license restrictions on usage due to company size or what type of content is produced with the model. The AI image generator Flux is another “open” model that is not truly open source. Because of this type of ambiguity, we’ve typically described AI models that include code or weights with restrictions or lack accompanying training data with alternative terms like “open-weights” or “source-available.”

To address the issue formally, the OSI—which is well-known for its advocacy for open software standards—has assembled a group of about 70 participants, including researchers, lawyers, policymakers, and activists. Representatives from major tech companies like Meta, Google, and Amazon also joined the effort. The group’s current draft (version 0.0.9) definition of open source AI emphasizes “four fundamental freedoms” reminiscent of those defining free software: giving users of the AI system permission to use it for any purpose without permission, study how it works, modify it for any purpose, and share with or without modifications.

By establishing clear criteria for open source AI, the organization hopes to provide a benchmark against which AI systems can be evaluated. This will likely help developers, researchers, and users make more informed decisions about the AI tools they create, study, or use.

Truly open source AI may also shed light on potential software vulnerabilities of AI systems, since researchers will be able to see how the AI models work behind the scenes. Compare this approach with an opaque system such as OpenAI’s ChatGPT, which is more than just a GPT-4o large language model with a fancy interface—it’s a proprietary system of interlocking models and filters, and its precise architecture is a closely guarded secret.

OSI’s project timeline indicates that a stable version of the “open source AI” definition is expected to be announced in October at the All Things Open 2024 event in Raleigh, North Carolina.

“Permissionless innovation”

In a press release from May, the OSI emphasized the importance of defining what open source AI really means. “AI is different from regular software and forces all stakeholders to review how the Open Source principles apply to this space,” said Stefano Maffulli, executive director of the OSI. “OSI believes that everybody deserves to maintain agency and control of the technology. We also recognize that markets flourish when clear definitions promote transparency, collaboration and permissionless innovation.”

The organization’s most recent draft definition extends beyond just the AI model or its weights, encompassing the entire system and its components.

For an AI system to qualify as open source, it must provide access to what the OSI calls the “preferred form to make modifications.” This includes detailed information about the training data, the full source code used for training and running the system, and the model weights and parameters. All these elements must be available under OSI-approved licenses or terms.

Notably, the draft doesn’t mandate the release of raw training data. Instead, it requires “data information”—detailed metadata about the training data and methods. This includes information on data sources, selection criteria, preprocessing techniques, and other relevant details that would allow a skilled person to re-create a similar system.

The “data information” approach aims to provide transparency and replicability without necessarily disclosing the actual dataset, ostensibly addressing potential privacy or copyright concerns while sticking to open source principles, though that particular point may be up for further debate.

“The most interesting thing about [the definition] is that they’re allowing training data to NOT be released,” said independent AI researcher Simon Willison in a brief Ars interview about the OSI’s proposal. “It’s an eminently pragmatic approach—if they didn’t allow that, there would be hardly any capable ‘open source’ models.”

Debate over “open source AI” term brings new push to formalize definition Read More »