The Current State of GPT4All

cover
22 Dec 2024

Abstract and 1. Introduction

2 The Original GPT4All Model

2.1 Data Collection and Curation

2.2 Model Training, 2.3 Model Access and 2.4 Model Evaluation

3 From a Model to an Ecosystem

3.1 GPT4All-J: Repository Growth and the implications of the LLaMA License

3.2 GPT4All-Snoozy: the Emergence of the GPT4All Ecosystem

3.3 The Current State of GPT4All

4 The Future of GPT4All

Limitations and References

3.3 The Current State of GPT4All

Today, GPT4All is focused on improving the accessibility of open source language models. The repository provides compressed versions of open source models for use on commodity hardware, stable and simple high level model APIs, and a GUI for no code model experimentation. The project continues to increase in popularity, and as of August 1 2023, has garnered over 50000 GitHub stars and over 5000 forks.

Figure 1: TSNE visualizations showing the progression of the GPT4All train set. Panel (a) shows the original uncurated data. The red arrow denotes a region of highly homogeneous prompt-response pairs. The coloring denotes which open dataset contributed the prompt. Panel (b) shows the original GPT4All data after curation. This panel, as well as panels (c) and (d) are 10 colored by topic, which Atlas automatically extracts. Notice that the large homogeneous prompt-response blobs no longer appearl. Panel (c) shows the GPT4All-J dataset. The "starburst" clusters introduced on the right side of the panel correspond to the newly added creative data. Panel (d) shows the final GPT4All-snoozy dataset. All datasets have been released to the public, and can be interactively explored online. In the web version of this article, you can click on a panel to be taken to its interactive visualization.

Table 1: Evaluations of all language models in the GPT4All ecosystem as of August 1, 2023. Code models are not included. OpenAI’s text-davinci-003 is included as a point of comparison. The best overall performing model in the GPT4All ecosystem, Nous-Hermes2, achieves over 92% of the average performance of text-davinci-003. Models marked with an asterisk were available in the ecosystem as of the release of GPT4All-Snoozy. Note that at release, GPT4All-Snoozy had the best average performance of any model in the ecosystem. Bolded numbers indicate the best performing model as of August 1, 2023.

Figure 2: Comparison of the github start growth of GPT4All, Meta’s LLaMA, and Stanford’s Alpaca. We conjecture that GPT4All achieved and maintains faster ecosystem growth due to the focus on access, which allows more users to meaningfully participate.

GPT4All currently provides native support and benchmark data for over 35 models (see Figure 1), and includes several models co-developed with industry partners such as Replit and Hugging Face. GPT4All also provides high level model APIs in languages including Python, Typescript, Go, C#, and Java, among others. Furthermore, the GPT4All no code GUI currently supports the workflows of over 50000 monthly active users, with over 25% of users coming back to the tool every day of the week. (Note that all GPT4All user data is collected on an opt in basis.) GPT4All has become the top language model integration in the popular open source AI orchestration library LangChain (Chase, 2022), and powers many popular open source projects such as PrivateGPT (imartinez, 2023), Quiver (StanGirard, 2023), and MindsDB (MindsDB, 2023), among others. GPT4All is the 3rd fastest growing GitHub repository of all time (Leo, 2023), and is the 185th most popular repository on the platform, by star count.

This paper is available on arxiv under CC BY 4.0 DEED license.

Authors:

(1) Yuvanesh Anand, Nomic AI, [email protected];

(2) Zach Nussbaum, Nomic AI, [email protected];

(3) Adam Treat, Nomic AI, [email protected];

(4) Aaron Miller, Nomic AI, [email protected];

(5) Richard Guo, Nomic AI, [email protected];

(6) Ben Schmidt, Nomic AI, [email protected];

(7) GPT4All Community, Planet Earth;

(8) Brandon Duderstadt, Nomic AI, [email protected] with Shared Senior Authorship;

(9) Andriy Mulyar, Nomic AI, [email protected] with Shared Senior Authorship.