OpenAI DevDay was this week. What delicious and/or terrifying things await?
First off, we have GPT-4-Turbo.
Today we’re launching a preview of the next generation of this model, GPT-4 Turbo.
GPT-4 Turbo is more capable and has knowledge of world events up to April 2023. It has a 128k context window so it can fit the equivalent of more than 300 pages of text in a single prompt. We also optimized its performance so we are able to offer GPT-4 Turbo at a 3x cheaper price for input tokens and a 2x cheaper price for output tokens compared to GPT-4.
GPT-4 Turbo is available for all paying developers to try by passing
gpt-4-1106-preview
in the API and we plan to release the stable production-ready model in the coming weeks.
Knowledge up to April 2023 is a big game. Cutting the price in half is another big game. A 128k context window retakes the lead on that from Claude-2. That chart from last week of how GPT-4 was slow and expensive, opening up room for competitors? Back to work, everyone.
What else?
Function calling updates
Function calling lets you describe functions of your app or external APIs to models, and have the model intelligently choose to output a JSON object containing arguments to call those functions. We’re releasing several improvements today, including the ability to call multiple functions in a single message: users can send one message requesting multiple actions, such as “open the car window and turn off the A/C”, which would previously require multiple roundtrips with the model (learn more). We are also improving function calling accuracy: GPT-4 Turbo is more likely to return the right function parameters.
This kind of feature seems highly fiddly and dependent. When it starts working well enough, suddenly it is great, and I have no idea if this will count. I will watch out for reports. For now, I am not trying to interact with any APIs via GPT-4. Use caution.
Improved instruction following and JSON mode
GPT-4 Turbo performs better than our previous models on tasks that require the careful following of instructions, such as generating specific formats (e.g., “always respond in XML”). It also supports our new JSON mode, which ensures the model will respond with valid JSON. The new API parameter
response_format
enables the model to constrain its output to generate a syntactically correct JSON object. JSON mode is useful for developers generating JSON in the Chat Completions API outside of function calling.
Better instruction following is incrementally great. Always frustrating when instructions can’t be relied upon. Could allow some processes to be profitably automated.
Reproducible outputs and log probabilities
The new
seed
parameter enables reproducible outputs by making the model return consistent completions most of the time. This beta feature is useful for use cases such as replaying requests for debugging, writing more comprehensive unit tests, and generally having a higher degree of control over the model behavior. We at OpenAI have been using this feature internally for our own unit tests and have found it invaluable. We’re excited to see how developers will use it. Learn more.We’re also launching a feature to return the log probabilities for the most likely output tokens generated by GPT-4 Turbo and GPT-3.5 Turbo in the next few weeks, which will be useful for building features such as autocomplete in a search experience.
I love the idea of seeing the probabilities of different responses on the regular, especially if incorporated into ChatGPT. It provides so much context for knowing what to make of the answer. The distribution of possible answers is the true answer. Super excited in a good way.
Updated GPT-3.5 Turbo
In addition to GPT-4 Turbo, we are also releasing a new version of GPT-3.5 Turbo that supports a 16K context window by default. The new 3.5 Turbo supports improved instruction following, JSON mode, and parallel function calling. For instance, our internal evals show a 38% improvement on format following tasks such as generating JSON, XML and YAML. Developers can access this new model by calling
gpt-3.5-turbo-1106
in the API. Applications using thegpt-3.5-turbo
name will automatically be upgraded to the new model on December 11. Older models will continue to be accessible by passinggpt-3.5-turbo-0613
in the API until June 13, 2024. Learn more.
Some academics will presumably grumble that the old version is going away. Such incremental improvements seem nice, but with GPT-4 getting a price cut and turbo boost, should be less call for 3.5. I can still see using it in things like multi-agent world simulations.
This claims you can now use GPT 3.5 at only a modest additional marginal cost versus Llama-2.
Hamel Husain: What’s wild is the new pricing for GPT 3.5 is competitive with commercially hosted ~ 70B Llama endpoints like those offered by anyscale and http://fireworks.ai. Cost is eroding as a moat gpt-3.5-turbo-1106 Pricing is $1/1M input and $2/1M output. [Versus 0.15 and 0.20 per million tokens]
I don’t interpret the numbers that way yet. There is still a substantial difference at scale, a factor of five or six. If you cannot afford the superior GPT-4 for a given use case, you may want the additional discount. And as all costs go down, there will be temptation to use far more queries. A factor of five is not nothing.
I’m going to skip ahead a bit to take care of all the incremental stuff first:
All right, back to normal unscary things, there’s new modalities?
New modalities in the API
GPT-4 Turbo with vision
GPT-4 Turbo can accept images as inputs in the Chat Completions API, enabling use cases such as generating captions, analyzing real world images in detail, and reading documents with figures. For example, BeMyEyes uses this technology to help people who are blind or have low vision with daily tasks like identifying a product or navigating a store. Developers can access this feature by using
gpt-4-vision-preview
in the API. We plan to roll out vision support to the main GPT-4 Turbo model as part of its stable release. Pricing depends on the input image size. For instance, passing an image with 1080×1080 pixels to GPT-4 Turbo costs $0.00765. Check out our vision guide.DALL·E 3
Developers can integrate DALL·E 3, which we recently launched to ChatGPT Plus and Enterprise users, directly into their apps and products through our Images API by specifying
dall-e-3
as the model. Companies like Snap, Coca-Cola, and Shutterstock have used DALL·E 3 to programmatically generate images and designs for their customers and campaigns. Similar to the previous version of DALL·E, the API incorporates built-in moderation to help developers protect their applications against misuse. We offer different format and quality options, with prices starting at $0.04 per image generated. Check out our guide to getting started with DALL·E 3 in the API.Text-to-speech (TTS)
Developers can now generate human-quality speech from text via the text-to-speech API. Our new TTS model offers six preset voices to choose from and two model variants,
tts-1
andtts-1-hd
.tts
is optimized for real-time use cases andtts-1-hd
is optimized for quality. Pricing starts at $0.015 per input 1,000 characters. Check out our TTS guide to get started.
I can see the DALL-E 3 prices adding up to actual money. When I use Stable Diffusion, it is not so unusual that I ask for the full 100 generations, then go away for a while and come back, why not? Of course, it would be worth it for the quality boost, provided DALL-E 3 was willing to do whatever I happened to want that day. The text-to-speech seems not free but highly reasonably priced. All the voices seem oddly similar. I do like them. When do we get our licensed celebrity voice options? So many good choices.
Model customization
GPT-4 fine tuning experimental access
We’re creating an experimental access program for GPT-4 fine-tuning. Preliminary results indicate that GPT-4 fine-tuning requires more work to achieve meaningful improvements over the base model compared to the substantial gains realized with GPT-3.5 fine-tuning. As quality and safety for GPT-4 fine-tuning improves, developers actively using GPT-3.5 fine-tuning will be presented with an option to apply to the GPT-4 program within their fine-tuning console.
All right, sure, I suppose it is that time, makes sense that improvement is harder. Presumably it is easier if you want a quirkier thing. I do not know how the fine-tuning is protected against jailbreak attempts, anyone want to explain?
Custom models
For organizations that need even more customization than fine-tuning can provide (particularly applicable to domains with extremely large proprietary datasets—billions of tokens at minimum), we’re also launching a Custom Models program, giving selected organizations an opportunity to work with a dedicated group of OpenAI researchers to train custom GPT-4 to their specific domain. This includes modifying every step of the model training process, from doing additional domain specific pre-training, to running a custom RL post-training process tailored for the specific domain. Organizations will have exclusive access to their custom models. In keeping with our existing enterprise privacy policies, custom models will not be served to or shared with other customers or used to train other models. Also, proprietary data provided to OpenAI to train custom models will not be reused in any other context. This will be a very limited (and expensive) program to start—interested orgs can apply here.
Expensive is presumably the watchword. This will not be cheap. Then again, compared to the potential, could be very cheap indeed.
So far, so incremental, you love to see it, and… wait, what?
Today, we’re releasing the Assistants API, our first step towards helping developers build agent-like experiences within their own applications. An assistant is a purpose-built AI that has specific instructions, leverages extra knowledge, and can call models and tools to perform tasks. The new Assistants API provides new capabilities such as Code Interpreter and Retrieval as well as function calling to handle a lot of the heavy lifting that you previously had to do yourself and enable you to build high-quality AI apps.
This API is designed for flexibility; use cases range from a natural language-based data analysis app, a coding assistant, an AI-powered vacation planner, a voice-controlled DJ, a smart visual canvas—the list goes on. The Assistants API is built on the same capabilities that enable our new GPTs product: custom instructions and tools such as Code interpreter, Retrieval, and function calling.
A key change introduced by this API is persistent and infinitely long threads, which allow developers to hand off thread state management to OpenAI and work around context window constraints. With the Assistants API, you simply add each new message to an existing
thread
.Assistants also have access to call new tools as needed, including:
Code Interpreter: writes and runs Python code in a sandboxed execution environment, and can generate graphs and charts, and process files with diverse data and formatting. It allows your assistants to run code iteratively to solve challenging code and math problems, and more.
Retrieval: augments the assistant with knowledge from outside our models, such as proprietary domain data, product information or documents provided by your users. This means you don’t need to compute and store embeddings for your documents, or implement chunking and search algorithms. The Assistants API optimizes what retrieval technique to use based on our experience building knowledge retrieval in ChatGPT.
Function calling: enables assistants to invoke functions you define and incorporate the function response in their messages.
As with the rest of the platform, data and files passed to the OpenAI API are never used to train our models and developers can delete the data when they see fit.
You can try the Assistants API beta without writing any code by heading to the Assistants playground.
It has been months since I wrote On AutoGPT and everyone was excited. All the hype around agents died off and everyone seemed to despair of making them work in the current model generation. OpenAI had the inside track in many ways, so perhaps they made it work a lot better? We will find out. If you’re not a little nervous, that seems like a mistake.
All right, what’s up with these ‘GPTs’?
First off, horrible name, highly confusing, please fix. Alas, they won’t.
All right, what do we got?
Ah, one of the obvious things we should obviously do that will open up tons of possibilities, I feel bad I didn’t say it explicitly and get zero credit but we were all thinking it.
We’re rolling out custom versions of ChatGPT that you can create for a specific purpose—called GPTs. GPTs are a new way for anyone to create a tailored version of ChatGPT to be more helpful in their daily life, at specific tasks, at work, or at home—and then share that creation with others. For example, GPTs can help you learn the rules to any board game, help teach your kids math, or design stickers.
Anyone can easily build their own GPT—no coding is required. You can make them for yourself, just for your company’s internal use, or for everyone. Creating one is as easy as starting a conversation, giving it instructions and extra knowledge, and picking what it can do, like searching the web, making images or analyzing data.
Example GPTs are available today for ChatGPT Plus and Enterprise users to try out including Canva and Zapier AI Actions. We plan to offer GPTs to more users soon.
There will be an app GPT store, your privacy is safe, if you are feeling frisky you can connect your APIs and then perhaps noting is safe. This is a minute-long demo of a Puppy Hotline, which is an odd example since I’m not sure why all of that shouldn’t work normally anyway.
An incrementally better example is Sam Altman’s creation of the Startup Mentor, which he has grill the user on why they are not growing faster. Again, this is very much functionally a configuration of an LLM, a GPT, rather than an agent. It might include some if-then statements, perhaps. Which is all good, these are things we want and don’t seem dangerous.
Tyler Cowen’s GOAT is another example. Authors can upload a book plus some instructions, suddenly you can chat with the book, takes minutes to hours of your time.
The educational possibilities alone write themselves. The process is so fast you can create one of these daily for a child’s homework assignment, or create one for yourself to spend an hour learning about something.
One experiment I want someone to run is to try using this to teach someone a foreign language. Consider Scott Alexander’s proposed experiment, where you start with English, and then gradually move over to Japanese grammar and vocabulary over time. Now consider doing that with a GPT, as you do what you were doing anyway, and where you can pause and ask if anything is ever confusing, and you can reply back in a hybrid as well.
The right way to use ChatGPT going forward might be to follow the programmer’s maxim that if you do it three times, you should automate it, except now the threshold might be two and if something is nontrivial it also might be one. You can use others’ versions, but there is a lot to be said for rolling one’s own if the process is easy. If it works well enough, of course. But if so, game changer.
Also goes without saying that if you could combine this with removing the adult content filtering, especially if you still had image generation and audio but even without them, that would be a variety of products in very high demand.
Ethan Mollick sums up the initial state of GPTs this way:
Right now, GPTs are the easiest way of sharing structured prompts, which are programs, written in plain English (or another language), that can get the AI to do useful things. I discussed creating structured prompts last week, and all the same techniques apply, but the GPT system makes structured prompts more powerful and much easier to create, test, and share. I think this will help solve some of the most important AI use cases (how do I give people in my school, organization, or community access to a good AI tool?)
GPTs show a near future where AIs can really start to act as agents, since these GPTs have the ability to connect to other products and services, from your email to a shopping website, making it possible for AIs to do a wide range of tasks. So GPTs are a precursor of the next wave of AI.
They also suggest new future vulnerabilities and risks. As AIs are connected to more systems, and begin to act more autonomously, the chance of them being used maliciously increases.
…
The easy way to make a GPT is something called GPT Builder. In this mode, the AI helps you create a GPT through conversation. You can also test out the results in a window on the side of the interface and ask for live changes, creating a way to iterate and improve your work.
…
Behind the scenes, based on the conversation I had, the AI is filling out a detailed configuration of the GPT, which I can also edit manually.
…
To really build a great GPT, you are going to need to modify or build the structured prompt yourself.
As usual, reliability is not perfect, and mistakes are often silent, a warning not to presume or rely upon a GPT will properly absorb details.
The same thing is true here. The file reference system in the GPTs is immensely powerful, but is not flawless. For example, I fed in over 1,000 pages of rules across seven PDFs for an extremely complex game, and the AI was able to do a good job figuring out the rules, walking me through the process of getting started, and rolling dice to help me set up a character. Humans would have struggled to make all of that work. But it also made up a few details that weren’t in the game, and missed other points entirely. I had no warning that these mistakes happened, and would not have noticed them if I wasn’t cross-referencing the rules myself.
I am totally swamped right now. I am also rather excited to build once I get access.
Alas, for now, that continues to wait.
Sam Altman (November 8): usage of our new features from devday is far outpacing our expectations. we were planning to go live with GPTs for all subscribers Monday but still haven’t been able to. we are hoping to soon. there will likely be service instability in the short term due to load. sorry :/
Kevin Fischer: I like to imagine this is GPT coming alive.
Luckily that is obviously not what is happening, but for the record I do not like to imagine that, because I like remaining alive.
There are going to be a lot of cool things. Also a lot of things that aspire to be cool, that claim they will be cool, that are not cool.
Charles Frye: hope i am proven wrong in my fear that “GPTs” will 100x this tech’s reputation for vaporous demoware.
Vivek Ponnaiyan: It’ll be long tail dynamics like the apple App Store.
Agents are where claims of utility go to die. Even if some of it starts working, expect much death to continue.
From the presentation it seems they will be providing a copyright shield to ChatGPT users in case they get sued. This seems like a very SBF-style moment? It is a great idea, except when maybe just maybe it destroys your entire company, but that totally won’t happen right?
Will this kill a bunch of start-ups, the way Microsoft would by incorporating features into Windows? Yes. It is a good thing, the new way is better for everyone, time to go build something else. Should have planned ahead.
Laura Wendel: new toxic relationship just dropped
Brotzky: All the jokes about OpenAI killing startups with each new release have some validity.
We just removed Pinecone and Langchain from our codebase lowering our monthly fees and removing a lot of complexity.
New Assistants API is fantastic ✨
Downside: having to poll Runs endpoint for a result..
Some caveats
– our usecase is “simple” and assistants api fit us perfectly
– we don’t use agents yet
– we use files a lot Look forward to all this AI competition bringing costs down.
Sam Hogan: Just tested OpenAI’s new Assistant’s API.
This is now all the code you need to create a custom ChatGPT trained on an entire website.
Less than 30 lines 🤯
McKay Wrigley: I’m blown away by OpenAI DevDay… I can’t put into words how much the world just changed. This is a 1000x improvement. We’re living in the infancy of an AI revolution that will bring us a golden age beyond our wildest dreams. It’s time to build.
Interestingly, I had 100x typed out originally and changed it to 1000x. These are obviously not measurable, but it more accurately conveys how I feel. I don’t think people truly grasp (myself included!) what just got unlocked and what is about to happen.
Look, no, stop. This stuff is cool and all. I am super excited to play with GPTs and longer context windows and feature integration. Is it all three orders of magnitude over the previous situation? I mean, seriously, what are you even talking about? I know words do not have meaning, but what of the sacred trust that is numbers?
I suppose if you think of it as ‘10 times better’ meaning ‘good enough that you could use this edge to displace an incumbent service’ rather than ‘this is ten times as useful or valuable’ then yes if it they did their jobs this seems ten times better. Maybe even ten times better twice. But to multiply that together is at best to mix metaphors, and this does not plausibly constitute three consecutive such disruptions.
Unless these new agents actually work far better than anyone expects, in which case who knows. I will note that if so, that does not seem like especially… good… news, in a ‘I hope we are not all about to die’ kind of way.
It is also worth noting that this all means that when GPT-5 does arrive, there will be all this infrastructure waiting to go, that will suddenly get very interesting.
Paul Graham retweeted the above quote, and also this related one:
Paul Graham: This is not a random tech visionary. This is a CEO of an actual AI company. So when he says more has happened in the last year than the previous ten, it’s not just a figure of speech.
Alexander Wang (CEO Scale AI): more has happened in the last year of AI than the prior 10 we are unmistakably in the fiery takeoff of the most important technology of the rest of our lives everybody—governments, citizens, technologists—is awaiting w/bated breath (some helplessly) the next version of humanity.
Being the CEO of an AI company does not seem incompatible with what is traditionally referred to as ‘hype.’ When people in AI talk about factors of ten, one cannot automatically take it at face value, as we saw directly above.
Also, yes, ‘the next version of humanity’ sounds suspiciously like tech speak for ‘also we are all quite possibly going to die, but that is a good thing.’
Ben Thompson has covered a lot of keynote presentations, and found this to be an excellent keynote presentation, perhaps a sign they will make a comeback. While he found the new GPTs exciting, he focuses in terms of business impact where I initially focused, which was on the ordinary and seamless feature improvements.
Users get faster responses, a larger context window, more up to date knowledge, and better integration of modalities of web browsing, vision, hearing, speech and image generation. Quality of life improvements, getting rid of annoyances, filling in practical gaps that made a big marginal difference.
How good are the new context windows? Greg Kamradt reports.
Findings:
GPT-4’s recall performance started to degrade above 73K tokens
Low recall performance was correlated when the fact to be recalled was placed between at 7%-50% document depth
If the fact was at the beginning of the document, it was recalled regardless of context length
So what: No Guarantees – Your facts are not guaranteed to be retrieved. Don’t bake the assumption they will into your applications
Less context = more accuracy – This is well know, but when possible reduce the amount of context you send to GPT-4 to increase its ability to recall
Position matters – Also well know, but facts placed at the very beginning and 2nd half of the document seem to be recalled better.
It makes sense to allow 128k tokens or even more than that, even if performance degrades starting around 73k. For practical purposes, sounds like we want to stick to the smaller amount, but it is good not to have a lower hard cap.
Will Thompson be right that the base UI, essentially no UI at all, is where most users will want to remain and what matters most? Or will we all be using GPTs all the time?
In the short term he is clearly correct. The incremental improvements matter more. But as we learn to build GPTs for ourselves both quick and dirty and bespoke, and learn to use those of others, I expect there to be very large value adds, even if it is ‘you find the 1-3 that work for you and always use them.’
Kevin Fisher notes something I have found as well: LLMs can use web browsing in a pinch, but when you have the option you usually want to avoid this. ‘Do not use web browsing’ will sometimes be a good message to include. Kevin is most concerned about speed but I’ve also found other problems.
Nathan Lebenz suggested on a recent podcast that the killer integration is GPT-4V plus web browsing, allowing the LLM to browse the web and accomplish things. Here is vimGPT, a first attempt. Here are some more demos. We should give people time to see what they can come up with.
Currently conspicuously absent is the ability to make a GPT that selects between a set of available GPTs and then seamlessly calls the one most appropriate to your query. That would combine the functionality into the invisible ‘ultimate’ UI of a text box and attachment button, an expansion of seamless switching between existing base modalities. For now, one can presumably still do this ‘the hard way’ by calling other things that then call your GPTs.