Apple claims its ReALM is better than GPT-4 at this task. What is it?

Apple researchers on Friday released a preprint paper on its ReALM large language model and claimed that it can “substantially outperform” OpenAI’s GPT-4 in particular benchmarks. ReALM can supposedly understand and handle different contexts. In theory, this will allow users to point to something on the screen or running in the background and query the language model about it.

Reference resolution is a linguistic problem of understanding what a particular expression is referring to. For example, when we speak, we use references like “they” or “that.” Now, what these words are referring to might be obvious to humans who can understand based on context. But a chatbot like ChatGPT may sometimes struggle to understand exactly what you are referring to.

This ability to understand exactly what is being referred to would be very important to chatbots. The ability for users to refer to something on a screen using “that” or “it” or another word and having a chatbot understand it perfectly would be crucial in creating a truly hands free screen experience, according to Apple.

This latest paper from Apple is the third one on AI that it published in the last few months. While it is still early to predict anything, these papers could be thought of as an early teaser of features that the company plans to include in its software offerings like iOS and macOS.

In the paper, researchers wrote that they want to use ReALM to understand and identify three kinds of entities — onscreen entities, conversational entities, and background entities. Onscreen entities are things that are displayed on the user’s screen. Conversational entities are those that are relevant to the conversation. For example, if you say “what workouts am I supposed to do today?” to a chatbot, it should be able to work out from previous conversations that you are on a 3-day workout schedule and what the schedule for the day is.

Background entities are those things that do not fall into the previous two categories but are still relevant. For example, there could be a podcast playing in the background or a notification that just rang. Apple wants ReALM to be able to understand when a user refers to these as well.

“We demonstrate large improvements over an existing system with similar functionality across different types of references, with our smallest model obtaining absolute gains of over 5 per cent for on-screen references. We also benchmark against GPT-3.5 and GPT-4, with our smallest model achieving performance comparable to that of GPT-4, and our larger models substantially outperforming it,” wrote the researchers in the paper.

But do note that with GPT-3.5, which only accepts text, the researchers’ input was just the prompt alone. But with GPT-4, they also provided a screenshot for the task, which helped improve performance substantially.

“Note that our ChatGPT prompt and prompt+image formulation are, to the best of our knowledge, in and of themselves novel. While we believe it might be possible to further improve
results, for example, by sampling semantically similar utterances up until we hit the prompt length, this more complex approach deserves further, dedicated exploration, and we leave this to future work,” added the researchers in the paper.

So while ReALM works better than GPT-4 in this particular benchmark, it would be far from accurate to say that the former is a better model than the latter. It is just that ReALM beat GPT in a benchmark that it was specifically designed to be good at. It is also not immediately clear when or how Apple plans to integrate ReALM into its products.

Leave a comment