Getty Images/UG
Despite not having launched any AI models since the generative AI craze began, Apple is working on some AI projects. Just last week, Apple researchers shared a paper unveiling a new language model the company is working on, and insiders have reported that Apple has two AI-powered robots. are working Now, the release of another research paper suggests that Apple is just getting started.
On Monday, Apple researchers published a research paper presenting Ferret-UI, a new Multimodal Large Language Model (MLLM) capable of understanding mobile user interface (UI) screens.
MLLMs differ from standard LLMs in that they go beyond text, with a deeper understanding of multimodal elements such as images and audio. In this case, Ferret-UI is trained to recognize various elements of the user’s home screen, such as app icons and small text.
Identifying app screen elements has been a challenge for MLLMs in the past due to their small nature. To overcome this problem, according to the paper, the researchers added “Any Resolution” on top of the ferret, which allows it to magnify details on the screen.
Based on this, Apple’s MLLM also has “referencing, grounding, and reasoning capabilities,” which allow Ferret-UI to fully understand UI screens and execute instructions based on the screen’s content. Permits, according to the paper, as seen in Pictured below.
K. You et al.
To measure the model’s performance against other MLLMs, Apple researchers compared Ferret-UI to GPT-4V, OpenAI’s MLLM, on public benchmarks, elementary tasks, and advanced tasks.
Ferret-UI outperformed GPT-4V in almost all tasks in the elementary category, including icon recognition, OCR, widget classification, icon searching, and searching widget tasks on iPhone and Android. The only exception was the “find text” task on the iPhone, where the GPT-4V performed slightly better than the ferret models, as seen in the chart below.
K. You et al.
When it comes to basing the conversation on UI results, the GPT-4V has a slight edge, beating the Ferret 93.4% to 91.7%. However, the researchers note that Ferret UI’s performance is still “remarkable” because it generates raw coordinates instead of the GPT-4V default set of boxes. You can find an example below.
K. You et al.
The paper does not indicate what Apple plans to leverage the technology for, or if it will at all. Instead, the researchers state more broadly that Ferret-UI’s advanced capabilities have the potential to positively impact UI-related applications.
“The advent of these enhanced capabilities promises substantial growth for many downstream UI applications, thus increasing the potential benefits provided by ferret-UI in this domain,” the researchers wrote. wrote
Ways to improve Ferret-UI Siri are obvious. Due to the model’s thorough understanding of the user’s app screen, and knowledge of how to perform certain tasks, Ferret-UI can be used to supercharge Siri to perform tasks for you.
Definitely interested in an assistant that does more than just answer questions. New AI gadgets like the Rabbit R1 get a lot of attention for being able to perform a complete task for you, such as booking a flight or ordering food, without having to walk you through it step-by-step.
Credit : www.zdnet.com