What is Ferret?
Ferret is a multimodal large-scale language model capable of referring and locating objects based on instructions. This model combines regional representation and visual sampling to achieve precise reference and positioning. It was trained on the GRIT dataset and has an evaluation benchmark named Ferret-Bench. The code and checkpoints for Ferret are available on GitHub.
In an October X channel, Apple AI/ML research scientist Zhe Gan described Ferret as a system with the ability to “identify and pinpoint anything in an image at any level of detail.” This means it can analyze different shapes and regions within an image.
Put simply, the model can focus on a specific area marked within an image, recognize useful elements relevant to a user’s query, pinpoint them, and outline them with a bounding box. It can then integrate this recognized element into a query, responding accordingly in a conventional manner.
For instance, if you highlight an animal in an image and ask the LLM what it is, Ferret could discern the species of the creature and that you’re singling out one among many. Leveraging the context of other items detected in the image, it could provide additional related information.
What People Discuss About?
- Reddit’s r/Apple noticed that Ferret underwent training utilizing 8 A100 GPUs, each equipped with 80GB of memory.
- news.ycombinator
- forums.appleinsider
Some discussions revolve around Apple’s artificial intelligence capabilities, their potential in the consumer market, and the limitations of their current ML products (such as Siri and predictive text). There’s also talk about the limitations of CoreML and the potential for MLX to fill this gap.