Today's Open Source (2024-10-29): Meta Open-Sources LongVU Large Model
LongVU enhances long video comprehension with spatiotemporal compression. CoI-Agent revolutionizes research via LLMs, and x.infer simplifies CV inference for 1,000+ models.
Here are some interesting AI open-source models and frameworks I wanted to share today:
Project: LongVU
The LongVU project aims to enhance long video language comprehension through spatiotemporal adaptive compression technology.
This project integrates advanced visual encoders and language models, effectively processing and understanding complex information in long videos.
LongVU provides multiple resource versions, supporting both local deployment and online demos, making it suitable for a wide range of applications requiring video and language data processing.
Project: CoI-Agent
Chain of Ideas (CoI) Agent is a project designed to revolutionize research and idea development through large language model (LLM) agents.
The project offers a systematic approach to generating and developing research ideas by utilizing advanced natural language processing and machine learning models. It helps researchers explore and innovate more efficiently in scientific studies.
https://github.com/DAMO-NLP-SG/CoI-Agent
Project: AgenticIR
This project addresses complex image restoration problems, utilizing an intelligent agent system for tasks like deblurring, defogging, and image enhancement.
By leveraging learning and experience, the system effectively restores real-world image quality.
https://github.com/Kaiwen-Zhu/AgenticIR
Project: x.infer
x.infer is a framework-agnostic computer vision inference library that enables inference for any computer vision model through simple Python code.
It supports various frameworks and over 1,000 models, providing a unified interface and modular design, allowing users to easily integrate and replace models.
x.infer also supports interactive interfaces through Gradio.
https://github.com/dnth/x.infer
Project: PyramidDrop
PyramidDrop is a project aimed at accelerating large vision-language models by reducing visual redundancy.
The core idea is to utilize image tagging at different levels of redundancy, reducing deep-layer redundancy to improve the efficiency of training and inference.
PyramidDrop accelerates models during training and can also be used as a plug-and-play strategy for inference acceleration, offering both high performance and low inference costs.
https://github.com/Cooperx521/PyramidDrop
Project: LLaVA-MoD
LLaVA-MoD is an efficient framework designed to train small-scale multimodal language models by distilling knowledge from large-scale multimodal language models.
The project optimizes networks by integrating a mixture of sparse experts (MoE) architecture and adopts a two-stage knowledge transfer strategy: imitation distillation and preference distillation.
Experiments show that LLaVA-MoD outperforms existing models on multimodal benchmarks with the least number of activated parameters and low computational costs.