已发表成果:
WOK 论文 34 篇;
M<SUP>3</SUP>ixup: 3 ixup: A multi-modal data augmentation approach for image captioning
Image Captioning via Dynamic Path Customization
ControlMLLM: Training-Free Visual Prompt Learning for Multimodal Large Language Models
3D-GRES: Generalized 3D Referring Expression Segmentation
Multi-branch Collaborative Learning Network for 3D Visual Grounding
Exploring Phrase-Level Grounding with Text-to-Image Diffusion Model
HRSAM: Efficiently Segment Anything in High-Resolution Images
Evaluating and Analyzing Relationship Hallucinations in LVLMs
Beat: Bi-directional One-to-Many Embedding Alignment for Text-based Person Retrieval
SAM as the Guide: Mastering Pseudo-Label Refinement in Semi-Supervised Referring Expression Segmentation
Image Captioning via Dynamic Path Customization
X-Oscar: A Progressive Framework for High-quality Text-guided 3D Animatable Avatar Generation
Toward Open-Set Human Object Interaction Detection
X-RefSeg3D: Enhancing Referring 3D Instance Segmentation via Structured Cross-Modal Graph Neural Networks
Improving Panoptic Narrative Grounding by Harnessing Semantic Relationships and Visual Confirmation
3D-STMN: Dependency-Driven Superpoint-Text Matching Network for End-to-End 3D Referring Expression Segmentation
MMAPS: End-to-End Multi-Grained Multi-Modal Attribute-Aware Product Summarization
Rotated Multi-Scale Interaction Network for Referring Remote Sensing Image Segmentation
X-Dreamer: Creating High-quality 3D Content by Bridging the Domain Gap Between Text-to-2D and Text-to-3D Generation
Semi-Supervised Panoptic Narrative Grounding
Semi-Supervised Panoptic Narrative Grounding
Beyond First Impressions: Integrating Joint Multi-modal Cues for Comprehensive 3D Representation
JM3D & JM3D-LLM: Elevating 3D Representation with Joint Multi-modal Cues
3D-STMN: Dependency-Driven Superpoint-Text Matching Network for End-to-End 3D Referring Expression Segmentation
M3PS: End-to-End Multi-Grained Multi-Modal Attribute-Aware Product Summarization in E-commerce
Beyond First Impressions: Integrating Joint Multi-modal Cues for Comprehensive 3D Representation
Towards Real-Time Panoptic Narrative Grounding by an End-to-End Grounding Network
Towards local visual modeling for image captioning
X-Mesh: Towards Fast and Accurate Text-driven 3D Stylization via Dynamic Textual Guidance
Towards Local Visual Modeling for Image Captioning
Towards Real-Time Panoptic Narrative Grounding by an End-to-End Grounding Network
X-Mesh: Towards Fast and Accurate Text-driven 3D Stylization via Dynamic Textual Guidance
Improving Image Captioning by Leveraging Intra- and Inter-layer Global Representation in Transformer Network
Dual-Level Collaborative Transformer for Image Captioning