已发表成果:
WOK 论文 65 篇;
M<SUP>3</SUP>ixup: 3 ixup: A multi-modal data augmentation approach for image captioning
Deep hybrid transformer network for robust modulation classification in wireless communications
Image Captioning via Dynamic Path Customization
Image Captioning via Dynamic Path Customization
CycleTrans: Learning Neutral Yet Discriminative Features via Cycle Construction for Visible-Infrared Person Re-Identification
Deep Instruction Tuning for Segment Anything Model
Towards Efficient Diffusion-Based Image Editing with Instant Attention Masks
Not All Attention is Needed: Parameter and Computation Efficient Transfer Learning for Multi-modal Large Language Models
Fast Text-to-3D-Aware Face Generation and Manipulation via Direct Cross-modal Mapping and Geometric Regularization
Feast Your Eyes: Mixture-of-Resolution Adaptation for Multimodal Large Language Models
Towards Efficient Diffusion-Based Image Editing with Instant Attention Masks
MMAPS: End-to-End Multi-Grained Multi-Modal Attribute-Aware Product Summarization
MoIL: Momentum Imitation Learning for Efficient Vision-Language Adaptation
Towards Omni-supervised Referring Expression Segmentation
PixelFace plus : Towards Controllable Face Generation and Manipulation with Text Descriptions and Segmentation Masks
Parameter and Computation Efficient Transfer Learning for Vision-Language Pre-trained Models
M3PS: End-to-End Multi-Grained Multi-Modal Attribute-Aware Product Summarization in E-commerce
Towards Language-Guided Visual Recognition via Dynamic Convolutions
Systematic Investigation of Sparse Perturbed Sharpness-Aware Minimization Optimizer
Towards Real-Time Panoptic Narrative Grounding by an End-to-End Grounding Network
Approximated Prompt Tuning for Vision-Language Pre-trained Models
Adapting Pre-trained Language Models to Vision-Language Tasks via Dynamic Visual Prompting
Towards local visual modeling for image captioning
Cheap and Quick: Efficient Vision-Language Instruction Tuning for Large Language Models
Active Teacher for Semi-Supervised Object Detection
Towards End-to-end Semi-supervised Learning for One-stage Object Detection
Towards Efficient Visual Adaption via Structural Re-parameterization
Towards Local Visual Modeling for Image Captioning
Towards Real-Time Panoptic Narrative Grounding by an End-to-End Grounding Network
Semantic-Guided Selective Representation for Image Captioning
HSM-QA: Question Answering System Based on Hierarchical Semantic Matching
A Survivor in the Era of Large-Scale Pretraining: An Empirical Study of One-Stage Referring Expression Comprehension
RefTeacher: A Strong Baseline for Semi-Supervised Referring Expression Comprehension
RefCLIP: A Universal Teacher for Weakly Supervised Referring Expression Comprehension
Parameter and Computation Efficient Transfer Learning for Vision-Language Pre-trained Models
Cheap and Quick: Efficient Vision-Language Instruction Tuning for Large Language Models
Make Sharpness-Aware Minimization Stronger: A Sparsified Perturbation Approach
Learning Dynamic Prior Knowledge for Text-to-Face Pixel Synthesis
Towards Open-Ended Text-to-Face Generation, Combination and Manipulation
CycleTrans: Learning Neutral yet Discriminative Features for Visible-Infrared Person Re-Identification
What Goes beyond Multi-modal Fusion in One-stage Referring Expression Comprehension: An Empirical Study
Towards Lightweight Transformer via Group-wise Transformation for Vision-and-Language Tasks
PixelFolder: An Efficient Progressive Pixel Synthesis Network for Image Generation
SeqTR: A Simple yet Universal Network for Visual Grounding
Plenty is Plague: Fine-Grained Learning for Visual Question Answering
Knowing What it is: Semantic-Enhanced Dual Attention Transformer
Towards Lightweight Transformer Via Group-Wise Transformation for Vision-and-Language Tasks
Knowing What to Learn: A Metric-Oriented Focal Mechanism for Image Captioning
Multi-Branch Distance-Sensitive Self-Attention Network for Image Captioning
DIFNet: Boosting Visual Information Flow for Image Captioning
Active Teacher for Semi-Supervised Object Detection
SeqTR: A Simple Yet Universal Network for Visual Grounding
PixelFolder: An Efficient Progressive Pixel Synthesis Network for Image Generation
Make Sharpness-Aware Minimization Stronger: A Sparsified Perturbation Approach
Uncovering Media Bias via Social Network Learning
TRAR: Routing the Attention Spans in Transformer for Visual Question Answering
RSTNet: Captioning with Adaptive Attention on Visual and Non-Visual Words
A Real-Time Global Inference Network for One-Stage Referring Expression Comprehension
Knowledge-Driven Generative Adversarial Network for Text-to-Image Synthesis
K-armed Bandit based Multi-Modal Network Architecture Search for Visual Question Answering
Cascade Grouped Attention Network for Referring Expression Segmentation
Attacking Image Captioning towards Accuracy-Preserving Target Words Removal
Multi-task collaborative network for joint referring expression comprehension and segmentation
Free vqa models from knowledge inertia by pairwise inconformity learning
Dynamic capsule attention for visual question answering