已发表成果:
WOK 论文 87 篇;
Fast Text-to-3D-Aware Face Generation and Manipulation via Direct Cross-modal Mapping and Geometric Regularization
Feast Your Eyes: Mixture-of-Resolution Adaptation for Multimodal Large Language Models
Towards Efficient Diffusion-Based Image Editing with Instant Attention Masks
Rotated Multi-Scale Interaction Network for Referring Remote Sensing Image Segmentation
X-Dreamer: Creating High-quality 3D Content by Bridging the Domain Gap Between Text-to-2D and Text-to-3D Generation
Towards Omni-supervised Referring Expression Segmentation
Semi-Supervised Panoptic Narrative Grounding
Semi-Supervised Panoptic Narrative Grounding
PixelFace+: Towards Controllable Face Generation and Manipulation with Text Descriptions and Segmentation Masks
Beyond First Impressions: Integrating Joint Multi-modal Cues for Comprehensive 3D Representation
Beat: Bi-directional One-to-Many Embedding Alignment for Text-based Person Retrieval
JM3D & JM3D-LLM: Elevating 3D Representation with Joint Multi-modal Cues
Parameter and Computation Efficient Transfer Learning for Vision-Language Pre-trained Models
3D-STMN: Dependency-Driven Superpoint-Text Matching Network for End-to-End 3D Referring Expression Segmentation
Towards Language-Guided Visual Recognition via Dynamic Convolutions
Continual Face Forgery Detection via Historical Distribution Preserving
Beyond First Impressions: Integrating Joint Multi-modal Cues for Comprehensive 3D Representation
Towards General Visual-Linguistic Face Forgery Detection
Systematic Investigation of Sparse Perturbed Sharpness-Aware Minimization Optimizer
End-to-End Zero-Shot HOI Detection via Vision and Language Knowledge Distillation
Towards Real-Time Panoptic Narrative Grounding by an End-to-End Grounding Network
Adapting Pre-trained Language Models to Vision-Language Tasks via Dynamic Visual Prompting
Towards local visual modeling for image captioning
Cheap and Quick: Efficient Vision-Language Instruction Tuning for Large Language Models
X-Mesh: Towards Fast and Accurate Text-driven 3D Stylization via Dynamic Textual Guidance
Active Teacher for Semi-Supervised Object Detection
Towards End-to-end Semi-supervised Learning for One-stage Object Detection
Towards Efficient Visual Adaption via Structural Re-parameterization
Towards Local Visual Modeling for Image Captioning
Towards Real-Time Panoptic Narrative Grounding by an End-to-End Grounding Network
HSM-QA: Question Answering System Based on Hierarchical Semantic Matching
A Survivor in the Era of Large-Scale Pretraining: An Empirical Study of One-Stage Referring Expression Comprehension
RefTeacher: A Strong Baseline for Semi-Supervised Referring Expression Comprehension
RefCLIP: A Universal Teacher for Weakly Supervised Referring Expression Comprehension
X-Mesh: Towards Fast and Accurate Text-driven 3D Stylization via Dynamic Textual Guidance
Make Sharpness-Aware Minimization Stronger: A Sparsified Perturbation Approach
Learning Dynamic Prior Knowledge for Text-to-Face Pixel Synthesis
X-CLIP: End-to-End Multi-grained Contrastive Learning for Video-Text Retrieval
Towards Open-Ended Text-to-Face Generation, Combination and Manipulation
Clover: Towards A Unified Video-Language Alignment and Fusion Model
X-CLIP: End-to-End Multi-grained Contrastive Learning for Video-Text Retrieval
What Goes beyond Multi-modal Fusion in One-stage Referring Expression Comprehension: An Empirical Study
Towards Lightweight Transformer via Group-wise Transformation for Vision-and-Language Tasks
PixelFolder: An Efficient Progressive Pixel Synthesis Network for Image Generation
End-to-End Zero-Shot HOI Detection via Vision and Language Knowledge Distillation
SeqTR: A Simple yet Universal Network for Visual Grounding
Global2Local: A Joint-Hierarchical Attention for Video Captioning
Differentiated Relevances Embedding for Group-based Referring Expression Comprehension
Plenty is Plague: Fine-Grained Learning for Visual Question Answering
Fast Monocular Depth Estimation via Side Prediction Aggregation with Continuous Spatial Refinement
Knowing What it is: Semantic-Enhanced Dual Attention Transformer
Multi-Branch Distance-Sensitive Self-Attention Network for Image Captioning
Towards Lightweight Transformer Via Group-Wise Transformation for Vision-and-Language Tasks
Knowing What to Learn: A Metric-Oriented Focal Mechanism for Image Captioning
DIFNet: Boosting Visual Information Flow for Image Captioning
Active Teacher for Semi-Supervised Object Detection
An Information Theoretic Approach for Attention-Driven Face Forgery Detection
SeqTR: A Simple Yet Universal Network for Visual Grounding
PixelFolder: An Efficient Progressive Pixel Synthesis Network for Image Generation
Make Sharpness-Aware Minimization Stronger: A Sparsified Perturbation Approach
TRAR: Routing the Attention Spans in Transformer for Visual Question Answering
Dual-Level Collaborative Transformer for Image Captioning
Improving Image Captioning by Leveraging Intra- and Inter-layer Global Representation in Transformer Network
RSTNet: Captioning with Adaptive Attention on Visual and Non-Visual Words
A Real-Time Global Inference Network for One-Stage Referring Expression Comprehension
Deep Semantic Parsing of Freehand Sketches With Homogeneous Transformation, Soft-Weighted Loss, and Staged Learning
Knowledge-Driven Generative Adversarial Network for Text-to-Image Synthesis
Modeling long-term video semantic distribution for temporal action proposal generation
Detecting High Frequency Oscillations for Stereoelectroencephalography in Epilepsy via Hypergraph Learning
Sketch-specific data augmentation for freehand sketch recognition
K-armed Bandit based Multi-Modal Network Architecture Search for Visual Question Answering
Cascade Grouped Attention Network for Referring Expression Segmentation
Attacking Image Captioning towards Accuracy-Preserving Target Words Removal
Exploring Language Prior for Mode-Sensitive Visual Attention Modeling
Semi-Supervised Adversarial Monocular Depth Estimation
Similarity-Preserving Linkage Hashing for Online Image Retrieval
Hadamard Matrix Guided Online Hashing
SSAH: Semi-supervised adversarial deep hashing with self-paced hard sample generation
Multi-task collaborative network for joint referring expression comprehension and segmentation
Multi-modal multi-layer fusion network with average binary center loss for face anti-spoofing
Hypergraph induced convolutional manifold networks
Free vqa models from knowledge inertia by pairwise inconformity learning
Towards optimal discrete online hashing with balanced similarity
Dynamic capsule attention for visual question answering
Towards optimal fine grained retrieval via decorrelated centralized loss with normalize-scale layer
Variational structured semantic inference for diverse image captioning
Information competing process for learning diversified representations