Hisham Cholakkal

Assistant Professor, Mohamed bin Zayed University of Artificial Intelligence (MBZUAI)

Multimodal Large Language Models · Healthcare Foundation Models · Vision Encoders

Research Interests

Hisham Cholakkal’s research lies at the intersection of computer vision and multimodal learning, with a focus on foundational models and multimodal large language models (LMMs). His objective is to build omnimodal AI companions that seamlessly integrate vision, audio, speech and text across different languages and cultures, and to deploy these systems in smart wearables such as smart glasses. He is also interested in the application of multimodal large language models and AI companions to healthcare and for social good. To support this vision, his research program is structured around three interconnected pillars: multimodal learning, healthcare foundation models and advanced visual recognition architectures.

Bio

Hisham Cholakkal is an Assistant Professor at the Mohamed bin Zayed University of Artificial Intelligence (MBZUAI). Prior to joining MBZUAI, he held research and technical leadership positions at the Inception Institute of Artificial Intelligence (IIAI) in the UAE, Mercedes‑Benz R&D India, BEL’s Central Research Laboratory (India) and the Advanced Digital Sciences Center in Singapore. With more than twelve years of experience in computer vision and multimodal AI, he bridges fundamental research, teaching and AI product development at scale.

As a Principal Investigator at MBZUAI, he has secured over eight research grants and awards, including the Meta Llama Impact Innovation Award (2024), NVIDIA Academic Grant (2025), Meta Regional Research Grant (2025) and Google Gift Research Award (2023). His last‑authored papers have received paper awards and recognitions, including the SAC Highlights Award at EMNLP 2025. His teaching contributions at MBZUAI have been recognized through the inaugural MBZUAI Teaching Excellence Award (2025) and the Founding Service Award. He serves in leadership roles at top AI conferences, including General Chair of ACM Multimedia Asia 2026 and Local Chair of ACCV 2028, and has held Area Chair positions at CVPR, ICLR, NeurIPS, ACM Multimedia, ECCV and BMVC. He also organizes workshops on foundational models and vision transformers at major venues including CVPR, ICCV, NeurIPS, ACCV and ICME.

Honors & Awards

Services

Selected Talks & Keynotes

Positions Available

I am looking for exceptional candidates for various positions at MBZUAI, including PostDocs, Research Engineers, PhD students, MSc students and Research Interns. PostDoc applicants should have strong experience in multimodal models and LLM/VLMs. Research engineers must demonstrate strong development skills through past projects; a background in LLM/VLM and generative AI for healthcare is a plus. Research Interns and PhD candidates should possess a strong academic background in BS/MS programs, focusing on computer vision/machine learning, with relevant coursework and projects. Previous publications in top conferences such as CVPR, ECCV, ICCV, NeurIPS, ICLR or ICML are desirable. If you are interested, please send your CV and GitHub profile link to me.

News

Teaching

Hisham Cholakkal teaches core and advanced computer vision courses at MBZUAI, recognized through both the inaugural Teaching Excellence Award and the Computer Vision Department Teaching and Mentorship Award (2026). Below are brief summaries of his courses.

Human and Computer Vision (CV701)

This course covers fundamental image processing concepts (filtering, edge detection, color planes), classical computer vision methods (corner and blob detection, SIFT), camera geometry and optics (intrinsic and extrinsic parameters, stereo matching and 3D vision), machine learning and deep learning for vision and advanced topics such as human action recognition and pose estimation.

Offered: Spring 2021, Fall 2021, Fall 2022.

Visual Recognition and Detection (CV703)

This course explores fundamentals of visual recognition; CNN‑based and transformer‑based architectures; single‑stage and two‑stage object detection; segmentation architectures; video tracking and cutting‑edge detection frameworks.

Offered: Fall 2021, Spring 2022, Spring 2023, Spring 2024, Spring 2025, Spring 2026.

Advanced Computer Vision (CV801)

A two‑part course for first‑semester PhD students. Part I establishes foundations in convolutional and transformer architectures for detection and segmentation. Part II introduces advanced topics, including foundation models, vision–language models, large multimodal and language models, the Segment Anything Model, efficient computer vision, remote sensing change detection and diffusion‑based image generation.

Offered: Fall 2023, Fall 2024, Fall 2025.

Team (Primary Supervision)

I have had the privilege of collaborating with exceptional students, colleagues, and collaborators. Below is a partial list:

Postdoctoral Researchers

Researchers Engineers

Current Students (Primary Supervision)

Alumni

Previous Roles

Education

Ph.D. in Computer Engineering, Nanyang Technological University (NTU), Singapore (2016). Research focused on visual attention, visual recognition and learning with limited supervision.

M.Tech. in Digital Signal Processing, Indian Institute of Technology (IIT) Guwahati, India (2009). Research focused on video compression.

Research Demonstrations, Public Engagement & Media Coverage

Our research outputs are frequently showcased as live demonstrations at major international events, widely covered in the media and shared with the public to inspire the next generation of AI practitioners.

Collage of media outlets covering our research
Highlights of media coverage featuring our demonstrations and research projects.

Selected Publications

Professor Hisham Cholakkal has published over 100 research papers and holds more than eight granted U.S. patents across three primary research pillars: multimodal learning, healthcare foundation models and efficient visual recognition architectures. Representative research publications are listed below. A complete and continuously updated publication list is available on Google Scholar.

Multimodal LLM

Healthcare Foundation Models

Vision Encoders & Visual Recognition