AI Engineer - Multimodal
Granules India Limited
Posted on: February 26, 2026
About the Company :
Granules is a fully integrated pharmaceutical manufacturer specializing in Active Pharmaceutical Ingredients (APIs), Pharmaceutical Formulation Intermediates (PFIs), and Finished Dosages (FDs), with operations in over 80 countries and a focus on large-scale pharmaceutical manufacturing. Founded in 1984 and headquartered in Hyderabad, India, the company is publicly traded and employs 5,001 to 10,000 people.
About the Role:
We are hiring an AI Engineer – Multimodal to design and build real-time multimodal/omni AI systems that generate audio, video, and language for conversational, human-like interfaces. The role focuses on developing models that tightly couple speech, visual behavior, and language to enable natural, low-latency interactions.
You will work at the intersection of conversational AI, neural audio, and audio-visual generation, contributing both foundational research and production-ready systems. This is a hands-on role with strong ownership over technical direction.
Responsibilities:
• Research and develop multimodal/omni generation models for conversational systems, including neural avatars, talking-heads, and audio-visual outputs.
• Build and fine-tune expressive neural audio / TTS systems, incorporating prosody, emotion, and non-verbal cues.
• Design and operate real-time, streaming inference pipelines optimized for low latency and natural turn-taking.
• Experiment with and apply diffusion-based models (DDPMs, LDMs) and other generative approaches for audio, image, or video generation.
• Develop models that align conversation flow with verbal and non-verbal behavior across modalities.
• Collaborate with applied ML and engineering teams to transition research into production-grade systems.
• Track, evaluate, and apply emerging research in multimodal and generative modeling.
Qualifications:
• Master’s or PhD (or equivalent hands-on experience) in ML, AI, Computer Vision, Speech, or a related field.
• 4–8 years of hands-on experience in applied AI/ML research or engineering, with a strong focus on multimodal and generative systems.
Required Skills:
• Strong experience modeling human behavior and generation, including facial expressions, affect, or speech, preferably in conversational or interactive settings.
• Deep understanding of sequence modeling across video, audio, and language domains.
• Strong foundation in deep learning, including Transformers, diffusion models, and practical training techniques.
• Familiarity with large-scale model training, including LLMs and/or vision-language models (VLMs).
• Excellent programming skills in PyTorch, with hands-on experience in GPU-based training and inference.
• Proven experience deploying and operating real-time or streaming AI systems in production.
• Strong intuition for human-like speech and behavior generation, including diagnosing and improving unnatural outputs.
Nice to Have:
• Experience with long-form audio or video generation.
• Exposure to 3D graphics, Gaussian splatting, or large-scale training pipelines.
• Familiarity with production ML or software engineering best practices.
• Research publications in respected venues (e.g., CVPR, NeurIPS, ICASSP, BMVC).
Equal Opportunity Statement:
We are committed to diversity and inclusivity in our hiring practices.
About Company
Granules India Limited
Telangana ,IN
https://granulesindia.com
Your next job is waiting
Create your profile and start applying in minutes.