About the Company : Granules is a fully integrated pharmaceutical manufacturer specializing in Active Pharmaceutical Ingredients (APIs), Pharmaceutical Formulation Intermediates (PFIs), and Finished Dosages (FDs), with operations in over 80 countries and a focus on large-scale pharmaceutical manufacturing. Founded in 1984 and headquartered in Hyderabad, India, the company is publicly traded and employs 5,001 to 10,000 people. About the Role: We are hiring an AI Engineer – Multimodal to design and build real-time multimodal/omni AI systems that generate audio, video, and language for conversational, human-like interfaces. The role focuses on developing models that tightly couple speech, visual behavior, and language to enable natural, low-latency interactions. You will work at the intersection of conversational AI, neural audio, and audio-visual generation, contributing both foundational research and production-ready systems. This is a hands-on role with strong ownership over technical direction. Responsibilities: • Research and develop multimodal/omni generation models for conversational systems, including neural avatars, talking-heads, and audio-visual outputs. • Build and fine-tune expressive neural audio / TTS systems, incorporating prosody, emotion, and non-verbal cues. • Design and operate real-time, streaming inference pipelines optimized for low latency and natural turn-taking. • Experiment with and apply diffusion-based models (DDPMs, LDMs) and other generative approaches for audio, image, or video generation. • Develop models that align conversation flow with verbal and non-verbal behavior across modalities. • Collaborate with applied ML and engineering teams to transition research into production-grade systems. • Track, evaluate, and apply emerging research in multimodal and generative modeling. Qualifications: • Master’s or PhD (or equivalent hands-on experience) in ML, AI, Computer Vision, Speech, or a related field. • 4–8 years of hands-on experience in applied AI/ML research or engineering, with a strong focus on multimodal and generative systems. Required Skills: • Strong experience modeling human behavior and generation, including facial expressions, affect, or speech, preferably in conversational or interactive settings. • Deep understanding of sequence modeling across video, audio, and language domains. • Strong foundation in deep learning, including Transformers, diffusion models, and practical training techniques. • Familiarity with large-scale model training, including LLMs and/or vision-language models (VLMs). • Excellent programming skills in PyTorch, with hands-on experience in GPU-based training and inference. • Proven experience deploying and operating real-time or streaming AI systems in production. • Strong intuition for human-like speech and behavior generation, including diagnosing and improving unnatural outputs. Nice to Have: • Experience with long-form audio or video generation. • Exposure to 3D graphics, Gaussian splatting, or large-scale training pipelines. • Familiarity with production ML or software engineering best practices. • Research publications in respected venues (e.g., CVPR, NeurIPS, ICASSP, BMVC). Equal Opportunity Statement: We are committed to diversity and inclusivity in our hiring practices.

AI Engineer - Multimodal

Your next job is waiting