We are seeking a skilled developer to design, optimize, and deploy offline/on-premise generative AI models and LLMs. This role focuses on building high-throughput, low-latency inference systems and robust MLOps pipelines for private environments. Responsibilities: • Deploy and optimize LLMs using frameworks like TGI, vLLM, DeepSpeed • Accelerate models with TensorRT-LLM, quantization, GPU optimizations • Build offline RAG/agent systems with LangChain, Google ADK, SgLang • Fine-tune models (LoRA/QLoRA) for edge and constrained environments • Develop CI/CD pipelines with Docker & Kubernetes/OpenShift • Create production-ready APIs with FastAPI, SQLAlchemy/Alembic • Research new inference algorithms & quantization techniques Qualifications: • Bachelor’s/Master’s in CS, Engineering, or a related field • 4–5 years ML experience (with minimum 2 years of experience in LLM optimization/deployment) • Strong Python & CUDA/GPU expertise • Proven record of deploying offline, high-performance LLMs Preferred: • Experience in air-gapped/non-cloud environments • Knowledge of Spark/Kafka for RAG pipelines • Contributions to open-source LLM frameworks

GenAI and Open-source LLM Developer

Your next job is waiting