Senior/Lead Technical Support Engineer
Hyderabad, TS, India
Full Time
Senior Executive
Experience: 5-10 years
Role Overview: This role is responsible for troubleshooting and resolving production incidents. This role acts as a bridge between the support and development teams, handling technical investigations, applying quick fixes, and escalating critical issues. By managing and resolving incidents effectively, this role allows the development team to focus on R&D and feature development.
Key Responsibilities:
Required Qualifications:
Technical Skills:
Role Overview: This role is responsible for troubleshooting and resolving production incidents. This role acts as a bridge between the support and development teams, handling technical investigations, applying quick fixes, and escalating critical issues. By managing and resolving incidents effectively, this role allows the development team to focus on R&D and feature development.
Key Responsibilities:
- Incident Management and Troubleshooting
- Take ownership of production incidents, perform deep-dive investigations, and provide immediate resolutions or workarounds.
- Monitor production alerts, logs, and error notifications in real-time to ensure rapid incident response.
- Escalate unresolved issues to the development team only when necessary, minimizing their involvement in routine incidents.
- Document all production issues, resolutions, and lessons learned to improve troubleshooting efficiency.
- Develop and maintain incident response plans to ensure a structured troubleshooting approach.
- Collaboration and Support Enablement
- Work closely with the support team to assist with technical escalations and ensure customer issues are addressed quickly.
- Coordinate with the development team to report recurring issues that need long-term fixes while reducing their direct involvement in incident handling.
- Communicate incident status, impact, and resolution progress to key stakeholders and leadership.
- System Monitoring and Performance Optimization
- Monitor support emails, process failure notification emails, and Prometheus alerts to proactively detect or prevent incidents before they occur.
- Work with DevOps to improve observability, logging, and alerting strategies.
- Suggest Workarounds and Implement Quick Fixes
- Understand the product and customer use cases to provide workaround solutions when needed.
- Execute minor SQL queries and data fixes to resolve customer issues without requiring development team intervention.
- Leadership and Team Management
- Lead and mentor a team of junior support engineers, ensuring they follow best practices in incident handling.
- Train the support team on troubleshooting common production issues.
- Establish clear ownership of incident response to reduce ad-hoc escalations to the development team.
Required Qualifications:
Technical Skills:
- 5+ years of experience in production support, incident management, or site reliability engineering.
- Good expertise in Linux/Unix systems and troubleshooting.
- Experience with monitoring tools such as ELK Stack, Grafana, Prometheus, and CloudWatch.
- Proficiency in SQL (MySQL, PostgreSQL, or Oracle) for running queries and applying minor data fixes.
- Hands-on experience with log analysis and debugging using ELK Stack.
- Knowledge of scripting languages such as Shell, Python, or Groovy to automate incident handling.
- Familiarity with microservices, REST APIs, and message queues like RabbitMQ and Kafka.
- Strong problem-solving and troubleshooting skills under pressure.
- Ability to mentor junior engineers and effectively lead small teams.
- Excellent communication skills for collaboration with engineering, CS and DevOps teams
- Proactive mindset to reduce developer involvement in incident handling and improve overall system reliability.
Apply for this position
Required*