Kafka Tier 3 Support Engineer
Job
Tata Consultancy Services Limited
Canton, MA (In Person)
$130,000 Salary, Full-Time
Review key factors to help you decide if the role fits your goals.
Pay Growth
?
out of 5
Not enough data
Not enough info to score pay or growth
Job Security
?
out of 5
Not enough data
Calculating job security score...
Total Score
54
out of 100
Average of individual scores
Skill Insights
Compare your current skills to what this opportunity needs—we'll show you what you already have and what could strengthen your application.
Job Description
Must Have Technical/Functional Skills Kafka & Streaming
- Strong hands on experience with Apache Kafka
- Experience supporting at least one of: o AWS MSK o Confluent Platform / Confluent Cloud o Self managed Kafka (VM or Kubernetes)
- Deep understanding of: o Brokers, partitions, replication, ISR, leader election o Consumer groups and rebalancing o Producer/consumer internals and failure modes Operations & Performance
- Expertise in diagnosing: o Consumer lag and throughput bottlenecks o Broker disk, network, and JVM performance o Metadata and controller instability
- Experience with monitoring and observability tools (Kafka metrics, CloudWatch, Prometheus, Grafana, etc.) Security & Governance
- Knowledge of Kafka security concepts: o TLS, authentication (IAM/SASL/SCRAM), ACLs/RBAC o Principle of least privilege
- Experience supporting regulated or multi tenant environments Preferred / Nice to Have Skills
- Experience with Kafka Connect, Schema Registry, or streaming frameworks
- Exposure to KRaft-based Kafka deployments
- Cloud platforms (AWS preferred; Azure/GCP beneficial)
- Automation and IaC experience for Kafka operations
- Experience in SRE or DevOps-aligned environments Roles & Responsibilities Key Responsibilities 1. Tier 3 Incident Management & Escalation Support
- Act as the highest technical escalation point for Kafka production incidents (Sev 1 / Sev 2).
- Lead deep troubleshooting across: o Broker instability, controller elections, ISR shrinkage o Under replicated partitions and leader imbalance o Producer/consumer failures, lag spikes, and rebalance storms o Disk, network, JVM, and request handler saturation
- Provide hands on remediation for complex issues, including: o Partition reassignment and leader rebalance o Broker configuration tuning o Throttle/quota strategies for noisy producers or consumers
- Coordin ate with vendor support during service incidents, providing logs, metrics, and forensic details.
- Guide Tier 2 teams during major incidents and validate restoration actions. 2. Kafka Performance Engineering & Optimization
- Analyze Kafka workloads for performance and scalability risks: o Partition skew and hot partitions o Inefficient producer batching/compression o Consumer lag root cause analysis o Thread pool, I/O, and network bottlenecks
- Recommend and validate: o Topic design (partition count, replication factor, retention, compaction) o Producer and consumer configuration best practices o Quotas, quotas enforcement, and multi tenant controls
- Support onboarding of high throughput or latency sensitive workloads, ensuring Kafka is correctly sized and tuned. 3. Platform Stability, Reliability & Resilience
- Diagnose and resolve systemic Kafka stability issues: o Repeated broker failures or flapping o Metadata/controller instability (Zookeeper or KRaft) o Recovery issues following failovers or maintenance events
- Support resilience initiatives: o Multi AZ cluster health validation o Replication and DR strategies (MirrorMaker 2, Replicator, or app level DR patterns) o Failover testing and validation
- Define and improve Kafka SLOs for availability, durability, and latency. 4. Change, Upgrade & Configuration Leadership
- Lead medium to high risk Kafka changes, including: o Broker and cluster configuration changes o Partition expansion or large scale reassignment o Topic policy changes impacting durability or performance
- Support and plan: o Kafka version upgrades o MSK / Confluent upgrade cycles o Client compatibility and rollout strategies
- Participate in CAB reviews, assess risk, and design rollback and validation plans. 5. Root Cause Analysis & Continuous Improvement
- Own RCA documentation for major incidents with clear corrective and preventive actions (CAPA).
- Identify recurring failure patterns and architectural gaps.
- Re commend platform-level improvements: o Automation opportunities o Guardrails and standards o Monitoring and alerting enhancements
- Contribute to continuous improvement of runbooks, knowledge base articles, and operational playbooks. 6. Mentorship & Collaboration
- Provide technical guidance and mentoring to Tier 2 Kafka support teams.
- Collaborate with: o Application teams on Kafka client usage and best practices o Platform and SRE teams on capacity planning and reliability engineering o Security teams on access control, encryption, and compliance requirements Act as a subject matter expert for Kafka within the organization.
TCS Employee Benefits Summary:
Discretionary Annual Incentive.Comprehensive Medical Coverage:
Medical & Health, Dental & Vision, Disability Planning & Insurance, Pet Insurance Plans.Family Support:
Maternal & Parental Leaves.Insurance Options:
Auto & Home Insurance, Identity Theft Protection.Convenience & Professional Growth:
Commuter Benefits & Certification & amp; Training Reimbursement.Time Off:
Vacation, Time Off, Sick Leave & Holidays.Legal & Financial Assistance:
Legal Assistance, 401K Plan, Performance Bonus, College Fund, Student Loan Refinancing. #LI-SP1Similar remote jobs
Los Alamos National Laboratory
Los Alamos, NM
Posted1 day ago
Updated4 hours ago
American Civil Liberties Union
Washington, DC
Posted1 day ago
Updated4 hours ago
Similar jobs in Canton, MA
NEFCO Construction Supply LLC
Canton, MA
Posted2 days ago
Updated4 hours ago
Safeguard Restoration
Canton, MA
Posted2 days ago
Updated21 hours ago
Similar jobs in Massachusetts
L3Harris Technologies
Wilmington, MA
Posted1 day ago
Updated4 hours ago