Site Reliability Engineer (US) at WALT Labs in Spring, TX

Apply for the Site Reliability Engineer (US) position at WALT Labs in Spring, TX. Find the best jobs for you effortlessly with InJob.AI, your ultimate solution for job search. Discover top job opportunities and streamline your job search process.

alert circle

Job Description

<div>
 At WALT Labs, we are committed to empowering businesses to leverage the transformative power of cloud technology, facilitating innovation and operational efficiency. Specializing in managed services across Google Cloud Platform (GCP) and Amazon Web Services (AWS), we seek a dedicated local Site Reliability Engineer (SRE) who is passionate about technology, excels in problem-solving, and is dedicated to providing unparalleled customer service. You will become the SME to the scale, resiliency and uptime of our own and the customer environments we support.
 <br/>
 <br/>
 <strong>
  Role Summary
  <br/>
  <br/>
 </strong>
 As a critical member of our team, the SRE will provide technical support and expertise to our managed services clients. This role involves diagnosing and resolving complex issues across diverse cloud environments and technologies, ensuring high performance and reliability. The ideal candidate is a tech enthusiast, eager to expand their knowledge and skills daily, committed to problem-solving and delivering customer-focused solutions within defined Service Level Agreement (SLA) guidelines.
 <br/>
 <br/>
 <strong>
  Key Responsibilities:
  <br/>
  <br/>
 </strong>
 <ul>
  <li>
   Ensure high availability and reliability of software systems and infrastructure. Building out SLOs &amp; SLAs and constantly improving reliability of systems
  </li>
  <li>
   Design, implement, and maintain monitoring and alerting systems to detect and address issues proactively, using mainly Datadog, GCP Cloud Monitoring and Pagerduty/Incident.io
  </li>
  <li>
   Debug and troubleshoot production issues across various customer environments, technology stacks, and cloud providers, primarily focusing on GCP and AWS
  </li>
  <li>
   Participate in an on-call rotation to respond to and resolve production incidents and conduct RCAs/Post Mortems to identify and address issues
  </li>
  <li>
   Develop and maintain runbooks and playbooks for incident response and troubleshooting
  </li>
  <li>
   Proactively optimize systems and application environments to identify bottlenecks and areas of improvements
  </li>
  <li>
   Conduct load testing and capacity planning to ensure systems can handle expected traffic and growth
  </li>
  <li>
   Develop and maintain IaC (Terraform) and Configuration Management (Ansible, Helm as examples)
  </li>
  <li>
   Work closely with development teams to understand system architecture, identify potential reliability risks, and implement solutions
  </li>
  <li>
   Collaborate with operations teams to ensure smooth deployment and operation of software systems
  </li>
  <li>
   Master a broad range of technologies, including but not limited to VMs, container orchestration, networking, security, databases, data warehouses, serverless technologies, and storage solutions
  </li>
  <li>
   Proficiently deploy applications into Kubernetes using Helm, and manage Kubernetes administration and troubleshooting
  </li>
  <li>
   Provide direct support to clients during production outages, offering expert assistance to swiftly rectify issues, adhering to SLA expectations
  </li>
  <li>
   Diligently document solutions and processes, constantly seeking to improve knowledge, skills, and operational efficiency
   <br/>
   <br/>
   <br/>
  </li>
 </ul>
 <strong>
  Requirements
  <br/>
  <br/>
 </strong>
 <ul>
  <li>
   Prefer candidate to be located in the Houston, TX area. We are accepting fully remote candidates within the United States.
  </li>
  <li>
   3+ years experience in an SRE role
  </li>
  <li>
   From your core you understand how important SLOs, SLIs and KPIs are to the systems you support, using observability to be your grounding point on a daily basis
  </li>
  <li>
   Extensive knowledge of all major services in GCP (Cloud Run, BigQuery, GKE etc)
  </li>
  <li>
   In-depth knowledge of all major services in AWS
  </li>
  <li>
   Experience in setting up and managing monitoring solutions like Datadog, Google Cloud Operations Suite, Cloudwatch, Nagios, and Zabbix
  </li>
  <li>
   Familiarity with various CI/CD systems (Jenkins, Codefresh, GitLab CI, GitHub Actions, Argo CD)
  </li>
  <li>
   Exceptional problem-solving capabilities, the ability to work under pressure, and strong critical thinking skills
  </li>
  <li>
   Be the voice and commander of incidents managed internally and externally to customers
  </li>
  <li>
   A passion for technology and an unquenchable thirst for learning new skills
  </li>
  <li>
   A customer-focused mindset, dedicated to delivering the highest level of service
   <br/>
   <br/>
   <br/>
  </li>
 </ul>
 <strong>
  Benefits
  <br/>
  <br/>
 </strong>
 <ul>
  <li>
   We cover 100% of your base medical plan!
  </li>
  <li>
   Dental, vision, disability, and life insurance available
  </li>
  <li>
   Generous PTO policy that increases with longevity
  </li>
  <li>
   401k
  </li>
  <li>
   Professional development and advancement opportunities
  </li>
  <li>
   Bonus incentives
  </li>
 </ul>
</div>

AI Powered Job Insights

Top Interview Questions

People Faces

200+ professionals have found their dream job with InJob.ai this week.

salary

Salary Benefits

Salary details not provided

application process

Want to apply directly?

Apply for the Site Reliability Engineer (US) position at WALT Labs in Spring, TX using https://www.linkedin.com/jobs/view/3955860318

Get StartedGet Started

Similar Jobs found by InJob.AI


Scroll To Top
Get Started

Frequently asked Questions

Still have a question? Check out our FAQ section below.

FAQ Section

InJob searches for the best jobs, based on your profile and automatically generates customized cover letters for you. It saves a lot of hours in your job hunting time.

InJob creates your profile by having a conversation with you to learn about your skills and requirements. It also scans your resume to gather information about your experiences, skills, and achievements. This information is used to craft your profile in the backend which is further used to match jobs and gives you a personalized cover letter for each job opportunity.

InJob searches for job opportunities across a wide range of sources, including LinkedIn, Indeed, and hundreds of other job boards to find hidden gems. Its search is not limited, ensuring it covers as many potential job listings as possible. It also searches the career pages of individual companies that suit your target industry and location and you get applied there.

InJob is constantly active, scanning for fresh job opportunities every single minute. This ensures that you are the first person to apply to new job listings that align with your profile.

InJob plays matchmaker by comparing your profile and resume with job listings. Each job receives a score from 1-10, indicating how well you match with it.

In the upcoming update, Yes, this will be included and this will be the main differentiator. InJob will apply for jobs on your behalf. It will target top matches and craft custom cover letters for each job, ensuring your application stands out. InJob will also handle the application process, including visiting company websites and filling out forms.

In the upcoming update, Yes, InJob will provide an interactive dashboard that serves as mission control for your job search. It will display all the jobs InJob has applied for you and their current status. You will also be able to track which companies have shown interest in your profile and view the feedback they provided.

In an upcoming feature, Yes, InJob will collect all feedback, including positive and constructive feedback, and presents it to you. This will allow you to know exactly where you stand in the job market and provides insights on how to improve your skills.