Site Reliability Jobs - Remote Work From Home & Flexible
-
30+ days agoAs ourSite Reliability工程师,你将负责collaborating on the design, build, and maintenance of reliable and scalable infrastructure and software systems. This will be accomplished by tracking error budgets against service level...
-
22天前As aSite ReliabilityEngineer in our SRE Team you will be responsible for deploying and maintaining the SaaS Infrastructure in regulated and sensitive environments, such as AWS GovCloud. You will partner with engineers across the company to...
-
30+ days agoOption for telecommuting. Candidate will handle daily monitoring and maintenance of applications, deploy new releases across multiple SaaS customers, and help software engineering on core applications. Must have experience building systems in AWS.
-
24 days agoDevelop, deploy and operate services, following standard procedures and guidelines, delivering high performance, scalable and zero downtime releases in AWS environments. Carry out assigned tasks to modernize company services and related infrastructure.
-
30+ days agoProvide feedback to developers on how their products operate at scale; write code, submit bugs, and work with other teams within the company. Must have a bachelor's degree and experience with software development and Python. WFH.
-
30+ days ago对于重点客户建立图书馆和SLOs workflows that your team owns. Diagnose the factors that most threaten SLOs and identify necessary improvements. Improve observability, measurement, and diagnostics for key customer SLIs and SLOs...
-
22天前负责保持所有user-facing services such as our booking platform/marketplace systems running smoothly by ensuring that the underlying infrastructure is running smoothly, and that systems and tools are working as expected.
-
8 days agoWe're looking for a SeniorSite ReliabilityEngineer to improvereliabilityand stability of customer-facing, production infrastructure, serving millions of page views per hour. Our product is used by over 2 million users world-wide across 190...
-
30+ days ago实现和维护系统,监控networks, server health, and application performance. Configuring infrastructure systems to provide load balancing, application firewalls, reverse proxying, and related services.
-
New!2 days agoWe are looking for either a DevOps / SRE engineer that can code (Node / Ruby) or a developer that knows about infrastructure. You'll work to make sure our applications provide a quality experience for our customers while working to maintain efficiency...
-
30+ days agoBuild CI and CD pipelines. Optimize and scale workloads. Secure containers and web services. We like you to know Docker, Kubernetes, GCP, AWS, Go, Postgres, Redis, familiarity with JavaScript, excellent communication skills (English)...
-
30+ days agoHandle incoming IAM requests from other teams. You will help create and implement least-privilege-based IAM solutions to meet other teams' project and access requirements. You will work with requesters to promote a good experience.
-
30+ days agoWork closely with product engineers to implement scalable and highly reliable systems. Scale existing backend systems to handle ever-increasing amounts of traffic and new product requirements. Collaborate with other developers to understand tooling.
-
30+ days agoPlay a vital role in ensuring the stability, scalability, andreliabilityof our systems and services. Will work closely with all the company's engineering teams as part of supporting shared infrastructure and developer tools.
-
26 days agoThis is a perfect opportunity for someone who is passionate about automating away classic systemreliabilityissues, coming up with creative solutions to scale impact. We are looking for someone who has an aversion to the throw more people at the on-...
-
30+ days ago我们的site reliabilityengineers bring Python software-engineering skills and rigour to the operations domain. Architect and run OpenStack, Kubernetes and software defined storage. Software Engineering or Computer Science degree.
-
30+ days agoResponsible for keeping all customer-facing services and other production systems running smoothly. SREs are a blend of pragmatic operators and software craftspeople who apply sound software engineering principles, operational discipline, and mature au...
-
20 days agoDesign, build and drive adoption of a platform that enables service resilience testing/chaos engineering to validate and test architecture is resilient to failure. Build and own a performance testing framework/environment to enable our R&D teams to...
-
30+ days agoBuild systems for declarative application and infrastructure lifecycle management: continuous deployment, continuous integration, Kubernetes cluster management, service and workload inventory. Prioritize and help troubleshoot problems, downtime, and...
-
30+ days agoWork with modern EKS infrastructure and deployment tools like fluxcd and argocd. Support hosted database platforms like Mongo Atlas. Mentor other team members within your areas of subject matter expertise, to avoid creating knowledge silos. Build...
-
30+ days agoRespond to, investigate and fix service issues, whether they are deep in the OS kernel or in the application code. Design, build and maintain the infrastructure we need to support orders of magnitude more customers. 5+ years of experience...
-
30+ days agoDesign and implement reliable and scalable systems using software engineering best practices. Develop and deploy automation and monitoring tools to proactively detect and mitigate incidents, and to prevent outages. Partner with engineers across...
-
30+ days agoDesign and implement logging, monitoring, observability capabilities as well as bespoke tools to manage our products and services running on global multi-cloud infrastructure. Participate in design discussions with other teams to promote SRE...
-
Featured3 weeks agoAs theSite ReliabilityEngineer, you will play a key role in designing, developing, and maintaining reliable, scalable, and highly available infrastructure for our API services. You will contribute heavily to the high-impact challenges behind...
-
30+ days agoPair-programming to collaboratively improve the services that power us. Establishing SLIs and SLOs for the key customer workflows that your team owns. Diagnosing the factors that most threaten SLOs and identifying necessary improvements. Improving...
-
30+ days agoAs a PrincipalSite Reliabilityyou will focus on innovating and providing strong technical vision as well as work with the team to build reliable, scalable and highly available datastores on a constantly growing multi-region scale platform.
-
30+ days agoBe on an on-call rotation to respond to incidents that impact availability and drive the efforts to provide service restoration within SLAs. Conduct neutral postmortems of issues and events to identify Root Cause. Use your on-call shift schedule to...
-
1 week agoDesign, build, and maintain core MTA infrastructure pieces that allow company scaling to support real-time processing and delivery of billions of messages. Experience in managing and working with MTAs including MTA administration as a postmaster.
-
8 days agoDesign, build, and maintain core MTA infrastructure pieces that allow scaling to support real-time processing and delivery of billions of messages. Plan the growth of MTA infrastructure. Automate the deployment process to make it as boring as possible.
-
30+ days agoChampion and implement a culture of SRE to maintain a high-quality platform infrastructure. Champion and implement application and infrastructure monitoring and alerting to prevent client impacting issues by ensuring system availability and performance.
-
27 days agoImprove our software delivery pipeline in a way that makes it expedient and encourages a culture of high code quality. Work with business units scale and design systems that are highly available and resilient.Make substantial code contributions to apps.
-
15 days agoImplement and iterate the infrastructure underlying the company production environment in Cloud and On-Prem. Participate in performance tuning, availability, and disaster recovery for production environments. Work with engineering stakeholders to...
-
30+ days ago协助the design and continuously improve our team's processes, tools and solutions used to build, deploy, monitor, maintain and scale production systems. Assist in the design and improve our monitoring, alerting and remediation solutions with...
-
3 weeks agoCollaborate with software developers, infrastructure engineers, and data scientists to help implement robust and scalable software products. Create workshops, training, and knowledge sharing across the organization. On-call rotation for production.
-
1 week agoDesign and develop the CI/CD systems developers use and the infrastructure for all current and future websites and services. Diagnose and debug production incidents and then improve systems to prevent the problem from recurring. Collaborate with web...
-
1 week agoDesign and develop the CI/CD systems developers use and the infrastructure for all current and future websites and services. Diagnose and debug production incidents and then improve systems to prevent the problem from recurring. Collaborate with web...
-
Featured15 days agoDesign, implementation and maintenance of public facing infrastructure and services. Use of configuration management and deployment tools. Architectural design and operation at scale. Monitoring of systems and services, optimization of performance and...
-
FeaturedNew!2 days agoDesign, implementation & maintenance of public facing infrastructure and services. Use of configuration management and deployment tools. Architectural design and operation at scale. Monitor systems and services, optimization of performance and resources.
-
Featured30+ days ago2+ years experience in an SRE/Operations/DevOps role as part of a team. Experience with operating highly available infrastructure. Comfortable with shell and a programming language used in an SRE/Operations engineering context (Python, Go, Ruby, etc.).
-
Featured23 days agoBe part of our team responsible for designing, writing, and delivering software to improve the availability, scalability, latency, and efficiency of services. Work with a team of software and systems engineers on projects for users responsible for...
-
Featured1 week agoDesign, implementation and maintenance of public facing infrastructure and services. Use of configuration management and deployment tools. Architectural design and operation at scale. Monitoring of systems and services, optimization of performance and...
-
27 days agoDefine and operationalize service level objectives (SLO) and find sustainable methods for monitoring, managing, and scaling our platforms and services. Distill and synthesize non-functional requirements into discreet and meaningful iterations that can...
-
22天前You will be an integral part of designing and operating large-scale highly available distributed systems in the cloud. Collaborate with our application development teams to ensure thereliabilityand performance of our infrastructure.Write quality code.
-
New!3 days agoHelp craft the best architectures and processes and be a role model for the Engineering team in both delivery and collaboration- Areliability-focused technical leader that focuses on automation and will help drive our operational excellence to a new...
-
30+ days agoFacilitating the development process and operations. Identifying setbacks and shortcomings. Creating suitable DevOps channels across the organization. Establishing continuous build environments to speed up software development. Designing efficient...
-
New!YesterdayBuild tools and applications to extend and improve the company's developer onboarding and software development processes. Strong software development experience (particularly Python or Golang), especially writing software to compose pipelines and...
-
1 week agoYou'll be responsible for identifying, troubleshooting, and reporting platform problems to product engineers in order to ensure that the thousands of clusters we manage are providing a stable and reliable service. Experience with Java. 100% remote work.
-
30+ days agoAutomate software operations for reusability and consistency across private and public clouds, taking into consideration the complexities of distributed systems. Python software development experience, with large projects.
-
30+ days agoYou will work with developers to create more scalable services and help us build self-service paved roads to simplify writing services and provisioning infrastructure. You will help to isolate, trap, and respond from the inevitability of system failure...
-
New!2 days agoThis role will provide support to development teams using development tools like Ansible, Jenkins and GIT for Java applications. Facility planning, storage systems, server systems, website and web applications, LAN, and other IT related systems functions.