Senior Advanced Research Computing Systems Administrator, Victoria

48.4283 -123.365
Victoria, Canada
Dernière édition le: il y a moins d’une semaine
Ajouter
Partager

Description

Mandate Reporting to the Manager and Architect of Advanced Research Computing Infrastructure, the Senior Advanced Research Computing Systems Administrator works as part of a team to design, build, and ensure the operational effectiveness of the university’s research servers and storage. Members of this team maintain systems critical to many research groups on-campus and beyond, including web servers, database servers, high‑performance research computing systems (HPC), cloud infrastructure, and container orchestration used by researchers both atUVic, from institutions across the country, and with international collaborations. These systems are required to be in operation 24 hours per day, 365 days of the year, and decisions regarding these systems can impact UVic’s obligations to other parties beyond the institution.

Objectives The Senior Advanced Research Computing System Administrator’s work includes the design, installation, configuration, and maintenance of hardware and software, problem determination/resolution, resource allocation, performance and security monitoring, and usage reporting. Each position has specialized areas of expertise in multiple domains: storage technologies such as Ceph, dCache, GPFs, Lustre, and IBM Spectrum Protect (TSM); deployment technologies like xCAT, Cobbler, Ansible, Puppet, and Terraform; and compute/virtualization technologies such as Kubernetes, OpenStack; HPC schedulers such as SLURM, HTCondor, and Moab; and systems monitoring. The specific technologies that are leveraged in this role will change over time, and this position has the responsibility to help guide the decision on how future technologies are selected and deployed.This position requires the incumbent to have significant problem‑solving skills to analyze and correct software and hardware problems and to automate administration tasks. It also requires effective communication skills to provide technical assistance and advice to peers and the user community, and to inform user areas on the impact and implications of systemfailures, maintenance, and cybersecurity incidents. The role leads project teams and provides recommendations on the university’s server and storage infrastructure. System maintenance is usually performed off‑hours, with major issues responded to on a 24/7 basis. This role may need to work outside of normal hours on an emergency or pre‑scheduled basis and may require travel out of town or country. The position requires a Bachelor’s Degree in Computer Science or another relevant discipline plus at least five years of experience in system administration in a large enterprise or academic/research environment. An equivalent combination of education and experience may be considered.

Required Knowledge, Skills, and Abilities

Expert knowledge of RedHat Enterprise Linux and/or derivatives (e.g., AlmaLinux, Rocky Linux, etc.)

In‑depth experience installing and operating at least one of OpenStack, Kubernetes, or Ceph

In‑depth experience with scripting and revision control (e.g., Bash, PERL, Python, Git, or Subversion)

Working knowledge of provisioning and configuration management tools (e.g., Ansible, Terraform, xCAT, Cobbler)

Experience supporting cloud computing and/or containerized environments

Excellent communication skills, both written and verbal

Ability to build and maintain productive working relationships with all stakeholders

Ability to work collaboratively in a team environment

Proven track record of achieving project goals on time and producing deliverables of high quality

High degree of attention to detail and ability to understand complex technical concepts; requires maintaining broad and in‑depth technical knowledge of all aspects of servers and server operating systems

High level of problem‑solving ability; must effectively identify and resolve unusual and highly complex technical problems

Ability to effectively manage multiple tasks and priorities, and work under pressure to meet time‑sensitive and mission‑critical deadlines in a complex environment

Ability to take initiative and work with limited direction

Ability to mentor and coach technical staff and teams, and act as a resource

Ability to contribute to complex projects by developing project work plans and monitoring and directing the activities of a project team

Excellent written and oral communications skills

Commitment to valuing diversity and contributing to an inclusive and respectful working and learning environment

Assets or Preferences

Working knowledge of load balancers and HA environments

Experience supporting HPC environments

Experience supporting compute and/or storage systems in a research or academic setting

Experience participating with and contributing to open‑source software projects

Working knowledge of GPU acceleration of computational workloads, preferably in a virtualized environment

Working knowledge of KVM/QEMU virtualization, ContainerD or Docker container runtimes, and Calico, Linuxbridge, or OpenVSwitch virtual networking

#J-18808-Ljbffr

Informations clefs