National Cancer Centre of Singapore Pte Ltd

Data Infrastructure Engineer / Data Scientist (DCS)

Job Category:  Research
Posting Date:  19 Dec 2024

NCCS Data and Computational Science (DCS) is a newly established computational hub within National Cancer Centre of Singapore (NCCS) which focuses on leveraging data analytics and computational methods to advance cancer research and treatment. DCS features high-powered computing resources capable of processing ‘big data’ profiles and running advanced interpretable machine learning algorithms and robust statistical techniques. DCS offers in-house and centralised solutions for NCCS researchers who require computational analysis without the need to buy specialised equipment or contract with third party vendors. DCS aims to maximise the efficiency of data processes and accelerate research outcomes. With access to national level medical data spanning clinical, imaging and omics datasets, our efforts are concentrated on harvesting the innate value of these rich datasets to improve cancer patient care and treatment delivery through the production of world-class research. 

You will provide support and insights for DCS core multi-omics research and big data analysis computing infrastructure that focus on using next-generation sequencing (NGS), radiological imaging and other multi-modal datatypes to develop biomarkers predictive of clinical responses in cancer patients. You will be expected to wrangle and optimise large datasets and execute complex parallel computational pipelines to better understand the complexity of cancer progression and treatment resistance across multiple cancer types. 

Key Responsibilities:
- IT Infrastructure Design: Develop strategic IT infrastructure plans (hardware, software and network) that ensures scalability, security, and performance.
- IT Infrastructure Procurement:  Source and liaise with vendors in procurement of the IT infrastructure to support the team's expansion needs.
- IT Infrastructure Maintenance: Monitor system performance, identify and resolve bottlenecks or issues, ensuring minimal downtime. Apply software updates and patches. Backup data to prevent data loss. 
- Data Integration and ETL: Design, implement and maintain data processing pipelines for ingesting, transforming, and loading data from various sources. Implement and optimise ETL processes for efficiency and reliability.
- Security and Compliance: Define computing resources and data access controls, encryption, and authentication mechanisms. Ensure compliance with data privacy regulations (e.g., GDPR, PDPA, APAC Data Laws etc.) and organisation structures. 
- Collaboration: Work closely with Principal Investigator, Clinicians, Data Scientists, Researchers and stakeholders to understand data requirements. Collaborate with other IT team members to support data initiatives and maintain a consistent, high-quality data delivery architecture across projects.

Job Requirements:
- Bachelors degree or higher in a relevant STEM discipline.
- Knowledge of production data pipelines, especially in a bioinformatics or clinical healthcare setting.
- Familiarity with Linux or other Unix flavours, preferably as an administrator/superuser and server maintenance.
- Familiarity with data security and access control measures.
- Strong programming expertise in at least one major language (e.g. Bash, Python, R, C/++, Rust).
- A keenness in incrementally designing, building and testing software components to ensure correct end-to-end running of primary production pipelines.
- A keenness to continually learn and integrate new tools and parameters to keep up with industry best practices, adapting them to local needs.
- Ability to independently plan and execute data analysis and ad-hoc projects, in collaboration with teammates and external parties.
- Strong organisational, interpersonal and presentation skills.
- Familiarity with pipeline management systems (Nextflow, Snakemake, CWL, WDL).
- Familiarity with job schedulers (SLURM, PBS, SGE, LSF).
- Familiarity with container/virtualization systems (Docker, Singularity, Podman, Kubernetes).
- Familiarity with multimodal datatypes (e.g. Polars, Arrow, vector DBs and column-stores).
- Interest in optimising GPU workloads and Large Models.
- Interest in front-end and back-end development for data analytics and parallel computation.