Privacy and analytics

HPC-Pilot is an AI project built to help the HPC community troubleshoot issues. We use optional usage analytics to understand performance, feature usage, and error trends. These events stay anonymized and are only enabled after you consent.

HPC-Pilot
System status: Ready

HPC-Pilot: Tenant-Aware AI Copilot for University HPC Support

Analyze cluster failures, enrich the answer with public university HPC wiki context, and get grounded next actions for complex incidents.

View architecture

Universities

14

Source groups

15

Topics

3

Backend

Log-to-Fix

Paste one HPC incident and Pilot turns the failure into a grounded sequence of diagnosis, next actions, validation, and source links.

Ready

Workspace

Public Wiki Context

Configured public HPC wiki pages are indexed ahead of time so the selected university can add local operational context to the answer.

9 seed URLs

PII

PII Scrubbing

Logs are sanitized before analysis so email addresses, IPs, usernames, and internal-looking paths are reduced before model processing.

Scrubbed before analysis

Configured tenants

Universities currently available in the system.

These institutions currently have public HPC wiki context configured for troubleshooting and retrieval.

14 enabled tenants
UB

Bielefeld University

uni_bielefeld

Bielefeld currently relies on shared HPC.nrw wiki guidance until a stronger site-specific source is added.

1 source groups1 seed URLs
FJ

Forschungszentrum Juelich

fz_juelich

Use JSC user support documentation for JUWELS, JURECA, and JUPITER guidance.

1 source groups1 seed URLs
HH

Heinrich Heine University Duesseldorf

uni_hhu_duesseldorf

Use HHU ZIM HPC documentation for HilberD-related support.

1 source groups1 seed URLs
HPCN

HPC.NRW

hpc_nrw_shared

Use the central HPC Wiki as umbrella public context for site-independent NRW HPC guidance.

1 source groups9 seed URLs
RB

Ruhr University Bochum

uni_rub

Use HPC@RUB help documentation for Elysium-related support.

1 source groups1 seed URLs
RWTH

RWTH Aachen

uni_rwth

Use RWTH Aachen Slurm documentation during retrieval.

1 source groups2 seed URLs
TU

TU Dortmund University

uni_tu_dortmund

Use LiDO cluster documentation for LiDO3 and LiDO4 guidance.

1 source groups1 seed URLs
UB

University of Bonn

uni_bonn

Use Bonn HPC wiki content for Marvin and Bender troubleshooting.

1 source groups1 seed URLs
UC

University of Cologne

uni_koeln

University of Cologne points to an ITCC GitLab wiki, but the current static scraper cannot extract its body content yet. Shared HPC.nrw fallback remains available until a GitLab-specific adapter is added.

1 source groups1 seed URLs
DE

University of Duisburg-Essen

uni_due

University of Duisburg-Essen currently relies on shared HPC.nrw wiki guidance until a stronger site-specific source is added.

1 source groups1 seed URLs
UM

University of Muenster

uni_muenster

Use PALMA documentation for cluster usage and troubleshooting.

1 source groups1 seed URLs
PC2

University of Paderborn PC2

uni_paderborn_pc2

Use PC2 documentation for Noctua 2 and Otus support guidance.

1 source groups1 seed URLs
US

University of Siegen

uni_siegen

Use University of Siegen OMNI HPC documentation during retrieval.

2 source groups8 seed URLs
UW

University of Wuppertal

uni_wuppertal

Use Pleiades documentation for cluster access and usage guidance.

1 source groups1 seed URLs

Graphical flow

How the registry becomes retrieval-ready context.

Sources are ingested once, cleaned for tenant-aware retrieval, and prepared with a lighter embedding model before the generator is asked to respond.

01

Registry ingestion

YAML source definitions and scraper output are normalized into a repeatable document set that can be audited and refreshed.

02

Embedding preparation

PII is scrubbed, content is chunked, and each record is tagged with `university_id` before vectors are written.

03

Model savings

Embeddings use `models/gemini-embedding-001`, while answer generation stays on `gemini-3.1-flash-lite-preview`, keeping the heavier model out of indexing work.

Current focus

Embeddings

models/gemini-embedding-001

Generation

gemini-3.1-flash-lite-preview

System_Arch_v2.4

End-to-End Pipeline

From sanitized logs to verified remediation.

Pilot combines sanitized incident input, public university HPC wiki context, and VectorDB-backed retrieval before generating findings for the selected university scope.

Step 01

Ingestion

Public university HPC wiki pages are configured ahead of time and prepared for retrieval.

Step 02

Anonymization

Sensitive identifiers are scrubbed before model processing.

Step 03

VectorDB Retrieval

The configured VectorDB retrieves public wiki excerpts scoped to the selected university.

Step 04

Synthesis

Gemini Flash Lite turns the retrieved public wiki context into findings, command guidance, and validation steps.

hpc-pilot / public-wiki-context
// Public wiki context retrieval

university_id: "hpc_nrw_shared"
source_scope: "public_hpc_wiki"
vector_db: "configured tenant-scoped retrieval"
vector_filter: { "university_id": "hpc_nrw_shared" }

Success: Retrieval restricted to configured public documentation.

Public Wiki Context

Grounded in public university HPC documentation.

Pilot does not guess local cluster rules from scratch. It uses public HPC wiki data configured for the selected university to add source-grounded context to the diagnosis, actions, and citations.

  • Supporting context comes from public HPC wiki pages configured for the selected university.
  • Retrieved excerpts stay inside the selected university scope before the answer is generated.
  • Pilot links the response back to the public source pages used for the guidance.

Why this exists

Built from real HPC support pressure at University of Siegen.

While working in the ZIMT HPC department at University of Siegen, many support tickets repeated the same HPC and Slurm failure patterns.

With the rise of LLMs and RAG, HPC-Pilot was designed as a practical AI system that helps users solve those issues directly from public wiki context, saving time for both researchers and department staff.

Origin story

"Repeated HPC support tickets became the signal: the same pain was showing up again and again, and public wiki context was already there waiting to be used better."

HPC-Pilot turns that observation into a practical workflow: users bring the incident, the system retrieves the relevant public wiki context, and the answer comes back in an operator-friendly order.

Workspace

Incident input

Add the university, the failure output, and any optional execution context you already have. Pilot supplements the answer with public wiki context from the selected institution.

How to use it

Add one incident the same way an operator would describe it.

This input area works best when the failure is entered in sequence: scope first, evidence second, execution context third.

Step 1Required

Choose the university

Select the institution whose public HPC wiki should ground the answer.

Step 2Required

Paste the failing log or error

Use the first failing lines plus the final error so Pilot can anchor the diagnosis.

Step 3Optional

Add job goal and script context

Submission scripts, modules, partitions, and commands make the answer more specific.

Step 4Recommended

Review retrieval depth and run analysis

Start with a tighter setting, then broaden only if you need more local documentation context.

Choose the university first so retrieval stays grounded in the correct public HPC wiki context.

Pilot retrieves supporting context from configured public HPC wiki pages for the selected university.

1 source groups / 9 seed URLs

Use tighter retrieval for focused incidents and broader retrieval when the failure may span multiple local docs pages.

How much local wiki context should be used?

Lower values stay tighter. Higher values compare more local documentation but can become broader.

Optional

Sample for testing

Samples are only for a quick demo. They do not change the workflow beyond filling the form.

Lines: 0Words: 0Characters: 0

If the model key is not configured, the interface returns a direct setup error instead of a vague failure.

Findings

Guidance and supporting evidence

Pilot organizes the answer into the same sequence most operators use: failure, cause, next action, validation, and sources.

Run an incident analysis and the findings will appear here.

Need to validate a workflow or discuss the project?

Use the contact form for deployment questions, public wiki source setup, or research discussions around the HPC-Pilot workflow.

View architecture