Interface language

System status: Unreachable

HPC-Pilot: Tenant-Aware AI Copilot for University HPC Support

Analyze cluster failures, enrich the answer with public university HPC wiki context, and get grounded next actions for complex incidents.

View architecture

Universities

Source groups

Topics

Backend

Log-to-Fix

Paste one HPC incident and Pilot turns the failure into a grounded sequence of diagnosis, next actions, validation, and source links.

Unreachable

Workspace

Public Wiki Context

Configured public HPC wiki pages are indexed ahead of time so the selected university can add local operational context to the answer.

9 seed URLs

PII

PII Scrubbing

Logs are sanitized before analysis so email addresses, IPs, usernames, and internal-looking paths are reduced before model processing.

Scrubbed before analysis

Configured tenants

Universities currently available in the system.

These institutions currently have public HPC wiki context configured for troubleshooting and retrieval.

3 enabled tenants

HPCN

HPC.NRW

hpc_nrw_shared

Use the central HPC Wiki as umbrella public context for site-independent NRW HPC guidance.

1 source groups9 seed URLs

RWTH

RWTH Aachen

uni_rwth

Use RWTH Slurm documentation during retrieval.

1 source groups2 seed URLs

University of Siegen

uni_siegen

Use University of Siegen OMNI HPC documentation during retrieval.

1 source groups3 seed URLs

Graphical flow

How the registry becomes retrieval-ready context.

Sources are ingested once, cleaned for tenant-aware retrieval, and prepared with a lighter embedding model before the generator is asked to respond.

Registry ingestion

YAML source definitions and scraper output are normalized into a repeatable document set that can be audited and refreshed.

Embedding preparation

PII is scrubbed, content is chunked, and each record is tagged with `university_id` before vectors are written.

Model savings

Embeddings use `models/gemini-embedding-001`, while answer generation stays on `gemini-3.1-flash-lite-preview`, keeping the heavier model out of indexing work.

Current focus

Embeddings

models/gemini-embedding-001

Generation

gemini-3.1-flash-lite-preview

System_Arch_v2.4

End-to-End Pipeline

From sanitized logs to verified remediation.

Pilot combines sanitized incident input, public university HPC wiki context, and VectorDB-backed retrieval before generating findings for the selected university scope.

Step 01

Ingestion

Public university HPC wiki pages are configured ahead of time and prepared for retrieval.

Step 02

Anonymization

Sensitive identifiers are scrubbed before model processing.

Step 03

VectorDB Retrieval

The configured VectorDB retrieves public wiki excerpts scoped to the selected university.

Step 04

Synthesis

Gemini Flash Lite turns the retrieved public wiki context into findings, command guidance, and validation steps.

hpc-pilot / public-wiki-context

// Public wiki context retrieval

university_id: "hpc_nrw_shared"
source_scope: "public_hpc_wiki"
vector_db: "configured tenant-scoped retrieval"
vector_filter: { "university_id": "hpc_nrw_shared" }

Success: Retrieval restricted to configured public documentation.

Public Wiki Context

Grounded in public university HPC documentation.

Pilot does not guess local cluster rules from scratch. It uses public HPC wiki data configured for the selected university to add source-grounded context to the diagnosis, actions, and citations.

Supporting context comes from public HPC wiki pages configured for the selected university.
Retrieved excerpts stay inside the selected university scope before the answer is generated.
Pilot links the response back to the public source pages used for the guidance.

Why this exists

Built from real HPC support pressure at University of Siegen.

While working in the ZIMT HPC department at University of Siegen, many support tickets repeated the same HPC and Slurm failure patterns.

With the rise of LLMs and RAG, HPC-Pilot was designed as a practical AI system that helps users solve those issues directly from public wiki context, saving time for both researchers and department staff.

Origin story

"Repeated HPC support tickets became the signal: the same pain was showing up again and again, and public wiki context was already there waiting to be used better."

HPC-Pilot turns that observation into a practical workflow: users bring the incident, the system retrieves the relevant public wiki context, and the answer comes back in an operator-friendly order.

Workspace

Incident input

Add the university, the failure output, and any optional execution context you already have. Pilot supplements the answer with public wiki context from the selected institution.

How to use it

Add one incident the same way an operator would describe it.

This input area works best when the failure is entered in sequence: scope first, evidence second, execution context third.

Step 1Required

Choose the university

Select the institution whose public HPC wiki should ground the answer.

Step 2Required

Paste the failing log or error

Use the first failing lines plus the final error so Pilot can anchor the diagnosis.

Step 3Optional

Add job goal and script context

Submission scripts, modules, partitions, and commands make the answer more specific.

Step 4Recommended

Review retrieval depth and run analysis

Start with a tighter setting, then broaden only if you need more local documentation context.

UniversityRequired

Choose the university first so retrieval stays grounded in the correct public HPC wiki context.

Pilot retrieves supporting context from configured public HPC wiki pages for the selected university.

1 source groups / 9 seed URLs

Documentation contextRecommended

Use tighter retrieval for focused incidents and broader retrieval when the failure may span multiple local docs pages.

How much local wiki context should be used?

Lower values stay tighter. Higher values compare more local documentation but can become broader.

Optional

Sample for testing

Samples are only for a quick demo. They do not change the workflow beyond filling the form.

Findings

Guidance and supporting evidence

Pilot organizes the answer into the same sequence most operators use: failure, cause, next action, validation, and sources.

Run an incident analysis and the findings will appear here.

Need to validate a workflow or discuss the project?

Use the contact form for deployment questions, public wiki source setup, or research discussions around the HPC-Pilot workflow.

View architecture