Log-to-Fix
Paste one HPC incident and Pilot turns the failure into a grounded sequence of diagnosis, next actions, validation, and source links.
Ready
Privacy and analytics
HPC-Pilot is an AI project built to help the HPC community troubleshoot issues. We use optional usage analytics to understand performance, feature usage, and error trends. These events stay anonymized and are only enabled after you consent.
Analyze cluster failures, enrich the answer with public university HPC wiki context, and get grounded next actions for complex incidents.
Universities
14
Source groups
15
Topics
3
Paste one HPC incident and Pilot turns the failure into a grounded sequence of diagnosis, next actions, validation, and source links.
Ready
Configured public HPC wiki pages are indexed ahead of time so the selected university can add local operational context to the answer.
9 seed URLs
Logs are sanitized before analysis so email addresses, IPs, usernames, and internal-looking paths are reduced before model processing.
Scrubbed before analysis
Configured tenants
These institutions currently have public HPC wiki context configured for troubleshooting and retrieval.
Graphical flow
Sources are ingested once, cleaned for tenant-aware retrieval, and prepared with a lighter embedding model before the generator is asked to respond.
01
YAML source definitions and scraper output are normalized into a repeatable document set that can be audited and refreshed.
02
PII is scrubbed, content is chunked, and each record is tagged with `university_id` before vectors are written.
03
Embeddings use `models/gemini-embedding-001`, while answer generation stays on `gemini-3.1-flash-lite-preview`, keeping the heavier model out of indexing work.
Current focus
Embeddings
models/gemini-embedding-001
Generation
gemini-3.1-flash-lite-preview
System_Arch_v2.4
End-to-End Pipeline
Pilot combines sanitized incident input, public university HPC wiki context, and VectorDB-backed retrieval before generating findings for the selected university scope.
Step 01
Public university HPC wiki pages are configured ahead of time and prepared for retrieval.
Step 02
Sensitive identifiers are scrubbed before model processing.
Step 03
The configured VectorDB retrieves public wiki excerpts scoped to the selected university.
Step 04
Gemini Flash Lite turns the retrieved public wiki context into findings, command guidance, and validation steps.
// Public wiki context retrieval
university_id: "hpc_nrw_shared"
source_scope: "public_hpc_wiki"
vector_db: "configured tenant-scoped retrieval"
vector_filter: { "university_id": "hpc_nrw_shared" }
Success: Retrieval restricted to configured public documentation.Public Wiki Context
Pilot does not guess local cluster rules from scratch. It uses public HPC wiki data configured for the selected university to add source-grounded context to the diagnosis, actions, and citations.
Why this exists
While working in the ZIMT HPC department at University of Siegen, many support tickets repeated the same HPC and Slurm failure patterns.
With the rise of LLMs and RAG, HPC-Pilot was designed as a practical AI system that helps users solve those issues directly from public wiki context, saving time for both researchers and department staff.
Origin story
"Repeated HPC support tickets became the signal: the same pain was showing up again and again, and public wiki context was already there waiting to be used better."
HPC-Pilot turns that observation into a practical workflow: users bring the incident, the system retrieves the relevant public wiki context, and the answer comes back in an operator-friendly order.
Workspace
Add the university, the failure output, and any optional execution context you already have. Pilot supplements the answer with public wiki context from the selected institution.
How to use it
This input area works best when the failure is entered in sequence: scope first, evidence second, execution context third.
Select the institution whose public HPC wiki should ground the answer.
Use the first failing lines plus the final error so Pilot can anchor the diagnosis.
Submission scripts, modules, partitions, and commands make the answer more specific.
Start with a tighter setting, then broaden only if you need more local documentation context.
Choose the university first so retrieval stays grounded in the correct public HPC wiki context.
Pilot retrieves supporting context from configured public HPC wiki pages for the selected university.
1 source groups / 9 seed URLs
Use tighter retrieval for focused incidents and broader retrieval when the failure may span multiple local docs pages.
How much local wiki context should be used?
Lower values stay tighter. Higher values compare more local documentation but can become broader.
Optional
Sample for testing
Samples are only for a quick demo. They do not change the workflow beyond filling the form.
Findings
Pilot organizes the answer into the same sequence most operators use: failure, cause, next action, validation, and sources.
Use the contact form for deployment questions, public wiki source setup, or research discussions around the HPC-Pilot workflow.