Joining as an early engineer on path-critical infrastructure. Axilon brings LLMs into industrial operations and engineering workflows; the work sits at the intersection of applied ML, reliability engineering, and customer-specific deployment. Backed by BoxGroup and RTX Ventures.
Aengus McGuinness
Software engineer working on systems and infrastructure. Currently building path-critical infrastructure at Axilon. Previously at Los Alamos National Laboratory and the Harvard MCB department, where I spent four years on computational biology and high-performance computing.
I’m most interested in the places where careful engineering meets hard real-world problems — distributed systems, low-latency networking, ML infrastructure, and the layers below.
Now
Before
Designed distributed deep learning pipelines on GPU clusters under SLURM. Integrated Bayesian optimization into existing OpenFold workflows, achieving 2× validation improvement while keeping execution scalable across compute nodes.
Optimized distributed HPC pipelines processing terabyte-scale biological datasets on SLURM clusters. Reduced workflow runtime by 80% through parallelization, memory tuning, and GPU-accelerated data processing modules.
Selected Projects
libibverbs · reportA key-value cache with three communication paths — TCP/RPC, two-sided RDMA, and one-sided RDMA reads over a registered hash-table memory region. The one-sided path bypasses server CPU entirely.
On CloudLab Mellanox hardware: 974k ops/s with stable 12 μs p99 latency on one-sided reads — a 13× throughput improvement and 20× tail-latency reduction over the TCP baseline. Adding RDMA FETCH_AND_ADD atomics for recency tracking imposes a consistent 3–4 μs p99 penalty, isolating the cost of cache-policy maintenance on the read path.
A two-phase study of hardware prefetching via stream buffers: Jouppi’s original fixed-depth design and Palacharla & Kessler’s adaptive extension, both implemented as Intel Pin tools with a two-level cache hierarchy and explicit latency model.
The adaptive policy learns stream-length distributions online via a histogram and chooses prefetch depth dynamically. Across SPEC CPU2006 workloads it achieves 5.02× speedup on libquantum and 83% prefetch accuracy on dealII (versus 78% for static next-line), while reducing wasted memory bandwidth on irregular workloads by an order of magnitude. Total hardware cost: 250–400 bytes.
An asynchronous RPC client built around gRPC completion queues and a multi-threaded polling architecture. Increased throughput from ~8k to 55k+ RPC/s via batching, non-blocking I/O, and flow-control tuning.
Writing
Coming soon.