Mark Hildebrand

Logo

A Website

About

I like solving problems. I am most excited when I don’t initially know what the solution is going to look like because that provides the most opportunity for learning new things.

Projects

Here’s a list of programming projects that I have worked on, either as part of my graduate research, during internships, or for other extracurricular purposes. I’ve attached approximate dates associated with each project. These dates mainly correspond to the time when the projects were under active development. Many of these projects continued to be used to run experiments and received minor improvements and bug fixes.

Table of Contents

Intel Internship Projects

Research Related Projects: Most of the technical work for my PhD was completed by late 2021. After that, it was mostly a matter of minor bug fixes and running experiments for my dissertation.

Miscellaneous Projects


Internship Projects (Intel)

Dates: Approximately December 2020 or so (I was initially working on other related projects) to March 2021 full time on this project, then from April 2021 to March 2022 at 10% time.

This was work that I conducted while interning at Intel. Although I can’t link to any of the written code, I can provide some links to articles written by my awesome collaborators:

In a nut-shell, the name of the game is:

The approximate part comes from relaxing the restriction that the k nearest neighbors are the exact nearest neighbors. In other words, we’re willing to trade exactness for performance (to a point, we could always return k random neighbors for a really fast algorithm, but that wouldn’t be particularly useful).

Role: I was the primary developer behind this submission (writing code, debugging, performance optimization). Highlights of the library include:


AutoTM

Dates: We first had the idea for this around February 2019. I was able to write the code and generate results for the submission in late August of 2019. After that, I spent a couple months cleaning up the code for artifact submission and working on some improvements for other research.

Role: I worked with my advisors on the high level ideas behind this project and on designing what kind of experiments we wanted to run and which data to collect. Other than that, I wrote all of the code and performed all of the experiments.

Description: This is the project for my 2020 ASPLOS paper. At a high level, the goal of this project was given:

  1. A static computation graph for a convolutional neural network.
  2. Two memories, one fast and one slow with a size constraint on the fast memory.

To minimize graph execution time by controlling the placement and movement of intermediate data (tensors) in the fast and slow memories. The technique to achieve this was to first profile all kernels in the execution graph for their runtime performance for all possible memory locations of input and output parameters. Then, using this information and the computation graph, express the optimization problem as an Integer Linear Program (ILP) and offload all the thinking to Gurobi.

In detail, this project involved:

This involved messing around with the internals of nGraph, which was quite fun.


OneDNN Wrapper for Julia

Dates: I had the original idea in June 2020. Work done on-and-off from April 2021 to November 2021 to turn it into a functioning project.

Description: For my PhD work on heterogeneous memory, I needed high-performance neural network primitives where I controlled the pointers for the intermediate tensors. The solution was to create a Julia wrapper for Intel’s OneDNN library, which basically supplies the fastest CPU-based neural network primitives around.

This wrapper library features a bunch of cool things:


Embedding Tables

Dates: Serious performance oriented development begin in August 2021 and went through October of the same year. I did some performance re-engineering in March and April of 2022.

Description: High-performance embedding table lookup and stochastic gradient descent (SGD) operations. I implemented this as part of a very fast DLRM implementation. Some highlights include:

In addition, at one point in time, I had access to a Cooperlake CPU, which has hardware support for the BFloat16 data type. At that time, I was able to extend support for that data type as well as the “split SGD” optimization outlined in this paper. I’m looking forward to BFloat16 hardware becoming more available in the future because it offers interesting trade-offs when it comes to optimizing numerical software!


CachedArrays

Dates: The initial work on this project was done from April 2020 to June 2020 as a follow-on to AutoTM. I did an internship at Intel and came back to this project between about April 2021 and October 2021.

Description: In many respects, this is a generalization of the ideas presented in AutoTM. The idea is to introduce an array type in Julia that can both live and migrate between fast and slow memory, but isn’t limited to static computation graphs like AutoTM was.

CachedArrays Overview A general overview of the architecture of CachedArrays is shown in the figure above. The manager maintains two memory heaps: one for fast memory and one for slow. Data arrays that are allocated are backed by an “object”, which contains a pointer to the actual data backing the array. The actual data live in either fast or slow memory and can migrate between memory pools during runtime. The policy component is customizable on a per-application basis and can take semantic hints from the user to help direct data orchestration.

Heap Evict

In order to support migrating data from one memory pool to the other, the memory heaps in CachedArrys need to support eviction. Because memory allocations are not uniform in size, this potentially requires evicting multiple memory blocks at a time. The figure above shows what this looks like. Beginning at some entry point, the heap manager calls a call-back function on each allocated block it experiences. This callback is responsible for migrating the existing data to the remote memory pool. Free regions are skipped over. Once enough memory has been freed, a new larger allocation can be conducted.

More information on CachedArrays as well as its usage in CNN experiments can be found in Chapter 4 of my dissertation [link coming].


CounterTools

Dates: Some initial exploration in November 2019. The majority of this tool was developed in March 2020 when I finally decided that I really needed it.

Description: A small library I wrote for configuring and reading both core and uncore hardware performance counters on x86 Intel CPUs. This mainly works on Skylake/Cascade Lake server CPUs with some support for the Ice Lake IMC (integrated memory controller) counters.

Now, you’re probably asking yourself: “why do we need yet another performance counter library when things like perf and pcm exist?” The answer is: “Well, we really don’t.”

My motivation for writing this was:

This library worked really well for me, but I completely do not recommend it in favor of more industrial strength tools. 😛


SystemSnoop

Dates: Mostly developed between October 2018 to February 2019. This originally contained a lot more functionality, including reuse distance for Linux applications using idle page tracking. However, it was simplified to the more minimal tool it is now when I was working on AutoTM.

Description:SystemSnoop is one of those tiny utility packages that just end up being super useful. Though-out my PhD work, I did a lot of performance monitoring such as periodically collecting memory consumption of running applications, reading performance counters, taking time stamps, etc. This package helped automate that process using a simple measurement API:

Multiple instruments are collected into a NamedTuple which automatically broadcasts the API functions along each element of the tuple and constructs a NamedTuple of measurements. Though this is simple, it provided a uniform and reliable way of grouping together my desired measurements, collecting the results, and easily allowed for new measurement types to be added.


Miscellaneous Projects

Many-Core Place-and-Route

Dates: Work started on this tool in earnest in October 2017 (after a few previous no-so-great attempts) and continued until around September 2018. I was doing an internship in San Jose between July 2018 and late August 2018, so not much work was done on this project during that time.

Description: This was the first real project I worked on when I started graduate school at UC Davis at the VLSI Computation Lab. The “Mapper” is a research place-and-route tool for applications written targeting Kilocore. While it was originally meant to just automate the process of assigning tasks to cores for better simulation results (plus, no one wants to do this by hand for 1000+ tasks), I realized this could be a much more general tool to allow for design space exploration, both in network topology as well in place-and-route algorithms.

Here’s a link to a published paper where Figure 3 was placed by my tool: PDF Warning

This final tool works on arbitrary grids (i.e., rectilinear, hexagonal) as well as irregular grids (arbitrary processors can be removed and the tool will place around that). The two figures below show the same application (I think it’s some kind of FFT application) mapped to two different topologies. In these figures, squares represent processors (or memory tiles like on the bottom), cyan represents a task mapped to a tile, and the lines (black, blue, or red) indicate communication links between tiles.

This is a test Figure: Rectilinear Topology with memory along the bottom.

Figure: Hexagonal Topology with memory along the bottom.

Figure: Routing around a missing block of processors.

Additionally, the tool allows for extension of the placement objective. For example, the figure below shows two different placement strategies for the older AsAP 2 architecture, one where the total number of hops is minimized (left) and one where the maximum hop length is minimized (right). Though the placement on the left has fewer total hops, the one on the right is faster because the maximum hop between two communicating processors is 2 (AsAP 2 had a quirk where link frequency decreased with hop count). The tool could support all kinds of flexible tweaks like this. Plus, it was very fast.

Figure: Tailoring placement objective to minimize the maximum link length. Left: 237 hops with a maximum length of 3. Right: 262 hops with a maximum length of 2.


DE10-Lite HDL

Dates: A few days here and there during August and September 2018.

Description: I worked as a teaching assistant (TA) for the UC Davis “intro to Verilog-based digital design” class for many quarters (like 5 or 6? I don’t remember). Over time, I was tasked with creating the lab exercises the students did every week. Now, this class used the DE10 Lite from Terasic with an Intel Max 10 FPGA and includes all these fun peripherals like SDRAM, an accelerometer, VGA port etc. However, we basically never used them in the class because:

  1. None of the TAs had bothered getting them to work and the reference code was scary to say the least.
  2. This was an intro class so we could only get so complicated with the labs we created.

To that end, I spent some time getting all the peripherals working with (hopefully) clean, simple interfaces that students could actually use and (again hopefully) clear enough Verilog to serve as an example for student’s designs. Features of this library include:

Nothing too fancy, but I did get it to the point where I could transfer a “gif” from my computer to the FPGA and have it be displayed on a screen through the board’s VGA cable, so that was fun. I hope future TAs of the class found it useful.