Analyzing your Program's Memory by Inspecting the Linux Page Table

When optimizing code for performance, it is often useful to have a very clear idea about what happens in memory. Reducing page faults, cache misses, TLB misses et cetera can be a major factor in speeding up your code. In this post I will demonstrate how you can inspect paging information regarding the memory of your process under Linux.

To get meaningful information from the paging data, one must roughly understand how the paging system works on Linux, so I will start by giving a coarse overview in the section Linux Paging Primer. If you already know your way around Linux’ paging system, you can skip ahead. I will use C/C++ throughout this article, but everything in here should work in every language that gives you access to raw pointers, resp. raw memory addresses. Some details of paging differ between CPU architectures. I am using x86-64/amd64 in this article.

Linux Paging Primer

The memory addresses our program sees (i.e., the numeric value of pointers) are addresses in the so-called virtual memory space. Virtual memory is associated with some “real” memory using paging. Every running program gets its own virtual memory space; for example, our program can have a pointer with value 0xc0ffee point to some data in physical memory, and a different program, running in parallel, can have a pointer with value 0xc0ffee point to some completely different data.1 Conversely, virtual memory addresses can also point to other things than just “data in physical memory”, for example we can have pointers into memory-mapped files. In the following, I will ignore these special cases and just care about data in memory.

Virtual memory is divided into contiguous pages of 4kB, or 4096 Bytes, each. The first page starts at address 0x0, the second page starts at 0x1000, the third page at 0x2000, and so on. To actually do something useful with an address in virtual memory, the page containing that address must be mapped to something. Since we only consider pages pointing into physical memory in this post, we can assume that a page is either unmapped, or mapped to a frame in physical memory. A frame is the physical-memory equivalent to a page: a 4kB block of contiguous physical memory. See Figure 1 for a visualization.

Figure 1: Pages and Frames
Figure 1: Pages and Frames

The Page Table, the MMU and the TLB

With the relationship between pages and frames as above, we need some bookkeeping to keep track of which page is mapped to which frame, so that we can resolve virtual addresses to physical ones. This is done in the page table. The page table is - despite its name - not just one big table, but actually a hierarchical data structure with multiple levels. However, it is hard to make general statements about “the page table”, since the structure of the page table is completely defined by the CPU architecture, and thus differs from architecture to architecture. Fortunately, the Linux kernel provides a nice, more-or-less architecture independent interface to the page table, as we will see in the next section.

The reason that the page table format is defined by the CPU architecture is that the operation of resolving a virtual address to a physical address is implemented in hardware in the CPU, resp. the Memory Management Unit (MMU), which is part of virtually all modern CPUs. If one had to invoke the OS kernel every time an address needs resolving, that would be way too slow.

In fact, it would even be too slow if the MMU had to consult the whole multi-level page table for every address resolution. That is why MMUs usually include a Translation Lookaside Buffer (TLB). The TLB is essentially a small cache of mappings from pages to frames. The hope is that most applications have a property called locality of reference, or, in simpler terms: they usually don’t often jump too far between two memory accesses. That means that right after a page-to-frame lookup was performed, we hope that the next memory accesses will happen in the same page, which the MMU now has in its TLB.

How to Programmatically Read Paging Information

Under Linux, there are three main “files” in the proc pseudo-filesystem that are of interest if we want to know about the paging of our process. They are found under /proc/self, which is a shortcut for the procfs entries for the current process. If we want to inspect the memory of a different than the current process, we need to know its process ID (PID), and look inside /proc/<PID>. The three files we will use are:

  • /proc/self/maps (or /proc/<PID>/maps) contains information about which areas of our virtual address space are mapped to what (remember that virtual addresses can not only be mapped to physical memory, but also to files).
  • /proc/self/pagemap (or /proc/<PID>/pagemap) gives more detailed information about each mapped page, and especially gives us the number of the frame that a page is mapped to, if that page is mapped to physical memory.
  • /proc/kpageflags provides status information about frames. To use it, we need to know the frame number of the page we are interested in.

Note that pagemap and kpageflags can effectively2 only be read by user root.3 This is because detailed information about memory status, especially frame numbers and some of the status bits in /proc/kpageflags, were used to facilitate attacks such as Rowhammer, which rely on knowledge about the physical layout of memory. As a consequence, Linux started in version 4.0 to restrict access to this information to privileged processes.

So, if we want to get information about the paging of a certain memory address, we need a two-step process: First, get the frame number of the respective page from the pagemap file, then look up the correct frame in kpageflags. This is what I demonstrate in the next section. If we want to get more general information about the whole of our application’s allocated memory, e.g. the number of swapped-out pages in our memory, it becomes a three-step process: First, we must determine the range of pages in our virtual memory mapped to physical memory by inspecting /proc/self/maps, and then perform the two steps from above for each such page. I demonstrate this process in the last section.

Getting Pageflags for a Fixed Address

Let’s start by retrieving the page flags that apply to some virtual memory address. Let’s start with some memory pointer void * ptr. First, we need to get the index of the page that this memory is located in. We need to round the memory address down to the next multiple of 4096, and then divide by 4096. Since C/C++ integer division truncates towards zero we can just do:

1size_t page_index(void *ptr) {
2  return reinterpret_cast<size_t>(ptr) / 4096;
3}

The pagemap format specifies that for each page, there is a 64-bit entry in /proc/<PID>/pagemap, with the lower 54 bits indicating the page frame number (PFN). To get this PFN, we first seek to the correct position, then we read 64 bits and extract the PFN from it:

 1size_t get_pfn(void *ptr) {
 2  size_t pi = page_index(ptr); // see above
 3
 4  // /proc/self points to the procfs folder of the current process, so
 5  // we don't need to figure out our PID first.
 6  auto fp = std::fopen("/proc/self/pagemap", "rb");
 7
 8  // Each entry in pagemap is 64 bits, i.e, 8 bytes.
 9  std::fseek(fp, pi * 8, SEEK_SET);
10
11  // Read 64 bits into an uint64_t
12  uint64_t page_info = 0;
13  std::fread(&page_info, 8, 1, fp);
14
15  // check if page is present (bit 63). Otherwise, there is no PFN.
16  if (page_info & (static_cast<uint64_t>(1) << 63)) {
17	  // Create a mask that has ones in the lowest 54 bits, and use that to extract the PFN.
18    uint64_t pfn = page_info & ((static_cast<uint64_t>(1) << 55) - 1);
19    return static_cast<size_t>(pfn);
20  } else {
21	  // page not present
22    return 0;
23  }
24}

Now that we have the page frame number in hands, we can use that as index into /proc/kpageflags to get to our page’s flags. Again, each entry is 64 bits long, so we must skip ahead in 8-Byte-steps:

 1size_t get_pflags(void *ptr) {
 2  size_t pfn = get_pfn(ptr);
 3  if (pfn == 0) { return 0; } // non-present pages
 4
 5  auto fp = std::fopen("/proc/kpageflags", "rb");
 6  std::fseek(fp, pfn * 8, SEEK_SET);
 7
 8  uint64_t pflags = 0;
 9  auto result = std::fread(&pflags, 8, 1, fp);
10
11  return static_cast<size_t>(pflags);
12}

I again refer to the excellent pagemap documentation for a list of available flags (and their numeric value). If we want to e.g. know whether our pointer is allocated inside a huge page, this is how we could do it:

1bool is_huge(void *ptr) {
2  uint64_t flags = get_pflags(ptr);
3
4  return
5	  (flags & (static_cast<uint64_t>(1) << 17)) || // "huge" flag
6	  (flags & (static_cast<uint64_t>(1) << 22)); // "transparent huge"
7}

Getting Some Flags Straight from pagemap

Remember that the PFN is only readable if our process has the CAP_SYS_ADMIN capability, otherwise it is set to zero. If we don’t have that capability, we can still get some information which is directly contained in the pagemap entries. See the documentation for a full list of available flags. One interesting flag is the present flag, which tells us whether a page is currently associated with a frame in physical memory.4 This would do the trick:

 1bool present_from_pte(void *ptr) {
 2	// vvvvvv same as get_pfn() vvvvvv
 3  size_t pi = page_index(ptr);
 4  auto fp = std::fopen("/proc/self/pagemap", "rb");
 5  std::fseek(fp, pi * 8, SEEK_SET);
 6  uint64_t page_info = 0;
 7  std::fread(&page_info, 8, 1, fp);
 8	// ^^^^^^ same as get_pfn() ^^^^^^
 9
10  // Bit 63 is the 'present' bit
11  return page_info & (static_cast<uint64_t>(1) << 63);
12}

Computing Statistics about our Application’s Memory

Getting information for a specific address is nice, but most of the time you probably want to compute some statistics for the memory used by your application. We need a way to determine which virtual memory areas are mapped, i.e., part of the page table, at all. For this, we have the file /proc/<PID>/maps. This file contains lines of this form5:

1       address           perms offset  dev   inode       pathname
2        00400000-00452000 r-xp 00000000 08:02 173521     /usr/bin/dbus-daemon
3        00e03000-00e24000 rw-p 00000000 00:00 0          [heap]
4    35b1a21000-35b1a22000 rw-p 00000000 00:00 0
57fffb2c0d000-7fffb2c2e000 rw-p 00000000 00:00 0          [stack]

Each line (note that the ‘header line’ in this example is not part of the file) corresponds to one area of mapped virtual memory addresses. The address part is the start end end virtual memory address of the area. Remember that virtual memory addresses can also point into files. In those cases, the pathname contains the path to the file, the inode field contains the file’s inode, the dev file contains the major and minor identifier of the device the file is on, and offset contains the offset within the file to which the start of the area corresponds. We ignore these file-backed areas in this article.

The areas we are interested in are those with either an empty pathname, or something in square brackets. The names in square brackets are “pseudo pathnames”, the two most interesting ones are [heap] and [stack]. The [heap] areas (yes, there may be multiple) contain our application’s heap (the area(s) malloc() and friends usually take their memory from), and [stack] contains the stack of the first thread of our application. Then there are the unnamed areas - those are either stacks of other threads, or memory allocated (most likely by libc) via mmap.

We can analyze our application’s memory with the tools we already have by parsing these lines and then applying functions like get_pflags() at 4kB steps within the address range. Say we want to count the number of present pages in our heap(s). We first determine the address ranges of all [heap] areas by using std::regex to parse the lines in /proc/self/maps:

 1std::vector<std::pair<size_t, size_t>> get_heap_ranges() {
 2  // This only matches lines ending in "[heap]", and captures the respective
 3  // start/end address in the first/second group.
 4  std::regex re(
 5      "([0-9a-f]*)-([0-9a-f]*) .{4} [0-9a-f]* .{2}:.{2} [0-9]* *\\[heap\\]");
 6  std::ifstream mapsfile("/proc/self/maps");
 7  std::string line;
 8
 9  std::vector<std::pair<size_t, size_t>> result;
10  while (std::getline(mapsfile, line)) {
11    std::smatch match_result;
12    bool matched = std::regex_match(line, match_result, re);
13    if (!matched) {
14      // not a "heap" line
15      continue;
16    }
17
18    // start/end addresses are in hex, so we use base 16 here
19    size_t start = std::stoul(match_result[1].str(), nullptr, 16);
20    size_t end = std::stoul(match_result[2].str(), nullptr, 16);
21    result.emplace_back(start, end);
22  }
23
24  return result;
25}

With that information, we can apply our present_from_pte() function from earlier in page-size steps:

 1size_t count_present_in_heap() {
 2  size_t present_count = 0;
 3  size_t total = 0;
 4
 5  auto ranges = get_heap_ranges();
 6  for (auto [heap_start, heap_end] : ranges) {
 7	  // We 'probe' in 4kB, i.e., page size, steps, so we apply present_from_pte() to one address from
 8	  // each page each.
 9    for (size_t current = heap_start; current < heap_end; current += 4096) {
10      total++;
11
12      if (present_from_pte(reinterpret_cast<void *>(current))) {
13        present_count++;
14      }
15    }
16  }
17
18  std::cout << "Counted " << present_count << " pages present in " << total
19            << " pages of heap memory.\n";
20  return present_count;
21}

Note that this solution did not involve page frame numbers or information from /proc/kpageflags at all, so it can be run without the CAP_SYS_ADMIN capability!


  1. Note that processes can also share memory. If they do, they can, but do not have to, map the shared data to the same virtual addresses. ↩︎

  2. Yes, you can open pagemap without any special permissions, but the interesting information - the page frame number (PFN) - will be zeroed out. ↩︎

  3. More precisely, by a process with the CAP_SYS_ADMIN capability, but there should be no meaningful difference to just running the process as root. ↩︎

  4. You could also use the mincore system call to test for page presence. ↩︎

  5. The example is taken straight from the proc manpage↩︎

Comments

You can use your Mastodon account to reply to this post.

Reply to tinloaf's post

With an account on the Fediverse or Mastodon, you can respond to this post. Since Mastodon is decentralized, you can use your existing account hosted by another Mastodon server or compatible platform if you don't have an account on this one.

Copy and paste this URL into the search field of your favourite Fediverse app or the web interface of your Mastodon server.