Comments
You can use your Mastodon account to reply to this post.
When optimizing code for performance, it is often useful to have a very clear idea about what happens in memory. Reducing page faults, cache misses, TLB misses et cetera can be a major factor in speeding up your code. In this post I will demonstrate how you can inspect paging information regarding the memory of your process under Linux.
To get meaningful information from the paging data, one must roughly understand how the paging system works on Linux, so I will start by giving a coarse overview in the section Linux Paging Primer. If you already know your way around Linux’ paging system, you can skip ahead. I will use C/C++ throughout this article, but everything in here should work in every language that gives you access to raw pointers, resp. raw memory addresses. Some details of paging differ between CPU architectures. I am using x86-64/amd64 in this article.
The memory addresses our program sees (i.e., the numeric value of pointers) are addresses in the
so-called virtual memory space. Virtual memory is associated with some “real” memory using
paging. Every running program gets its own virtual memory space; for example, our program can have
a pointer with value 0xc0ffee
point to some data in physical memory, and a different program,
running in parallel, can have a pointer with value 0xc0ffee
point to some completely different
data.1 Conversely, virtual memory addresses can also point
to other things than just “data in physical memory”, for example we can have pointers into
memory-mapped files. In the following, I will ignore these special cases and just care about data in
memory.
Virtual memory is divided into contiguous pages of 4kB, or 4096 Bytes, each. The first page starts
at address 0x0
, the second page starts at 0x1000
, the third page at 0x2000
, and so on. To
actually do something useful with an address in virtual memory, the page containing that address
must be mapped to something. Since we only consider pages pointing into physical memory in this
post, we can assume that a page is either unmapped, or mapped to a frame in physical memory. A
frame is the physical-memory equivalent to a page: a 4kB block of contiguous physical memory. See
Figure 1 for a visualization.
With the relationship between pages and frames as above, we need some bookkeeping to keep track of which page is mapped to which frame, so that we can resolve virtual addresses to physical ones. This is done in the page table. The page table is - despite its name - not just one big table, but actually a hierarchical data structure with multiple levels. However, it is hard to make general statements about “the page table”, since the structure of the page table is completely defined by the CPU architecture, and thus differs from architecture to architecture. Fortunately, the Linux kernel provides a nice, more-or-less architecture independent interface to the page table, as we will see in the next section.
The reason that the page table format is defined by the CPU architecture is that the operation of resolving a virtual address to a physical address is implemented in hardware in the CPU, resp. the Memory Management Unit (MMU), which is part of virtually all modern CPUs. If one had to invoke the OS kernel every time an address needs resolving, that would be way too slow.
In fact, it would even be too slow if the MMU had to consult the whole multi-level page table for every address resolution. That is why MMUs usually include a Translation Lookaside Buffer (TLB). The TLB is essentially a small cache of mappings from pages to frames. The hope is that most applications have a property called locality of reference, or, in simpler terms: they usually don’t often jump too far between two memory accesses. That means that right after a page-to-frame lookup was performed, we hope that the next memory accesses will happen in the same page, which the MMU now has in its TLB.
Under Linux, there are three main “files” in the proc
pseudo-filesystem that are of interest if
we want to know about the paging of our process. They are found under /proc/self
, which is a
shortcut for the procfs
entries for the current process. If we want to inspect the memory of a
different than the current process, we need to know its process ID (PID), and look inside
/proc/<PID>
. The three files we will use are:
/proc/self/maps
(or /proc/<PID>/maps
) contains information about which areas of our
virtual address space are mapped to what (remember that virtual addresses can not only be
mapped to physical memory, but also to files)./proc/self/pagemap
(or /proc/<PID>/pagemap
) gives more detailed information about each
mapped page, and especially gives us the number of the frame that a page is mapped to, if
that page is mapped to physical memory./proc/kpageflags
provides status information about frames. To use it, we need to know the
frame number of the page we are interested in.Note that pagemap
and kpageflags
can effectively2 only be read by user root.3 This is because detailed information about memory status, especially frame
numbers and some of the status bits in /proc/kpageflags
, were used to facilitate attacks such
as Rowhammer, which rely on knowledge about the physical layout of memory. As a consequence,
Linux started in version 4.0 to restrict access to this information to privileged processes.
So, if we want to get information about the paging of a certain memory address, we need a
two-step process: First, get the frame number of the respective page from the pagemap
file,
then look up the correct frame in kpageflags
. This is what I demonstrate in the next
section. If we want to get more general information about the whole of our application’s
allocated memory, e.g. the number of swapped-out pages in our memory, it becomes a three-step
process: First, we must determine the range of pages in our virtual memory mapped to physical
memory by inspecting /proc/self/maps
, and then perform the two steps from above for each such
page. I demonstrate this process in the last section.
Let’s start by retrieving the page flags that apply to some virtual memory address. Let’s start
with some memory pointer void * ptr
. First, we need to get the index of the page that this
memory is located in. We need to round the memory address down to the next multiple of 4096, and
then divide by 4096. Since C/C++ integer division truncates towards zero we can just do:
1size_t page_index(void *ptr) {
2 return reinterpret_cast<size_t>(ptr) / 4096;
3}
The pagemap format specifies that for each page, there is a 64-bit entry in
/proc/<PID>/pagemap
, with the lower 54 bits indicating the page frame number (PFN). To get this
PFN, we first seek to the correct position, then we read 64 bits and extract the PFN from it:
1size_t get_pfn(void *ptr) {
2 size_t pi = page_index(ptr); // see above
3
4 // /proc/self points to the procfs folder of the current process, so
5 // we don't need to figure out our PID first.
6 auto fp = std::fopen("/proc/self/pagemap", "rb");
7
8 // Each entry in pagemap is 64 bits, i.e, 8 bytes.
9 std::fseek(fp, pi * 8, SEEK_SET);
10
11 // Read 64 bits into an uint64_t
12 uint64_t page_info = 0;
13 std::fread(&page_info, 8, 1, fp);
14
15 // check if page is present (bit 63). Otherwise, there is no PFN.
16 if (page_info & (static_cast<uint64_t>(1) << 63)) {
17 // Create a mask that has ones in the lowest 54 bits, and use that to extract the PFN.
18 uint64_t pfn = page_info & ((static_cast<uint64_t>(1) << 55) - 1);
19 return static_cast<size_t>(pfn);
20 } else {
21 // page not present
22 return 0;
23 }
24}
Now that we have the page frame number in hands, we can use that as index into /proc/kpageflags
to
get to our page’s flags. Again, each entry is 64 bits long, so we must skip ahead in 8-Byte-steps:
1size_t get_pflags(void *ptr) {
2 size_t pfn = get_pfn(ptr);
3 if (pfn == 0) { return 0; } // non-present pages
4
5 auto fp = std::fopen("/proc/kpageflags", "rb");
6 std::fseek(fp, pfn * 8, SEEK_SET);
7
8 uint64_t pflags = 0;
9 auto result = std::fread(&pflags, 8, 1, fp);
10
11 return static_cast<size_t>(pflags);
12}
I again refer to the excellent pagemap documentation for a list of available flags (and their numeric value). If we want to e.g. know whether our pointer is allocated inside a huge page, this is how we could do it:
1bool is_huge(void *ptr) {
2 uint64_t flags = get_pflags(ptr);
3
4 return
5 (flags & (static_cast<uint64_t>(1) << 17)) || // "huge" flag
6 (flags & (static_cast<uint64_t>(1) << 22)); // "transparent huge"
7}
pagemap
Remember that the PFN is only readable if our process has the CAP_SYS_ADMIN
capability,
otherwise it is set to zero. If we don’t have that capability, we can still get some
information which is directly contained in the pagemap
entries. See the documentation for a full
list of available flags. One interesting flag is the present
flag, which tells us whether a page
is currently associated with a frame in physical memory.4 This would do the trick:
1bool present_from_pte(void *ptr) {
2 // vvvvvv same as get_pfn() vvvvvv
3 size_t pi = page_index(ptr);
4 auto fp = std::fopen("/proc/self/pagemap", "rb");
5 std::fseek(fp, pi * 8, SEEK_SET);
6 uint64_t page_info = 0;
7 std::fread(&page_info, 8, 1, fp);
8 // ^^^^^^ same as get_pfn() ^^^^^^
9
10 // Bit 63 is the 'present' bit
11 return page_info & (static_cast<uint64_t>(1) << 63);
12}
Getting information for a specific address is nice, but most of the time you probably want to
compute some statistics for the memory used by your application. We need a way to determine which
virtual memory areas are mapped, i.e., part of the page table, at all. For this, we have the
file /proc/<PID>/maps
. This file contains lines of this form5:
1 address perms offset dev inode pathname
2 00400000-00452000 r-xp 00000000 08:02 173521 /usr/bin/dbus-daemon
3 00e03000-00e24000 rw-p 00000000 00:00 0 [heap]
4 35b1a21000-35b1a22000 rw-p 00000000 00:00 0
57fffb2c0d000-7fffb2c2e000 rw-p 00000000 00:00 0 [stack]
Each line (note that the ‘header line’ in this example is not part of the file) corresponds to
one area of mapped virtual memory addresses. The address
part is the start end end virtual
memory address of the area. Remember that virtual memory addresses can also point into files.
In those cases, the pathname
contains the path to the file, the inode
field contains the
file’s inode, the dev
file contains the major and minor identifier of the device the file is
on, and offset
contains the offset within the file to which the start of the area
corresponds. We ignore these file-backed areas in this article.
The areas we are interested in are those with either an empty pathname
, or something in square
brackets. The names in square brackets are “pseudo pathnames”, the two most interesting ones are
[heap]
and [stack]
. The [heap]
areas (yes, there may be multiple) contain our application’s
heap (the area(s) malloc()
and friends usually take their memory from), and [stack]
contains
the stack of the first thread of our application. Then there are the unnamed areas - those are
either stacks of other threads, or memory allocated (most likely by libc) via mmap.
We can analyze our application’s memory with the tools we already have by parsing these lines
and then applying functions like get_pflags()
at 4kB steps within the address range. Say we want
to count the number of present pages in our heap(s). We first determine the address ranges of all
[heap]
areas by using std::regex
to parse the lines in /proc/self/maps
:
1std::vector<std::pair<size_t, size_t>> get_heap_ranges() {
2 // This only matches lines ending in "[heap]", and captures the respective
3 // start/end address in the first/second group.
4 std::regex re(
5 "([0-9a-f]*)-([0-9a-f]*) .{4} [0-9a-f]* .{2}:.{2} [0-9]* *\\[heap\\]");
6 std::ifstream mapsfile("/proc/self/maps");
7 std::string line;
8
9 std::vector<std::pair<size_t, size_t>> result;
10 while (std::getline(mapsfile, line)) {
11 std::smatch match_result;
12 bool matched = std::regex_match(line, match_result, re);
13 if (!matched) {
14 // not a "heap" line
15 continue;
16 }
17
18 // start/end addresses are in hex, so we use base 16 here
19 size_t start = std::stoul(match_result[1].str(), nullptr, 16);
20 size_t end = std::stoul(match_result[2].str(), nullptr, 16);
21 result.emplace_back(start, end);
22 }
23
24 return result;
25}
With that information, we can apply our present_from_pte()
function from earlier in page-size
steps:
1size_t count_present_in_heap() {
2 size_t present_count = 0;
3 size_t total = 0;
4
5 auto ranges = get_heap_ranges();
6 for (auto [heap_start, heap_end] : ranges) {
7 // We 'probe' in 4kB, i.e., page size, steps, so we apply present_from_pte() to one address from
8 // each page each.
9 for (size_t current = heap_start; current < heap_end; current += 4096) {
10 total++;
11
12 if (present_from_pte(reinterpret_cast<void *>(current))) {
13 present_count++;
14 }
15 }
16 }
17
18 std::cout << "Counted " << present_count << " pages present in " << total
19 << " pages of heap memory.\n";
20 return present_count;
21}
Note that this solution did not involve page frame numbers or information from /proc/kpageflags
at all, so it can be run without the CAP_SYS_ADMIN
capability!
Note that processes can also share memory. If they do, they can, but do not have to, map the shared data to the same virtual addresses. ↩︎
Yes, you can open pagemap
without any
special permissions, but the interesting information - the page frame number (PFN) - will be
zeroed out. ↩︎
More precisely, by a process with the
CAP_SYS_ADMIN
capability, but there should be no meaningful difference to just running the
process as root. ↩︎
You could also use the mincore
system call to test for page presence. ↩︎
You can use your Mastodon account to reply to this post.